Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

RESOURCE PORTALS
View all Portals

WEB SEMINARS
Scheduled Events

RESEARCH VAULT
White Paper Library
Research Papers

CAREERZONE
View Job Listings
Post a job

Advertisement

INFORMATION CENTER
DM Review Home
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

GENERAL RESOURCES
Bookstore
Buyer's Guide
Glossary
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

DM REVIEW
About Us
Press Releases
Awards
Advertising/Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

We are in the process of loading a data warehouse from a mainframe source.

    Ask The Experts published in DMReview.com
February 1, 2000
 
  By David Marco and Sid Adelman

Q:  

We are in the process of loading a data warehouse from a mainframe source. Some of the files contain millions of rows. I am interested in how accurate the data is. Instead of checking all the records, I would only like to sample a few hundred or so. Is there a formula or rule of thumb that would give me a 95 percent or 99 percent confidence level that the data is correct assuming the data is either right or wrong?

A:  

David Marco's Answer: Sampling only a few hundred out of millions of records would not give me a 99 percent confidence level. However, here is the formula that I use: test the first record, the last record and at each interval for the remaining number of records that you wish to check. For example, if I only wished to test 11 out of 100 records with ids and sorted 1 - 100, I would test the following record numbers 1, 100, 10, 20, 30, 40, 50, 60, 70, 80, and 90.

Sid Adelman's Answer: Unless you really know sampling and are confident you can pull a representative sample, you're taking a chance. Also, will your users be comfortable hearing there is a 95 percent chance that their results are correct? The nice thing about checking the accuracy of the data is that it's usually not window dependent; you should be able to run these checks during off-hours. Unless you are terribly machine constrained, I'd run the checks against all the rows.

...............................................................................

For more information on related topics visit the following related portals...
Data Quality and ETL.

David Marco is an internationally recognized expert in the fields of enterprise architecture, data warehousing and business intelligence and is the world's foremost authority on meta data. He is the author of Universal Meta Data Models (Wiley, 2004) and Building and Managing the Meta Data Repository: A Full Life-Cycle Guide (Wiley, 2000). Marco has taught at the University of Chicago and DePaul University, and in 2004 he was selected to the prestigious Crain's Chicago Business "Top 40 Under 40."  He is the founder and president of Enterprise Warehousing Solutions, Inc., a GSA schedule and Chicago-headquartered strategic partner and systems integrator dedicated to providing companies and large government agencies with best-in-class business intelligence solutions using data warehousing and meta data repository technologies. He may be reached at (866) EWS-1100 or via e-mail at DMarco@EWSolutions.com.

Sid Adelman is a principal in Sid Adelman & Associates, an organization specializing in planning and implementing data warehouses, in data warehouse and BI assessments, and in establishing effective data architectures and strategies. He is a regular speaker at DW conferences. Adelman chairs the "Ask the Experts" column on www.dmreview.com. He is a frequent contributor to journals that focus on data warehousing. He co-authored Data Warehouse Project Management and is the principal author on Impossible Data Warehouse Situations with Solutions from the Experts. His new book, Data Strategy, is scheduled for publication this year. He can be reached at 818-783-9634 or sidadelman@aol.com.  Visit his Web site at www.sidadelman.com.

Solutions Marketplace
Provided by IndustryBrains

Data Validation Tools: FREE Trial
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Backup SQL Server or Exchange Continuously
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.

Manage Data Center from Virtually Anywhere!
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space


E-mail This Ask The Experts E-Mail This Ask The Experts
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Advertisement
advertisement
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.