DM Review Published in DM Review Online in February 2000.
Printed from DMReview.com


We are in the process of loading a data warehouse from a mainframe source.

by David Marco  and by Sid Adelman

Summary: Is there a rule of thumb that would give me a 95 to 99 percent confidence level that the data in our warehouse is correct?

Q:  

We are in the process of loading a data warehouse from a mainframe source. Some of the files contain millions of rows. I am interested in how accurate the data is. Instead of checking all the records, I would only like to sample a few hundred or so. Is there a formula or rule of thumb that would give me a 95 percent or 99 percent confidence level that the data is correct assuming the data is either right or wrong?

A:  

David Marco?s Answer: Sampling only a few hundred out of millions of records would not give me a 99 percent confidence level. However, here is the formula that I use: test the first record, the last record and at each interval for the remaining number of records that you wish to check. For example, if I only wished to test 11 out of 100 records with ids and sorted 1 ? 100, I would test the following record numbers 1, 100, 10, 20, 30, 40, 50, 60, 70, 80, and 90.

Sid Adelman?s Answer: Unless you really know sampling and are confident you can pull a representative sample, you?re taking a chance. Also, will your users be comfortable hearing there is a 95 percent chance that their results are correct? The nice thing about checking the accuracy of the data is that it?s usually not window dependent; you should be able to run these checks during off-hours. Unless you are terribly machine constrained, I?d run the checks against all the rows.


David Marco is an internationally recognized expert in the fields of enterprise architecture, data warehousing and business intelligence and is the world's foremost authority on meta data. He is the author of Universal Meta Data Models (Wiley, 2004) and Building and Managing the Meta Data Repository: A Full Life-Cycle Guide (Wiley, 2000). Marco has taught at the University of Chicago and DePaul University, and in 2004 he was selected to the prestigious Crain's Chicago Business "Top 40 Under 40." He is the founder and president of Enterprise Warehousing Solutions, Inc., a GSA schedule and Chicago-headquartered strategic partner and systems integrator dedicated to providing companies and large government agencies with best-in-class business intelligence solutions using data warehousing and meta data repository technologies. He may be reached at (866) EWS-1100 or via e-mail at DMarco@EWSolutions.com.

Sid Adelman is a principal in Sid Adelman & Associates, an organization specializing in planning and implementing data warehouses and in establishing effective data architectures and strategies. He jointly developed a methodology, MapXpert for Data Warehouse, that provides a master plan for implementing a data warehouse. Adelman is a regular speaker at The Data Warehouse Institute and IBM's DB2 and Data Warehouse Conference. He chairs the "Ask the Experts" column on DMReview.com and is a founding member of the Business Intelligence Alliance. Adelman is a frequent contributor to journals that focus on the data warehouse. He coauthored Data Warehouse Project Management with Larissa Moss and Impossible Data Warehouse Situations: Solutions from the Experts. He can be reached at (818) 783-9634 or sidadelman@aol.com. Visit his Web site at www.sidadelman.com.

Copyright 2004, Thomson Media and DM Review.