Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

Resource Portals
Analytic Applications
Business Intelligence
Business Performance Management
Data Integration
Data Quality
Data Warehousing Basics
EDM
EII
ETL
More Portals...

Advertisement

Information Center
DM Review Home
Conference & Expo
Web Seminars & Archives
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

General Resources
Bookstore
Industry Events Calendar
Vendor Listings
White Paper Library
Glossary
Software Demo Lab
Monthly Product Guides
Buyer's Guide

General Resources
About Us
Press Releases
Awards
Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Ask the Experts Question and Answer

Ask the Expert

Meet the Experts
Ask a Question (Names of individuals and companies will not be used.)
Question Archive
Ask the Experts Home

Q:

We are in the process of loading a data warehouse from a mainframe source. Some of the files contain millions of rows. I am interested in how accurate the data is. Instead of checking all the records, I would only like to sample a few hundred or so. Is there a formula or rule of thumb that would give me a 95 percent or 99 percent confidence level that the data is correct assuming the data is either right or wrong?

A:

David Marco?s Answer: Sampling only a few hundred out of millions of records would not give me a 99 percent confidence level. However, here is the formula that I use: test the first record, the last record and at each interval for the remaining number of records that you wish to check. For example, if I only wished to test 11 out of 100 records with ids and sorted 1 ? 100, I would test the following record numbers 1, 100, 10, 20, 30, 40, 50, 60, 70, 80, and 90.

Sid Adelman?s Answer: Unless you really know sampling and are confident you can pull a representative sample, you?re taking a chance. Also, will your users be comfortable hearing there is a 95 percent chance that their results are correct? The nice thing about checking the accuracy of the data is that it?s usually not window dependent; you should be able to run these checks during off-hours. Unless you are terribly machine constrained, I?d run the checks against all the rows.

(Posted )


ARCHIVE OF QUESTIONS & ANSWERS FOR DATA QUALITY
BACK TO THE LIST OF CATEGORIES



Advertisement
advertisement
Site Map Terms of Use Privacy Policy

Thomson Media

2005 The Thomson Corporation and DMReview.com. All rights reserved.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.