Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events
Archived Events

White Paper Library

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Ask the Experts Question and Answer

Ask the Expert

Meet the Experts
Ask a Question (Names of individuals and companies will not be used.)
Question Archive
Ask the Experts Home


When we are cleaning data from a relational source, and some values cannot be cleaned. Should it be put in an exception/error table? What do you suggest the structure of this table be?


Sid Adelman’s Answer: You have a number of choices when you find data that can’t be cleaned and, depending on the importance of the data and the need for completeness, you may decide to throw the row away completely, to put in a value indicating that it is invalid, insert a null value or, as you suggest, put in an exception/error table. Your decision will depend upon what you expect the users to need (do they need all the rows?), their ability to understand missing, flagged and null values. If you do put the unclean data in an exception table, what are you expecting the users to do with it? What purpose will this table serve? (It would be very difficult for the users to work with this exception table.)

David Marco’s Answer: The answer to your question is maybe. Sometimes I load these values into an error table, other times they are loaded into the data warehouse. Which you do is up to the business users assessing the system. As far as the table layout, I recommend going to the Real-World Decision Support (RWDS) newsletter http://www.EWSolutions.com/Newsletter.asp and looking up Mike Jenning’s excellent two-part article on operational meta data. In this article he provides a good beginning layout. Look at the first two issues of RWDS in the archives.

Larissa Moss’s Answer: Dirty data, regardless from what source, is a time bomb waiting to go off for most companies who are moving toward CRM, BI, B2B and B2C. The quality of information used and sold as a product can make or brake a company. My suggestions go beyond the question of how to dispense or suspense these data exceptions. Certainly, as a short-term solution, the ETL program/tool can write the "error report" into a table and populate such columns as: Load Date, Source file (table) name, source field (column) name, target table name, target column name, value of rejected data item. This table can then be used to assign each rejected data item to an analyst for resolution and the assignment date, assigned-to name, resolution date, resolution comment, and even reprocess date can be added to this table and updated manually. But for the long term, the company will need policies governing its data. This policy must include identification of its critical data, its important but not critical data, its less important data, and its not important data, as seen through the eyes of business executives who make strategic decisions based on this data, or as seen through the eyes of business operations units that sell this data as a product. This policy must also include guidelines for acceptability of data exceptions for each of those categories. In some cases 30 % "dirty data" may be acceptable, in other cases 5% is too high. This policy must also include procedures for dispensing or suspending data depending on its significance and its exception threshold. Writing exceptions out to a table may be acceptable for one situation but not another. This policy has to be followed with procedures to minimize the proliferation of bad data from its inception on the business operations side to the its use on the business intelligence side. This would include establishing a data quality management team whose responsibility it is to monitor the quality of strategically significant and important data of the company. It would also include changes to development methods, such as including certain testing guidelines for the various categories of data, asking down-stream users to participate in operational requirements definition sessions, setting data standards from naming to definitions to meta data, and so on. It would even include changes to current job-descriptions and incentives paid to employees, from data creation on the business side to data distribution on the IT side, to include data quality in the performance appraisal guidelines. In other words, writing data exceptions into a table is barely the beginning of raising a company’s or maybe even only a business unit’s consciousness for data quality. But it is not a long-term solution to a problem that could severely impact a company’s competitive edge in the best case, and put it out of business in the worst case, as it did some automobile makers a few decades ago.

Ross Armstrong’s Answer: The answer may not as straight forward as just putting the data in an error table. Your strategy should be to put as much data into the warehouse as possible without manual intervention. Using default values, minimum or maximum values, eye catchers, etc can accomplish this. As long as the data isn’t dimensional keys or facts this can save a lot of effort correcting data that may be infrequently analyzed. I also design an exception table which contains the suspected data, the source it came from, the ETL process that detected it and the error reason code and a timestamp.

(Posted )


Site Map Terms of Use Privacy Policy
SourceMedia (c) 2005 DM Review and SourceMedia, Inc. All rights reserved.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.