DM Review Published in DM Review Online in March 2001.
Printed from DMReview.com


When we are cleaning data from a relational source, and some values

by Sid Adelman  and by David Marco  and by Larissa Moss

Summary: When we are cleaning data from a relational source, and some values cannot be cleaned. Should it be put in an exception/error table? What do you suggest the structure of this table be?

Q:  

When we are cleaning data from a relational source, and some values cannot be cleaned. Should it be put in an exception/error table? What do you suggest the structure of this table be?

A:  

Sid Adelman’s Answer: You have a number of choices when you find data that can’t be cleaned and, depending on the importance of the data and the need for completeness, you may decide to throw the row away completely, to put in a value indicating that it is invalid, insert a null value or, as you suggest, put in an exception/error table. Your decision will depend upon what you expect the users to need (do they need all the rows?), their ability to understand missing, flagged and null values. If you do put the unclean data in an exception table, what are you expecting the users to do with it? What purpose will this table serve? (It would be very difficult for the users to work with this exception table.)

David Marco’s Answer: The answer to your question is maybe. Sometimes I load these values into an error table, other times they are loaded into the data warehouse. Which you do is up to the business users assessing the system. As far as the table layout, I recommend going to the Real-World Decision Support (RWDS) newsletter http://www.EWSolutions.com/Newsletter.asp and looking up Mike Jenning’s excellent two-part article on operational meta data. In this article he provides a good beginning layout. Look at the first two issues of RWDS in the archives.

Larissa Moss’s Answer: Dirty data, regardless from what source, is a time bomb waiting to go off for most companies who are moving toward CRM, BI, B2B and B2C. The quality of information used and sold as a product can make or brake a company. My suggestions go beyond the question of how to dispense or suspense these data exceptions. Certainly, as a short-term solution, the ETL program/tool can write the "error report" into a table and populate such columns as: Load Date, Source file (table) name, source field (column) name, target table name, target column name, value of rejected data item. This table can then be used to assign each rejected data item to an analyst for resolution and the assignment date, assigned-to name, resolution date, resolution comment, and even reprocess date can be added to this table and updated manually. But for the long term, the company will need policies governing its data. This policy must include identification of its critical data, its important but not critical data, its less important data, and its not important data, as seen through the eyes of business executives who make strategic decisions based on this data, or as seen through the eyes of business operations units that sell this data as a product. This policy must also include guidelines for acceptability of data exceptions for each of those categories. In some cases 30 % "dirty data" may be acceptable, in other cases 5% is too high. This policy must also include procedures for dispensing or suspending data depending on its significance and its exception threshold. Writing exceptions out to a table may be acceptable for one situation but not another. This policy has to be followed with procedures to minimize the proliferation of bad data from its inception on the business operations side to the its use on the business intelligence side. This would include establishing a data quality management team whose responsibility it is to monitor the quality of strategically significant and important data of the company. It would also include changes to development methods, such as including certain testing guidelines for the various categories of data, asking down-stream users to participate in operational requirements definition sessions, setting data standards from naming to definitions to meta data, and so on. It would even include changes to current job-descriptions and incentives paid to employees, from data creation on the business side to data distribution on the IT side, to include data quality in the performance appraisal guidelines. In other words, writing data exceptions into a table is barely the beginning of raising a company’s or maybe even only a business unit’s consciousness for data quality. But it is not a long-term solution to a problem that could severely impact a company’s competitive edge in the best case, and put it out of business in the worst case, as it did some automobile makers a few decades ago.

Ross Armstrong’s Answer: The answer may not as straight forward as just putting the data in an error table. Your strategy should be to put as much data into the warehouse as possible without manual intervention. Using default values, minimum or maximum values, eye catchers, etc can accomplish this. As long as the data isn’t dimensional keys or facts this can save a lot of effort correcting data that may be infrequently analyzed. I also design an exception table which contains the suspected data, the source it came from, the ETL process that detected it and the error reason code and a timestamp.


Sid Adelman is a principal in Sid Adelman & Associates, an organization specializing in planning and implementing data warehouses and in establishing effective data architectures and strategies. He jointly developed a methodology, MapXpert for Data Warehouse, that provides a master plan for implementing a data warehouse. Adelman is a regular speaker at The Data Warehouse Institute and IBM's DB2 and Data Warehouse Conference. He chairs the "Ask the Experts" column on DMReview.com and is a founding member of the Business Intelligence Alliance. Adelman is a frequent contributor to journals that focus on the data warehouse. He coauthored Data Warehouse Project Management with Larissa Moss and Impossible Data Warehouse Situations: Solutions from the Experts. He can be reached at (818) 783-9634 or sidadelman@aol.com. Visit his Web site at www.sidadelman.com.

David Marco is an internationally recognized expert in the fields of enterprise architecture, data warehousing and business intelligence and is the world's foremost authority on meta data. He is the author of Universal Meta Data Models (Wiley, 2004) and Building and Managing the Meta Data Repository: A Full Life-Cycle Guide (Wiley, 2000). Marco has taught at the University of Chicago and DePaul University, and in 2004 he was selected to the prestigious Crain's Chicago Business "Top 40 Under 40."  He is the founder and president of Enterprise Warehousing Solutions, Inc., a GSA schedule and Chicago-headquartered strategic partner and systems integrator dedicated to providing companies and large government agencies with best-in-class business intelligence solutions using data warehousing and meta data repository technologies. He may be reached at (866) EWS-1100 or via e-mail at DMarco@EWSolutions.com.

Larissa Moss is founder and president of Method Focus Inc., a company specializing in improving the quality of business information systems. She has more than 20 years of IT experience with information asset management. Moss is coauthor of three books: Data Warehouse Project Management (Addison-Wesley, 2000), Impossible Data Warehouse Situations (Addison-Wesley, 2002) and Business Intelligence Roadmap: The Complete Project Lifecycle for Decision- Support Applications (Addison-Wesley, 2003). Moss can be reached at methodfocus@earthlink.net.

Copyright 2005, SourceMedia and DM Review.