Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

When we are cleaning data from a relational source, and some values

    Ask The Experts published in DMReview.com
March 20, 2001
  By Sid Adelman and David Marco and Larissa Moss


When we are cleaning data from a relational source, and some values cannot be cleaned. Should it be put in an exception/error table? What do you suggest the structure of this table be?


Sid Adelman’s Answer: You have a number of choices when you find data that can’t be cleaned and, depending on the importance of the data and the need for completeness, you may decide to throw the row away completely, to put in a value indicating that it is invalid, insert a null value or, as you suggest, put in an exception/error table. Your decision will depend upon what you expect the users to need (do they need all the rows?), their ability to understand missing, flagged and null values. If you do put the unclean data in an exception table, what are you expecting the users to do with it? What purpose will this table serve? (It would be very difficult for the users to work with this exception table.)

David Marco’s Answer: The answer to your question is maybe. Sometimes I load these values into an error table, other times they are loaded into the data warehouse. Which you do is up to the business users assessing the system. As far as the table layout, I recommend going to the Real-World Decision Support (RWDS) newsletter http://www.EWSolutions.com/Newsletter.asp and looking up Mike Jenning’s excellent two-part article on operational meta data. In this article he provides a good beginning layout. Look at the first two issues of RWDS in the archives.

Larissa Moss’s Answer: Dirty data, regardless from what source, is a time bomb waiting to go off for most companies who are moving toward CRM, BI, B2B and B2C. The quality of information used and sold as a product can make or brake a company. My suggestions go beyond the question of how to dispense or suspense these data exceptions. Certainly, as a short-term solution, the ETL program/tool can write the "error report" into a table and populate such columns as: Load Date, Source file (table) name, source field (column) name, target table name, target column name, value of rejected data item. This table can then be used to assign each rejected data item to an analyst for resolution and the assignment date, assigned-to name, resolution date, resolution comment, and even reprocess date can be added to this table and updated manually. But for the long term, the company will need policies governing its data. This policy must include identification of its critical data, its important but not critical data, its less important data, and its not important data, as seen through the eyes of business executives who make strategic decisions based on this data, or as seen through the eyes of business operations units that sell this data as a product. This policy must also include guidelines for acceptability of data exceptions for each of those categories. In some cases 30 % "dirty data" may be acceptable, in other cases 5% is too high. This policy must also include procedures for dispensing or suspending data depending on its significance and its exception threshold. Writing exceptions out to a table may be acceptable for one situation but not another. This policy has to be followed with procedures to minimize the proliferation of bad data from its inception on the business operations side to the its use on the business intelligence side. This would include establishing a data quality management team whose responsibility it is to monitor the quality of strategically significant and important data of the company. It would also include changes to development methods, such as including certain testing guidelines for the various categories of data, asking down-stream users to participate in operational requirements definition sessions, setting data standards from naming to definitions to meta data, and so on. It would even include changes to current job-descriptions and incentives paid to employees, from data creation on the business side to data distribution on the IT side, to include data quality in the performance appraisal guidelines. In other words, writing data exceptions into a table is barely the beginning of raising a company’s or maybe even only a business unit’s consciousness for data quality. But it is not a long-term solution to a problem that could severely impact a company’s competitive edge in the best case, and put it out of business in the worst case, as it did some automobile makers a few decades ago.

Ross Armstrong’s Answer: The answer may not as straight forward as just putting the data in an error table. Your strategy should be to put as much data into the warehouse as possible without manual intervention. Using default values, minimum or maximum values, eye catchers, etc can accomplish this. As long as the data isn’t dimensional keys or facts this can save a lot of effort correcting data that may be infrequently analyzed. I also design an exception table which contains the suspected data, the source it came from, the ETL process that detected it and the error reason code and a timestamp.


For more information on related topics visit the following related portals...
Data Quality.

Sid Adelman is a principal in Sid Adelman & Associates, an organization specializing in planning and implementing data warehouses, in data warehouse and BI assessments, and in establishing effective data architectures and strategies. He is a regular speaker at DW conferences. Adelman chairs the "Ask the Experts" column on www.dmreview.com. He is a frequent contributor to journals that focus on data warehousing. He co-authored Data Warehouse Project Management and is the principal author on Impossible Data Warehouse Situations with Solutions from the Experts. His new book, Data Strategy, is scheduled for publication this year. He can be reached at 818-783-9634 or sidadelman@aol.com.  Visit his Web site at www.sidadelman.com.

David Marco is an internationally recognized expert in the fields of enterprise architecture, data warehousing and business intelligence and is the world's foremost authority on meta data. He is the author of Universal Meta Data Models (Wiley, 2004) and Building and Managing the Meta Data Repository: A Full Life-Cycle Guide (Wiley, 2000). Marco has taught at the University of Chicago and DePaul University, and in 2004 he was selected to the prestigious Crain's Chicago Business "Top 40 Under 40."  He is the founder and president of Enterprise Warehousing Solutions, Inc., a GSA schedule and Chicago-headquartered strategic partner and systems integrator dedicated to providing companies and large government agencies with best-in-class business intelligence solutions using data warehousing and meta data repository technologies. He may be reached at (866) EWS-1100 or via e-mail at DMarco@EWSolutions.com.

Larissa Moss is founder and president of Method Focus Inc., a company specializing in improving the quality of business information systems. She has more than 20 years of IT experience with information asset management. Moss is coauthor of three books: Data Warehouse Project Management (Addison-Wesley, 2000), Impossible Data Warehouse Situations (Addison-Wesley, 2002) and Business Intelligence Roadmap: The Complete Project Lifecycle for Decision- Support Applications (Addison-Wesley, 2003). Moss can be reached at methodfocus@earthlink.net.

Solutions Marketplace
Provided by IndustryBrains

Data Validation Tools: FREE Trial
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Recover SQL Server or Exchange in minutes
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.

Manage Data Center from Virtually Anywhere!
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space

E-mail This Ask The Experts E-Mail This Ask The Experts
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.