Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Information Is Your Business
   Information Is Your Business Advanced Search

Business Intelligence
Corporate Performance Management
Data Integration
Data Quality
Data Warehousing Basics
Master Data Management
View all Portals

Scheduled Events

White Paper Library
Research Papers



DM Review Home
Current Magazine Issue
Magazine Archives
DM Review Extended Edition
Online Columnists
Ask the Experts
Industry News
Search DM Review

Tech Evaluation Center:
Evaluate IT solutions
Buyer's Guide
Industry Events Calendar
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Intelligent Solutions:
Improving Data Quality

  Column published in DM Review Magazine
February 2007 Issue
  By Jonathan G. Geiger

Data profiling does not improve data quality directly. It provides a better understanding of the challenges you face in delivering information that meets business expectations. To improve data quality, you must act on the information.

Symptoms are the visible results provided by data profiling work. These represent the errors found in the data sources; you must decide how to deal with these.

When you encounter a data quality deficiency, there are four potential actions you can take: reject, ignore, fix or substitute.

Reject. Rejection consists of dealing with data that is so bad that you cannot use it. For example, if you encounter nonnumeric data in a field reserved for the sales amount, you cannot process the record element (or entire record). When this happens, you need to capture the errors and build a process to enable correction of the data before forwarding it to the data warehouse. Sometimes, when accuracy is more important than completeness, the decision may be made to simply move forward with other data.

Ignore. The next option is to ignore the error. In your data profiling, you may discover some data deficiencies. Your business rules, however, may be such that these deficiencies do not materially impact your ability to make strategic decisions with the data warehouse. In this case, you may simply capture the data, knowing that it is not perfect. You should also provide information in the metadata to ensure that business users understand the level of quality in the data.

Fix. Fixing the data is sometimes an option. It is possible that data elsewhere in the environment can be used to derive the correct value. For example, if you have a ZIP code and the state is invalid, you could use the ZIP code to derive the correct state.

Substitute. Substitution is similar to fixing in that you are inserting a value that enables you to move forward. In the case of substitution, however, you may not know the correct value and may be able to substitute a value of "unknown" to have a value in each field. When a substitution is performed, the downstream impact needs to be considered so that the business community understands that "unknown" is a valid value.

When you fix the data or substitute a value, you have two additional options. First, you may choose to record both the original value and the adjusted value in the data warehouse. This is sometimes needed to provide an audit trail and to provide information in defense of a business decision made with the incorrect data. Second, you may be able to generate a transaction that will update information in the source data and correct it. While that may appear to solve the problem, it really doesn't - you are still dealing with the symptom. You've simply built a process to ensure that the error doesn't impact the data warehouse. To truly solve the problem, you must understand the root causes so that you can prevent the symptom from occurring in the future.

The root cause is uncovered by repeatedly asking why the problem occurs. With a flat tire, when you initially ask why, you may learn that the tire had a bald spot. Asking why again may reveal that the bald spot occurred because the car was out of alignment. Further investigation may reveal that the driver was not aware of how to maintain tires, an activity that includes having proper air pressure and periodically performing visual inspections. Once you identify this, then you can take action (learn how to maintain tires) that will prevent a similar problem from occurring in the future.

Root cause analysis is often done during data profiling. When you uncover significant data problems, you need to perform additional analysis to understand the causes. Instinctively, you may conclude that the source program is deficient. For example, maybe the data validation is not thorough or there are some program coding errors. When you've come to a conclusion, you should document the problem and ensure that a change request is submitted and appropriately prioritized to fix the source system.

Very often, the problem is external to the source system. For example, you may discover that 20 percent of your population was born on the same date. In this case, the source system checks each date individually and lets the data through because the date is valid. It cannot detect that when the date is missing or unclear, a data entry person inserts any date (e.g., 11/11/11) into the field because it's a mandatory field. Similarly, when you make a sale, the salesperson provides information that may be inaccurate (but good enough to pass reasonable validation steps) about a customer. In these cases, your business rules are such that people don't have the training or incentives to capture correct data. Fixing the source system won't solve these problems. If the root cause is in the business process, the corrections must be implemented there.

Data profiling makes data quality issues visible. When it comes to loading the data warehouse, you have two challenges. First, you must find a way to load the required data by taking one of four actions - reject the data (if accuracy is more critical than completeness), accept the data (if it is within your tolerance levels), fix the data (if you have information to do so) or substitute a value (when that would meet your business needs).

The second challenge is the one with the greater payback;  understand why the problems occurred by performing analysis so that you can take actions that substantially address the root causes.


For more information on related topics visit the following related portals...
Data Quality.

Jonathan G. Geiger is executive vice president at Intelligent Solutions, Inc.  Geiger has been involved in many corporate information factory and customer relationship management projects within the utility, telecommunications, manufacturing, education, chemical, financial and retail industries.  In his 30 years as a practitioner and consultant, Geiger has managed or performed work in virtually every aspect of information management.  He has authored or co-authored numerous articles and three books, presents frequently at national and international conferences, and teaches several public seminars.  Geiger may be reached at JGeiger@IntelSols.com.

Solutions Marketplace
Provided by IndustryBrains

Free DB Modeling Trial with ER/Studio
Design and Build More Powerful Databases with ER/Studio.

Recover SQL Server or Exchange in minutes
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.

DeZign for Databases - Database Design Made Easy
Create, design & reverse engineer databases with DeZign for Databases, a database design tool for developers and DBA's with support for Oracle, MySQL, MS SQL, MS Access, DB2, PostgreSQL, InterBase, Firebird, NexusDB, dBase and Pervasive.

Free Webinar: The IT infrastructure of the future
Learn about how the HP BladeSystem is revolutionizing the enterprise. Sign up and download here.

Data Mining Courses: Strategy, Methods & Apps
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2007 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.