Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Information Is Your Business
   Information Is Your Business Advanced Search

Business Intelligence
Corporate Performance Management
Data Integration
Data Quality
Data Warehousing Basics
Master Data Management
View all Portals

Scheduled Events
Software Demo Series

White Paper Library
Research Papers



DM Review Home
Current Magazine Issue
Magazine Archives
DM Review Extended Edition
Online Columnists
Ask the Experts
Industry News
Search DM Review

Tech Evaluation Center:
Evaluate IT solutions
Buyer's Guide
Industry Events Calendar
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Data Quality: Horizontal or Vertical?

  Article published in DM Review Magazine
August 2006 Issue
  By Frank Dravis

When IT professionals first embark on a data quality initiative, a common question they ask is, "What industries or market verticals have the best data quality?" Underlying that question is the desire and need to establish a data quality benchmark. All firms want to know how they are doing relative to their peers, and most data managers want to hear encouraging words that their industry is achieving quality goals more easily than others.

The desire for data quality benchmarks is even more acute because of the dearth of data quality success stories published in the media. There are two reasons for this palpable absence. First, organizations do not like to air their dirty laundry. Publicizing a data quality success story in many ways is good public relations, but for some markets, such as health care and financial services, touting improved quality is implicitly admitting you had a previous problem, and no one wants to think their hospital or bank had problems with their patient or investor data. Second, no firm wants to give up a hard-earned competitive advantage. I've worked with a number of clients who told me point blank I could not publicize their successes because it would educate their competition about how they gained advantage.

There is another issue lurking behind the question, "What industries or market verticals are doing the best with data quality?" The questioner is looking for confirmation that his or her company is in a market vertical fertile for the adoption of data quality practices. Fortunately, data quality practices, methods, processes and technology are generic. They span industries and markets, and are equally applicable. I know this will be disappointing for some people to hear because we all want to think we are special and the industry we work in is unique and requires custom solutions. The fact is, data quality processes and practices perform equally as well for one industry as for another.

That is what enterprise information management (EIM) is all about - enabling an organization to create a comprehensive strategy to ensure they are using trustworthy information. Data quality is a critical part of EIM, and like EIM, data quality is not just a technology. A successful data quality initiative is 80 percent people and process. A firm can create unique data categories, types and objects that may temporarily defy existing cleansing technologies, but data quality is mostly about people and process. Even a unique, proprietary product SKU code is a candidate for cleansing. In a proprietary SKU code there are specified patterns to the data. Those patterns can be represented by business rules, and the rules can be loaded into parsing, standardization and correction routines in the form of program parameters.

Why are data quality practices applicable horizontally, across all markets? In the simplest terms, data is facts about things. Data is made for human consumption, and we humans like our data served up in the same general way, regardless of industry or market. Whether it is financial, telemetry, environmental, product or customer data, we want it broken down to its discrete components, fielded out and grouped into records that create a full picture of all the available business information at hand. We want the similar records to be grouped in tables and related records. Events and other related data should be linked to the records in question, for example, a customer address record is linked to the customer's credit history and also linked to their purchase history. The ultimate purpose behind mining unstructured data is to move the important facts into a structured environment.

The practice of ensuring data accuracy applies to all industries and markets equally. Every firm needs data to manage its operations. Data, and hence its quality, is foundational to every industry. A false perception exists that gaining a competitive advantage depends on specialized treatment of the data. We can dissolve this misperception by simply exposing the standard process everyone uses to ensure high quality data. First, it starts with measuring and analyzing your data. What are the defects and what caused them? Data profiling solutions are designed for quantifying the defects and providing metadata to help analyze the cause of the defects.

Second comes the process of parsing the data into its individual components as in Figure 1. Here the client's data (two different records with just Product  No. and Description fields) started out bundled into, as in the case of the first record, one unformatted contiguous string - Bolt 2,5 x 20 mm Coated Zn. Because we can understand it mentally, we can define a set of rules to load into a data quality package that will programatically parse the data into its requisite components, in this case, product, dimension, type and compound. Once the data is componentized, it can be standardized, which is the third step, such as converting m.m. to MM. The fourth step is to ensure the data is accurate, in other words correct. To do this you need some form of trusted data source to compare the records against. In the case of master data management (MDM) that would be a master parts or product list. In our example in Figure 1, the elements found in the type column were compared to the truth data and stainl was corrected to stainless. The same thing could have been done if a wrong value was found in the dimension column.

Figure 1: Parsing Data into Components

The fifth step in what we will call the data quality function framework is enhancement. Enhancing the data means you can add additional facts or attributes that increase the value of the records for specific downstream operations. Using our example, we could append a preferred provider code for the parts from an industry association list. What these steps in the framework accomplish to this point is the preparation of the data for matching and consolidation. Everyone has duplicate records, and supply chain management operations are no exception.

The data in Figure 2 has been extracted, as part of the MDM activity, from an equipment assets database. The goal of this MDM project is to consolidate all various equipment parts cataloged across the enterprise into one database so parts procurement can be consolidated. By increasing the accuracy and oversight of the data, the firm can decrease the number of purchases and increase their volume, thereby gaining pricing leverage with its vendors. The problem is the duplicates. To identify the duplicate records, a matching operation, the sixth step, is run. The left two columns of Figure 2 contain match codes posted by the operation. Wherever there is an identical value in the group number column, the matching operation has determined, according to the user's business rules, that those records are duplicates.

Figure 2: An Example of Duplicate Records

Using the match codes, the seventh operation in the data quality framework -consolidation - can either eliminate the dupes or consolidate them into best-of records as seen in Figure 3.

Figure 3: Best-Of Records

Here the consolidation function was programmed to ignore extraneous data, such as the usage of the steel plates, and eliminate those extraneous records. Whether the data is supply chain information, addresses, personal names, diagnosis codes or equity descriptions, the process is the same. Business rules can be defined to identify any element regardless of industry, and even special truth data unique to a market can be created to support corrections.

An additional proof point that data quality cuts across market verticals is the fact that so many data quality projects are driven by BI, customer relationship management (CRM), customer data integration (CDI) and data integration (ETL) operations, in addition to MDM. They are all deployed across industries and are not captured by the monopolistic domain of any one market.

Where verticalization (industry specialization) comes into play is in the application of standard data quality functionality against custom vertical data sets, such as ISO country codes, Department of Justice compliance lists or USPS address delivery points. For example, compliance solutions are marketed to firms seeking to identify their customers against any one of dozens of domestic and international watch lists. While the data may be unique to government agencies and needed by firms to comply with identity resolution regulations, the underlying techniques and technologies that match those vertical lists to horizontal customer files is applicable to all industries.

As I said earlier, 80 percent of a total data quality solution is people and process. However, we live in the information age where we have megabytes, gigabytes and now terabytes worth of data. These data volumes defy cost-efficient manual cleansing and matching. So while technology may be only 20 percent of the solution, it is indeed a critical 20 percent. Why do I say 20 percent? Because when the total effort of a complete solution cycle is considered, an information quality solution cycle, from the research in the awareness phase to the installation in the implementation phase, we learn that the bulk of the time is invested in researching the problem, educating stakeholders, designing the solution, improving the processes, developing a strategy and planning the project. When we finally deploy technology to manipulate our mountains of data is in the implementation phase, at the end of a lengthy process.

At the end of the information quality solutions cycle, through the use of EIM technologies (metadata management, data integration, data quality, etc.) that data is delivered to applications and business operations cleanly and efficiently, formatted and standardized as required by those operations, regardless of the market or industry vertical. Consider the example of a high-tech manufacturer that saved 50 percent of their direct marketing budget, almost $12 million dollars, by consolidating duplicate customer records and eliminating redundant and wasted brochure mailings. I still can't identify the market or product, but their story applies to any firm that markets to customers. So the next time you hear someone ask which industry gets the most value from data quality, understand the answer is: it's important to all industries. Data quality is a competitive advantage regardless of market or function.  


For more information on related topics visit the following related portals...
Data Quality.

Frank Dravis is vice president EIM Strategies for Business Objects. Dravis has nineteen years of experience in data quality and enterprise information management (EIM) solutions design, implementation and consulting. At Business Objects he leads the EIM product marketing effort, where he researches and aids in the formulation of EIM market strategies and the planning of EIM implementations in business intelligence technologies. As a benefit of the research, Dravis delivers data quality best-practice advice and consulting to Business Object's extensive list of industry-leading clientele. He is a former director of the International Association of Information and Data Quality, an instructor at The Data Warehouse Institute (TDWI) where he teaches data quality strategy development. His primary research of data quality issues and technology trends has made him a frequent speaker at industry events, particularly the annual MIT conferences on information quality. Dravis is a contributing author for magazines, a columnist in the IQ View newsletter and the author of a globally recognized Web log on data quality (http://weblogs.firstlogic.com). Prior to Business Objects Dravis was the VP of Information Quality at Firstlogic, Inc. where he consulted with clients and served as a member of the executive team providing direction for the firm.

Solutions Marketplace
Provided by IndustryBrains

Create, capture, manage, and archive endless data.
Learn how EMC enterprise content management software helps you control content to achieve success.

Find Consulting Jobs
Access Pre-Qualified Projects from Top Businesses. Register Now!

Free DB Modeling Trial with ER/Studio
Design and Build More Powerful Databases with ER/Studio.

Data Mining Courses: Strategy, Methods & Apps
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.

Recover SQL Server or Exchange in minutes
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Article E-Mail This Article
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2007 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.