Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Plain English About Information Quality:
Data Format Standards as an IQ Tool

  Column published in DM Review Magazine
May 2003 Issue
  By Larry English

Thanks to a reader who inspired this month's column by asking about the meaning of the terms "standardization" and "remediation" as applied to the work of creating data format standards for the data element "Organization Name."

The IQ Problem

Two problems exist with an attribute such as Organization Name. (While this column discusses Organization Name, the principles also apply to other formatted text fields such as Product Name, Asset Name, Facility Name and other structured text attributes.)

  1. As a single-value attribute, it does not reflect real-world objects. Organizations in real life tend to have more than one name. Most data models and database designs fail to properly model this. Because an enterprise comes to know an organization through any of its name forms and needs to know multiple names (such as legal name, doing business as name, common name or abbreviated name), the data model and database must support this. Otherwise, you get the problems of enterprises having 87 different representations of the name of the company popularly known as DuPont, legally, EI du Pont de Nemours and Company, with names miscaptured as Dupont, Du Pont, E. I. Du Pont, etc.1 My company, Information Impact International, Inc., is known as Information Impact International and Information Impact, with a short name of InfoImpact from the Web site name - infoimpact.com. An assessment of my mail over a three-month period a few years ago found forty different name spellings and formats.
  2. Without clear data format standards for capturing organization names (or product names), information producers have no guidance for consistency of names across the various organization (or product) records. Without this guidance, duplicate records are easily created for the same real-world organization, fragmenting an enterprise's view of its customers or business partners.

The problems this creates are horrific. Databases with 20 to 40 percent duplication are not unheard of. The real problem is the organization's inability to understand its customers or partners, the process failure caused and the complexity created by having multiple records that represent a single real-world object.

The problems of matching and consolidating records across multiple source files are exceedingly complex, compounded when source files also have different lengths and formats for the data.

The reader's question revolved around when creating a common name value for the different occurrences really "standardization" and when is it "remediation." To answer this, we must define the following terms:

  • Standardization: From a data perspective, data standardization means data has a common name, definition and value set or format. From a process perspective, standardization means "documentation of the best way to do the job."2 In Kaizen, a quality management method of continuous process improvement, standardization is one of the three foundations of process improvement. Information quality requires both data and process standardization.
  • Remediation: In general, remediation means "the act or process of correcting a fault or deficiency."3 In quality management, this is "scrap and rework." In information quality, remediation is "information scrap and rework." The data has been created and maintained in a way that it is defective and must be corrected, so that the knowledge workers who require it can use it to perform their work properly.

Standardization means that we define the data and process in a way to create data that minimizes subsequent scrap and rework. To accomplish this, we may establish "standards" in one area of control, such as a data warehouse, and then improve the upstream processes to comply with the new standards.

To qualify as true data standards, they must apply across the entire enterprise or scope of the data impact. If each business area has different "standards" for creating organization name, does the enterprise have data standards? Let's look at a business analogy. If every business area created its own standard budget accounts for developing and reporting on its budget expenditures, does the enterprise have a standard general ledger chart of accounts? Of course not. Is there an enterprise standard format for defining new budget codes? You bet. Do business area managers follow that when creating codes for new accounts? Absolutely.

The same is true for data standards. Data "standards" are standards if they apply to the full scope of the organization affected by that information type, whether it is organization, product or asset.

Root-Cause Analysis

The root cause of inconsistent organization-name formats is twofold: 1) failure to recognize that organization name is not a single-value fact in order to model and design the database properly, and 2) failure to standardize the create processes and provide standards and guidelines, including training for correct data creation.

Recommended Improvements

  • Always model data based on the characteristics of the real-world objects and events about which the enterprise must know.
  • Do not "under" model (i.e., create one attribute that may have multiple values, such as name). This forces information producers to determine which name they will capture, with inconsistency of different - yet correct - name values for the same object.
  • Assure the data model contains the relationships that support the relationships of the real-world objects and events. The real relationship between organization and organization name is one (organization) to many (names), based on name type (legal, known-as name, doing-business-as-name, abbreviated name or nickname).
  • Develop format standards for text and other structured data element values, such as product name, that provide guidelines to information producers on how to create consistency in data name values.
  • Create a validation table to map multiple synonyms of names to a standard name or to map different name values to the one occurrence of the organization.
  • Access the validation table in the source application as one of the edit tests to identify potential duplicate records to prevent creating duplicate occurrences.
  • Provide phonetic edit and validation tests, such as Soundex, to match common phonetic spelling variations of names.

Questions for Improvements

Some questions to ask:

  • Is there a data standards steering or governance group that operates at an enterprise level?
  • If so, does this standards group include representatives of business areas whose data is affected by the standards?

If the answer to both of these questions is yes, work with this group to create or revise data standards and guidelines, including format structure.

If the answer to either of these questions is no, then create a data standards team to create universal and global data standards and guidelines.

Once data standards have been defined and adopted by representatives of the information stakeholders, document them in an enterprise repository or data dictionary that is accessible to all knowledge-workers, and train information producers and knowledge-workers on the data standards and guidelines.

If legacy processes and applications cannot support the new standards, analyze how to capture data close to the standardized data; and map that data value to the standard data value in migrating data for downstream standardized databases.

When data standards are in place, prioritize any data correction to existing data based upon costs of the nonstandard data (cost of process failure, inability to reconcile data, rework activities required and miscommunication, etc.), the cost of standardizing it and the resources available to convert it.

Data standards and guidelines for creating data values are part of the "information product specification" along with the definition and business rules. They are an important part of data definition for information quality improvement.

What do you think?

1. DuPont is a registered trademark of EI du Pont de Nemours and Company.
2. Imai, Masaaki. Gemba Kaizen. New York: McGraw-Hill, 1997, p. xxviii.
3. American Heritage Dictionary.


For more information on related topics visit the following related portals...
Data Quality.

Larry P. English is president and principal of INFORMATION IMPACT International, Inc., Brentwood, Tennessee, and the author of the widely acclaimed book, Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. English is cofounder of the International Association for Information and Data Quality (www.iaidq.org). English is an internationally recognized speaker, teacher, consultant and author and may be reached at larry.english@infoimpact.com or through his Web site at www.infoimpact.com. For more on how to improve your IQ principles and techniques, and prevent your organization from wasting millions in information scrap and rework, join the IAIDQ (visit www.iaidq.org).

Solutions Marketplace
Provided by IndustryBrains

Data Validation Tools: FREE Trial
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Manage Data Center from Virtually Anywhere!
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.