Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events
Archived Events

White Paper Library

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Data Warehousing Lessons Learned:
Inflexion Point in Information Quality: Data Profiling Assimilated by IQ

  Column published in DM Review Magazine
May 2004 Issue
  By Lou Agosta

Data profiling (DP) has always had a special relationship to information quality (IQ) and has often, though not always, been represented by Forrester as a subset of the larger IQ market. The market has now validated the hierarchical relationship again with the announcement of the Harte-Hanks/Trillium acquisition of Avellino in February 2004. Meanwhile, Firstlogic has shipped data profiling capabilities with IQ Insight 2.3; and DataFlux enhanced the data profiling capability of its version 6.1 IQ technology. In a preemptive move, Innovative Systems is promoting Synchronous, its customer data hub, and will continue to support Avellino's Discovery, which it had previously sold as Innovative Discovery.

Best of breed data profiling will now be inbred, which will enrich the gene pool of the software DNA of the assimilating enterprise. Data profiling capabilities such as redundant data identification, parent-child relationship analysis, data validation and sampling have always been a part of data standardization. Vendors such as Vality (now a part of Ascential), Trillium, Similarity Systems, Innovative Systems and Firstlogic have been featuring their data profiling functions in briefings to Forrester industry analysts, including this author, for years. What is happening now that makes this an inflexion point? Three causal factors follow:

Reality has now caught up with the rhetoric. Data profiling capabilities - to which lip service has always been devoted - are being strengthened by vendors in their latest product shipments with more powerful functions and a greater diversity of choice. As is often the case, the promises preceded the results. In this case, the capabilities are now delivering on their promises.

Data standardization is being differentiated from information quality. Both data profiling and data standardization are subsets of information quality. Standardization can result in the loss of information unless it is based on an understanding of how the standards interact with what is given by the raw data. Data profiling is being integrated with data standardization so that the one leads naturally to the other in the order of implementation.

Defect inspection is giving way to a design for information quality. There is a world of difference between inspecting the content of every individual data element and designing a process that produces the correct output by design. The latter is pursued as part of an integrated methodology for information quality.

Data profiling determines what are valid values in a population of data values where validity is uncertain. It is essential to have reports on frequency analysis, word counts, patterns and related occurrences of tokens and labels. These are usability and productivity enhancers. Data profiling tools should be able to report concisely on questions such as:

  • What are the values contained in the data elements?
  • What keys are inferred from what is in the data?
  • What is the proposed parsing of the free-form text field?
  • What allegedly different data elements are actually aliases (synonyms) for the same data element, even though in different files?
  • What are the data dependencies, constraints and rules implied by the data?
  • What is the normalized logical and physical data model or relational design that represents the business rules of the existing, analyzed (legacy) data?

In the final analysis, semantics is messy because the real world is messy. A dictionary of special words, phrases and patterns remains essential in identifying noise words, suffixes and prefixes, such as "in care of," "Dr." and "Ph.D." It is a semantic problem to know whether the word "church" occurring in a free-form text refers to an individual named Alonzo Church, Church Street in Evanston, Illinois, or the First Church of God. Such dictionaries will continue to be a part of the solution, whether in the form of cartridges for text mining or dictionaries, more narrowly defined, in data standardization.

The Future of Data Profiling

Data profiling is the first step in a readiness assessment for information quality improvement. You need to know what you have and what you are up against prior to engaging and transforming it. In the future, the results of the profiling and analysis activity will be incorporated into a meta data repository of values for further inspection and validation. This will be a source of inputs to other downstream enterprise processes and technologies such as data warehousing, data mining and data standardization. By certifying upfront that the data inputs conform by design to quality standards, the downstream applications will literally be defended against externalities due to variations in data quality (though obviously internal system design defects will still be an issue). After having profiled and parsed the semantics of an opaque data element or text, a logical next step is to standardize the results. Standardization leads directly into the functionality provided by information quality tools.


For more information on related topics visit the following related portals...
Data Quality.

Lou Agosta, Ph.D. is a technology analyst specializing in data warehousing, data mining and data quality - keyword: data. Agosta is the author of The Essential Guide to Data Warehousing (Prentice Hall PTR, 2000); and he offers unbiased research, advice and commentary on the interrelation between business and information technology. Please send comments or questions to him at LAgosta@acm.org.



Solutions Marketplace
Provided by IndustryBrains

Easy Software Migration to SAP
If your current applications are at risk, SAP Safe Passage provides a clear roadmap for solution migration with maintenance support & integration technology. View free demos now!

Dedicated Server Hosting: High Speed, Low Cost
Outsource your web site and application hosting to ServePath, the largest dedicated server specialist on the West Coast. Enjoy better reliability and performance with our screaming-fast network and 99.999% uptime guarantee. Custom built in 24 hours.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Data Quality Tools, Affordable and Accurate
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Data Mining: Strategy, Methods & Practice
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2005 DM Review and SourceMedia, Inc. All rights reserved.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.