Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Data Warehousing Lessons Learned:
Data Quality and ETL Will Converge, Not Merge

  Column published in DM Review Magazine
November 2003 Issue
  By Lou Agosta

The idea of invoking data quality processes at extract, transform and load (ETL) time is a compelling one. It is a good time to apply validation, transformation, filtering or standardization in the interest of data quality. Disk I/O is one of the most expensive processes actions and because that has already been completed, it makes sense to subject the data to multiple processes prior to writing it back or putting it out through a network interface. Converging includes:

Design time integration: The functions are available in a palette of transformations at the developer's finger tips to support a diversity of transformations, including those relevant to data quality.

Execution time integration: The processes are applied in the application that is generated and promoted to production.

Meta data integration: The information is stored in the local meta data repository, but then is able to be interchanged thanks to a variety of bridges (available at a modest extra fee) with other tools (such as a query and reporting interface, data modeling or data mining tool) in a federated design and execution environment.

This convergence of ETL and data quality (DQ) technologies has been in progress at least since 1999 when Oracle acquired Carleton and its data quality product Pure. SAS's acquisition of DataFlux followed apace a few months later. Ascential then acquired Vality in the Spring of 2002. Group 1 reversed the direction of the trend of ETL vendors acquiring data quality vendors. Here the DQ vendor acquired the ETL vendor. Group 1's primary focus had been on data quality in the direct marketing vertical, and Sagent was initially an end-to-end business intelligence software provider with an ETL tool. See Figure 1 for a summary.

Figure 1: Convergence of ETL and Data Quality

However, in spite of significant convergence, the merging of features and functions across data quality and ETL will remain incomplete. Some clients will find that transforming operational data into a star schema format is disconnected from issues of data quality, which are best addressed upstream in the transactional system. Others will find that addressing data quality issues requires semantic analyses and content updates that are significantly different than the structural and syntactic transformations in which ETL tools excel. Furthermore, many ETL tools now accept a near real-time data feed from message brokers such as MQ Series. However, in practice, actual deployments of ETL tools remain batch oriented, whereas data quality supports real time. Finally, differences exist in the form and uses of meta data. The meta data of the ETL tool is a grammar-like repository of data models and data structures, whereas the meta data of the DQ tool is a dictionary-like repository of valid contents. Two separate problem spaces (and markets) exist here, even though it is often operationally convenient to perform related functions at the same time. Both categories of tools -- DQ and ETL -- will continue to address separate requirements and will continue to exist separately in spite of productive collaborations.

One open question for clients is: Will the convergence of the two technologies, partial though it may be, provide end users additional technology "at no extra charge" following the model that has been characteristic of software innovation, or will the perception of additional value be employed by the vendor to propose a price increase? When the two vendors remain separate entities (Informatica/Trillium), then separate license agreements and separate fees are to be expected, though discounts based on local circumstances are always possible. When the two become one (Group 1/Sagent or SAS/DataFlux), the opportunities for flexible pricing are enhanced. In either case, clients should do their homework to determine the internal costs and benefits of their own data warehousing, data transformation and data quality applications. If the buyer has not yet made a commitment, he/she enjoys the maximum leverage in negotiating for additional technology at no extra charge, whether ETL or DQ. This may include training or a long-term maintenance contract, locking in low prices or premium service where the commitment warrants. If the costs for upgrading are high, clients should present a case to the vendor for additional support, discounts and related "investment protection."


For more information on related topics visit the following related portals...
Data Quality and ETL.

Lou Agosta, Ph.D., joined IBM WorldWide Business Intelligence Solutions in August 2005 as a BI strategist focusing on competitive dynamics. He is a former industry analyst with Giga Information Group, has served as an enterprise consultant with Greenbrier & Russel and has worked in the trenches as a database administrator in prior careers. His book The Essential Guide to Data Warehousing is published by Prentice Hall. Agosta may be reached at LoAgosta@us.ibm.com.

Solutions Marketplace
Provided by IndustryBrains

Manage Data Center from Virtually Anywhere!
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Data Mining: Levels I, II & III
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.