Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Information Is Your Business
   Information Is Your Business Advanced Search

Business Intelligence
Corporate Performance Management
Data Integration
Data Quality
Data Warehousing Basics
Master Data Management
View all Portals

Scheduled Events

White Paper Library
Research Papers



DM Review Home
Current Magazine Issue
Magazine Archives
DM Review Extended Edition
Online Columnists
Ask the Experts
Industry News
Search DM Review

Tech Evaluation Center:
Evaluate IT solutions
Buyer's Guide
Industry Events Calendar
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Data Integration Adviser:
The Trial-and-Error Method for Data Integration

  Column published in DM Review Magazine
February 2007 Issue
  By Rick Sherman

In last month's column, I discussed some common data architecture mistakes. This month, I point out areas where companies tend to go wrong in data integration or extract, transform and load (ETL) processing.

Data Integration Mistakes

1. Not developing an overall architecture and workflow. The usual development approach for data integration is to gather the data requirements, determine what data is needed from source systems, create the target databases such as a data warehouse and then code. This is an incomplete, bottom-up approach. It needs to be coupled with a top-down approach that emphasizes an overall data integration architecture and workflow.

Some of the design considerations often lost with a bottom-up approach include:

  • How and where do you implement a refresh (replacing all your data) rather than a change data capture (CDC) approach?
  • How do you process dimensional data, such as products and customers, in relation to facts, such as business transactions?
  • When do you filter and aggregate data?
  • When do you use staging tables, and are they persistent or transient?
  • How do you handle data quality?
  • Do you reject, suspend or flag data with perceived data quality problems?
  • How do you handle the changes and gaps in historical data?

Failing to address these considerations in the beginning can delay your project, increase costs, reduce perceived data quality and cause business users to question the value of your business intelligence (BI)/data warehousing (DW) efforts.

2. Thinking that data quality is a product rather than a process. People often assume that data quality problems are simply data errors or inconsistencies in the transactional systems that can be fixed with data quality products. They overlook and, therefore, don't try to prevent the fact that problems arise when you integrate data from disparate source systems into a data warehouse.

They're not seeing that many data quality problems are really data consistency and integrity issues that arise when you integrate data from multiple transaction systems. The differences in dimensional data, such as product (part IDs, code and hierarchy), customers (business and/or people), suppliers, partners and employees become an issue on the enterprise level, which is precisely the target for DW and performance management reporting and analysis.

Even when data quality problems are not your fault, you still need to take responsibility to proactively measure, monitor and report on data quality metrics as you load your DW and data marts. You might not own fixing the problems, but you certainly own measuring them. The business should not make decisions using enterprise data of unknown quality.

Measure data quality by obtaining data quality requirements in your initial business requirements phase, incorporating data quality metrics into your architecture, monitoring those metrics in all your data integration processes and reporting on data quality so the business users understand the data on which they are basing their decisions.

3. Assuming custom coding is faster than ETL development. While most large enterprises have embraced ETL development as a best practice, the reality is that custom coding is still prevalent, especially in the following areas:

  • The small-to-medium business market is not using ETL tools as extensively as their larger brethren.
  • Companies that have used database vendors' ETL tools (particularly older versions of these tools, such as Microsoft DTS and Oracle Warehouse Builder) may be using them just to run SQL scripts or store procedures. Although they are technically using ETL tools, in reality they are writing custom code and just using the ETL tool to run their code.
  • Many BI applications need to build summary or aggregation tables (these should be data marts, but many times they are not). These reporting tables are often built using SQL scripts or stored procedures. The BI developers know SQL, feel comfortable in creating tables with it and do not feel the need to use an ETL tool.
  • Business groups are using custom coding to create countless data shadow systems in large and small enterprises (defined in my April 2004 column). They may have asked for a flat file extract from IT, or they used a BI tool to create their own flat file extract. They then use a combination of Microsoft Access and Microsoft Excel to perform their ETL work.

In all the cases just mentioned, I get the usual feedback. "Rick, how can it be faster to code with the ETL tool, and how can the ETL code perform as well as the code I create?" Without debating custom coding versus the merits of ETL development, let's just say that when I see a large block of custom code, I see a great opportunity for change and improvement. But custom coding is often below the radar because most just assume that everyone is using ETL tools.

In general, people tend to take an oversimplified view of data integration. By not seeing it as a process to be managed from a top-down perspective, they tend to run into problems and learn the hard way - through trial and error. 


For more information on related topics visit the following related portals...
Data Integration and Data Quality.

Rick Sherman has more than 20 years of business intelligence and data warehousing experience, having worked on more than 50 implementations as a director/practice leader at PricewaterhouseCoopers and while managing his own firm. He is the founder of Athena IT Solutions, a Boston-based consulting firm that provides data warehouse and business intelligence consulting, training and vendor services. Sherman is a published author of over 50 articles, an industry speaker, a DM Review World Class Solution Awards judge, a data management expert at searchdatamanagement.com and has been quoted in CFO and Business Week. Sherman can be found blogging on performance management, data warehouse and business intelligence topics at The Data Doghouse.You can reach him at rsherman@athena-solutions.com or (617) 835-0546.

In addition to teaching at industry conferences, Sherman offers on-site data warehouse/business intelligence training, which can be customized and teaches public courses in the Boston area. He also teaches data warehousing at Northeastern University 's graduate school of engineering.


Solutions Marketplace
Provided by IndustryBrains

Free DB Modeling Trial with ER/Studio
Design and Build More Powerful Databases with ER/Studio.

Database Access Tool
Reverse engineers relational schemas and automatically generates data access code that can be used in Service Oriented Architectures.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Data Mining Courses: Strategy, Methods & Apps
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.

Backup SQL Server or Exchange Continuously
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2007 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.