Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

Resource Portals
Analytic Applications
Business Intelligence
Business Performance Management
Data Integration
Data Quality
Data Warehousing Basics
EDM
EII
ETL
More Portals...

Advertisement

Information Center
DM Review Home
Conference & Expo
Web Seminars & Archives
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

General Resources
Bookstore
Industry Events Calendar
Vendor Listings
White Paper Library
Glossary
Software Demo Lab

General Resources
About Us
Press Releases
Awards
Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

The Conformed ETL Architecture:
Pounding Square Pegs into Round Holes (without Cutting Corners)

  Article published in DM Review Magazine
September 2004 Issue
 
  By Pat Minton and Don Steffen

Requirements have been defined, the data model is complete, source systems have been identified, tool selections have been made, and now the only thing left to do is connect the dots. Sounds easy, right? Designated with creating the extract, transform and load (ETL) architecture to move the source data into the warehouse, you begin drawing boxes and lines depicting the individual ETL processes that must be created in order to reconcile the idiosyncratic source system data into its generic business view of the world. The solution seems obvious: create an ETL process for each source you must introduce.

And they're off! The ETL analysts charge forward, creating the detail design (source-to-target mappings) for each of the processes you identified. However, upon reviewing said designs, you begin to discover that many of them are repetitively performing many of the same actions (and not always consistently). Maybe there's a better way.

ETL architects and data warehouse designers are faced with the task of homogenizing data into a standard and consistent format. Whether by geography, business unit or anything (and everything!) else, data in the source systems feeding the data warehouse is inherently different. This article examines the conformed ETL architecture - an approach that eliminates proliferating, redundant ETL actions by distinguishing between where to apply source-specific and non source-specific business rules during the ETL load processes within the data warehouse.

Defining ETL Actions

Creating a stable, timely, maintainable and extensible framework for converging like, but disparate, data sources into homogenized entities is the goal of any successful ETL architecture. This meshing of source system-specific data into the business view represented in the warehouse requires two distinctly different types of ETL actions:

  • Source-Specific Actions: Those actions which apply source-specific business rules in order to transform and decode the data from a format that is recognizable only by the source systems into a standardized format.
  • Non Source-Specific Actions: Those actions performed on data, regardless of the source system of origination. Non source-specific actions include the application of business rules, as well as the ETL logic necessary to effect the change in the warehouse (insert/update).

A logical ETL design outlines the necessary actions and mapping of specific elements that must be performed in order to successfully integrate disparate data sources. This detailed abstraction provides insight into the transformations that the source data must undergo at an atomic level, regardless of the ETL tool used for the physical implementation. Accurately categorizing these actions into their appropriate designations (source- or non source-specific) can be a difficult undertaking. Consider the following scenario where the customer's status (either ACTIVE or INACTIVE) and the customer's billing status (either BILLED or UNBILLED) are being determined (see Figure 1).


Figure 1: Source-Specific and Non Source-Specific Business Rules

It would be easy to mistakenly determine that the assignment of the customer's billing status would be an action specific to the source. However, by isolating the individual transformations and business rules to apply to source data, we have better insight into the true scope of the business rule to apply, allowing us to create generic, reusable business rules and definitions.

Traditional Versus Conformed ETL Architecture

Following is a simple example as context for this discussion.

Company ABC has two sales systems (systems X and Y), which contain information surrounding the sale of widgets. Each sale of a widget needs to be captured as part of the data warehouse implementation. The sales are qualified by the date of the sale, the customer purchasing the widget, the type of widget purchased (commercial or residential) and the total dollar amount of the sale (see Figure 2).


Figure 2: Widget Sales Example - Sources and Targets

When faced with the issue of how to create the necessary ETL processes to converge like data sources into standardized entities within the warehouse, there are two choices:

Traditional ETL Architecture:
Create individual ETL processes for each source system (as shown in Figure 3).


Figure 3: Traditional ETL Architecture

A traditional ETL architecture would create one ETL process to perform all of the logic necessary to transform the source data into its target destination. The advantages to this approach are that there are fewer ETL processes to create and maintain. The hidden shortcoming of this approach is that each individual ETL process must redundantly perform many of the same actions (the non source-specific actions). This duplication of efforts results in additional efforts throughout the life cycle of the data warehouse:

  • During development, this redundancy results in an increase in the amount and complexity of ETL code to create and test, thereby lengthening the development time line.
  • Upon implementation, modifications require more effort as changes potentially must be made in multiple places.

Conformed ETL Architecture:
Create individual ETL processes for each source to apply source-specific business rules necessary to conform the source into a generic format (conformed widget sales) and create reusable, generic processes to execute non source-specific business rules and actions (see Figure 4).


Figure 4: Conformed ETL Architecture

The conformed ETL architecture involves creating an intermediary definition of the data entity (the conformed table). This operational data entity contains the standardized format of the converged source systems, clearly defines the boundary between source-specific and reusable generic actions and rules, and serves as the gateway for moving data into the warehouse. Obviously, the disadvantages of this approach are that there are more objects and processes that must be created and maintained. However, the underlying advantages of this architecture are plentiful:

  • Modularization of ETL processes: The creation of smaller, less complex ETL processes makes troubleshooting problems and creating future enhancements easier.
  • Reusability of post-conform processes: Often, the "simple" process of enforcing referential integrity (looking up foreign key assignments) and performing inserts and/or updates to the final target destination is "not so simple." The conform approach prevents having to duplicate the efforts and logic necessary to apply non source-specific actions to data. Eliminating redundant processing and ETL logic reduces the chances of error, ultimately improving data quality.
  • Extensibility: The reusable aspects of the conformed ETL architecture allow for the rapid acquisition of additional source systems. When new source data is introduced to the warehouse, only the source-specific ETL processes to conform the data must be created.

To Conform or Not to Conform?

While the conform ETL architecture has many benefits, consideration should be made concerning where to apply the approach, as it is not applicable in all scenarios. Analysis should be performed to decide which sources to conform to ensure that the extra development overhead will have quantifiable benefits (increased performance, flexibility in ETL scheduling, modularization of processes, etc.). In making the decision to follow a conformed architecture or not, be sure to keep in mind the future requirements of the warehouse (additional sources of data, etc.), not just the current state.

Redundancy within a data warehouse is a given - where the data is concerned. By properly identifying and categorizing the business rules and logic to be applied to source data, many processing redundancies can be reduced as well. The conformed ETL architecture allows for the creation of a scalable, extensible framework, which helps to ensure that timely information is consistently delivered to the masses.

...............................................................................

For more information on related topics visit the following related portals...
ETL.

Pat Minton is a manager with DiamondCluster International, a premier global management consulting firm that helps leading organizations develop and implement growth strategies, improve operations and capitalize on technology. He has more than 10 years of database architecture experience in the telecommunications, media and transportation industries. He may be reached at patrick.minton@diamondcluster.com.

Don Steffen is a cofounder and partner of AmberLeaf Partners, Inc. (formerly BI Solutions, Inc.), a consulting firm dedicated to enabling innovative companies with the information to make critical investment decisions. Steffen has been designing and delivering technical architecture and solutions in the business intelligence and data warehouse industry for more than a decade. He can be reached at dsteffen@amberleaf.net.

Solutions Marketplace
Provided by IndustryBrains

Bowne Global Solutions: Language Services
World's largest language services firm offers translation/localization, interpretation, and tech writing. With offices in 24 countries and more than 2,000 staff, we go beyond words with an in depth understanding of your business and target markets

Award-Winning Database Administration Tools
Embarcadero Technologies Offers a Full Suite of Powerful Software Tools for Designing, Optimizing, Securing, Migrating, and Managing Enterprise Databases. Come See Why 97 of the Fortune 100 Depend on Embarcadero!

Online Backup and Recovery for Business Servers
Fully managed online backup and recovery service for business servers. Backs up data to a secure offsite facility, making it immediately available for recovery 24x7x365. 30-day trial.

Test Drive the Standard in Data Protection
Double-Take is more affordable than synchronous mirroring and enables you to recover from an outage more quickly than tape backup. Based upon the Northeast blackout and the west coast wild fires, can you afford to be without it?

Help Desk Software Co-Winners HelpSTAR and Remedy
Help Desk Technology's HelpSTAR and BMC Remedy have been declared co-winners in Windows IT Pro Readers' Choice Awards for 2004. Discover proven help desk best practices right out of the box.

Click here to advertise in this space


View Full Issue View Full Magazine Issue
E-mail This Article E-Mail This Article
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Advertisement
advertisement
Site Map Terms of Use Privacy Policy

Thomson Media

2005 The Thomson Corporation and DMReview.com. All rights reserved.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.