|Sign-Up for Free Exclusive Services:||Portals|||||eNewsletters|||||Web Seminars|||||dataWarehouse.com|||||DM Review Magazine|
|Covering Business Intelligence, Integration & Analytics||Advanced Search|
The Conformed ETL Architecture:
Requirements have been defined, the data model is complete, source systems have been identified, tool selections have been made, and now the only thing left to do is connect the dots. Sounds easy, right? Designated with creating the extract, transform and load (ETL) architecture to move the source data into the warehouse, you begin drawing boxes and lines depicting the individual ETL processes that must be created in order to reconcile the idiosyncratic source system data into its generic business view of the world. The solution seems obvious: create an ETL process for each source you must introduce.
And they're off! The ETL analysts charge forward, creating the detail design (source-to-target mappings) for each of the processes you identified. However, upon reviewing said designs, you begin to discover that many of them are repetitively performing many of the same actions (and not always consistently). Maybe there's a better way.
ETL architects and data warehouse designers are faced with the task of homogenizing data into a standard and consistent format. Whether by geography, business unit or anything (and everything!) else, data in the source systems feeding the data warehouse is inherently different. This article examines the conformed ETL architecture - an approach that eliminates proliferating, redundant ETL actions by distinguishing between where to apply source-specific and non source-specific business rules during the ETL load processes within the data warehouse.
Creating a stable, timely, maintainable and extensible framework for converging like, but disparate, data sources into homogenized entities is the goal of any successful ETL architecture. This meshing of source system-specific data into the business view represented in the warehouse requires two distinctly different types of ETL actions:
A logical ETL design outlines the necessary actions and mapping of specific elements that must be performed in order to successfully integrate disparate data sources. This detailed abstraction provides insight into the transformations that the source data must undergo at an atomic level, regardless of the ETL tool used for the physical implementation. Accurately categorizing these actions into their appropriate designations (source- or non source-specific) can be a difficult undertaking. Consider the following scenario where the customer's status (either ACTIVE or INACTIVE) and the customer's billing status (either BILLED or UNBILLED) are being determined (see Figure 1).
Figure 1: Source-Specific and Non Source-Specific Business Rules
It would be easy to mistakenly determine that the assignment of the customer's billing status would be an action specific to the source. However, by isolating the individual transformations and business rules to apply to source data, we have better insight into the true scope of the business rule to apply, allowing us to create generic, reusable business rules and definitions.
Following is a simple example as context for this discussion.
Company ABC has two sales systems (systems X and Y), which contain information surrounding the sale of widgets. Each sale of a widget needs to be captured as part of the data warehouse implementation. The sales are qualified by the date of the sale, the customer purchasing the widget, the type of widget purchased (commercial or residential) and the total dollar amount of the sale (see Figure 2).
Figure 2: Widget Sales Example - Sources and Targets
When faced with the issue of how to create the necessary ETL processes to converge like data sources into standardized entities within the warehouse, there are two choices:
Traditional ETL Architecture:
Create individual ETL processes for each source system (as shown in Figure 3).
Figure 3: Traditional ETL Architecture
A traditional ETL architecture would create one ETL process to perform all of the logic necessary to transform the source data into its target destination. The advantages to this approach are that there are fewer ETL processes to create and maintain. The hidden shortcoming of this approach is that each individual ETL process must redundantly perform many of the same actions (the non source-specific actions). This duplication of efforts results in additional efforts throughout the life cycle of the data warehouse:
Conformed ETL Architecture:
Create individual ETL processes for each source to apply source-specific business rules necessary to conform the source into a generic format (conformed widget sales) and create reusable, generic processes to execute non source-specific business rules and actions (see Figure 4).
Figure 4: Conformed ETL Architecture
The conformed ETL architecture involves creating an intermediary definition of the data entity (the conformed table). This operational data entity contains the standardized format of the converged source systems, clearly defines the boundary between source-specific and reusable generic actions and rules, and serves as the gateway for moving data into the warehouse. Obviously, the disadvantages of this approach are that there are more objects and processes that must be created and maintained. However, the underlying advantages of this architecture are plentiful:
While the conform ETL architecture has many benefits, consideration should be made concerning where to apply the approach, as it is not applicable in all scenarios. Analysis should be performed to decide which sources to conform to ensure that the extra development overhead will have quantifiable benefits (increased performance, flexibility in ETL scheduling, modularization of processes, etc.). In making the decision to follow a conformed architecture or not, be sure to keep in mind the future requirements of the warehouse (additional sources of data, etc.), not just the current state.
Redundancy within a data warehouse is a given - where the data is concerned. By properly identifying and categorizing the business rules and logic to be applied to source data, many processing redundancies can be reduced as well. The conformed ETL architecture allows for the creation of a scalable, extensible framework, which helps to ensure that timely information is consistently delivered to the masses.
Pat Minton is a manager with DiamondCluster International, a premier global management consulting firm that helps leading organizations develop and implement growth strategies, improve operations and capitalize on technology. He has more than 10 years of database architecture experience in the telecommunications, media and transportation industries. He may be reached at email@example.com.
Don Steffen is a cofounder and partner of AmberLeaf Partners, Inc. (formerly BI Solutions, Inc.), a consulting firm dedicated to enabling innovative companies with the information to make critical investment decisions. Steffen has been designing and delivering technical architecture and solutions in the business intelligence and data warehouse industry for more than a decade. He can be reached at firstname.lastname@example.org.
|View Full Magazine Issue|
|E-Mail This Article|