Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

RESOURCE PORTALS
View all Portals

WEB SEMINARS
Scheduled Events

RESEARCH VAULT
White Paper Library
Research Papers

CAREERZONE
View Job Listings
Post a job

Advertisement

INFORMATION CENTER
DM Review Home
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

GENERAL RESOURCES
Bookstore
Buyer's Guide
Glossary
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

DM REVIEW
About Us
Press Releases
Awards
Advertising/Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Data Integration:
The Alphabet Soup of Data Integration

online columnist Greg Mancuso and Al Moreno     Column published in DMReview.com
December 9, 2004
 
  By Greg Mancuso and Al Moreno

Do you feel like you have fallen into a large bowl of alphabet soup when talking about data integration? Are you totally frustrated while trying to keep up with all the acronyms? You are not alone! The world of IT tools has been expanding to meet the ever-increasing needs of the global business environment. When business was simple there was little to worry about other than the operational systems. In the prehistoric world of IT, operational systems required to run the business were typically from the same vendor; and, in many cases, they were modules, such as G/L, payables, HR and sales, which allowed for data to be relatively easily tied together. In the many cases where standard modules were not available, the IT organization developed custom applications with data keys and structures that were based on their purchased applications. The demands for information were easily satisfied by canned reports.

Now, enter the modern world of conglomerates. They have sales tracking modules from one vendor, accounting modules from another and a separate HR system. And, just for good measure, throw in ERP and CRM systems. The new dashboards and scorecards the top executives are asking for require a broad cross-section of data, which is scattered throughout every system in the company. Explore the new world of extraction, transformation and load (ETL), enterprise application integration (EAI) and enterprise information integration (EII) solutions. Over the course of the next view months, this column will concentrate on deciphering the ins and outs of the current integration strategies focusing on theories and functionality rather than on specific products.

This month, the discussion will focus on the world of ETL. This is a term that we all use, even the business users. But, seldom does everyone use the term to mean the same processes, technologies or functions. In the traditional sense, ETL - extraction, transformation and load - is an acronym associated with a variety of processes. The ETL processes and technology arose from the business need to draw information from disparate data systems and either move it to a central repository or to create load files that feed into a reporting system. Figure 1 depicts the traditional ETL deployment as it was at the beginning.


Figure 1: Traditional ETL Process

During the extraction process, one or more files are created containing the extracted data values. The data needed is identified and mapped to existing sources of data, such as from the financials, CRM and ERP systems. The data is then extracted from the sources and placed into flat files or temporary database tables in a staging area. These files and/or tables are then fed into a set of processes that perform the transformation activities, such as data validation, creating aggregated or derived values, and other transformations, such as converting Y to 1 and N to 2, key replacement and address standardization. Load files containing valid information are created during the transformation process. This is true regardless of whether the transformation process uses flat files or temporary database tables. As the transformation process has already performed full validations, the receiving systems perform little, if any, validation and the target database's (databases') native bulk loader(s) is (are) the most often selected method of database insertion. All steps of the transformation and load processes are optimized to run concurrently, unless there are specific referential integrity or order issues. Also, it should be noted that, as a rule, data runs in one direction: sources to transform to targets. Figure 2 depicts a non-traditional or alternative methodology that has evolved over the last few years.


Figure 2: Non-Traditional ETL Process

Non-traditional ETL extends the ETL process by adding in business meta data and reference data management. The reference data includes hierarchy definition, business rule management and data flow control information which are used by the ETL process to optimize all steps of the process. The reference data manager allows for one set of master hierarchies and data mappings to be set up by the organization. These hierarchies and mappings are then validated by all relevant parties and may be pushed out to all data sources and targets. For example, a company may define the G/L and organization hierarchies in a reference data manager and push them back to their Oracle Financials application (source) and G/L OLAP cube (target) to ensure all applications are looking at the same data in the same structures. The hierarchies and mappings may also be fed into the ETL tool which uses them to optimize the data extract from the source and mapping to the target data repositories. In many cases, the transformation server performs the cleansing and transformations just as it does in the traditional ETL process. In other cases, the source and destination systems are fed the details of any required validations and/or transformations and the most appropriate system performs the functions prior to the data insertion. While it is possible that the transformation server could create load files, the more common practice is for the server to use the database server's bulk insertion utility and stream the load data directly to the database.

The decision to select the traditional or non-traditional methodology is a function of the environment and the level of complexity required by the ETL process. The traditional method has vendors that have created very sophisticated user interfaces and scripting languages that provide all the tools necessary to perform the extractions, edits and transformations that are most commonly needed. They work well when there is a 1:1 correspondence between the number of systems and the number of extractions. They begin to suffer performance degradation when extracts have to be performed from disparate systems and require heavy consistency and standardization processing and business rule enforcement. For example, to create a set of load files that contain information from several sources is significantly more involved than creating a load file based on information from only one source database. The information is extracted into temporary structures - files or tables - and the business rules are applied to combine the data into a consolidated output format, which may be comprised of multiple load files. When the time and effort required by the ETL process' business rules becomes overwhelming, alternatives, such as non-traditional ETL methods should be considered.

The non-traditional ETL was developed to reduce this level of complexity; however, it does create other problems. As long as referential integrity issues and complexity of required transformation is minimal from source(s) to target(s), this method will likely move data faster and will reduce the need for number of complex processes that must run to prepare the data for bulk inserts. However, since many of the products that use the non-traditional method do so by generating complex SQL, the performance of this method will degrade in proportion to the level of complexity of the resulting SQL based on the data.

It, therefore, becomes incumbent on the users to fully understand their ETL needs and to select the tools and methods most appropriate for their particular circumstances. While vendors of both methods claim high performance and product superiority, the appropriate toolset is the one that will actually work in a given situation. During the vendor selection process, ask plenty of questions and use proofs of concept from the vendors to identify the best fitting tool(s). It is safe to say that the only absolute when looking at data integration is that using a purchased ETL tool is almost always superior to writing and maintaining one in house. As to which method should be used, the best rule of thumb is to use whichever method is most appropriate for your environment.

...............................................................................

For more information on related topics visit the following related portals...
Data Integration.

Greg Mancuso and Al Moreno are principals with Sinecon, a business intelligence consultancy specializing in data integration and BI/DW solution architecture design. Together they have more than 29 years of data warehouse and business intelligence experience and have implemented many large-scale solutions in both the U.S. and European markets. They may be reached at gmancuso@sinecon-llc.com or amoreno@sinecon-llc.com.

Solutions Marketplace
Provided by IndustryBrains

Data Validation Tools: FREE Trial
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Recover SQL Server or Exchange in minutes
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space


E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Advertisement
advertisement
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.