Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

RESOURCE PORTALS
View all Portals

WEB SEMINARS
Scheduled Events

RESEARCH VAULT
White Paper Library
Research Papers

CAREERZONE
View Job Listings
Post a job

Advertisement

INFORMATION CENTER
DM Review Home
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

GENERAL RESOURCES
Bookstore
Buyer's Guide
Glossary
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

DM REVIEW
About Us
Press Releases
Awards
Advertising/Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Consultant’s Corner:
Getting Technical – ETL Selection Criteria, Part 1

online columnist Dan Meers     Column published in DMReview.com
June 20, 2002
 
  By Dan Meers

I am frequently asked by vendors, clients or consulting partners to aid in the selection, testing and adoption of extraction, transformation and load (ETL) tools. I have stopped providing product reviews and related work for vendors (as of 12/31/2001) and am under no pressure to suggest anything other than my own preferences. Our team uses a variety of tools under diverse circumstances. I try very hard to focus on technical and business criteria, not product preferences when asked for my help. I also try to shape customer expectations so that they can rapidly discern the best possible candidates from an over-crowded field. This article outlines categories of ETL tools and criteria for consideration in their selection.

ETL Tool Categories

There are many ways to categorize ETL tools, price, complexity, breadth of product line and so on. It is important to separate these tools based on their primary platform support and methodology. That is because specialized platforms (mainframe, application servers) require specialized products. The breadth of solutions provided by ETL products is a good way to discern basic product categories. These start at the point solutions and grow to enterprise class applications. Point solutions are designated by their ability to support complexity and volume, not by their price. Enterprise class applications are very broad in terms of both features and data volume capacity. The purpose, features and pricing of enterprise class products separate discrete tools from overall solution sets.

First Generation ETL Tools

ETL has grown from a simple data-mart- in-a-box function to a robust data integration and information construction tool. First generation tools were originally focused on direct-to-mart strategies. These attempted to solve problems of data integration for business intelligence (BI) delivery by way of the dedicated data mart. These were often called federated or so-called independent data marts. They were meant to be freestanding data stores built using dimensional models (or no models) with direct (hard-coded) connections to the source applications. The emphasis was on the extraction of data into some (typically custom or proprietary) form of dimensional model. This is due to the comparative ease of navigation provided by dimensional data models (which require a certain amount of selectivity and focus).

Second Generation ETL Tools

The introduction of data warehousing by Bill Inmon and others provided the path to the second generation of ETL tools. These are focused on a broad array of targets including data warehouses, data marts, operational data stores, data mining and exploration warehouses and analytic applications. The flexibility of these second generation tools provided far superior returns to their owners since the rapid proliferation of federated data marts was already proving to be a massive failure. These independent data marts were impossible to conform to one another (or anything else) so that there was never a single version of the truth. The need for a base data warehouse using a more relational model to provide the foundation for all manner of dimensional analysis has since become the accepted method of data integration. Inconsistent, incomparable and incomprehensible results are hardly the stuff of ROI.

ETL Selection Criteria: Primary Processing Platforms

I start with this issue because many IT shops have specific processing platform requirements that must be respected. There are fewer and fewer specialized platforms running proprietary operating systems, but those that remain are important to their owners and users. The mainframe (IBM 390 and others) is the best known and most pervasive of these specialized platforms. Please remember these tools are mentioned for customers who require native mainframe execution. Code generators can provide COBOL and JCL to allow ETL work to be completed entirely on the mainframe or largely on the mainframe with an external load to non-mainframe systems.

The other major segment of the specialized computing market is the iSeries of eServers (formerly known as AS/400, System3x). There are more than 250,000 iSeries customers throughout the world who rely on these application servers for some or all of their data processing requirements. IBM has continually developed this product line to provide massive scalability and a broad array of IBM and partner software applications.

Many iSeries customers have a high reliance on this technology and a vested interest in the addition of data warehousing success without the addition of new platforms and issues. IBM has provided a path to this extended success using partner products specifically designed to run on the iSeries and support IBM DB2 for the iSeries directly.

Key criteria for iSeries native ETL tools include:

  • Native access (R/W) to DB2 iSeries (bit-mapped indexes, etc.)
  • Complexity of transforms supported (relational and dimensional)
  • Ease of user interface - (graphic, no code environment)
  • Distributed architecture (thin client design)
  • Meta data repository (relational)
  • Scalability - utilizes iSeries scalability
  • Training and support services
  • Installed base - scope of usage, complexity, ROI
  • IBM partnership status

The bottom line for mainframe and iSeries customers is that their processing platforms can support data warehousing success if tools specific to that environment are employed. Customers that require complex sourcing from these platforms to more open targets (Wintel, UNIX) may still want to consider these native applications for iSeries and mainframe, since much of their application knowledge and expertise is tied to these platforms and their respective operational systems.

Selection Criteria: Scope of Usage

Point solutions abound in the world of ETL tools. Unfortunately, many of them come with an enterprise-class price tag. What is the distinction between point and enterprise solutions? Generally, the use of an ETL tool across projects and time is what distinguishes enterprise class solutions. Many ETL tools are used for projects and maintained for many years without additional adoption within the enterprise. This is not always due to product limitations. The ETL products may be quite adequate for additional usage, but the tool vendor or system integrator fail to service the customer relationship. Many software vendors are known for their low-hanging-fruit approach to quick sales. They often abandon their best prospects in search of the next quick hit. Don't hesitate to demand ongoing attention from your vendor of choice. They will come to understand your "lifetime" value. They may actually learn a little about your use of their product and the improvements they should make. Your willingness to share a positive purchase and implementation experience will go a long way towards establishing an open conduit for ongoing communications.

Driving enterprise adoption requires a broader array of features and benefits than point solutions typically provide. There are two basic drivers for enterprise adoption of ETL tools. The first is the use of an independent engine to actually execute the ETL process. Many tools use the underlying database (RDBMS) engine of the target to complete their work. This is fine until you generate significant complexity and volume. The other factor is the variety and complexity of source systems. More complexity requires more robust solutions. This is particularly true of complex sources from nonrelational environments. Mainframe derived flat files, third-party data and other complex sources require much more transformation work to integrate into the data warehouse.

Enterprise utilization also requires features that support shared meta data, process management and control, and managed scalability. Enterprise use implies some level of distribution of sources, ETL processing and targets. It also requires support for the end users of integrated information via business intelligence tools. ETL becomes the clearinghouse for all manner of transactional data and must support some exposure of this process to distributed constituents. Managed scalability must provide for an orderly, cost effective means of increasing data sources, targets, volumes, varieties and frequencies without compelling customers to reengineer the library of ETL jobs they have amassed.

A note about the data warehouse: our approach is based upon Bill Inmon's groundbreaking work in the area. We support the iterative methodology for the design and construction of the data warehouse. Industry logical data models provide the foundation for the enterprise logical and physical models. The data warehouse is built iteratively by subject area. Each new subject area requires some integration to the existing subject areas, hence the use of relational (ER) modeling techniques. Dimensional models (stars) also have a place in the warehouse. We do not support the use of relational- or dimensional-only modeling techniques as both provide value and can be combined. The data warehouse is populated via time-based additions, or snapshots, that are moment in time updates. Each snapshot is serialized to provide longitudinal integrity. These are our guiding design principles, yours may be different. Our review of ETL tools and the criteria for their selection are the direct result of our approach. If you use a markedly different approach then adjust our criteria accordingly.

Issues for consideration in scope of usage:

  • Code generator for specialized processing platform
    1. Full procedural and job control generation and management
    2. Full utilization of processing environment resources
    3. Distributed design and meta data usage
  • Use of ETL engine or code generator relying on RDBMS
    1. Independent, server-based engine (best)
    2. SQL, VB or other code generator for execution by OS or RDBMS
  • Robust design interface that supports complex transformation design and testing
    1. Ease of use (apparent)
    2. Range of use (when does the GUI stop and the code start)
    3. Effect of design methodology (best)
  • No forced sub-setting of data with repeated execution to cycle thru data
  • Unlimited use of direct to memory, tokenization or hashing options
  • Unlimited portability of design elements (objects)
  • Meta data management and control
    1. Serialization and version control
    2. Exposure via relational models for users and other tools
    3. Some form of damage control to predict the impact of change.
  • Process Management and Control
    1. Check pointing of jobs in design AND execution
    2. Versioning of job modules or components (OO)
    3. Remote monitoring and control of processes
    4. Distributed process mechanics (servers and distance)
  • Managed Scalability
    1. Growth of sources, targets, volumes, and varieties at controlled cost
    2. Reasonable support for platform growth and additions (processors and boxes)
  • Enterprise Feature Sets
    1. Data quality - contents, standards, matching and beyond
    2. Real-time support - messaging, EAI and other transports
    3. MPP support - optional use of massively parallel environments
    4. Advanced source mapping and discovery - ERP, CRM, other
...............................................................................

For more information on related topics visit the following related portals...
ETL.

Dan Meers has spent almost twenty years integrating data. He has co- authored two books and several articles and presentations for industry publications and portals. Meers is president of The Inmon Meers Group, a strategic consulting partnership with Bill Inmon focused on supporting "Just In Time BI Delivery" through consulting partnerships. Dan can be reach at: dmeers@inmonmeers.com.

Solutions Marketplace
Provided by IndustryBrains

Data Validation Tools: FREE Trial
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Manage Data Center from Virtually Anywhere!
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space


E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Advertisement
advertisement
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.