Data Warehousing Lessons Learned:
The Case for an Objective ETL Benchmark
Benchmarks can be a source of marketing hype and spin unless they are carefully defined and audited by independent third parties. Even then, many hazards and risks threaten objectivity. In spite of these risks, industry- standard, independently audited benchmarks, such as those sponsored by the Transaction Processing Performance Council, have value. Used properly, benchmarks can drive development of new technology features that benefit real- world customers (not just "benchmark specials"), creating an economic reference point about cost, setting a performance bar at a point in time for a given software and revealing lessons about how to tune products that can be shared with the end-user community. Auditing of technology benchmarks is a subset of objective research and, as such, is the kind of activity likely to continue to be in the news. Business ethics ? integrity in messaging, transparency in metrics and accountability in leadership ? is arguably the latest post-Enron business and technology trend.
Ascential Software is launching an initiative to create an extract, transform and load (ETL) benchmark that would be objectively defined and audited. This is not a trivial undertaking; and, in many ways, the odds are against it. Such an undertaking faces a number of challenges, including the need for an objective audit, the issue of resisting the lure of "benchmarketing" and the inevitable task of making inferences from laboratory work to real-world experiences in client data centers. Nevertheless, because of the latest flap about integrity (an arguably objective need in the market) and because one of the vendors has stolen a march on the other by acquiring parallel technology, now is an opportune time to make sense of conflicting performance claims. The ultimate issue is whether the dynamics of unenlightened self-interest will be able to be contained by professionalism and objectivity.
Using the data model of customer and products defined by the TPC-H benchmark, Ascential Software reported an unaudited benchmark in which it moved 1TB of data through its DataStage XE Parallel Extender in 1.4 hours and into the TPC-H data model.
Ascential's next step was to execute the ETL processes with actual data transformation. Ascential extracted 1TB of data including mixed binary and character data types, requiring complex data conversion and transformation. Each row was 534 bytes wide with 97 columns of real-life data including customer data, integers, packed decimals and strings. Advanced transforms were applied on 68 percent of the 97 columns. The execution environment consisted of a 24-processor IBM p680 machine, an enterprise storage array with 96GB of memory and IBM DB2 EEE. The ETL tool used was DataStage XE Parallel Extender. The result: At an IBM development center, 1TB of data was transformed and loaded in eight hours and 43 minutes. I disagree with the assertion this can be automatically scaled to run in slightly more than three hours on a 64-way machine. Such an operation might be feasible ? no one is saying it is impossible. What is being said is that it has not yet been proven. As a general rule, scaling across such processor configurations is (or ought to be) disallowed by standard benchmark rules. It is "benchmarketing." Ascential is to be commended for posting its results publicly; however, it is still far short of an objective, industry standard benchmark, which no vendor can create alone. If the hardware and configuration are different from test to test or case to case, then it is not only testing the ETL technology, but also the entire technology stack. This reinforces the importance of having an audit performed by an objective third party.
In conversation with the author, Ascential states it is participating in the TPC meetings and working toward the creation of standards that will enable ETL results to be audited. This is not a trivial undertaking because the current TPC-H and TPC-R are designed to highlight the execution of a set of SQL inquiries against a specific data model at various volume points. This is not applicable in any obvious way to the execution of ETL tools that operate as transformation engines. However, the TPC-H data model is applicable. Given the importance of the relational database and underlying hardware servers to the process, the TPC organizational framework may also be relevant. Readers are cautioned not to look for results soon. However, it might be feasible, given the usual amount of hard work, to include a testing template that contains basic and useful data transformations (character to integer) and operations such as aggregation. Indeed aggregators, look-ups, joiners and sorting are resource-intensive operations and susceptible to rigorous, objective definition. This is likely to be controversial as the various vendors lobby to have what is perceived to be their strong suits accommodated. "Benchmark specials" will have to be identified and avoided (i.e., ruled illegal) in the context of ETL. Doing so would facilitate the improved self-knowledge of the technology and software providers. This is typically one of the benefits of benchmarking in the positive sense ? to drive technical innovations that actually improve real-world performance.
For more information on related topics visit the following related portals...
Data Acquisition, Replication and
Lou Agosta is the lead industry analyst at Forrester Research, Inc. in data warehousing, data quality and predictive analytics (data mining), and the author of The Essential Guide to Data Warehousing (Prentice Hall PTR, 2000). Please send comments or questions to firstname.lastname@example.org.
Provided by IndustryBrains
|Bowne Global Solutions: Language Services|
World's largest language services firm offers translation/localization, interpretation, and tech writing. With offices in 24 countries and more than 2,000 staff, we go beyond words with an in depth understanding of your business and target markets
|Award-Winning Database Administration Tools|
Embarcadero Technologies Offers a Full Suite of Powerful Software Tools for Designing, Optimizing, Securing, Migrating, and Managing Enterprise Databases. Come See Why 97 of the Fortune 100 Depend on Embarcadero!
|Online Backup and Recovery for Business Servers|
Fully managed online backup and recovery service for business servers. Backs up data to a secure offsite facility, making it immediately available for recovery 24x7x365. 30-day trial.
|Data Mining: Strategy, Methods & Practice|
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.
|Test Drive the Standard in Data Protection|
Double-Take is more affordable than synchronous mirroring and enables you to recover from an outage more quickly than tape backup. Based upon the Northeast blackout and the west coast wild fires, can you afford to be without it?
|Click here to advertise in this space|