Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

Resource Portals
Analytic Applications
Business Intelligence
Business Performance Management
Data Integration
Data Quality
Data Warehousing Basics
EDM
EII
ETL
More Portals...

Advertisement

Information Center
DM Review Home
Conference & Expo
Web Seminars & Archives
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

General Resources
Bookstore
Industry Events Calendar
Vendor Listings
White Paper Library
Glossary
Software Demo Lab
Monthly Product Guides
Buyer's Guide

General Resources
About Us
Press Releases
Awards
Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Optimal ETL Processing – Changed Data Capture Strategy

  Article published in DM Direct Newsletter
October 29, 2004 Issue
 
  By Soumendra Mohanty

A good data warehouse model is a hybrid representing the diversity of different components and processes required to acquire, store, package and deliver data.

The expectations from a data warehouse implementation are growing at an exponential rate without compromising the ever-increasing data volumes and shrinking latency. If we think about it, we really should not be surprised at all! Organizations today demand a high degree of access to accurate and timely information so that decision making is fast and effective. At the same time, volume of data is growing exponentially every single day. In addition to this, increasing number of source systems, different data formats and system complexities are directly impacting the system resources and the ROI of the data warehouse.

One can choose industry-standard ETL tools, design robust and scalable ETL architectures, budget considerable amount of time and money for performance enhancements, still the key to a successful data warehouse implementation lies in defining and implementing an optimal data acquisition strategy.

What is Data Acquisition?


Figure 1

Data acquisition can be defined as a strategy - a data-centric processes for assembling data from disparate sources and systems throughout an enterprise or between enterprises and enriching that data in a manner that creates valuable and reusable information.

Invariably the data warehousing vision of the enterprise and the business drivers determine which data to source. At the same time, the requirement gathering process triggers the business justification of each of the source systems identified and the type of business data required. In this phase, business analysts and functional analysts do not pay much attention to the low level details of data transfer technical requirements. The focus lies in identifying and getting the right business data into the data warehouse environment. However, data acquisition as a strategy should not be only limited to identifying and getting right business data into the data warehouse environment; it should also focus on the following:

  • Efficiently and effectively extract data from the source systems;
  • Identify and document service level agreements with the source systems;
  • Provide guidelines on the data transfer mechanism, escalation procedures in case of failures during data transmission;
  • Supplement/enrich the data from the source systems for easy updates and changes as applicable in the data warehouse environment;
  • Provide a framework to permit successful reconciliation strategies both at the source end as well as the target end.

Having said that, let us take a closer look at the operational challenges that we face in the data warehouse environment:

  • Data volume growth - Over the years the size of the source data grows resulting in decreasing load window.
  • Business drivers change - Business focus changes demanding in more in-time data resulting in a renewed focus to reduce latency.
  • Cost containment - Competitive factors pose a sever challenge on infrastructure cost. Once the data warehouse has become operational, business focus shifts to analytics and reporting.

Most of these challenges are related to size of that data, the data refreshment policy and the data retention policy.

Data Capture Strategy and CDC

Once the data warehouse is built, attention shifts from the building of the data warehouse to its day-to-day operations. Inevitably, the discovery is made that the cost of operating and maintaining a data warehouse is high, and the volume of data in the warehouse is growing faster than anyone had predicted. The widespread and unpredictable usage of the data warehouse by the end-user DSS analyst causes contention on the server managing the warehouse. Yet the largest unexpected expense associated with the operation of the data warehouse is the periodic refreshment of data. What starts as an almost incidental expense quickly turns very significant.

Data Capture Options

Traditional data capture methods have relied heavily on the following options:

Extract all the data to flat files - These are usually from the legacy source systems, where in a batch job extracts all the data and puts in a COBOL copybook like structure.

  • Cumbersome process
  • Prone to errors hence confidence level on data quality is low
  • Results in large OS files
  • Data transfer time is more and needs robust network infrastructure and bandwidth

Triggers - This option warrants that the source system also has a DBMS as the database. Triggers are attached to each and every business events that capture the information and record those.

  • Introduces point of failure concept
  • Have to maintain multiple objects on source systems
  • Involves a lot of triggers, capture tables, database links and custom codes

Merge-Purge Option - This option does a lot of verification and reconciliation processes with the source system data.

  • Extract entire information
  • See if records exist at the target system, compare and then process
  • Large amount of data extracted every time
  • Very complex compare codes

The first step most organizations take in the refreshment of data warehouse data is to read the old legacy databases. For some kinds of processing and under certain circumstances, directly reading the older legacy files is the only way refreshment can be achieved, for instance, when data must be read from different legacy sources to form a single unit that is to go into the data warehouse. In addition, when a transaction has caused the simultaneous update of multiple legacy files, a direct read of the legacy data may be the only way to refresh the warehouse.

As a general-purpose strategy, however, repeated and direct reads of the legacy data are a very costly. The expense of direct legacy database reads mounts in two ways. First, the legacy DBMS must be online and active during the read process. The window of opportunity for lengthy sequential processing for the legacy environment is always limited. Stretching the window to refresh the data warehouse is never welcome. Second, the same legacy data is needlessly passed many times. The refreshment scan must process 100 percent of a legacy file when only one or two percent of the legacy file is actually needed. This gross waste of resources occurs each time the refreshment process is done. Because of these inefficiencies, repeatedly and directly reading the legacy data for refreshment is a strategy that has limited usefulness and applicability.

A much more appealing approach is to trap the data in the legacy environment as it is being updated. By trapping the data, full table scans of the legacy environment are unnecessary when the data warehouse must be refreshed. In addition, because the data can be trapped as it is being updated, there is no need to have the legacy DBMS online for a long sequential scan. Instead, the trapped data can be processed offline.

Two basic techniques are used to trap data as update is occurring in the legacy operational environment. One technique is called data replication; the other is called change data capture, where the changes that have occurred are pulled out of log or journal tapes created during online update. Each approach has its pros and cons.

Data Replication

Replication requires that the data to be trapped be identified prior to the update. Then, as update occurs, the data is trapped. A trigger is set that causes the update activity to be captured. One of the advantages of replication is that the process of trapping can be selectively controlled. Only the data that needs to be captured is, in fact, captured. Another advantage of replication is that the format of the data is "clean" and well defined. The content and structure of the data that has been trapped are well documented and readily understandable to the programmer. The disadvantages of replication are that extra I/O is incurred as a result of trapping the data and because of the unstable, ever-changing nature of the data warehouse, the system requires constant attention to the definition of the parameters and triggers that control trapping. The amount of I/O required is usually nontrivial. Furthermore, the I/O that is consumed is taken out of the middle of the high-performance day, at the time when the system can least afford it.

Changed Data Capture


Figure 2

The second approach to efficient refreshment is changed data capture (CDC). One approach to CDC is to use the log tape to capture and identify the changes that have occurred throughout the online day. In this approach, the log or journal tape is read. Reading a log tape is no small matter, however. Many obstacles are in the way, including the following:

  • The log tape contains much extraneous data.
  • The log tape format is often arcane.
  • The log tape contains spanned records.
  • The log tape often contains addresses instead of data values.
  • The log tape reflects the idiosyncrasies of the DBMS and varies widely from one DBMS to another.

The main obstacle in CDC, then, is that of reading and making sense out of the log tape. But once that obstacle is passed, there are some very attractive benefits to using the log for data warehouse refreshment. The first advantage is efficiency. Unlike replication processing, log tape processing requires no extra I/O. The log tape will be written regardless of whether it will be used for data warehouse refreshment. Therefore, no incremental I/O is necessary. The second advantage is that the log tape captures all update processing. There is no need to go back and redefine parameters when a change is made to the data warehouse or the legacy systems environment. The log tape is as basic and stable as you can get.

There is a second approach to CDC: lift the changed data out of the DBMS buffers as change occurs. In this approach the change is reflected immediately. So reading a log tape becomes unnecessary, and there is a time-savings from the moment a change occurs to when it is reflected in the warehouse. However, because more online resources are required, including system software sensitive to changes, there is a performance impact. Still, this direct buffer approach can handle large amounts of processing at a very high speed.

Once the initial load of the data has been completed in the ODS and data warehouse, you may now choose to use delta or change data mechanisms to capture only the data that has changed since the last extract process. A changed data capture process deals with data volumes that are significantly smaller than the entire set of data in the operational database. This mechanism of capture is useful for the ODS where data currency is critical and for the data warehouse where data volumes are enormous.

Five methods for capturing data changes are available:

  1. Use source timestamps. Advances in operating systems and RDBMS functionality has enabled us to a large extent to identify record creation and updates to the nanosecond ranges. Source time stamped records can serve as a powerful platform to facilitate CDC.
  2. Read the DBMS log. Technology advances in DBMS have enabled to capture the history of all the changes occurring within a DBMS. The changes are brought into the staging area and are matched up with the partially constructed records for a completed one, ready for the next process integration.
  3. Compare before and after images of the database. To make this process easier, it is recommend that you use the backup files (image copies) and a compare utility. Comparison may be your only choice for older batch systems. It can be a slow process to identify those records that have changed since the last compare. Once identified, the records or the fields are extracted and put into a file for further integration.
  4. Create snapshots of the operational systems and use these in the capture process. This usually requires a complete reload of the tables and works best in the ODS. If used for the data warehouse, then all history must be rebuilt.
  5. Use database triggers from the operational system. This mechanism requires that the operational source has a relational DBMS that saves inserted, updated or deleted records of interest. Unfortunately, this also creates twice the workload on the operational system. For example, upon the insertion of a new record, the database fires a trigger not only to insert the record into its own database, but also to update another table for our use in data capture.

Usually the CDC option operates on a publish-subscribe methodology.


Figure 3: Functional Flow - Publish


Figure 4: Functional Flow - Subscribe

Technology Enablers

Informatica

Informatica PowerExchange's patented architecture for delivering changed data capture utilizes source-specific capture agents, leveraging published APIs, log files, journal files and other non-invasive resources to capture changed data. By combining visual, SQL-based access with a non-invasive architecture, PowerExchange enables organizations to easily access only changed data, while reducing demands on system and staff resources.

And as data volumes grow and delivery windows shrink, Informatica PowerExchange allows a seamless transition from batch and bulk, to batch and changed data capture, to real-time changed data capture delivery.

Through a single architecture, data access requirements can quickly adapt without the need for intensive hand coding. PowerExchange's Changed Data Capture (CDC) architecture provides native capabilities to detect data updates and to deliver them in real time or batch through the same common API used for bulk updates. This significantly reduces the amount of data to be moved and the need to rebuild tables, which in turn reduces the processing demands on networks and hardware. Multitasking agents can run in parallel, delivering changed data to multiple projects concurrently to ensure real-time synchronization of data across the enterprise.

Oracle

Oracle provides a robust framework to enable the CDC option as shown in Figure 5.


Figure 5

Starting with Oracle9i release, following packages and utilities are available for implementing CDC architecture:

  • CDC Views
  • DBMS_CDC_SUBSCRIBE
  • DBMS_CDC_PUBLISH

Capturing changed data makes for a very efficient process but may not always work in your environment with your particular set of technologies. A chat with your operations people, application maintenance staff, database administrators and technology vendor may help you determine if this will work for you and which mechanism is best suited.

The progression described here mimics the mind-set of organizations as they mature in their understanding and operation of the data warehouse. First, the organization reads legacy databases directly to refresh its data warehouse. Then it tries replication. Finally, the economics and the efficiencies of operation lead it to CDC as the primary means to refresh the data warehouse. Along the way it is discovered that a few files require a direct read. Other files work best with replication. But for industrial-strength, full-bore, general-purpose data warehouse refreshment, CDC looms as the long-term final approach to data warehouse refreshment.

...............................................................................

For more information on related topics visit the following related portals...
Data Quality.

Soumendra Mohanty is a program manager of Accenture, India where he leads the Data Warehousing/Business Intelligence Capability Group providing architectural solutions on various industry domains and DW/BI technology platforms. He has worked with several fortune 500 clients and executed projects in various industry domains as well as technology platform areas. He can be reached at Soumendra.Mohanty@accenture.com.

 

Solutions Marketplace
Provided by IndustryBrains

Bowne Global Solutions: Language Services
World's largest language services firm offers translation/localization, interpretation, and tech writing. With offices in 24 countries and more than 2,000 staff, we go beyond words with an in depth understanding of your business and target markets

Award-Winning Database Administration Tools
Embarcadero Technologies Offers a Full Suite of Powerful Software Tools for Designing, Optimizing, Securing, Migrating, and Managing Enterprise Databases. Come See Why 97 of the Fortune 100 Depend on Embarcadero!

Online Backup and Recovery for Business Servers
Fully managed online backup and recovery service for business servers. Backs up data to a secure offsite facility, making it immediately available for recovery 24x7x365. 30-day trial.

NEW Glasshouse White Paper from ADIC
Learn to integrate disk into your backup system; evaluate real benefits and costs of different disk backup approaches; choose between disk arrays and virtual tape libraries; and build long-term disaster recovery protection into a disk backup system.

Data Mining: Strategy, Methods & Practice
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.

Click here to advertise in this space


E-mail This Article E-Mail This Article
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
advertisement
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2005 DM Review and SourceMedia, Inc. All rights reserved.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.