Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

RESOURCE PORTALS
View all Portals

WEB SEMINARS
Scheduled Events
Archived Events

RESEARCH VAULT
White Paper Library

CAREERZONE
View Job Listings
Post a job

Advertisement

INFORMATION CENTER
DM Review Home
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

GENERAL RESOURCES
Bookstore
Buyer's Guide
Glossary
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

DM REVIEW
About Us
Press Releases
Awards
Advertising/Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Real Time Data Warehousing:
Customers, Budget and System Architecture

    Column published in DMReview.com
January 8, 2004
 
  By Simon Terr

Continuing with the series on real-time data warehousing building blocks, in this installment I am going to discuss three building blocks: customers, budget and system architecture.

Because customers and budget are well-published topics, I'm going to briefly touch upon them and dedicate the rest of this article to system architecture. If you feel that you would like me to elaborate in detail upon customers and budget, send me an e-mail and I will revisit these topics in a future article.

Customers

I am a firm believer in using methodologies when interacting with customers. Although there are a variety of proven customer and project management methodologies, they provide a framework that ensures customers are involved in all key phases and milestones of any undertaking of some significance. Regardless of which methodology you use, select one that's appropriate for you and use it.

Budget

A budget should realistically reflect the costs associated with the successful delivery of goals and projects identified in the road map. Taking into account the anticipated system architecture, hardware, software and staff prior to finalizing any budget will uncover any additional hidden expenditures, thereby making the budget that much more accurate.

System Architecture

A system architecture for real-time data warehousing is comprised of four major components: volumetrics, data acquisition method, data recovery architecture and storage.

  • Volumetrics. Volumetrics' purpose is to provide enough information to determine whether the current systems can handle the demands of real-time data warehousing today and tomorrow and, if the systems cannot handle the demands, identify the challenged systems. Size of source tables, available network bandwidth, storage availability, server capacity, current and anticipated consumption rates of resources and data growth rates are some of the key metrics that comprise volumetrics.
  • Data Acquisition Methods. In real-time data warehousing, change data capture is the modus operandi. Attempting to do real-time data warehousing by replicating whole tables, database partitions or the like is simply not practical unless you have a requirement for it. With this in mind, here are some of the more popular change data capture methods and how they work: application, log, replication, trigger and continuous.

Application Method

The application method works by incorporating acquisition code directly into the source system that generates the source data. The premise here is that once a transaction is complete, the source application commits the transaction in its own database and in the data warehouse.

The adage "straight from the horse's mouth" is pretty accurate here. Since there are typically no intermediary systems between the source system and the data warehouse, this is a straightforward and pretty enticing data acquisition method.

Unfortunately, this method has a couple of significant limitations. For custom and in-house built applications, the first limitation has to do with the acquisition code becoming part of the source system. This integration means that the acquisition code is tied directly to the code base and life cycle of the source system. Hence, any changes to the acquisition code will have to be made by the source system developers unless the data warehouse development team can arrange to have a development role in the source system. In the case of packaged application, this method may not be even feasible.

The second limitation has to do with doing double work. Every time the source system writes a transaction to the database, it also writes a transaction to the data warehouse. Hence, the source system is doing double work and incurs a significant performance burden.

Log Method

The log method works by acquiring change data from the database transaction logs. As this method utilizes log tables, acquisition code does not have to be incorporated into either the source application or its database. Database logs are well documented and reading them does not typically require unusual procedures or code. Furthermore, this method is utilized by several ETL tools and data replication engines making the method relatively popular, at least among ETL software vendors.

Replication Method

The replication method works by acquiring change data using a data replication engine. Since data replication engines typically utilize database logs, this method shares many similarities to the log acquisition method. Data replication engines typically provide queues and robust error handling that make this method interesting enough that several ETL vendors utilize data replication engines to provide change data capture functionality. Typically, data replication engines do not include data transformation functionality so an ETL tool is frequently a necessary component of this data acquisition method.

Trigger Method

The trigger method works by adding triggers to the to source system to capture changed data. Triggers are robust and can be tailored to apply to specific database tables, fields, and events thereby capturing only the data that's desired. Furthermore, triggers are traditionally compiled and the compiled code is stored directly in the database server thereby making triggers fast. If not used in conjunction with an ETL tool, triggers can get complex and unwieldy over time.

Continuous Method

The continuous method works by having an ETL process continuously monitor the source system for new and changed data. Some ETL vendors utilize data replication engines to implement continuity while others offer special versions of their software, typically dubbed "real time", to provide this functionality. Continuous method is probably one of the best ways to achieve real time performance simply on the virtue that all systems are integrated and working together. Unfortunately, this is one of the most complex architectures to recover from failure. Unless there are is robust data recovery architecture in place, the failure of one component will ensure an adverse cascading effect to all downstream components.

  • Data Recovery Architecture. Although they vary on a case by case basis, there are three common data recovery architecture guidelines to consider. The first guideline is to schedule processes around maintenance periods. Typically every system has regular scheduled maintenance windows during which system updates, backups, and other administrative tasks get done. Systems should be scheduled to not run during maintenance windows to avoid dropped connections and other system dependency related issues.

    Decoupling systems is another guideline. Decoupling simplifies data acquisitions by breaking them up into smaller, logical units while minimizing the impact of failure by any one system. To illustrate this guideline, consider a system that has an ETL process that moves data from a source system to the data warehouse. If the source system becomes inaccessible, the ETL process typically experiences a dropped connection condition followed by failure. To reduce the impact of the source system being inaccessible, the source system and data warehouse can be decoupled by setting up two data acquisitions. One data acquisition moves source data into a staging area while the second data acquisition moves data from the staging area into the data warehouse. The first data acquisition can be a simple data movement affair while the second data acquisition can contain the core ETL logic. If the source system becomes inaccessible, only the first acquisition will require a restart while the second acquisition will retain its integrity without any significant impacts to it.

    The third guideline is to plan for downtime. To illustrate this guideline, consider a system that has two data acquisitions: acquisition A moves source transactions into a staging database and acquisition B moves transactions from the staging database into the data warehouse. Let's say acquisition B fails while acquisition A is still up and running. In this case, planning for downtime implies two things. First, the staging database has to big enough to hold accumulating transactions while acquisition B is down. Second acquisition B has to be fast enough to process the accumulated transactions and be able to catch up to the new accumulating transactions.

  • Storage. Real-time data warehousing has several impacts on storage such as increased data storage and timely data retrievals requirements. Although there are a variety of storage options, each one has its own impacts on the system architecture. For instance, the impacts of a SAN versus Fibre Channel solution on the network and system architecture as a whole are very different.

    Atomic and even fact tables tend to grow to millions and sometimes billions of rows and retrieving data from a large table can be a challenge. Traditional performance enhancing techniques such as indexing are sometimes insufficient to gain acceptable performance. Consequently, alternative methods have to be employed such as breaking up the storage of a large table across different media such as hard disk and tape. The introduction of mixed media storage solutions to increase the performance of data retrieval, reduce storage cost or both impacts the system architecture as well.

All in all, utilizing a customer and project management methodology ensures that there is customer involvement in key phases and milestone events. Utilizing traditional budgeting techniques while taking into account the system architecture, hardware, software and staff requirements increases the accuracy of the budget. Finally, volumetrics, data acquisition method, data recovery architecture and storage are the key components of a sound system architecture that can support real-time data warehousing today and tomorrow.

...............................................................................

For more information on related topics visit the following related portals...
DW Design, Methodology and Real-Time Enterprise.

Simon Terr is CEO and founder of KnowledgeSoft, Inc., a provider of products and services in the areas of real time data warehousing and business intelligence. Simon Terr can be reached at knowledgesoft@att.net.

Solutions Marketplace
Provided by IndustryBrains

Easy Software Migration to SAP
If your current applications are at risk, SAP Safe Passage provides a clear roadmap for solution migration with maintenance support & integration technology. View free demos now!

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

SAP Master Data Management
Learn about SAP Master Data Management to easily consolidate, augment, store & distribute master data w/ NetWeaver. Manage complex & rich product, customer, supplier, & employee info w/ SAP.

Scrub Bad Addresses from Your Customer Database
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Save on Business Intelligence and Data Warehousing
Leverage Open Source database software and PC-based commodity hardware for an unsurpassed price/performance value. ExtenDB transforms the economics in developing a Business Intelligence infrastructure.

Click here to advertise in this space


E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Advertisement
advertisement
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2005 DM Review and SourceMedia, Inc. All rights reserved.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.