Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Which tools would you suggest to load data in near real time in a Siebel database?

    Ask The Experts published in DMReview.com
December 12, 2005
  By Tom Haughey


  • We have a Siebel database that registers daily data (LOGs) to the data warehouse in a batch process at night. We are evaluating how to load data in near real time. Which tools would you suggest?

Tom Haughey's Answer:

  • There are three issues with near real-time population of a warehouse, and the tools are only one part. The three issues are definition of real time, design of a solution and selection of tools. We will deal briefly with each.

    Definition of Real Time
    Real time can have several definitions; here are two possible definitions:

    1. Nearly synchronous. This is usually achieved via by message brokers and queuing systems which also handle guaranteed delivery. In this scenario, data movement is one for one with the events or transactions. Nobody does purely synchronous because it has too many dependencies.
    2. Very short intervals. This is store and forward of relatively small batches at short intervals. ETL can do this quite well, performing a transformation and load, say, every five or 15 minutes.

    Design of the Solution
    Here there are three challenges:

    1. Finding the change data.
    2. Not disrupting either the operational system or the data warehouse.
    3. Providing a process and database design to accommodate this.

    Finding Change Data.
    There are about five ways for dealing with change data, namely,

    1. Read database transaction logs
    2. Application writes changes to change file
    3. Scan database for timestamps
    4. Compare database snapshots, including creating and comparing a cyclic redundancy check (CRC)
    5. Full replication. The main point of these is to avoid disrupting the operational system.

    Database Design
    Consider implementing a process of “trickle load.” Here is one scenario for doing this. Assume there are multiple levels of data in the warehouse, such as base grains and several levels of summarization. Base grains are the most detailed level of data in the warehouse, such as transactions. Summaries are aggregations off of base grains and/or external data. Both base and summary data can be used for querying. There is another set of tables, called “reporting tables”, much like the base tables but with further optimization and maybe some summarization for querying. For example, what is a complex and ragged dimension hierarchy in the base tables can be flattened out in the reporting tables to improve querying. Finally, at the detail grain level, there are two sets of identical tables, one for load (which we will call “load tables”) and one for querying (which we called “base tables” above). Queries can run off the base tables, the reporting tables or the summary tables, but not the load tables. Load tables are used strictly for loading data. Views are used to control access to base and reporting tables; this is important to the process.

    Data is loaded into the load tables in short intervals. When a given load is done, the database is momentarily frozen and the newly loaded data is instantly copied into the base tables. This usually takes seconds. Remember, views are used to control access to base and reporting tables. Next we flip the views: the former load tables become the base tables and the former base tables now become the load tables. Flipping the views make this transparent to the user. Data is now loaded into the (new) load tables, as before, and the process is repeated for each interval. The goal of all this is to avoid disrupting the warehouse

    DBMSs, ETL products and other third-party tools can play a role, depending on the type of near real time you select. It used to be that Sybase was best at replication and has almost a corner on the market. They replicate off the DBMS log. They still are very good at it but other DBMS vendors, such as IBM and Oracle, have enhanced their support to be competitive with Sybase. DBMSs are best at nearly synchronous replication because they generally work off the log. ETL tools can deal with near-real time in short intervals. Third party tools are also available. I recommend that you deal with the first two issues first (your definition of real-time and the database design), see what tools match your needs, and then seriously consider doing a “shoot out” among the candidate vendors.


For more information on related topics visit the following related portals...
Database/Application Performance, Databases and DW Administration, Mgmt., Performance.

Tom Haughey is the president of InfoModel LLC, a training and consulting company specializing in data warehousing and data management. He has worked on dozens of database and data warehouse projects for more than two decades. Haughey was former CTO for Pepsi Bottling Group and director of enterprise data warehousing for PepsiCo. He may be reached at (201) 337-9094 or via e-mail at tom.haughey@InfoModelUSA.com.

Solutions Marketplace
Provided by IndustryBrains

Data Validation Tools: FREE Trial
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Recover SQL Server or Exchange in minutes
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space

E-mail This Ask The Experts E-Mail This Ask The Experts
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.