Published in DM Review Online in February 2004.|
Printed from DMReview.com
Real-Time Data Warehousing: Hardware and Softwareby Simon Terr
In this final installment of the real-time data warehousing building blocks series, I am going to discuss hardware, software and staff.
The challenge with real-time data warehousing is to ensure that data flows continuously. Letís consider the plumbing through which the data flows starting with the source system and ending with the customerís desktop:
There are two principal challenges associated with source systems. The first challenge is how to extract the initial load and the second challenge is how to capture changed data without impacting the source system. The first challenge typically requires the source data to be copied to an intermediary location from which the data can be prepared and loaded into the data warehouse. Although the initial load is a one-time requirement, this requirement repeats itself for each new acquisition. Hence, having a dedicated server that can be configured appropriately to handle all anticipated initial loads may be a necessity, especially in organizations that have multiple source systems.
Meeting the second challenge depends largely on the ETL strategy for capturing changed data. In the case of non-invasive ETL strategies, such as those utilizing changed data views or log tables, the impact is typically limited to ensuring the source system can handle the additional stresses of the changed data read operations. For invasive ETL strategies, such as those utilizing triggers, the resource impacts can be more extensive. If captured data is stored in the source system, the source system has to have the storage capacity to hold the captured data. If triggers are used to capture data, depending on their complexity, the triggers may require additional processing resources.
The job of processing the initial load and ongoing changed data capture is relegated to the ETL servers. The ETL servers have to be able to perform the initial load fast enough so the changed data does not accumulate to unmanageable volumes. The change data capture ETL job has to process changed data, including the accumulated data during the initial load, fast enough to live up to the "real time" in real-time data warehousing.
To accomplish all of the above, the ETL servers need the hardware and software to handle the initial load and changed data requirements. The servers need to have enough CPUs, memory and storage to perform their job. In addition, the servers have to be scalable and allow for additional CPUs, memory and/or storage as needed.
The ETL software has to have the features that will allow it to take full advantage of the ETL serversí hardware. If the ETL servers begin to reach high utilization rates and more processing power is needed, the ETL software needs to have the features to allow adding more ETL servers so processing can be distributed seamlessly across the ETL servers with minimal, if any, down time.
Moving data between servers requires network bandwidth. The network must be able to handle the initial load and changed data bandwidth requirements, especially during peak hours. The network should be tested to see if it can handle the data bandwidth requirements. Interestingly enough, sometimes the network is not the issue although it may appear to be so at first blush. For instance, a bad network card on the source server may give the appearance that the network is the culprit. In some instances, a long haul network segment such as a transatlantic connection may be the bottleneck and upgrading the segment may be too costly. Hence, prioritizing the network traffic may be a solution. The message here is if network performance is an issue, the network needs to be analyzed to find out exactly what is causing the performance issues and why.
Data Warehouse Server
Since there are so many flavors of CPUs, memory, storage devices and software available for servers, choosing a particular combination of hardware and software for a data warehouse server can be a daunting task. The first order of business is to capacitate the data warehouse server by calculating the volumetrics. For example, consider Acme Corporation that has a point-of-sale (POS) system that has about five years of data, about 100GB. Acme expects its sales to go grow about five percent per year which translates to about 21GB/year or 58MB/day. Volumetrics should drive the performance requirements the data warehouse needs to meet, and the data warehouse should be capacitated accordingly.
Although the data warehouse typically stores data in a format that is easy to report on, rarely is that sufficient to satisfy all reports which is the crux of the problem. To illustrate what I mean, reports that go against OLAP cubes that are built overnight from data retrieved from the data warehouse Ė the data in the data warehouse may be available in real time, but the OLAP cubes negate this timeliness by being built nightly.
Real-time data warehousing adds requirements that the reporting server may not be able to meet either today or tomorrow. Consequently, reporting volumetrics need to be computed and utilized to either purchase or upgrade a reporting server solution.
There are a lot of moving parts to real-time data warehousing and these parts are impacted differently by the ebb and flow of business. The staff must not only have the technical skills to respond effectively to these impacts but also have the fortitude, self-motivation, attitude and persistence to initiate and manage change when needed. Growth in the business is bound to tax systems. Mergers and acquisitions are bound to change the architecture. New functional requirements will evolve systems.
Source system, ETL server, network, data warehouse server and reporting server are the major components in a real- time data warehousing framework. That is a lot of moving parts and understanding the role of each component ensures that it can be appropriately selected or tuned to let data flow continuously. Having the right staff ensures data flows continuously.
Simon Terr is CEO and founder of KnowledgeSoft, Inc., a provider of products and services in the areas of real time data warehousing and business intelligence. Simon Terr can be reached at email@example.com.
Copyright 2005, SourceMedia and DM Review.