Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Information Is Your Business
   Information Is Your Business Advanced Search
advertisement

RESOURCE PORTALS
Business Intelligence
Compliance
Corporate Performance Management
Data Integration
Data Quality
Data Warehousing Basics
ETL
Master Data Management
View all Portals

WEB SEMINARS
Scheduled Events

RESEARCH VAULT
White Paper Library
Research Papers

CAREERZONE

Advertisement

INFORMATION CENTER
DM Review Home
Newsletters
Current Magazine Issue
Magazine Archives
DM Review Extended Edition
Online Columnists
Ask the Experts
Industry News
Search DM Review

GENERAL RESOURCES
Tech Evaluation Center:
Evaluate IT solutions
Bookstore
Buyer's Guide
Glossary
Industry Events Calendar
Software Demo Lab
Vendor Listings

DM REVIEW
About Us
Press Releases
Awards
Advertising/Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Data Warehousing Lessons Learned:
Very Large Data Warehouses Challenge Backup

  Column published in DM Review Magazine
June 2005 Issue
 
  By Lou Agosta

But Backup Is Still Required!

As the number of data warehouses at least one terabyte in size has grown from 21 percent in 2002 to 43 percent of enterprises in 2004, many have asked Forrester: Is it feasible to back up the database?1 Complaints have included: 1) the backup won't fit in our batch window, 2) we don't have a batch window, and 3) the mean time between failures (MTBF) of modern storage technology is better than ever - do we need a backup at all? The answer is unless the system is of no consequence to the business (in which case, why is it in operation?), a periodic backup must be made based on the time horizon in which the business process functions.

Getting started with a data warehousing backup strategy means managing how:

  • Different parts of the warehouse change at a different rate.
  • Backup frequency slides in the direction of the transactional.
  • "Read mostly" means "read and write."

Optimize the Backup!

Even if the data warehouse is of a heroic size -- that is, in the multiterabyte range -- several tactics can increase the likelihood of handling the backup operation using standard database utilities. The options are straightforward and, in part, depend on the capabilities of the underlying database (such as IBM DB2, Microsoft SQL Server, Oracle10g, Sybase or Teradata) that is in operation. Use these approaches:

Partitioning to increase operational flexibility. Partition the database in such a way that an individual partition can be backed up in a reasonable window of time, even if the entire data structure is too large to backup all at once. For example, 12 partitions of 250GB each could be backed up periodically (daily, weekly or monthly, depending on the requirements). The schedule handles one partition at a time, not all three terabytes at once.

Incremental backup to reduce elapsed time to complete an operation. Use a database whose backup and recovery utility allows for incremental backup and restore. Both Oracle and DB2 provide for incremental backup. Because the backup copy just contains the deltas, the recovery operation requires less time than if the entire data structure has to be restored. However, a complete copy of the data to which the incremental copies can be applied must first be made.

Parallel processing to divide and conquer large data volumes. The operations environment can increase the bandwidth of the backup or restore activity by increasing the number of tape devices used and executing the operation in parallel. For example, if the operation calls for 24 data structures to be backed up and each requires one hour to complete, and 24 tape devices are available, then an exclusive backup window of only one hour is needed. In contrast, if only one tape drive is available, then the entire process will require 24 hours.

In this example, the tradeoff is between investing up front in the additional tape drive infrastructure and the inefficiency of the process on a daily basis, day after day. Such a parallel architecture also applies to massively parallel databases, such as Teradata or the IBM Scalable Parallel framework, where each node can be mapped to a tape drive.

Concurrent backup to extend availability. The major database vendors now allow concurrent backup - the table or table space can be backed up while the data structure is still online and in use. Log changes must also be applied if there are concurrent updates, but that is not the case for an inquiry-only data warehouse. This concurrent backup greatly relieves the problem of having to fit the backup into an exclusive batch window.

Of course, in the case of a data warehouse that is inquiry (i.e., read) only, the problem is much less severe. Once an initial backup is taken, the database does not change. The data management function must still provide for a backup of the additions to the tables, but these are usually much less than a full backup and can be handled by incremental backup of just the deltas (what has changed).

Redundant arrays of independent disks (RAID) to recover from media failure. If the installation is dealing with a data structure that is truly extreme in size -- such as the hundreds of terabytes that reportedly occur at Wal-Mart - then use a storage technology mechanism such as RAID 1 or RAID 5. In case of disk failure, the disk's contents can be reconstructed using redundant storage technology. However, note that this only addresses availability and data loss due to media (i.e., disk drive) failure. Other provisions must be made to address damage or loss of data integrity due to a rogue application or incorrect update. Design and implement a custom application to rebuild the data structure in case of loss of data integrity.

A combination of tactics to satisfy complex, real-world data center scenarios. In the real world, all five of these methods are combined as applicable. Partitioned tables are backed up in rotation, incremental backup is used against concurrent processes, and, for those structures that are too big to fit in the batch window, RAID 1 or RAID 5 is used. It is always necessary to have a backup. In some instances where the table or data structure does not ever change, the installation can save the input file so that the table can be restored in case of damage. Such a restore will be by means of an application rather than a formal database utility backup. However, the function is the same. The input file functions as a backup in this case. The cost and complexity are greater than if a database utility could be used, but still less than going out of business!  

Reference:

  1. 21 percent of respondents to a 2001 data warehousing questionnaire reported having a data warehouse more than one terabyte in size. In contrast, this data point was up and stood at 43 percent in early 2004, according to the February 2004 Data Warehousing Institute-Forrester Quarterly Technology Survey.
...............................................................................

For more information on related topics visit the following related portals...
DW Administration, Mgmt., Performance and DW Design, Methodology.

Lou Agosta, Ph.D., is a business intelligence strategist with IBM World Wide Business Intelligence Solutions, focusing on competitive dynamics. He is a former industry analyst with Giga Information Group and has served many years in previous careers in the trenches as a database administrator. His book The Essential Guide to Data Warehousing is published by Prentice Hall. This article is © IBM. Agosta would like to hear from you, so please send comments and questions to him at LoAgosta@us.ibm.com.

Solutions Marketplace
Provided by IndustryBrains

Move to Next Generation EDI
Get greater value from your IT investment while lowering operating costs with the Next Generation EDI. Boomi Software, a leader in EDI and application integration, provides easy-to-use tools that increase the functionality of your business.

Free DB Modeling Trial with ER/Studio
Design and Build More Powerful Databases with ER/Studio.

Database Access Tool
Reverse engineers relational schemas and automatically generates data access code that can be used in Service Oriented Architectures.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Data Mining Courses: Strategy, Methods & Apps
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.

Click here to advertise in this space


View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Advertisement
advertisement
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2007 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.