Data Warehousing Lessons Learned:
SMP or MPP for Data Warehousing
For those data warehousing clients with requirements that are multiterabyte, high-end decision support, requiring superior price/performance across significant numbers of users, massively parallel processing (MPP) will remain an operational necessity. Although symmetric multiprocessing (SMP) is raising the price/performance bar and pushing the crossover point upward, SMP and MPP will coexist in a gray transition area at the high end; and clients will benefit from the choices and competition. Many more data warehousing clients will be able to make use of standard SMP approaches than had previously been the case, but those with multiterabyte volumes combined with high-performance requirements (active data warehousing) will still require a special-purpose data warehouse server.
Innovations in memory-to-CPU interconnection, such as crossbar switching, have reduced memory contention in SMP designs and have reduced the coordination costs of the hybrid clustered- SMP approach. However, slope of one linear scalability of hundreds of processors still requires an MPP database. The central trade-off between single image (SMP) and parallel processing (MPP) database warehousing servers is between ease of administration and scalability. Eventually the coordination costs of refreshing and synchronizing cache start to render the scalability of SMP less than one. The addition of another processor does not produce a full processor's quota of work and throughput due to coordination costs. A similar dynamic can also affect shared disk clusters where, absent an abstraction layer to map disks to nodes, a global lock or database synchronization mechanism is needed to preserve data integrity. In contrast, while MPP scales linearly to hundreds of nodes, troubleshooting so many processors can be an issue for administrators trained in the SMP world. More moving parts reduces the mean time between failure (MTBF). Advocates of SMP versus MPP (e.g., HP, Sun, IBM versus Teradata, IBM) argue that the performance cost of data movement through the high-speed switch (a defining characteristic of clustered hardware and MPP databases) can be significant, and data placement remains a critical success factor. This is true, but it is the required trade-off for high-performance results given complex queries against large volume points. Further trade-offs include:
- MPP offers superior scalability of computing power and throughput; SMP offers the best price/performance, especially below the gray area in Figure 1.
- MPP has (regardless of the DBMS) an issue of leaving MIPS unused throughout the day. (Consider a perfectly distributed MPP database. Now run a query that joins two tables on "date." Redistribution occurs, and more data get hashed to certain nodes. Real-time imbalance occurs.) SMP overcomes this imbalance if the software is dynamically able to allocate parallel tasks within a single node (DB2, Oracle- intrapartition parallelism.
- MPP has more "moving parts," hence, potentially worse MTBF numbers - diminished in the cases of true fault-tolerant architectures such as Compaq Himalaya (with nonStop SQL), IBM DB2 EEE with HACMP or Teradata (NCR).
Figure 1: SMP to MPP Crossover for DW Workloads
As a result of these trade-offs, a move toward clustered architectures is occurring with SMPs as the building blocks, resulting in a hybrid configuration. (The need for a high-speed switch and the implied complexity means that the hybrid approach is a special-purpose solution.) This allows clients to leverage the price/performance, reliability and real-time memory and CPU management of the SMP world with the superior scalability promise of the MPP approach. Clustered approaches with shared disks (for example, Oracle9i Real Application Clusters) will be preferred by clients for high availability, and clustered approaches with pure shared- nothing databases will be preferred for scalability and performance pure and simple.
In general, those installations that favor consolidation on a single SMP hardware/database platform will look to leverage the new 64-bit non- clustered servers from Sun, HP and IBM with a standard, non-parallel relational database, such as IBM's DB2 UDB or Oracle8i/9i. Those enterprises with data volumes in excess of 3TB (and growing) and with complex, high- performance active data warehousing requirements will require special-purpose database servers, such as Teradata (on NCR WorldMark hardware) or the Regatta with the IBM SP framework with DB2 EEE. (For clients already on a special- purpose data warehouse, such as Red Brick or Informix XPS, the IBM acquisition should be a source of assurance. Those clients can not only rely on continued IBM support for the products, but can also look forward to future enhancements.)
Special situations do arise that require a dedicated data warehouse database. The bar on those situations continues to rise as SMP processors handle terabyte volume points. However, the situations still exist. These include:
- Volume points above 3TB of raw data.
- A service level agreement (SLA) requiring rapid responses in real time.
- Significant volumes of updated activity to the data warehouse concurrent with inquires.
Enterprises will need to understand the data profile of the data warehousing application in question -- especially with reference to situations two and three -- and trace their requirements to the particulars of the available platforms. In general, multiterabyte volumes, high-performance SLAs and active data warehousing will continue to require special-purpose decision support servers.
For more information on related topics visit the following related portals...
Lou Agosta, Ph.D., joined IBM WorldWide Business Intelligence Solutions in August 2005 as a BI strategist focusing on competitive dynamics. He is a former industry analyst with Giga Information Group, has served as an enterprise consultant with Greenbrier & Russel and has worked in the trenches as a database administrator in prior careers. His book The Essential Guide to Data Warehousing is published by Prentice Hall. Agosta may be reached at LoAgosta@us.ibm.com.
Provided by IndustryBrains
|Data Validation Tools: FREE Trial|
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.
|Speed Databases 2500% - World's Fastest Storage|
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage
|Backup SQL Server or Exchange Continuously|
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.
|Manage Data Center from Virtually Anywhere!|
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.
|Design Databases with ER/Studio: Free Trial|
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.
|Click here to advertise in this space|