When planning for an enterprise data warehouse what types of investment should be planned for in the way of technology infrastructure?
When planning for an enterprise data warehouse, what types of investment should be planned for in the way of technology infrastructure - mainly computer hardware and software platforms (servers, databases, storage, OS, etc...)? Let's say this is a 1 terabyte EDW to start, please estimate requirements and rough costs.
Sid Adelman's Answer: Your cost will be based on many more factors other than just the size of the data warehouse. Your cost will be based on:
- The functions you are delivering
- The number and characteristics of the source files, levels of quality, and requirement for quality.
- The size of the database
- The number of users, both concurrent and total
- What software you buy and support (ETL, BI, data cleansing, DBMS, etc.)
- Your need for consultants and contractors - this will be a function of your skills levels and the schedule for implementation
- Your service level agreements for performance and availability.
- The architecture of your EDW, use of ODS, staging areas, data marts, metadata
- How well you plan to support your users
- What training you intend to provide for your users
- The training required for your IT staff
It's important to include all these costs as you establish a budget for the EDW. If the other costs are not included, they will not be allocated and you will have a substandard EDW or, if the moneys are allocated, you will be criticized for going over budget.
Les Barbusinski's Answer: Whoa! That's a tall order, but here goes. Exact infrastructure requirements and costs are dependent on a lot of factors including:
- The purpose of the data warehouse (e.g., BI vs. data integration vs. compliance, etc.) as well as the functionality that is required (e.g., 24x7 availability, real-time vs. batch data acquisition, etc.)
- Refresh frequencies and latency requirements
- Projected transaction volumes and concurrency requirements
- Service level agreements
- How much of the required HW/SW is already in house
- Shop standards
- The degree of competition between potential vendors
Hence, I can't give you exact system configurations or cost estimates. However, I can give you a list of infrastructure components that you'll need to assess in order to lay the groundwork for your terabyte data warehouse. This includes:
- Application Servers: You'll need a collection of different application servers to host the layers of application services that your data warehouse will need including ETL, Business Intelligence, Reporting, Messaging, Web Portal and Web Services, Business Rules Engines, etc. Each application will have an optimal configuration that you'll need to address. You'll need to size the server(s) for each application layer (i.e. number of servers, number of processors, RAM, etc.) in order to handle the expected transaction volumes. You'll also have to address the need (if any) for clustering, active/passive failover, 24x7 availability, etc. Also, don't forget to plan for different environments (e.g. development, test, and production). Some of these servers will be UNIX, while others will be Windows ... depending on the application.
- Database Servers: You'll need a collection of database servers to handle all of the database layers inherent in a data warehouse (e.g. historical data store, operational data stores, data marts, extract areas, etc.). First, you'll need to decide on how best to distribute your database nodes (e.g. all centralized, de-centralized data marts, de-centralized ODS instances, etc.), then size each node (i.e. CPUs and RAM per node, servers per node, etc.). As with the app servers, you'll have to address such issues as failover, clustering, 24x7 availability, etc. ... as well as make provisions for handling multiple environments. Keep in mind that your choice of RDBMS will - in large part - dictate your database server requirements.
- Disk (SAN) Space: You've already stated that you need to store roughly one terabyte of usable data. Now you need to plan the space requirements for such things as indexes, logs, sort-work areas, staging areas, distributed data marts, extract areas, etc. You also have to make adequate provisions for replication, mirroring, and database backup (i.e. most backups are done to disk then spooled to tape asynchronously). Finally, you'll have to adjust all these estimates to handle the expected growth rate (taking into account whatever archiving scheme you have).
- Tape Drives: Database and file backups from terabyte-plus sized data warehouses are usually spooled to streaming RAID-5 tape arrays. You'll need to assess the frequency and volume of these backups, choose an appropriate model of tape array, plan for an adequate supply of tapes, and insure adequate bandwidth between the servers being backed up and the tape arrays.
- Archiving Devices: The growth of a data warehouse must be controlled in order to assure adequate performance. Aged data that is no longer used, must be periodically culled and archived from the database. Normally, this function is performed with tape drives. However, you may want to consider archiving to a CD-ROM "jukebox" if there is a chance that the data will have to be (a) restored, or (b) retrieved at some point in the future.
- Networks: Obviously, you need to assess your company's LANs and WANs to insure that you have adequate bandwidth for (a) pulling data from source systems into the data warehouse, (b) replicating tables or refreshing materialized views to distributed nodes, and (c) handling query and reporting traffic from end users. You also need to insure that your network is sufficiently redundant to insure that the loss of a single trunk line will not disable your whole system. Also, if any portion of your data warehouse will be exposed to the Internet, you need to assess your need for security devices such as firewalls.
From an infrastructure perspective, all you really need to know about the software is a) the number of licenses that will be needed (remember, three environments ... development, test and production), b) which server(s) they'll be hosted on, and c) their cost (which is usually related to the size of the server they'll run on). Some of the software that you'll need to evaluate, price, and procure for your data warehouse include the following:
- Database (e.g., Oracle, DB2, SQL Server, Teradata, et al)
- ETL/Data Integration (e.g., Ascential, Informatica, et al)
- Business Intelligence (e.g., Microstrategy, Essbase, et al)
- Operational Reporting (e.g., Actuate, Crystal, et al)
- Web Portal (e.g., Plumtree, Sharepoint, et al)
- Messaging (e.g., WebSphere MQ, BEA WebLogic, et al)
- Data Mining (e.g., Clementine, IBM Data Miner, SAS, et al)
- Job Scheduling (e.g., AutoSys, Tivoli, et al)
- Backup (e.g., Veritas, Legato, et al)
Pheeeeew! Hope that helps.
For more information on related topics visit the following related portals...
DW Basics and
Sid Adelman is a principal in Sid Adelman & Associates, an organization specializing in planning and implementing data warehouses and in establishing effective data architectures and strategies. He jointly developed a methodology, MapXpert for Data Warehouse, that provides a master plan for implementing a data warehouse. Adelman is a regular speaker at The Data Warehouse Institute and IBM's DB2 and Data Warehouse Conference. He chairs the "Ask the Experts" column on DMReview.com and is a founding member of the Business Intelligence Alliance. Adelman is a frequent contributor to journals that focus on the data warehouse. He coauthored Data Warehouse Project Management with Larissa Moss and Impossible Data Warehouse Situations: Solutions from the Experts. He can be reached at (818) 783-9634 or email@example.com. Visit his Web site at www.sidadelman.com.
Les Barbusinski is vice president of technology and co-founder of Digital Symmetry, LLC, a consulting firm that specializes in the design and development of data warehousing and business intelligence solutions. He has more than 20 years of experience in data warehouse and operational systems development and provides hands-on expertise in data warehouse design, development and project management. Les can be reached at firstname.lastname@example.org.
Provided by IndustryBrains
|Embarcadero ER/Studio 6.6|
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.
|Help Desk Software Co-Winners HelpSTAR and Remedy|
Help Desk Technology's HelpSTAR and BMC Remedy have been declared co-winners in Windows IT Pro Readers' Choice Awards for 2004. Discover proven help desk best practices right out of the box.
|Dedicated Server Hosting: High Speed, Low Cost|
Outsource your web site and application hosting to ServePath, the largest dedicated server specialist on the West Coast. Enjoy better reliability and performance with our screaming-fast network and 99.999% uptime guarantee. Custom built in 24 hours.
|Click here to advertise in this space|