-
Marketplace
-
Channel Resources
Articles from this Site
A Scoring Model and Choice Model for Multistage Cross Selling in the Insurance Industry, Part 2
Lawrence Technological University Uses Metastorm ProVision
Information Builders to Extend WebFOCUS to Predictive Analysis
A Statistical Stocking Stuffer for the Holidays
What are your views on the advantages and/or disadvantages ETL tools and data modeling versus code?
White Papers
Best Practices: Eight Tips for Improving Your Professional Services Business
Metadata Management for Enterprise Applications
UML for C#
PHP Code Design
Domain-Specific Modeling: 10x Faster than UML
Web Seminars
Modeling Unstructured Data
Creative Strategies for Achieving 24/7 Uptime
Closing the Loop: Real-Time Event Detection and Response
Learning from Others: Best Practices for Data Governance
Supercharging Enterprise Information Quality with Web Services
Books
Data Mining Cookbook: Modeling Data for Marketing, Risk and Customer Relationship Management
Data Modeler's Workbench: Tools and Techniques for Analysis and Design
The Data Modeling Handbook: A Best-Practice Approach to Building Quality Data Models
Data Mining Using SAS Applications
Data Mining: Concepts, Models, Methods and Algorithms
What is the best data modeling approach for a staging area?
Question: What is the best data modeling approach for a staging area? Please note that the staging area in my case would be the source for both the warehouse (dimensional model) and also an OLTP (relational model) system.
Chuck Kelley's Answer: In a business intelligence staging area, I prefer to have the structures be a replica of the source system with a date or key that says when the data was pulled. That way, I can build my extraction processes to yank the data as quickly as possible, not to bog down the source system. From here, I start the transformation processes.
Clay Rehm's Answer: I don't think this can be answered without seeing the data elements and how the data will be used as well as how the data will be loaded to the staging area. It is safe to say that a fair amount of flexibility will be needed.
You could prototype several different model types and test them against the load approaches as well as the extract approaches.
Joe Oates' Answer: First, let me describe my view of what a staging area is. A staging area is a place where data from source systems is extracted to in order to perform the transformation and load steps of the ETL process. The staging area is not often on the same box as an OLTP system (source system to the data warehouse) because the ETL process may take enough resources so that the OLTP system has unacceptably slow response to its users.
The physical location of a staging area can be on the same box as the data warehouse or it can be on a separate box if huge volumes of data is being extracted and transformed. Keep in mind that the staging area, except for the initial (historical) load, only contains data that has changed, been added to, or deleted from the source systems since the last ETL process was run, not all of the data in the source systems.
The staging area will contain tables or files that have exactly the same structure as the source system tables or files. The staging area will also contain tables or files that have the same structure as the data warehouse dimensions and fact tables. These tables, unlike their counterparts in the data warehouse, are not indexed.
The transformation process takes the data from the staging area source system tables or files and transforms them into the data warehouse structures within the staging area based on mappings and transformation rules that have been developed in the design stage of the project.
Once the transformation process has completed the data that is in the tables or files with the same structure as the data warehouse are loaded into the data warehouse by a bulk loading utility. With the large volumes of data in a data warehouse, inserting rows into the data warehouse tables takes an unacceptably long time.
It is wise to partition some of the data warehouse tables, especially the very large fact tables, because before the data from the staging area is bulk loaded into the data warehouse, it is necessary to drop the indexes on the data warehouse tables. Once the bulk loading has finished, these indexes must be rebuilt. It takes much less time to rebuild the indexes and a partition that contains a few hundred thousand rows then for a fact table that contains many billions of rows.
So to sum up, I would recommend that your OLTP and data warehouse (including the staging area) not be on the same box. You need two data models, one for the OLTP data structures and one for the data warehouse data structures.
Joe Oates is an internationally known speaker, author and consultant on data warehousing. Oates has more than 30 years of experience in the successful management and technical development of business, real-time and data warehouse applications for industry and government clients. He has designed or helped design and implement more than 30 successful data warehouse projects.
Chuck Kelley is an internationally known expert in database and data warehousing technology. He has 30 years of experience in designing and implementing operational/production systems and data warehouses. Kelley has worked in some facet of the design and implementation phase of more than 50 data warehouses and data marts. He also teaches seminars, co-authored four books on data warehousing and has been published in many trade magazines on database technology, data warehousing and enterprise data strategies. He can be contacted at chuckkelley@usa.net.
Clay Rehm, CCP, PMP, is president of Rehm Technology (www.rehmtech.com), a consulting firm specializing in data integration solutions. Rehm provides hands-on expertise in project management, assessments, methodologies, data modeling, database design, metadata and systems analysis, design and development. He has worked in multiple platforms and his experience spans operational and data warehouse environments. Rehm is a technical book editor and is a co-author of the book, Impossible Data Warehouse Situations with Solutions from the Experts. In addition, he is a Certified Computing Professional (CCP), a certified Project Management Professional (PMP), holds a Bachelors of Science degree in Computer Science and a Masters Degree in Software Engineering from Carroll College. He can be reached at clay.rehm@rehmtech.com.
For more information on related topics, visit the following channels:


