Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

RESOURCE PORTALS
View all Portals

WEB SEMINARS
Scheduled Events

RESEARCH VAULT
White Paper Library
Research Papers

CAREERZONE
View Job Listings
Post a job

Advertisement

INFORMATION CENTER
DM Review Home
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

GENERAL RESOURCES
Bookstore
Buyer's Guide
Glossary
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

DM REVIEW
About Us
Press Releases
Awards
Advertising/Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Should the operational data store be designed using 3NF approach or a dimensional approach?

    Ask The Experts published in DMReview.com
April 10, 2006
 
  By Tom Haughey and Chuck Kelley and Larissa Moss

Q: Should the operational data store be designed using 3NF approach or a dimensional approach? What is the best design approach if I am planning for staging -> ODS -> DW -> DM approach?

Chuck Kelley's Answer:

I believe the ODS is a 3NF... It is an operational system. I also think that you have your approach a little backward. I think it is ODS -> staging -> DW -> DM. I also believe source systems feed both ODS and staging. In my opinion - ODS, staging, and DW are 3NF, and DM are based on the tools the user community is using. They would either be star, snowflake, 3NF, multidimensional, etc.

Larissa Moss' Answer:

The "best" design approach is always to use the most appropriate schema for the specific usage of a database and not to follow anyone's religion on how a database "should" be designed. Having said that, let's examine the most common usage of those four databases:

1. Staging area. The purpose for a staging area is to receive data from the source files and prepare the data for loading the target end-user databases (ODS, DW, DM). Therefore, any files or tables in this area are designed to support the ETL process in the most useful and appropriate way. Since many companies have a variety of different types of end-user databases (ODS with DM, or DW with DM, or ODS with DW and DM), they usually design the staging area as close to 3NF as possible for maximum flexibility (i.e., so that they can create load files for any current and future relational or dimensional end-user target database). In addition, they also utilize plenty of flat files, summary tables, and even dimensional schemas to support very specific ETL processes.

2. ODS. The reason why the ODS was invented is to provide ad-hoc querying capabilities as well as canned or parameterized reporting capabilities to business managers and knowledge workers to assist them with their tactical management needs as well as their operational activities. This usually involves the need to be able to access the data in the same way and through the same data relationships as the data would be accessible on the operational systems. Since operational systems are designed using the 3NF approach (for the purpose of minimizing update anomalies), the most appropriate design for ODS is relational - not necessarily 3NF but denormalized to the degree necessary to achieve the desired query and report performance. (Example: If two tables that have a One-to-Many relationship are always accessed together, they would be denormalized [collapsed], which would violate 1NF.) Frequently run reports that contain operational metrics would best be split off into "Oper Marts" (OM), which are designed using a dimensional schema, just like most other DM.

3. DW. Enterprise data warehouses (EDW or DW) mostly exist in two different scenarios: a) in a multi-tier hub-and-spoke architecture, where many additional DM exist and b) on Teradata platforms, where separate DM are discouraged because the desired performance and ease-of-use can be achieved through special Teradata features that mimic the dimensional views of DM. That means that metric-based reporting, which is most appropriately delivered through dimensional schemas is handled in both cases either through additional DM (first scenario) or through special product features (second scenario). That allows the EDW to remain in a relational schema - again, not necessarily 3NF but denormalized to the degree necessary to achieve the desired query and report performance - and allows it to support ad-hoc query needs in addition to canned reporting. The difference between ODS and EDW is that an ODS only contains current data (no history) and is volatile (can be updated), whereas an EDW contains historical data and is non-volatile (read only).

4. DM. DM, specifically dimensional DM, were invented to solve severe performance problems when companies ran trend analysis reports against their relational DW that required multi-table joins and table scans. In addition to the performance issues, end users found it difficult to deal with the multitude of tables and data relationships, when all they needed were a few of them to run their reports. Since most reports are metric-based facts delimited by a handful (or two) of dimensions, most DM are designed using the dimensional approach. However, if a group of end users needs extensive ad-hoc query capabilities against a specific subset of data for a specific purpose that requires the data to be navigable the way it is in the real world (i.e., the way it is related in the operational systems), then the most appropriate design for their DM would be relational - again, not necessarily 3NF but denormalized to the degree necessary to achieve the desired query and report performance. The most important distinction of DM is that they are tailored reporting solutions, which means that each solution must be designed (tailored) for its specific intended usage and access pattern.

Tom Haughey's Answer:

The operational data store (ODS) is a tactical environment which stores detailed, near-real time results of committed transactions for a certain period of time for immediate reporting needs and which can sometimes be updated by users. An ODS is often created for one of three purposes:

  • Integrating data from multiple sources (because modifying the source systems would be too costly)
  • Tactical reporting
  • Providing consolidated update processing

While the ODS can be used to stage data for the DW, that is not its primary purpose. An ODS needs to have an operational purpose, reflecting the three reasons just mentioned.

Figure 1 summarizes the differences across the system types you refer to, except for staging.

Figure 1: System Differences

Given this, let us discuss how normalization fits in.

First and foremost, the concept of a normalized structure is not well understood in data warehousing. Kimball's original Dimensional Modeling Manifesto is predicated on the completely false assumption that the designer has the choice of using a dimensional design of informational data or a normalized design of operational data. His examples all emphasize this. This assumption is completely wrong. In warehousing the second choice is for a normalized model of informational data. Actually, if you take most of Kimball's examples and provide their normalized equivalent, the primary difference would be that the normalized model would be a snowflake. Another difference in some examples, such as his Orders examples, would be the separation of Order Header and Order Item rather than just the fact table Order Item as in his example. More about that later.

My colleague Joe Oates published an interesting paper in which he described data warehousing normalization. Warehouse normalization is different than operational normalization, he contends. The reality is that there is no such thing as DW normalization, but there is the application of normalization to informational (or DW) data. The rules or normalization are the same. The difference is in the data. It just comes out different than if you normalize informational data.

Margy Ross (and Kimball along with her) are jousting at windmills when they say that the main reason proposed for normalizing DW data is to save space. This is nave. First, the main reason for normalizing the data in an operational environment is to prevent update anomalies. An update anomaly occurs when you need to update in several redundant places - and don't. Then you have an anomaly. In the data warehouse the main purpose of normalization is to produce the most flexible model possible. The normalized model is better suited to satisfy any ad hoc query. Any time you collapse data, and that is what happens in a star schema, you reduce the flexibility of the data at least to some degree. How much depends on the degree of flattening. Then you have to do other things to circumvent this, such as introduce a greater amount of redundancy to cover the variations you just flattened out. Any time you flatten the data into a structure that violates the functional dependencies of the data, you constrict its flexibility because the functionally dependent model is the most open. It shows all the granular possibilities.

In addition, when one talks about implementing a normalized model, that does not necessarily imply that the implemented model is totally normalized. Most fully normalized logical models get physically denormalized to some degree to accommodate performance and other purposes.

Back to the ODS. The ODS model is often a synthesis of different operational models. Sometimes, but rarely does it look like any one individual operational model. Above we talked about the three main reasons for building an ODS. Notice that some ODSs are updatable. For this reason and for the reason of greater flexibility, the ODS model is better off more normalized. If a lot of reporting is done off the ODS, some aggregate structures could be used as well to supplement the base data. Remember the state of being normalized and the introduction of some denormalization and even some supplementary summary (and denormalized) structures are not incompatible.

An ODS can be used for staging, but that is not one of its justifications. An ODS has to have some operational purpose as described above. Staging data can be kept in files or on relational structures, normalized or not as best suits the needs of pure staging.

The base data in the central data warehouse (that is, the main database in the data warehouse) should also be more normalized. Again, one of the main purposes of the CDW is to answer any query or extract needs, especially ad hoc queries. The normalized model is the best model for providing such flexibility. Again, some denormalization and some stored summarization are normal.

As data gets pushed out to data marts, it can (but doesn't have to) get more dimensional. There are different types of data marts, namely:

  • summary marts (where the data is aggregated),
  • subset marts (where the data is as detailed as the DW but with either fewer rows or columns), and
  • string marts (files extracts that go to data mining).

Summary marts are naturally dimensional, as is all aggregated data.

Finally, remember that the level of normalization required for efficiency is significantly influenced by the platform. If you a running on a robust parallel platform, such as Teradata, it would squander the resources of that system to reduce everything to a purely dimensional structure. On the other hand, if you are running on hand-me-down server and a left-over DBMS license, you may have to do everything you can to optimize the data.

...............................................................................

For more information on related topics visit the following related portals...
Data Management, Data Modeling and DW Administration, Mgmt., Performance.

Tom Haughey is the president of InfoModel LLC, a training and consulting company specializing in data warehousing and data management. He has worked on dozens of database and data warehouse projects for more than two decades. Haughey was former CTO for Pepsi Bottling Group and director of enterprise data warehousing for PepsiCo. He may be reached at (201) 337-9094 or via e-mail at tom.haughey@InfoModelUSA.com.

Chuck Kelley is a senior architect in the business intelligence practice for Hitachi Consulting (www.HitachiConsulting.com), a globally recognized leader in delivering value-based business and IT Solutions. Kelley is an internationally known expert in database and data warehousing technology. He has 30 years of experience in designing and implementing operational/production systems and data warehouses. Kelley has worked in some facet of the design and implementation phase of more than 50 data warehouses and data marts. He also teaches seminars, co-authored three books on data warehousing and has been published in many trade magazines on database technology, data warehousing and enterprise data strategies. He can be contacted at chuckkelley@hitachiconsulting.com.

Larissa Moss is founder and president of Method Focus Inc., a company specializing in improving the quality of business information systems. She has more than 20 years of IT experience with information asset management. Moss is coauthor of three books: Data Warehouse Project Management (Addison-Wesley, 2000), Impossible Data Warehouse Situations (Addison-Wesley, 2002) and Business Intelligence Roadmap: The Complete Project Lifecycle for Decision- Support Applications (Addison-Wesley, 2003). Moss can be reached at methodfocus@earthlink.net.

Solutions Marketplace
Provided by IndustryBrains

Data Validation Tools: FREE Trial
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Data Mining: Levels I, II & III
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space


E-mail This Ask The Experts E-Mail This Ask The Experts
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Advertisement
advertisement
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.