FREE DM Review Site Registration!
Sign-up today and access DM Review on the Web!

Your FREE registration entitles you to:

FREE email newsletters

FREE access to all DM Review content

FREE access to web seminars, resource portals, our white paper library and more!

   

What is the most appropriate schema architecture for a data warehouse with downstream dependent data marts?

Question:

  • What is the most appropriate schema architecture for a data warehouse with downstream dependent data marts?

Chuck Kelley's Answer:

  • I would tend to utilize a normalized data structure for my data warehouse. The biggest exception would be if some in the user community are going to access the data directly from the data warehouse. Then I might consider a star as well.

Tom Haughey's Answer:

  • A couple of preliminary points. First, let me use the term, central data warehouse (CDW), to describe the main database within the data warehouse. Second, there will be multiple levels of data in the overall warehouse architecture, going from base (and fairly normalized) data to highly summarized (and more denormalized) data. Third, aggregated data is naturally dimensional. Fourth, the term normalized means to produce a non-redundant structure; it does not force you to go back to the operational level of detail. Fifth, any time you flatten data, you compromise it in some way, however slightly.

    For the CDW, there is one major question: what grain of data should you keep - both for events (facts) and reference (dimension) data. Don’t be concerned about which is a fact and which is a dimensions per se in the CDW schema. Not all data in the CDW will fit nicely into this pattern. The CDW should be reasonably normalized; remember this does not mean normalized back to operational levels of data; stay within analytical data. For example, say you are a credit card company and decide to keep the credit card transaction. This will not be identical to the operational transaction. It will be an enhanced snapshot of the operational transaction. There is a new grain of data in the warehouse; normalize within it. Do not be overly concerned about predefining data as facts and dimensions. Focus more on the kinds of data you need to keep and keep it in a flexible structure so that it can be sliced, diced, queried and extracted to suit needs. Be sure in the CDW to keep data wider and deeper than requirements initially indicate.

    The CDW should ideally contain data supporting multiple management processes (cross-referenced) such as sales, marketing and finance. When the CDW contains the base data from all (or most) management processes, then the data warehouse itself is considered an enterprise data warehouse (EDW). EDWs exist and plenty of them. To paraphrase Mark Twain, “Reports of their death have been greatly exaggerated.” But remember two important guidelines:

    • Build the warehouse in short increments. Never attempt to do it all at once. The first release should be three to nine months from commencement, depending on size and complexity. All subsequent releases are quarterly.
    • Deliver an early prototype of the database one-third of the way through the release, so that user can test the data to make sure it satisfies their needs. Remember, data warehousing is a rapid application development (RAD) project type.

    Consider also Moore’s law: technology doubles in power and halves in price every two years. The practice 10-15 years ago was that 80 percent of queries ran off data marts and only 20 percent used base data (CDW). This no longer needs to be the case. Many contemporary warehouses run most reporting off the base grains of the CDW. Of course, you cannot do this on a second-hand server and a left-over DBMS license. You need a significant platform for this but many exist today.

    If you build a centralized warehouse as described above, the need for conformed dimensions will be reduced. Conformed dimensions presumes multiple copies of the same dimension. If you build a cross-functional CDW, there will probably not be conformed dimensions within it - for the simple reason that there will be only one copy of the dimensions within it. This includes embedded data marts, that is, data marts that are contained within the CDW. Conformed dimensions will be more applicable to dependent data marts, that is, data marts that draw their data from the data warehouse. Here there will be multiple copies of the same dimension, and, of course, these should be coordinated.

    The above architecture implies a reduced need for aggregates to improve query performance. Aggregates are needed for two reasons: to improve performance and to establish a consistent set of summary numbers for everyone to use. In today’s technology, the reliance on aggregates to improve performance is not eliminated but it is reduced. Aggregates are a two-edged sword: they improve performance, but they take resources to build and maintain. Build aggregates where base data is dense, for frequently run queries, where the resulting aggregate is stable (meaning that is will not have to be changed), and where the reduction in data volume is significant (say 10:1). If aggregation still does not give you adequate performance, then off-load the data to a separate server as a dependent data mart.

    With aggregated data, you don’t have to worry about whether to use a dimensional model because the data will take care of that for you: aggregated data is inherently and naturally dimensional. “Give me sales revenue and volume (facts) by customer by product by market segment by month (dimensions)” is naturally a dimensional structure of facts and dimensions. Whether you use a star of a snowflake is a separate question. Test to see which is better in different cases in your environment. For skinny and simple dimensions, a star with its flattened dimensions will work. For very wide dimensions, very populous dimensions, or very complex dimensions, it may be better to split one large dimension into a snowflake with multiple dimensions.

    Here are some examples. Take a very wide household dimension. It may be better to split it into (say) base household data, financial household data and demographic household data. The resulting snowflake components have a 1:1 relationship to each other. This will likely perform better than a pure star with all flattened into one dimension. Here is another example. The relation between Customer and Policy in an insurance warehouse or Customer and Account in a financial warehouse is M:M, not 1:M, so a more complex structure than just fact and dimension (1:M) will be necessary. Attempting to flatten this into a pure star will cause more problems than it is worth. Remember, the world is not a star, not even the world of analysis. A third example is that of a complex Employee history. Some parts of Employee do not change (and if they do you do not care), whereas other parts of Employee data may change at different intervals. It may be best to split these into several separate but related dimensions.


Chuck Kelley is an internationally known expert in database and data warehousing technology. He has 30 years of experience in designing and implementing operational/production systems and data warehouses. Kelley has worked in some facet of the design and implementation phase of more than 50 data warehouses and data marts. He also teaches seminars, co-authored four books on data warehousing and has been published in many trade magazines on database technology, data warehousing and enterprise data strategies. He can be contacted at chuckkelley@usa.net.

Tom Haughey is the president of InfoModel LLC, a training and consulting company specializing in data warehousing and data management. He has worked on dozens of database and data warehouse projects for more than two decades. Haughey was former CTO for Pepsi Bottling Group and director of enterprise data warehousing for PepsiCo. He may be reached at (201) 337-9094 or via e-mail at tom.haughey@InfoModelUSA.com.

For more information on related topics, visit the following channels:



Industry Vendors