Data Warehousing: What’s Next? Part 5: Increase Your Architectural Flexibility
The true mission of data warehousing is cross- functional integration and integrity of results. This was a dominant conclusion of the last article in this series (Part 4: The New Islands of Information). Many of us are still struggling with the more modest, yet still daunting, task of providing information access and decision support capabilities one group at a time. This series explores various avenues to prepare us for this broader goal.
This week we continue our discussion of architectural overhaul. We will build on ideas introduced in previous sections. If you have not read them yet, I recommend that you do so before proceeding. (DM Direct beginning May 10, 2002.)
One ultimate goal is to move toward facilitating the flow of information. Before we can begin, we must first master the selection of optimal structures for our data warehouse architecture. The incorrect match between data structure and its intended role in our architecture can be a barrier, rather than an enabler, to the flow of information.
Confusion of Methods
The history of data warehousing is dominated by the war of competing design frameworks. I call them the Rationalists, the Stargazers and the Cubists. The Rationalists believe in the purity of the relational model as expressed by the rules of normalization. The Stargazers are religiously devoted to the star schema and the practice of dimensional modeling. The Cubists proclaim the supremacy of multidimensional database structures.
The Cubists have always been a niche faction in the data warehouse pantheon. They are actually the oldest cult since multidimensional databases (MDDB) pre- date the relational DBMS that supports the other two factions. The MDDB has a strong following in select functional areas, such as finance, but is not suitable for core data warehouse functionality.
The Cubists have no single spokesperson today. However, their passion was given credence when the original father of relational theory, E.F. Codd, defined the rules for online analytic processing (OLAP) in the mid-1990s. This manifesto supported the arrival of a new, more capable generation of multidimensional products, such as Essbase, and later Microsoft's Analysis Services for SQL Server.
The Rationalists were originally the dominant faction. Relational design unified what was previously a chaos of proprietary databases, each with a unique methodology. For most of the early DW architects this was not only all they knew, it was a proven approach for traditional application design. The pure Rationalist line is that the whole of the data warehouse must be implemented in at least third normal, non-redundant form.
Bill Inmon is often touted as the champion of the Rationalist faction. This is a fundamental misreading of his much more sophisticated framework. From the very beginning, he advocated layers of lightly and heavily summarized data sets in a multiplicity of time variant forms to serve as the access layer for a data warehouse. These are what we refer to as summary/aggregates. They feature various degrees of denormalization and definitely violate the pristine notions of non-redundancy. Summary/aggregates are a necessary component of a functional data warehouse.
The Stargazers are the newest and fastest growing faction. The concept of a star scheme is about as old as relational theory, and variants of this method were used in early mainframe-based information centers on tabular databases such as Ramis, Focus and Nomad. The Metaphor marketing analysis workstation, introduced in 1983, was based on a proprietary star schema database.
Ralph Kimball is responsible for introducing data warehouse practitioners to this design approach. It is the dominant model for data marts and a growing component of more cross-functional data warehouses. Dimensional modeling offers a more business-intuitive approach than entity- relationship techniques and result in a more access- optimized design. He and his advocates are strong opponents of purely normalized data warehouse implementations, with good reason. The success rates are much higher for a pure star schema solution.
Let's explore the benefits and deficits of each structure:
Normal (3rd- 5th)
The traditional justifications for normalization either don't apply to data warehouse design or are down right dysfunctional.
- Eliminate redundancy - Eliminating duplicate rows is good. Eliminating "redundant" attributes in a star schema dimension table will actually destroy its high- access efficiency. When this concept is applied to reducing the total number of tables, it runs counter to the data warehouse principle of optimization via specialization. A summary table is not redundant if it supports more efficient access by multiple users. A maintained summary can be used to accurately reproduce historical results and improve consistency. The desire to eliminate redundancy is the most pernicious fallacy carried over from application design.
- Save space - This corollary to eliminating redundancy is a holdover from another era. Saving cost by saving space is low on the totem pole of optimization techniques. The relative impact of storage on cost is way down. The loss of access efficiency has far greater cost impact.
- Support efficient update - Does not apply at all since update is not allowed. The load methods for relational tables in a star schema design can actually be more efficient than a load of normalized transaction and snow- flaked reference data.
- Avoid corruption - The use of DBMS controlled referential integrity is actually counter-productive. Referential integrity is designed to support the original capture of data in the source system. A data warehouse needs to detect, flag and possibly correct data anomalies BUT rejecting incoming records can only reduce the integrity of results produced from DW data.
- Representation advantages - Entity/relational modeling evolved to ensure completeness of the design for the intended purpose and to offer a more intuitive representation of the real world. In the ambiguous world of data warehouse information needs, no technique can ensure completeness of design. In fact, the standard methods dictate that you only include attributes and relationships for which there is a current expressed need. This is suboptimal for data warehouse design. As for offering a more intuitive representation, it seems clear that dimensional design wins hands down.
The benefits of normalized design do not derive from traditional application justifications. They are attributes of the technique that have unique expressions in data warehousing.
- Transactions - Normalized design is the best for retention of transaction details. Dimensional design, whether in a star schema or an MDDB cube, does not support transactional keys and granularity. A normalized transaction table can serve as the drill-down target for summary analysis.
- Reference history is best maintained in a third normal form table with the addition of start and end dates to define the validity range. The start date becomes the lowest order component of the key. This way you can maintain a complete sequence of changes to reference data such as customer or product attributes.
- Base tables - These normalized event (transaction) and reference tables can act as an efficiency base to construct (or reconstruct) star schema dimensional and fact tables or summary/aggregate tables. They also serve as a base for export of detail.
The deficits of normalized design arise from its nature and from misconception of purpose.
- Poor access efficiency - Normalized design is the worst, by far, for most query access. A normalized design is optimized for key- based, record-at-a-time inquiry or table-level query that efficiently uses the provided indexes. Rarely are query variants limited enough to build an efficient array of indexes. This makes ad hoc support extremely poor.
- Complexity of query specification is high.
- Suboptimal for repetitive actions - Repetitive actions include canned reporting and repeated queries. Joins and summaries are executed repeatedly at much higher cost than operating from a summary/aggregate, star or cube.
- Limited exploratory access - Exploration can only be simulated by use of a relational OLAP front-end tool. This provides more configuration options but much lower performance than a MDDB.
- ODS mode - A normalized table can easily be used in ODS rather than DW mode. Updates can be applied violating the non-volatility rule. Often only current data is maintained. Net change history is not implemented.
- Staging areas - Sometimes relational tables are used as a staging area. The data is not retained. A strong case can be made for NOT using relational tables solely to build summary/aggregates, star schemas or cubes. Flat files can be far more efficient and actually offer more load, integrity and transformation options. They also eliminate maintaining an additional tier of tables.
A star schema embeds more business logic. The dichotomy of dimensions and facts is a natural representation of how data is analyzed. The fact table contains numerical event data generally derived from multiple source tables. It is a specialized form of summary/aggregate table. Each dimension represents the entire set of characteristics and reporting rollup for a single reference subject area. The design is organized via formal rules that are stricter, yet more easily understood, than normalization rules.
In a sense dimensional modeling rules are a twist on normalization. Both facts and dimensions satisfy the rules for the basic level of normalization. They require strict key dependency and disallow repeating groups. A fact table satisfies at least 3rd normal form rules while adding additional constraints. The number of keys match the number of dimensions; one key per dimension. Each key is single valued. Only numeric, non-encoded values are (generally) allowed in a fact table other than the keys. The fact that dimensions are NOT 3rd normal form is the key to the efficiency of the star schema.
The star schema is optimized for access. This is the core of its benefits that include:
- Analytic Flexibility - The facts can be accessed and analyzed across an array of levels of a variety of dimensions. It has fewer constraints than a cube.
- Analytic efficiency - It is far more efficient at producing summaries than a normalized set of tables and eliminates the need to join tables to aggregate facts. The dimensional join scheme is the source of its selective and summary power.
- More easily reconfigured than a cube - Dimensional attributes and fact elements can be added by separate table mods that do not affect the other tables. A cube (generally) has to be reloaded in its entirety.
- Best for ad hoc - For all the above reasons, interactive access is more optimal, on average, than with other methods. Ad hoc access is only constrained by the choice of elements in the fact table. Joining to other tables to get additional facts wrecks the optimization of the schema set.
- Excellent source for summary tables - The star schema's flexibility and efficiency make it a better choice for generating summary tables. The constraint is the availability of the full set of facts needed.
The star schema deficits are a factor of the compromises that make it work.
- More narrow scope than relational - The star schema is configured for a specific constituency in terms of facts assembled, base granularity offered and dimensions represented. This is a factor of the trade- off slanting toward greater specificity.
- Retention of history is complex and limited - Several different, mutually exclusive methods exist to record net change history in a star schema. The selection of any one method may further reduce the range of usage. They reduce efficiency of access. Many star schemas present a current- only data view. Relational net change history maintenance is easier and more broadly applicable. It will not impact query performance when the relational tables are used exclusively as base tables and not for access.
- Moderate performance - The structure offers generally better performance than relational and lower performance than a comparable cube. A specially designed summary/aggregate should always beat a star schema for its designated usage.
- Suboptimal for repetitive action - Canned reporting and any repeating query is best executed against summary/aggregate tables.
- Not suitable for detail.
An aggregate is a pre-joined but not summarized data set. It is used to accumulate data elements from multiple source tables to reduce runtime processing. They are not fully denormalized (1st or 2nd normal form). They are generally retained until the source data is updated but are not maintained. They are created, used for multiple queries and then dropped or truncated.
- High availability - Aggregates accumulate data needed for multiple executions of the same query (possibly with different search criteria) or execution of multiple similar queries using the same data subset.
- High performance for specific tasks
- Good for repetitive action such as reporting.
- Good source for summaries based on the defined subset of elements.
- Potential export source for the defined subset.
The deficits of an aggregate include:
- Must be re-aggregated to reflect changes in source data.
- No exploratory analysis - except as a component of a ROLAP implementation.
- Narrow applicability.
- Limited interactive use due to the narrow scope of coverage.
Pre-summarized data is often a summary/aggregate since data normally comes from more than one table. Summary tables may be used temporarily but are generally retained or maintained. A vertical summary keeps time (period) as a key for selection and sorting. A periodic collection is a form of vertical summary that is maintained. For instance, each month a new set of monthly summary rows is appended to the table. A horizontal summary contains a time series (array). This form greatly aids in inter-period comparisons or cumulative calculations. The horizontal form is rebuilt each period.
Summary tables are the dominant mechanism for improving performance of repetitive queries. Maintained versions provide consistency over time. The benefits of summary/aggregates are much like base aggregates plus:
- Essential for repetitive action - Summary/aggregates are the preferred structure for reporting and repetitive queries. Performance is increased and consistency is ensured across multiple all queries or reports.
- Time series variant optimized for inter-period operations - This is the only easy and efficient mechanism to support inter- period analysis without resorting to OLAP tools.
- Periodic collections are excellent for history - Maintaining summary data in this form ensures the ability to reproduce prior period values as reported.
- Vertical summaries may be cascaded - High levels of summary can be produced from lower levels. For instance, quarter summaries can be produced from monthly summaries. This has both performance and consistency advantages.
- Potential source for cubes when it is more efficient to feed pre- summarized data to the MDDB loader. Some loaders today are fairly efficient at accepting transaction level detail but the mapping process requires foreign keys that match the desired dimensionality. If you are producing a summary for another purpose, consistency argues for using this existing table even when performance advantages are nominal.
- Excellent Source for Export.
The deficits of summaries are the same as aggregates. The single exception is that the range of applicability of some summary/aggregates can be quite high (not a deficit). For instance, virtually every company that has something to sell produces some form of monthly summary by product category by market or channel or customer type. These high reusable data sets are prime candidates for construction and maintenance as periodic collections.
Cube is short for hypercube. A hypercube is a data structure that stores multiple levels of pre- computed summary data. It is a physical manifestation of multidimensionality.
The dominant rationale for an MDDB is to present access to any level of detail with the same almost instantaneous performance.
- Extremely high performance for a specific scope of activity.
- Optimized for exploratory action - Exploratory analysis is a dynamic process supporting discovery actions seamlessly and continuous during an engagement session. Common actions are to dive (drill down), ascend (drill up), surf (drill across) and spin (pivot) to explore the relationship between values. This is opposed to interactive access which presents one data perspective at a time. A new query must be run to get the next data slice.
- Inter-period comparisons - This is an OLAP advantage that is shared only with time-series summary data sets, which are rarely implemented.
- Inter-cell comparisons and calculations - This is an exclusive OLAP feature.
- Native support for drill through - The new generation of MDDB products support the ability to drill from a cell to access the underlying transaction detail.
- Best choice for specific constituencies - Financial analysis, for instance, is heavily oriented to multi-dimensional presentation with the performance of a true MDDB.
The cube structure is not suited for the broad range of typical data warehouse operations. For this reason, the list of deficits is proportionally larger. Keep in mind that it is the superior choice within its design parameters.
- Narrowest constituency range - A MDDB is constrained by many factors to a narrow range of usage. In addition to the factors below, it stores a fixed set of dimensions and attributes that requires redesign and a reload to extend.
- Poor choice for ad hoc - A cube contains a highly specialized subset of data and is not organized to support query access.
- Suboptimal for reporting - The reasons are similar to ad hoc. In addition, MDDB products are weak in formatting capability and are not suited for listing-style reports.
- Not suitable for detail - Like a star schema, a cube is a dimensional structure that cannot support the incompatible transaction keys. This is partially made up for by support for drill-through.
- Must be reloaded to capture changes to underlying data.
- Load performance is the trade- off to access performance - The load must calculate and store summaries for all (sometimes fewer) combinations of all levels of all dimensions. Load times have come down significantly with new product releases but are still much greater than alternative structures.
- Not a good vehicle for history - History in this context means net change history of slowly changing dimensions; not retention of old data. An MDDB is geared only to presentation of one set of dimensional values per cell. For instance, if a product name changed recently either all historical values are represented with one (generally the new) name or the old and new products are treated as distinct, unrelated things. You can't choose on the fly whether to treat them as the same, as distinct but linked, or as discrete products.
Additional Decision Variables
Clearly, no single structure is optimal except in a very narrow range of usage. The broader the scope of the warehouse, the more suboptimal any one approach becomes. As we have seen in prior installments, usage can vary as much within a function (such as finance or sales) as it does across functions. It can differ widely by consumer role. Some people just need raw transactional data. Others need highly refined metrics drawing data from diverse sources and multiple levels of detail.
Another factor we have not explored is the variety of time perspectives. Usage differs in terms of frequency of data refresh or periodicity of relevant summaries. The tolerance for latency varies widely. Some need yesterday's data today, while others can wait to get daily data at the end of the week. The difficult ones want today's data now.
As if our task is not difficult enough, factors such as volume, volatility and variability can affect which structural choice is optimal. Volume of data, both absolute and incremental, can invalidate the use of a cube structure if it is very high or eliminate the need for a summary/aggregate if it is very low. Volume of usage can perversely have the opposite effect. The cost of building a cube or summary aggregate is reduced when amortized over a greater base of usage.
Volatility of data, the rate of change of values, may have different effects depending on whether it is event data or reference data. Highly volatile event (or transaction) data will require fact tables, summary/aggregates and cubes all to be recomputed. Some MDDB products now support selective revisions which reduce this burden. Highly volatile reference (or dimension) data can require cubes and highly denormalized (1st normal form) summary/aggregates to be reloaded. Only the impacted dimension tables of a star scheme need to be replaced. Summaries that carry only coded values (2nd normal form) are not affected at all.
Volatility of need is a measure of change in usage frequency. This can throw a kink into the optimization process if the volatility is unpredictable. Variability of usage is when different data sets and different analyses are used by the same group over time. Strategic planning and market research groups operate on a project basis where the needs vary widely from case to case. These factors must be understood when attempting to align usage and structure.
It is easy to see why some folks fall back to the default of normalized design. Its saving grace is the ability to serve the widest range of unknown needs. This "solution" is clearly least common denominator. It is suboptimal in the majority of cases. Lack of knowledge regarding patterns of usage is the gotcha. As noted in earlier installments, lack of usage visibility is a flaw that must be corrected for any data warehouse to succeed in the long run.
This time we explored the options for a more diverse and flexible architecture. Each structural alternative has its place in a complete solution. In the next installment, we will present a set of architectural principles that will serve as a guide to architectural overhaul.
For more information on related topics visit the following related portals...
DW Basics and
DW Design, Methodology.
Michael Haisten, vice president of business intelligence at Daman Consulting, is considered one of a handful of visionaries who have helped shape the data management industry. He has accrued more than 22 years of leadership in information management and architecture development. Haisten served as chief information architect at Apple Computer Corporation, where he developed the data warehouse management team. He has been chief architect, designer and technical expert for more than 72 data warehouse and decision support projects. In addition, Haisten is the author of Data Access Architecture Guide and Data Warehouse Master Plan and has published extensively on data warehouse planning, data access facilitation and other key aspects of data warehousing. You can contact him at email@example.com.
Provided by IndustryBrains
|Data Validation Tools: FREE Trial|
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.
|Speed Databases 2500% - World's Fastest Storage|
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage
|Backup SQL Server or Exchange Continuously|
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.
|Design Databases with ER/Studio: Free Trial|
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.
|Free EII Buyer's Guide|
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.
|Click here to advertise in this space|