|Sign-Up for Free Exclusive Services:||Portals|||||eNewsletters|||||Web Seminars|||||dataWarehouse.com|||||DM Review Magazine|
|Covering Business Intelligence, Integration & Analytics||Advanced Search|
The Exploration Warehouse
In 199 when the world was first exposed to data warehousing, a data warehouse was a vague structure for DSS that was something other than an operational environment. While the idea of a data warehouse in 1990 was very appealing, the form and structure of the warehouse was unclear.
The Emergence of Structure
But as time passed, the different components (or architectural forms) of the warehouse began to emerge.
First came the operational data store (or ODS). The ODS sat one half in the operational environment and one half in the DSS environment. For those organizations trying to integrate unruly and undisciplined operational systems, the ODS provided a viable migration path.
Then came data marts. Each department wanted its own warehouse with its own unique interpretation of departmental data. The departmental data warehouse turned into a data mart.
The data marts were unique structures that bore a resemblance to the larger enterprise data warehouse but kept a distinct departmental flavor at the same time. Soon different types of technology that supported different types of data marts appeared--ROLAP, MOLAP, and so forth.
The next architectural component of the DSS environment that became evident was that of multiple enterprise data warehouses. Organizations found that some business areas were so large that they required their own enterprise detailed data. Their enterprise data warehouses were at the lowest level of detail and contained large amounts of history. Soon multiple enterprise data warehouses were linked together.
Enter a New Structure
Today a new DSS architectural structure is emerging. The structure can be called the "exploration warehouse." The exploration warehouse is just now being defined in terms of its form and structure.
The Evolution of the Exploration Warehouse
The exploration warehouse provides two fundamentally unique capabilities heretofore seen as irreconcilable. The first capability (and here the exploration warehouse may be called a "prototype" warehouse) provides a prototyping and iterative development capability for database designers and developers. In this role, the exploration warehouse allows the designer to truly practice iterative development of the data warehouse. The second capability of the exploration warehouse is to provide powerful analytical processing for very sophisticated business analysis.
The prototype warehouse provides prototyping and iterative development capabilities:
The combinatorial effect of these advantages of a prototype warehouse used in a development mode are such that the corporation is no longer hamstrung by the dearth of technical skills required to create and manage a database.
The exploration warehouse used in analytical processing provides:
The advantages provided by an exploration warehouse are essential to the competitive success of corporations struggling for profitability and market share.
The exploration warehouse--like all of the forms of DSS warehouses--was shaped out of need. Consider the organization with an enterprise data warehouse. The enterprise data warehouse contains detail data and an extensive amount of history. It serves many purposes: as a clearing house for feeding data marts, as a basis for farmer queries, as a source of reconciliation, as a historical repository of data, and so forth. Because of the volume of data in the enterprise data warehouse, the cost of the processor the enterprise data warehouse resides on and the volume of data flowing into and out of the enterprise data warehouse, resources are always scarce.
It is in this world that the database designers can fully demonstrate the worth of DSS and data warehousing. The data warehouse administrators can, at long last, deliver on the potential of data warehousing. They have the tools and the capacity to cope with the ever-changing requirements of the end user.
Enter the explorer. The explorer needs to look at large amounts of data in a random manner. The explorer operates in a random, non-repetitive mode. When the explorer embarks on an analytical quest trying to use the enterprise warehouse, the explorer is pushed away. The volume of data the explorer requires, the unusual and unpredictable way in which the explorer wishes to look at the data, and the time frame in which the explorer wishes to look at the data never fit easily with the enterprise data warehouse. A very convenient way to accommodate the needs of the explorer is by the creation of a separate exploration warehouse.
But, there is a second, very important reason why a separate exploration warehouse makes sense; and that reason is that the exploration warehouse is built on a foundation of technology that is optimal for exploration-- "token-based technology." Token-based technology allows data to be condensed to the point that it can be placed in memory. Once placed in memory, the speed of analysis and retrieval is a fraction of the speed that could be obtained in a standard enterprise data warehouse. Furthermore, once the exploration warehouse is created, the data warehouse administrator has moved troublesome exploration processing away from the enterprise data warehouse.
There are some powerful reasons why an exploration warehouse has evolved: the displacement of expensive machine cycles that the enterprise data warehouse resides on, an awkward fit between the needs of the explorer and the structure and technology of data residing on the enterprise warehouse, and the speed and economy of processing that can be achieved by creating an exploration warehouse with technology that is most appropriate for the needs of the explorers.
The Exploration Warehouse
As its name implies, the exploration warehouse is a structure useful for dealing with the unknown in a DSS environment. The exploration warehouse is a DSS architectural structure whose purpose is to provide a safe haven for exploratory and very ad hoc processes away from the primary enterprise data warehouse.
The exploration warehouse and its alternate form, the prototype warehouse, are shown in Figure 1.
When the exploration warehouse is used as a basis for exploratory processing of data that already exists, the exploration warehouse is fed directly and only from the enterprise data warehouse. When the exploration warehouse is refreshed, it is simply reloaded, whole cloth, from the freshest data found in the enterprise data warehouse.
Unlike the exploration warehouse, the prototype warehouse is used in the mode of providing a "what if" foundation for the initial building of the enterprise data warehouse. The prototype warehouse is fed directly from the legacy systems environment and possibly from external data as well. When used in this manner, the prototype processing done by the end user is one that forms the basis for design for the enterprise data warehouse.
In most cases, the nature of the exploration warehouse is one of a temporary, transitory nature. An exploration warehouse may have a life of several months or a life as short as a half hour. The very nature of the exploration warehouse is constant construction and reconstruction. It is very likely that once built, an exploration warehouse may never be built again in exactly the same form and content. And it is unusual for an exploration warehouse to physically exist in a single state for more than a few weeks.
The exploration warehouse serves the purpose of satisfying the need for very unstructured processing within the DSS data warehouse environment. The exploration warehouse is for the explorer who is mining data and the initial designer of the warehouse who is unsure of the requirements for the warehouse and wishes to create a database "straw man."
The Exploration Warehouse and Data Mining
The explorer is an end user who looks at the warehouse on a random, non-repetitive basis. The explorer often looks at very large amounts of data because the explorer needs to work with detail and history. The level of detail that the explorer examines and the length of time that history is spread over conspire to make the queries done by the explorer very large. Furthermore, the queries done by the explorer are very heuristic and iterative. There is no pattern of repetition in the queries submitted by the explorer.
In some cases, the explorer is looking for patterns of data. In other cases, the explorer is looking for associations and relationships that are hidden in the bulk of the data. In yet other cases, the explorer is looking at the quality and consistency of data that is being entered into the warehouse.
Because of the very random nature of explorer processing, data is highly indexed in the exploration warehouse.
The Prototype Warehouse and Initial Development
When the exploration warehouse is used in a prototype mode, it serves as a "trial balloon" for the testing of the initial design of the data warehouse. In this case, the designer is interested in examining and probing what data and what relationships are in (or are going to be in) the warehouse. The nature of the prototype warehouse in this mode is one that allows an enterprise warehouse to be constructed and reconstructed quickly when the designer finds data that is not quite right or that there are relationships that just don't add up.
A very real case can be made for the prototype warehouse to be used as the first iteration of development of the enterprise data warehouse. As data becomes integrated, it is often not clear exactly how the enterprise data warehouse should be shaped. Using the prototype warehouse as an early warehouse allows design to be done very quickly. The mistakes in design (which are a fact of life with every warehouse) can be done in miniature and on a small and inexpensive scale. Once the early iterations of development have been validated, the full-fledged enterprise data warehouse can be created. In many cases, the same technology deployed in the prototype warehouse will serve to house the enterprise warehouse provided appropriate scalability exists. This can further shorten the cycle of enterprise warehouse creation.
When the prototype warehouse is used, it is mandatory that data be integrated before being placed into the prototype warehouse. Merely copying or replicating data into a prototype warehouse creates (and proliferates) the problem of the lack of integration of the legacy, operational environment. Therefore, merely copying data into a prototype warehouse without integration is an exceedingly poor idea.
"Throw-Away" Exploration Warehouses
One of the appealing aspects of the exploration warehouse is that because of the ease and speed of development and because of the often transitory nature of data residing inside the warehouse, in many cases the exploratory warehouse can be considered to be "throw away." Indeed, for an analyst studying a problem that will be changing a week from now, it does not make sense to create a special structure that will stand forever and be used only once or twice. The cost of creating and operating an exploration warehouse is such that a throw-away database is a real possibility. Under these circumstances--which coincide very nicely with the needs of an explorer--throw- away exploration warehouses make imminent sense.
However, there is nothing to say that you have to throw away an exploration warehouse. There will arise circumstances where a body of data will be created that needs to be refreshed and used indefinitely inside the exploration warehouse. In addition to the possibility of an exploration warehouse being used as a throw-away warehouse, there is then the possibility of a long-term exploration warehouse. The possibilities are equally enticing.
By the same token, a prototype warehouse can be moved into the role as the enterprise data warehouse when the early iterations of development have been done.
A Bridge to Data Mining
There is another role played by the exploration warehouse and that is as a bridge to the world of data mining. The question can be raised: Why do we need another warehouse for data mining? The answer is that indeed data mining can be done on the enterprise data warehouse. But there are many reasons why a separate database makes sense.
The first reason why a separate exploration warehouse makes sense is that performance in the enterprise data warehouse is not affected when the explorer builds an exploration warehouse and does the exploration against it. Consider what happens when there is no exploration warehouse. The explorer sends very large queries to the enterprise data warehouse. Each of the exploration queries executes and, in doing so, consumes a large number of resources. Performance for everything but the explorer's processing comes to a halt in the enterprise data warehouse. The cessation of predictable service to the enterprise data warehouse is not a good thing as the enterprise data warehouse serves many purposes. If the explorer executes only a few transactions operating only against a reasonably small amount of data, then the resource load against the enterprise data warehouse may not be too bad. But in the face of explorers who want to run an unlimited amount of processes against an unlimited amount of data in an unpredictable manner, then the enterprise data warehouse does not serve as a viable foundation for corporate exploration. It is into these circumstances that the exploration warehouse plays exceedingly well.
Housing the Exploration Warehouse
The exploration and the prototype warehouses can be housed in standard DBMS technology, but a much better alternative is for the exploration and the prototype warehouses to be housed in token database technology.
Token database technology differs radically from standard database technology. In standard database technology, when a record is added to the system, a physical representation of the data is appended onto disk.
Each time a transaction is completed, a new record is added to the standard database. The scaling of data is said to be linear because the volume of data is a function of the number of records.
Once the records have been reduced to tokens, the volume required for storing the records shrinks dramatically. Furthermore, the larger the amount of data (i.e., the greater the number of records), the greater the difference in storage requirements for a standard, linearly scalable database and a token database. Stated differently, the more records there are, the greater the advantage of storage of a token- based DBMS.
In a token database technology, several very beneficial things happen:
Once token-based databases are placed in memory, the speed of processing is greatly accelerated. Depending on the particulars, the acceleration of a query may be two or three orders (or even more) of magnitude. Once the speed of processing is greatly accelerated, many possibilities become manifest. For example, very high speed in memory processing allows the analyst to employ techniques for heuristic analysis that otherwise would not be feasible. The analyst can submit queries that scan entire databases with impunity.
Another major benefit of the gross shrinkage of data is the possibility of indexing all attributes. Once all attributes are indexed, heuristic analysis is unlimited. The analyst can look at any field in any manner desired. The speed of the query is such that if the analyst wants to refine the results, a new query can be reformulated and rerun. All the reformulation and recalculation can be done in a fraction of the time that would have been required if the database had been built in a standard record-based DBMS.
Is the idea and power of an in-memory database new? No, in-memory databases have been around since the early spreadsheets. All spreadsheets--Lotus, Excel--are forms of in-memory databases. But the underlying technology surrounding the spreadsheets supports only a limited amount of data.
And is the idea of tokens in a DBMS a new idea? In fact, Model 204 used tokens as a standard data type many years ago. It is only now that Sand Technology's Nucleus has appeared that the ideas of tokens and in- memory databases have been combined to provide one of the most powerful technological advances in databases since the inception of computer technology. Now an in-memory database can give you the end-user satisfaction of a spreadsheet coupled with the volume of data found in a data warehouse. With the token technology provided by Sand, it is possible to reduce the cost of computation by one to two orders of magnitude while at the same time improving the speed of computation by two to three orders of magnitude.
In truth, the efficacy of the exploration and the prototype warehouses depends on the existence and economies of token-based database technology. Using standard DBMS technology, it is questionable whether exploration warehouse structures can ever become a standard part of the environment.
The database structure commonly deployed in an exploration warehouse is the normalized structure. The normalized structure is optimal because the exploration warehouse is servicing people who do not know what they want. If there happens to be a pattern of data access that is used by everyone, then the structure of the exploration warehouse may start to resemble a star join. But creating star joins in an exploration warehouse presupposes that everyone will want to look at the data in the same way. This supposition goes against the grain of the explorer who, by definition, does not know what will be examined or how it will be examined.
The explorer can take an entire enterprise data warehouse and condense it into a token-based exploration warehouse. Or the explorer can choose a subset of the enterprise data warehouse and use that as a basis for condensation. For example, the enterprise data warehouse may have five years of history in the database, and the explorer may choose three years for the exploration warehouse. Because of token-based DBMS technology, there are very few restrictions as to the data that can be selected for the population of the exploration warehouse.
In addition, more than one explorer can build an exploration warehouse. A financial explorer can build one exploration warehouse and a marketing explorer can build another exploration warehouse. As long as the different exploration warehouses have the same foundation enterprise data warehouse, there is always a single point of reconciliation.
One of the questions that can be asked is whether OLAP/multidimensional (i.e., "cube") technology can be used as a basis for doing exploration warehouse processing, rather than token DBMS technology. The answer is that superficially cube technology can be used. But there are some severe restrictions that greatly limit cube as a basis for exploration warehouse processing. The essence of cube technology is that all possible outcomes are calculated in a cube before the user starts to do processing. By precalculating the answers that might be asked, cube technology can provide very quick response time. The problem is that there are only so many possible outcomes that can be imagined before any processing starts to occur. The explorer will always have more ways to look at the data than can be predetermined by the design of any cube. And when the cube has to go back and try to calculate what the explorer wanted but did not exist in the cube, performance suffers.
Another implication of the approach of precalculation is that only so much detailed data can be managed. When there are many dimensions to the cube, there is a limitation on what detailed data can go into the dimensions. While there are limitations on token-based technology, it is fair to say that since token-based technology does not try to precalculate every possible permutation, that far, far more detailed data can go into a token-based exploration warehouse than a cube-based exploration warehouse.
One of the benefits of an exploration warehouse is that the data is stable once loaded. No updates are done to the exploration warehouse until the exploration warehouse is completely recast. This provides a very solid foundation for heuristic analysis. When an analyst runs an analysis, then reformulates the analysis and runs it again, the analyst knows that any difference achieved is due to reformulation, not to changing data. In other words, if data were to be updated between one iteration of analysis and another and the results of analysis were compared, the analyst could never know whether the difference in results was due to a change in data or a change in analysis formulation.
The usage of token-based DBMS technology for exploration and prototype warehouse processing is especially appropriate in the face of the need for periodic reformulation of the entire database. When data can be condensed into memory-sized databases, it is possible to construct one version of a database, examine it, decide that it is not quite properly structured and then reconstruct the database--all within a half hour's time. The load time alone precludes this sort of reformulation of a standard record-based implementation of an exploration warehouse.
Exploration Warehouse and Meta Data
Meta data plays an important role in all parts of the DSS environment, and the exploration and prototype warehouse environments are no exception. Because explorers and designers are looking at the exploration and prototype warehouses in many ways--some of which have never been examined before--meta data plays an especially important role. In order for there to be effective meta data at the exploration and prototype warehouse, there needs to be an effective meta data layer at the enterprise data warehouse. That layer needs to be able to be transported to the exploration and prototype warehouse environment every time there is a reconstruction of the exploration or prototype warehouse.
For more information on related topics visit the following related portals...
DW Engines, Exploration Warehouse.
Bill Inmon is universally recognized as the father of the data warehouse. He has more than 35 years of database technology management experience and data warehouse design expertise. His books have been translated into nine languages. He is known globally for his seminars on developing data warehouses and has been a keynote speaker for many major computing associations. For more information, visit www.inmongif.com and www.inmoncif.com. Inmon may be reached at (303) 681-6772.