DM Review Published in DM Review in July 2001.
Printed from

Integrating Data and Document Warehouses

by Dan Sullivan

Summary: Adding text to the BI environment is a logical next step in the evolution of data warehousing.

Editor's Note: This article introduces Dan Sullivan, DM Review's newest columnist. His monthly column, Document Warehousing and Content Management, will begin in our September issue.

Data warehousing methodology is as well developed as any area of software engineering; yet, only 32 percent of respondents to a recent online poll (as of April 30, 2001) report that their data warehouse projects are successful or at least adequate. Why aren't we meeting our customers' expectations? There is no single answer to this question. Most likely, we don't provide enough information, we don't provide the right information or users can't get the information. In this article, we will look at one type of the "not enough information" problem and discuss how to incorporate text into our business intelligence (BI) infrastructure.

Numbers Are Not Enough

With 80 percent of business information stored in textual form, there is no circumnavigating the need to face the mountains of documents before us. The documents we need to deal with are not uniform. They range from short pieces such as e-mails, status memos and press releases to longer detailed news stories, marketing plans and research results. All of these, and more, help to describe the big picture of an organization and the environment in which it operates. A BI environment needs to make this information accessible to decision-makers just as it has made numeric data available.

Most data warehouses tend to gather data from in-house transactional databases. In larger warehouses and specialized data marts, outside data, such as demographics and psychographics, supplements internal sources. This setup allows us to answer questions about who, what, when and where, but not why. It also allows us to build elaborate models to forecast projected sales, revenues and other critical metrics based upon information garnered from past transactional data. It does not help us understand changes that might influence those same metrics such as market shifts, competitive pressures and changes in technology. This inability to answer why and to adequately deal with the future is in part tied to the lack of textual information.

Consider the following scenario. You are an analyst for a national drugstore chain planning inventory for the near future, and you need to decide how much to invest in a wide range of products. You are currently working on smoking cessation products and want to determine which types of products to carry, the particular brands and dosages. In a numbers- only data warehouse, we could look at past sales trends, perhaps by geographical region or demographic characteristics of the population near the store, and forecast future sales. For mature products with well-understood sales patterns, this might work; but even those can experience snags. We often hear the problem is that past performance is not a guarantee of future performance. If you had a document warehouse that monitors multiple sources, you could have access to consumer opinions, studies on the effectiveness of various products and doses, clinical practice guidelines and news stories from consumer health magazines. For example, you might find that:

  • Recent studies show 16- and 24-hour patches work equally well.
  • 6- to 8-week programs are as effective as the recommended 10- to 18-week plans.
  • Manufacturers are considering raising the levels of nicotine in their products.

With this information, you might want lower your estimated sales projection of 24-hour patches as well as the total sales of all products, as shorter programs seem to be working for most smokers. Neither of these decisions would be warranted given past sales figures alone.

Adding relevant documents to the decision processes expands the pool of information available to analysts and serves two critical objectives.

First, decision-makers need access to all kinds of information and should have access to it from a single BI environment. According to Bernard Liautaud, author of e-Business Intelligence, 80 percent of the time dedicated to decision making is actually spent gathering information, leaving only 20 percent for actual analysis. Linking relevant numeric and textual data can improve this ratio.

Second, we want to improve the success of our forecasts. While we cannot make precise statements such as, "Sales will increase 2.8 percent over the next quarter" based upon what we just read, we can develop a more critical eye toward our forecasting models that are driven by historical data and a handful of user-specified parameters. Obviously, these models work well enough in many cases, but we want to reduce the number of times they do not work because of "unanticipated" issues.

The key to meeting the information needs of decision-makers is to integrate the data warehouse with a document warehouse. Just as data warehouses are designed to accommodate numeric data and high-volume operations on that data, so too are document warehouses built to house textual information. A document warehouse is a repository of textual information that is categorized and organized in such a way as to integrate semantically related texts so that they are accessible to end users and provide relatively high levels of information retrieval precision and recall in a decision-support environment. The three key processes associated with the document warehouse are acquiring content, extracting meta data and retrieving documents.

The Core Processes

Document warehouse content is added in several ways. It can be manually added, much as documents are added to document management systems. Most content, however, is likely to be added automatically through the use of search and retrieval programs or subscription services.

Search and retrieval processes operate on both internal and external data sources. Marketing reports, customer e-mails, status memos and competitive intelligence assessments can be found in internal files systems or document management systems. External sources include fee-for-service providers, such as news feeds and market analysis, as well as publicly available sources such as company Web sites, government agencies and industry portals. Documents from these sources are generally gathered using a Web crawler or more elaborate Web harvesting tools.

Once target documents have been gathered (which is similar to the extraction process in data warehousing), they need to be processed by text analysis tools to extract meta data for the contents of the document (similar to the transformation phase of document warehousing). Meta data in a document warehouse shares some similarities with data warehouse meta data, such as data source and format, but it also includes attributes specific to text. The four basic types of meta data are: document content meta data, search and retrieval meta data, text mining meta data and storage meta data.

Document content meta data describes general attributes of a document such as its creator, title, publisher, publication date and language. Search and retrieval meta data tracks the addition of documents to the warehouse, document sources, how to handle multiple versions of files and how to control crawler and harvester programs. Text mining meta data tracks topic categories and summaries, as well as the names of organizations, persons and places extracted from texts. This type of meta data is extracted with specialized tools such as Solutions-United's MetaTagger, IBM's Intelligent Miner for Text and Oracle's Open Text. Finally, storage meta data describes how a document should be handled after it has been retrieved. For example, should the entire document be stored or just a summary or URL along with an index of key terms? Should the text be translated from the original language? When does the document expire? Each type of meta data serves a distinct purpose, but the types all are used collectively to support the primary purpose of the document warehouse: to deliver relevant content to BI users.

The rubber finally meets the road in the last core process: information retrieval (IR). Obtaining relevant information, and only relevant information, is the goal of the IR processes. We have several different types of tools at our disposal to deliver IR services to BI users. The most important are: ad hoc querying, visualization, automatic routing and notification, and user interest modeling.

We are all familiar with ad hoc querying using Web search engines. We are also painfully aware of the two persistent problems with search engines: poor precision (the number of relevant documents returned compared to the number of irrelevant documents) and poor recall (the number of documents returned from the set of all documents that should have been returned). Many tools, including Autonomy and Oracle Open Text support both key word and thematic searching which can potentially improve precision and recall.

A broad class of visualization tools is now available for information retrieval. InXight's hyperbolic tree provides an easily navigated method for assessing the links between a large number of documents. Miner3D offers a way to view the results of an ad hoc query visually and in relative proportion to relevancy of the document to the query. Other techniques group documents into hierarchical clusters, allowing the user to drill from the general to the specific topics.

Routing and notification applications build upon classification programs and provide a mechanism to automatically notify users that documents of interest have been added to the repository. Most classification tools use either a rules-based or a statistical approach to categorizing documents, and neither technique is foolproof.

Modeling user interests is a key to successful routing and notification. Tools such as Autonomy can build a statistical model of user interest by analyzing users' queries. These models are then applied much like filters to provide yet another mechanism for delivering relevant content to users.

The core processes of document warehousing (acquiring content, extracting meta data and retrieving documents) provide the basis for integrating text into the BI environment.

(Future DM Review columns will address issues in meta data management, user interest modeling and effective use of visualization.)

Document Acquisition

The document acquisition process consists of three components: one or more processes that execute both internal and external document gathering, a set of locations or sites to search and a set of patterns to search for at those locations or sites. Administering the document-gathering processes is relatively straightforward and independent of the area we are researching. The locations we want to search and the patterns we will use, such as sales and marketing for a drugstore chain, however, are directly related to a subject area. Compiling a list of all the possible topics we would want to know about would be a huge task if we had to start from scratch. Fortunately, there is a better way.

Dimensional data defined in the data warehouse is an ideal source of information to bootstrap the creation of search patterns. First of all, dimensions contain short but descriptive information about relevant items to a subject area. For example, product dimensions contain manufacturer names, product names and product categories. These can be used to generate search patterns. For example, we would obviously want to search consumer health sites and online magazines for "nicotine patch" and "nicotine gum" to populate our drugstore chain document warehouse. More precise searches require that we drill farther down in the hierarchy to specific product names, but we quickly arrive at the point of diminishing returns. Unless we are searching sites that deal with particular products (for example, the U.S. Food and Drug Administration's database of recalled drugs), precise product names will not find additional useful information. In fact, overly specific searches will create two problems. First, the search process will end up with extremely long queues of search patterns (e.g., Nicorette Nicotine Gum Mint 4 mg. Refill, Nicorette Nicotine Gum 2 mg. Refill, etc.). Because many patterns are similar, they will repeatedly hit the same documents, reducing the effective throughput of the search process.

The second problem is that overly specific searches will lead to documents with narrowly relevant hits. For example, a search for "Nicorette Nicotine Gum 2 mg. Refill" could hit a high number of online catalog pages. These could be useful if you are looking for comparative pricing information, but the meta data extraction process will need to identify these as catalog pages and use specialized routines to extract particular pieces of data. In other cases, we would want to use a filter (created with a statistical categorization tool) to identify these irrelevant hits before they made it to the meta data extraction phase. Even a simple statistical filtering process is not always effective because these tools will hit limits in their discriminatory power.

Hierarchically organized structures, such as a product category, are an ideal hook for linking data and document warehouses. Just as we can use fine-grain descriptive data to define search patterns, we can use hierarchical data to manage the sites searched for specific patterns. Consider a simplified hierarchy for the smoking cessation example (see Figure 1).

For our drugstore chain, it is not likely that we would want to search a target site for all the products in our inventory, but we would want to remain informed on Food and Drug Admin-istration publications regarding all over-the- counter medicines. For relatively new products, such as nicotine patches, we might want to monitor consumer opinion sites or newsgroup postings.

To take advantage of the work that has already been done to create dimension, we need to compile a set of target sites to monitor. Next, we must link these sites to particular levels in the hierarchy so that patterns generated from items lower in the hierarchy are used when searching those sites. In this scenario, if we linked to smoking cessation products, the site would be searched for information on nicotine patches and nicotine gum as well as any other products below that point in the hierarchy. More elaborate combinations of dimensional data will, in some cases, further improve performance.

Text analysis tools use heuristics or rules of thumb and do not guarantee accurate results in all cases. Our objective is to improve the precision and recall of the information retrieval process and make additional information available to decision-makers; however, but this is not a 100 percent automated process.

Expanding the Scope of BI

Numbers do not tell the whole story and decision-makers spend too much time gathering information. Adding text to the BI environment is a logical next step in the evolution of data warehousing. Document warehousing techniques are designed to integrate with our existing infrastructure. By exploiting the work already completed for the data warehouse, we can bootstrap the next phase in the development of our BI infrastructures.

Dan Sullivan is president of the Ballston Group and author of Proven Portals: Best Practices in Enterprise Portals (Addison Wesley, 2003). Sullivan may be reached at

Copyright 2005, SourceMedia and DM Review.