DM Review Published in BI Report in December 2004.
Printed from

The Integration of Unstructured Data into a Business Intelligence System

by William N. McCrosky

Summary: This article discusses the challenges to merging the technologies and provides an example where the two can be merged to solve an organizational problem.

Business intelligence systems have traditionally been built with "structured" data - data that has a known format (integer, character, scientific and the like) and a known position within a source (electronic) record. While building a business intelligence system is still a daunting exercise, finding the data within a source record has never been the primary challenge.

Unstructured data - data from such sources as forms, e-mail or documents - contains a great deal of information that can be usefully employed in a business intelligence system. Some of this information is vitally important. The Enron and related scandals have made the location and retrieval of unstructured data - in the form of e-mails and corporate documents - a business process that can determine the very survival of a corporation.

The Sarbanes-Oxley Act1 has greatly spurred the extension of structured database management systems into the realm of content management - the addition of unstructured data objects into the safe harbor and support of an enterprise database. Sarbanes-Oxley requires corporations to be able retrieve any data (especially documents and e-mail) that pertain to the reliability of a corporation's financial statements. Database vendors are scrambling to deploy enterprise-class content management solutions that ensure compliance with this politically sensitive and highly visible legislation.

To date, tools used to build business intelligence systems have been developed independently of those designed to solve content management problems. However, the two are beginning to merge. This article discusses the challenges to merging the technologies and provides an example where the two can be merged to solve an organizational problem.

The Nature of the Problem

Finding data in an unstructured document is a challenge. Several examples illustrate this challenge:

  1. The source document is paper, not electronic. Insurance, medical and human resource forms are often paper-based. Note that the data of interest, in this case, is reasonably structured. Its position on the source document can be spatially located.
  2. The source document is structured - as in example 1 - but it is already in an electronic form. A Web technology such as XML may be used.
  3. The source document is electronic, but the data of interest is not structured. E-mail and word processing documents fall into this category.
  4. The source document is paper, and the data is unstructured. There is no electronic representation of the document - perhaps a historical document prepared before the advent of word processing.
  5. The source is a "blob," not a document - such as pictures, voice or video.

It is useful to classify unstructured data as one of two types: "structured/unstructured" (types 1 and 2 above) and "unstructured/unstructured" (types 3, 4 and 5). In type 1, the location of the data of interest to our business intelligence system is known - if we can only bridge the gap from paper form to electronic form. Type 2 is the easiest, from a data extraction point of view. The data is really no different from a traditional, record-oriented source of a business intelligence system. It is electronic and it is form-based (structured).

In types 3 and 4, we have a greater challenge. Even if the source document is already electronic (type 3), we cannot spatially locate the data of interest in the document - that is, the data we want is not always on a certain line and at a certain position within that line. Type 4 has all the problems of type 3, with the additional complication of being on paper. Types 3 and 4 are text-based, implying that some form of text processing or text mining might help us find the data of interest.

Type 5 seems the most difficult of all, since the source is impervious to common data- or text-processing techniques. This type of data is undoubtedly the source of great interest to intelligence agencies and is probably currently the focus of intense research efforts.

Until recently, content management and business intelligence capabilities have been developing in parallel. There has been relatively little effort - or motivation - to integrate these two areas before now.

An Example of Integrating Structured and Unstructured Data

Why is there interest now? I do not attempt to list all of the possible motivations, but can share an example I encountered on a recent project.

My client was a very large enterprise, employing many hundreds of thousands of people. These employees can participate in more than one retirement programs. Retirement services - such as benefits calculation and coverage determination (which retirement plan is an employee eligible for) - are still based manual analysis of paper documents. This paper is becoming a storage problem (expense and space) as well as a drag on efficient delivery of retirement services. The client is actively investigating processes and technologies for capturing this paper in electronic form and converting an image of a data element - for example, an employee's SSN or years of service - into a digital format amenable to traditional data processing.

This enterprise is moving rapidly toward "e-services," both internally and externally. Of particular interest is the fact that the enterprise is rapidly consolidating its human resource and payroll systems into a few, highly electronic processes. These systems contain retirement-related data - in particular, they contain employee contributions to various retirement programs. To build a more electronic system for providing retirement benefits, the enterprise is faced with the challenge of integrating data on paper forms with data in electronic systems.2

How Can this Problem be Solved

The electronic sources of data for this enterprise database present a well-understood exercise of integrating structured data from multiple sources into an integrated database. Of course, saying that the problem is well understood does not mean that it is "easy." There are a myriad of technical and business challenges in the extract, transform and load (ETL) process that have to be addressed to ensure the resulting database meets enterprise objectives.

The harder problem is capturing data from the forms - there are millions of them - and converting the image of data into digital data. However, once this problem is solved, the problem becomes "easy" again - the paper-based data can now be integrated into the enterprise database using a variety of ETL techniques.

The solution to this problem requires the involvement of a third information technology discipline, which has also been developing in parallel to the other two (ETL and content management) - scanning technologies.

Paper management has long been a problem for large enterprises. Insurance companies are probably at the forefront of this process, due the paper/document intensive nature of many of their business processes. Companies such as Xerox and Kodak have long served this market. These vendors have developed scanning technologies that can quickly turn paper into an image - and, equally important from the business intelligence perspective - produce meta data that enables one to find a document easily. Examples of commonly used meta data include SSN, name and form ID.

These "image vendors" also provide data conversion technologies - for example, a data entry operator can use a light pen and inform the computer of the coordinates of the SSN field on the form. The conversion technology can read the field and convert it to data.

Much of this data conversion process is manual, however. Depending on the nature and business complexity of the conversion, it may be cheaper or more reliable to employ two technicians to transcribe the same field and then compare the results.

There are usually two outputs from this conversion process - the scanned image and the converted data. Frequently, the scanned image needs to be retained - often for evidentiary purposes in case the form becomes involved in legislation. The content management components of the database are responsible for storing this data, indexing it with the appropriate meta data, and - perhaps most importantly of all - retaining a paper trail. Sarbanes-Oxley and DOD 5015.2 impose strict requirements on this document management process. Database vendors are continuously upgrading content management functionality - particularly the records management components - to meet these requirements.

The second output - the converted data - is typically a flat file. For seasoned designers of a business intelligence system, this flat file becomes an input for an ETL process.

Conceptually, at least, we are done. We have a solution to the types 1 and 2 problems. We have used three previously unrelated technologies - structured data management, unstructured content management and image scanning/conversion technologies - to create an integrated solution to the problem of integrating structured and unstructured data into an integrated, enterprise database.

Practically speaking, however, there are a number of issues that need to be addressed. One in particular, in the case of this client, is the issue of data primacy. Retirement data obtained from a paper document may conflict with that obtained from an electronic source.

A second significant problem is that the structure of a form changes over time. For example, a form to capture the particulars of a new employee hire will change, as the enterprise needs to know something new or different about new employees. Affirmative action data is but one of many examples. The addition of new data means that the position of the new data has to be captured - and possibly causes a position shift of other data on the form. Data positionality is almost transparent in relational data sources, but a significant and ongoing concern in document capture and conversion.

Types 3 and 4 problems may be solved by integrating text mining technologies into a similar architecture as described in this article. Type 5 problems probably await the publication of current research before a solution could be envisioned.


  1. Similar pressures are creating interest in the public sector, especially in the U.S. Department of Defense (DOD). DOD Standard 5015.2 attempts to standardize public sector policies and procedures with respect to document storage and retrieval.
  2. Technically, this system is probably more accurately classified as an operational data store (ODS), rather than a business intelligence system. The Federal government does not, at present, envision building analytics on this ODS.

The author would like to thank Christopher Weller of IBM's Software Group for his invaluable assistance on the project described in this article.

William N. McCrosky is a business intelligence consultant in IBM's Business Consulting Services. McCrosky has been a project manager and a technical architect on several significant IBM business intelligence engagements. He recently was the project manager on an engagement to build an integrated operational data store (ODS) data from structured and unstructured sources. He has a Master of Computer Science degree from the University of Virginia and 25 years of data management experience.

Copyright 2005, SourceMedia and DM Review.