Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Information Management:
Unstructured ETL

  Column published in DMReview.com
September 1, 2005
  By Bill Inmon

For years, two environments have grown up side by side - the unstructured environment and the structured environment. The unstructured environment is filled with informal systems built on e-mail, spreadsheets, texts and reports. The structured environment is formal and is filled with transactions, databases and operating systems. Important business of the corporation occurs in both places. However, these worlds might as well be as far apart as Peiking, China, and Rio De Janeiro, Brazil.

In years past, the practitioners of the structured world learned about the evils of stovepipe systems. With stovepipe systems, there was no integration of data across the corporation. There was no foundation of reusable data. There was no historical data, to any great extent. In short, stovepipe systems caused more long-term grief to the IT department than the Y2K problem ever did.

Stovepipe systems originated because of the inability of application developers and systems personnel to look beyond their immediate surroundings and see the larger picture. The result was a long-term, architectural nightmare from which some organizations, including the IT departments of the government, are still wondering how to extract themselves.

Does any of this sound familiar? Are the structured people and the unstructured people merely building today's silos of unintegrated information? How often does the user of unstructured systems stop to wonder how e-mail messages will integrate with structured systems? The answer is either almost never or never. When we step back and look at the larger picture, it is clear that we are busy building stovepipe systems once again in the worlds of structured systems and unstructured systems. Didn't we learn anything the first time around?

So you ask: How do I integrate these two very different environments? There is so much that is different about them - is it even possible to achieve integration from one environment to the other? The answer is yes. There are challenges; however, integration between the two worlds is absolutely a possibility.

In order to achieve integration between the two worlds, it is necessary to contemplate "unstructured ETL" (extract, transform and load) processing. ETL processing has been around for a long time. However, in the early renditions of ETL, the transformation was always from legacy structured applications into a decision support system (DSS) data warehouse environment, which is also structured.

In order to integrate the structured environment and the unstructured environment, it is necessary to create a completely different form of ETL - unstructured ETL.

In order to build an unstructured ETL environment, it is necessary to accomplish three tasks:

  • The access and selection of unstructured data,
  • The editing and manipulation of unstructured data, and
  • The integration of unstructured data into the structured environment.

Access of unstructured data: The access of unstructured data is the first challenge. The access of unstructured data means that unstructured data must be accessed in its native format. This means being able to read unstructured files such as e-mail, .txt/.doc/PDF files and many others. However, reading the files is only the first step. The next step is selecting the important text from the unstructured data. One of the features of the unstructured environment is that it contains a lot of blather that is not germane to business. Part of the access process is separating the blather from the business.

Editing/manipulation of data: After the unstructured data has been read, the next step is to edit and manipulate that data. Some forms of simple editing include "stop word" analysis, where common words (such as a, an, the, of, which, when, and that) are removed. After the stop words are removed, the remaining words are edited, reducing them to their stems. In doing so, it can be recognized that the words move, moving, moved and moves are all branches of the same stem. A much more meaningful understanding of words results from working with word stems. Then there are relationships that are created between words, and so forth. In fact, in determining what is to be done by the ETL tool, many forms of editing can occur.

Integration into the Unstructured Environment: After access and editing occurs, the next step is integration into the structured environment. If it is desired to have a firm integration into the structured environment, there must be an assurance that one piece of data is the same from one environment to the next. For example, suppose there is a "Dan Meers" in the structured environment and a "Dan Meers" in the unstructured environment. Certainly the names match, but that does not necessarily mean that both references are to the same person. In order to integrate the information, it is necessary to go to a much deeper level of matching.

Unstructured ETL

These, are just a few thoughts about what is meant by unstructured ETL. Of course these thoughts require an extension from theory into reality; however, given the tremendous push to integrate, it is predictable that unstructured ETL is right around the corner.


For more information on related topics visit the following related portals...
Enterprise Information Management, ETL and Unstructured Data.

Bill Inmon is universally recognized as the father of the data warehouse. He has more than 35 years of database technology management experience and data warehouse design expertise. His books have been translated into nine languages. He is known globally for his seminars on developing data warehouses and has been a keynote speaker for many major computing associations. For more information, visit www.inmongif.com and www.inmoncif.com. Inmon may be reached at (303) 681-6772.

E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.