Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

RESOURCE PORTALS
View all Portals

WEB SEMINARS
Scheduled Events

RESEARCH VAULT
White Paper Library
Research Papers

CAREERZONE
View Job Listings
Post a job

Advertisement

INFORMATION CENTER
DM Review Home
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

GENERAL RESOURCES
Bookstore
Buyer's Guide
Glossary
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

DM REVIEW
About Us
Press Releases
Awards
Advertising/Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Information Management:
Looking Ahead: Unstructured Data

  Column published in DM Review Magazine
December 2005 Issue
 
  By Bill Inmon

Data warehousing has come a long way, baby. Not so long ago, database theoreticians derided the data warehouse as setting the industry back 25 years. Today, data warehousing is conventional wisdom and a standard part of the corporate information infrastructure.

The past is only a prelude to the future. Looking into the crystal ball, one sees many things - very, very large data warehouses, exploration processing, enterprise resource planning (ERP) vendor support and analytical applications. Perhaps the most intriguing and most promising advances in data warehousing are the possibilities of bridging unstructured data with structured data.

There are two basic forms of unstructured systems - external and internal unstructured systems. External unstructured systems are those that embrace the data found outside the corporation. The Internet is easily the most vibrant example of external unstructured systems. There is an internal world of unstructured data existing within the organizations walls, which holds a world of informational wealth.

For years, corporations have had two types of systems - formal systems and informal systems. The formal systems have been dominated by databases and transaction processors. Indeed, the worlds of banking, finance and manufacturing make their day-to-day decisions based on transactional processing systems. There is another very important part of the information infrastructure that is not formal - the unstructured informal systems of the corporation. When people think of the unstructured informal systems, their first thought is usually of e-mail. Indeed, e-mail makes up a tremendous part of the informal systems environment, but there is much more to the informal decision-making environment. There are many, many different kinds of unstructured information including spreadsheets, reports and documents.

Internal unstructured data comes in two basic flavors - documents and records. Unstructured documents hold voluminous amounts of text and are notorious for having no form. One unstructured document can differ greatly from another unstructured document. There is no uniformity whatsoever to the unstructured documents.

Unstructured records are a different story. Even though there is no rigid format among unstructured records, there is a marked similarity between the records. Typical unstructured records are contracts, insurance policies, warranties, medical records, financial records and so forth. In addition, e-mails can be considered a form of unstructured records. With unstructured records, there is no fixed or even well-defined format.

Another major difference between unstructured records and unstructured documents is that unstructured documents do not normally have what can be called a "key" or "primary identifier" value. Trying to match the content of unstructured documents to similar or related data in the structured environment is strictly a hit-and-miss affair. However, trying to match content between the unstructured record environment and the structured environment is a fairly straightforward process, given the repeating nature of data found in the unstructured record environment.

Trying to bridge the gap between the unstructured environment and the structured data warehouse environment is reminiscent of the early days of extract, transform and load (ETL), when people were not sure they needed a data warehouse and were even less sure that they needed an ETL tool. What exists to bridge the gap between unstructured data and structured data is crude and unfocused. The best that can be said is that some products have some capabilities, and those capabilities appear to be an afterthought. The focus on the vendor-based products for the unstructured environment has been toward external unstructured data, not internal unstructured data.

There are some basic problems facing the organization that wishes to create a bridge between the two worlds:

  • Access of data. The technology used to support and manipulate unstructured data is quite different from the technology used to support and manipulate the structured world. For the most part, the unstructured vendors have been content to remain in the unstructured world and the structured vendors in the structured world.
  • Cross-pollination of environment content. Unstructured data simply does not have the discipline and integrity surrounding it that structured data has. When a value is found in the unstructured world, it is questionable whether the same value found in the structured world is actually the same. When "bill inmon" is found in the structured world, is it the same as "bill inmon" in the unstructured world? Consider if an e-mail said, "It is high time that we bill inmon floral services."
  • Synchronization. How do you keep track of changes in one environment and keep them synchronized with changes in the other environment?

What are the implications of opening the world of data warehousing to unstructured data? Quite frankly, a whole new world opens up. The world of data warehouse and data marts has been almost exclusively a world of numbers - roll ups, summaries and drill downs. From an analytical standpoint, 99 percent of the analysis is numerically based. The advent of unstructured data into the world of the data warehouse means that there are entirely new and unexplored possibilities.

In today's world, there is much talk about the 360-degree view of the customer. The 360-degree view is a wonderful concept, except where are the communications that have transpired between the customer and the corporation? How good is it for the corporation to know wonderful demographics about a customer when the customer has written an acerbic e-mail the previous week?

The truth is, there are an almost limitless number of ways that unstructured data enhances a data warehouse. It provides a dimension that is not possible through the standard quantitative analytical tools that are available today.

...............................................................................

For more information on related topics visit the following related portals...
Data Management, Enterprise Information Management and Unstructured Data.

Bill Inmon is universally recognized as the father of the data warehouse. He has more than 35 years of database technology management experience and data warehouse design expertise. His books have been translated into nine languages. He is known globally for his seminars on developing data warehouses and has been a keynote speaker for many major computing associations. For more information, visit www.inmongif.com and www.inmoncif.com. Inmon may be reached at (303) 681-6772.

Solutions Marketplace
Provided by IndustryBrains

Data Validation Tools: FREE Trial
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Manage Data Center from Virtually Anywhere!
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space


View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Advertisement
advertisement
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.