Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

RESOURCE PORTALS
View all Portals

WEB SEMINARS
Scheduled Events

RESEARCH VAULT
White Paper Library
Research Papers

CAREERZONE
View Job Listings
Post a job

Advertisement

INFORMATION CENTER
DM Review Home
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

GENERAL RESOURCES
Bookstore
Buyer's Guide
Glossary
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

DM REVIEW
About Us
Press Releases
Awards
Advertising/Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Document Warehousing & Content Management:
Structuring the Unstructured

  Column published in DM Review Magazine
September 2001 Issue
 
  By Dan Sullivan

The lines that divide business intelligence (BI) and content management are blurring. BI has traditionally been the province of data warehouses, star schemas and numeric data. Content management coexisted with business intelligence in portal applications designed to provide a single point of access for a wide range of information. The problem was, and continues to be, a lack of integration between the two. The situation is understandable. Business intelligence focused on providing information on the aggregate state of an organization while content management focused on collecting and making accessible unstructured assets. Web-based applications replaced their client/server counterparts and eventually led to portals as the linchpin tying BI and content management applications together along with external sources. Now we have well-developed tools in all three areas. Names such as Business Objects, Documentum and Plumtree are as familiar in many organizations as IBM, Sybase and Oracle. The next step in the evolution of these tools is the closer integration of database, content management, business intelligence and portal technologies, and that will be the focus of this monthly column in DM Review.

The first thing to understand about incorporating unstructured content into existing BI infrastructures is that there is no single tool or technique that will meet every need. Instead, a range of applications is now available that tackle the problems of unstructured text from a variety of vantage points. Here are some of the best known, or soon to be best known, vendors and their approaches.

Autonomy sees the challenges of managing unstructured texts as a pattern recognition problem. Rather than try to analyze the content of text, Autonomy's tools break text down into small segments that can be compared, counted and manipulated using a combination of Bayesian inference and information theory. Bayesian inference makes estimates about the likelihood of a fact based upon previously seen data. For example, if a user searches for the term "bank" and most instances of "bank" occur in documents about financial institutions, then it is most likely that a user searching for that term is interested in finance and not river banks. Information theory provides the basis for determining how much information can be conveyed in a message or, in our case, a document. Statistical tools, such as Autonomy's, do not depend upon any language- specific knowledge.

InXight's LinguistX Platform uses information about the structure and properties of language to analyze text. The LinguistX Platform is used in other InXight products such as Thing Finder and Categorizer as well as Oracle's Open Text (formerly Oracle interMedia Text. Statistical techniques are still used even in language-based tools, but they are not the sole means of analysis. The benefit of this is better precision and recall when searching for content because known rules about language in addition to pattern analysis are used to disambiguate terms and measure similarity. For example, a search for the "Society for Archeology" should not return the "Society for Architecture" as a similar match simply because of shared sequences of letters. The downside is that linguistic analysis is more complex than pattern recognition so processing times are longer in the former type of tools.

Megaputer's TextAnalyst supports information retrieval like Autonomy or Oracle Open Text, but its most distinguishing feature is its navigation. TextAnalyst allows users to find key terms and their relationships to other terms. For example, while conducting a competitive intelligence analysis of a competitor's patent portfolio, a user could quickly see the relationship between key technologies by examining the co-occurrence of representative terms.

Autonomy, InXight and Megaputer all use different approaches to analyzing text but they all work from the same basic principle - there is a discernable pattern in text that corresponds to its information content. By analyzing the text and making those patterns explicit, one can develop more effective information retrieval processes. In the case of Autonomy, the patterns sought are statistical, InXight exploits a range of linguistic patterns and Megaputer bases its analysis on morphological preprocessing and neural-net processing. Clearly, the structure of text can be found and manipulated in a number of ways. For us, the questions are which techniques work best in which situations, what are the performance implications, how well do these techniques operate and can we combine several techniques to offset the weaknesses of individual methods? Of course, these questions are all derived from the one question that really matters: How do we deliver the information to decision-makers when they need it and in the form they can use?

Business intelligence, content management and even knowledge management are overlapping domains without firm boundaries. Data warehousing has primarily focused on structured data, but that is changing as Richard Hackathorn documented in his article "The State of the BI Marketplace" (DM Review, April 2001). "Document Warehousing and Content Management" will examine one aspect of the changing nature of data warehousing: the inclusion of unstructured text into the business intelligence arena. Next month's column will examine how one data warehousing vendor, Oracle, is addressing the need for content with the new Ultra Search application available in Oracle9i.

...............................................................................

For more information on related topics visit the following related portals...
DW Design, Methodology, Content Management and Unstructured Data.

Dan Sullivan is president of the Ballston Group and author of Proven Portals: Best Practices in Enterprise Portals (Addison Wesley, 2003). Sullivan may be reached at dsullivan@ballstongroup.com.

Solutions Marketplace
Provided by IndustryBrains

Backup SQL Server or Exchange Continuously
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

DeZign for Databases - Database Design Made Easy
Create, design & reverse engineer databases with DeZign for Databases, a database design tool for developers and DBA's with support for Oracle, MySQL, MS SQL, MS Access, DB2, PostgreSQL, InterBase, Firebird, NexusDB, dBase and Pervasive.

Click here to advertise in this space


View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Advertisement
advertisement
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.