Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Information Is Your Business
   Information Is Your Business Advanced Search

Business Intelligence
Corporate Performance Management
Data Management
Data Modeling
Data Quality
Data Warehousing Basics
Master Data Management
View all Portals

Scheduled Events

White Paper Library
Research Papers



DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Enterprise Content Management:
Information Extraction in Enterprise Content Management

  Column published in DM Review Magazine
May 2002 Issue
  By Dan Sullivan

Enterprise content management (ECM) is a widespread domain that covers document management, information retrieval and portals. While these are the most widely recognized elements of ECM, a fourth thread, information extraction, is beginning to emerge.

Information extraction is the process of identifying essential pieces of information within a text, mapping them to standard forms and extracting them for use in later processing. At this point, information extraction tools work best finding the names of persons, places and things; dates and times; and monetary amounts within single documents. These elements, collectively known as named entities, are mapped to a standard form so that their relative frequency in the document can be determined. For example, a news article with "George Bush," "George W. Bush" and "Bush" would result in a named entity "George W. Bush" occurring three times. The relative frequency of these terms is then used to determine the most important named entities in the document. Because the basic operation of information extraction is looking for patterns, the same techniques can be used with a number of applications.

There are basically four reasons to perform information extraction: improve information retrieval, extract structured data elements from unstructured text, reformat content and mine text.

Enterprise-scale search engines allow users to specify criteria based upon a fixed set of parameters (date of creation, author and category). Some of these parameters, such as creation dates, are easily determined during indexing; and some, such as category, can be determined by a statistical analysis of text. While categorizers can identify merger and acquisition (M&A;) news stories with reasonable accuracy, they cannot pinpoint M&A; stories about deals worth more than $50 million. This is where categorization plus information extraction is needed to reach the next level of precision in searching.

When dealing with unstructured texts of limited scope, such as customer e-mails, resumes or financial reports, information extraction techniques can identify and tag typical pieces of information. For example, customer e-mails often contain information about a product, price, delivery date and billing; resumes have contact information and educational history; financial reports contain company names, dates and boilerplate text that identifies sections of structured reports such as SEC 10-K reports. Information extraction techniques can identify, to varying degrees of accuracy, these recurring types of information and map them to a relational format suitable for use with ad hoc query tools.

Another application is reformatting content. HTML content can be mapped to XML schemas. This would be especially useful for static HTML which contains both data and formatting information.

Text mining, the process of detecting patterns within and across text documents, depends upon information extraction techniques. By identifying key entities in a text, one can find correlations between terms and identify unsuspected links between related topics. For example, the connection between migraines and magnesium deficiencies was discovered by applying text-mining techniques to abstracts of online medical articles. This type of text mining is especially relevant to research-intensive industries such as pharmaceuticals and genomics. In many organizations, data- and text- mining techniques can be used in combination to analyze databases that use both coded data and free-form text. From CRM to electronic patient records, notes fields are used to save relevant but unusual or unanticipated information that does not neatly fit into coded data elements. This is exactly the type of information we want to find; and without text-mining techniques, we'll miss it.

Here are a few things to keep in mind when considering information extraction tools. First, statistical techniques are not sufficient for high-quality information extraction. These tools need gazetteers and other databases of information about names of persons and places. Specialized dictionaries, sometimes called authority files, might be needed to support industry-specific terminology. Some tools are incorporating WordNet, a publicly available lexical database of English from Princeton University, to improve semantic analysis. Choose a tool that is flexible enough to adapt to the demands of your domain.

Second, tools in this category vary widely in functionality. Visual Text, from Text Analysis, is an integrated development environment for developing rule-based information extraction and assumes the user has at least passing familiarity with parsing techniques. Megaputer's TextAnalysis is best for text mining across a wide range of documents. ClearForest's ClearTags is designed for marking-up unstructured texts. Whiz Bang Labs develops custom information extraction solutions. Understanding your functional requirements should make a tool choice clear.

Finally, this is an emerging industry. With the exception of IBM and Insightful, most of the offerings in this area are from vendors specializing in information extraction. This offers the potential for some cutting-edge technology from young, nimble firms. However, but they have not had time to develop the kind of track record that some customers require.


For more information on related topics visit the following related portals...

Dan Sullivan is president of the Ballston Group and author of Proven Portals: Best Practices in Enterprise Portals (Addison Wesley, 2003). Sullivan may be reached at dsullivan@ballstongroup.com.

Solutions Marketplace
Provided by IndustryBrains

Best Practices in BI: Webcast featuring Gartner
View this free Webcast featuring Gartner and Information Builders and hear leading experts share their vision for the future of enterprise business intelligence, including how to maximize the success and ROI of BI applications through best practices.

See Enterprise Business Intelligence in Action
See how business intelligence can be used to solve real business problems with this live demo from Information Builders

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Data Mining: Levels I, II & III
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.

Metadata Management Software
MetaCenter: Plug & play metadata management software for enterprise systems. Features: data dictionary, process documentation, impact analysis, search across multiple systems, web-based interface, reports, dashboards, import, export and more!

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.