Enterprise Content Management:
When the Perfect Search Tool Is Not Enough
The stereotypical enterprise content management (ECM) application combines elements of document management, Web content management, search and taxonomies - but that is about to change. These techniques are sufficient if you are interested in information retrieval, or simply identifying and presenting a set of articles, documents and Web pages about a particular topic to a user. In many cases, however, solving the information retrieval problem still leaves the user with an unmanageable amount of data. Many of us in ECM expend a great deal of effort developing techniques and domain-specific heuristics to improve the effectiveness of our information retrieval applications. However, these efforts will never address one fundamental and growing problem: even if we correctly retrieve only relevant content, there is still too much information for users to analyze. The next step in the evolution of ECM is the adoption of information extraction techniques which provide users with distilled information, not just documents.
Consider the problems in medical research and bioinformatics. Technical advances in experimental instruments in these fields have created vast amounts of new information that is published in scientific journals. Much of the information is available online from sources such as Medline, a database of scientific abstracts. With sophisticated search techniques, users can find the abstracts relevant to their work; but they are still left with the task of culling through those documents to find particular pieces of information, such as protein X activates Y and molecule A binds to B at location C. Information extraction techniques that identify patterns such as these allow us to create structured representations of the relationships between objects, such as proteins and genes. Once we have structured representations, we can apply many of the same analytic techniques that have been used in decision support and business intelligence, such as visualization and link analysis.
The most recent KDD Cup data mining competition sponsored by the Association for Computing Machinery (ACM) posed a problem dealing with mining facts from biological research and indicates the need to address both structured and unstructured data sources with analytic techniques. The fact that two commercial ventures, ClearForest Corporation, the winner of the competition, and Verity, an honorable mention, finished in the top ranks along with academic researchers demonstrates the commercial availability of state-of-the art information extraction.
With a database of facts extracted from text, the tasks performed by researchers and other knowledge- workers change. Users no longer search for documents; they search for connections between facts. With a fact database, one can search for a series of links between two entities: for example, A causes B which inhibits C which lowers levels of D. The discovery of the relationship between dietary magnesium deficiencies and migraines was found using such a method with information extraction techniques and medical research abstracts. Of course, this approach is not limited to scientific work. Law enforcement, government agencies and financial services have all used information extraction to manage and analyze large volumes of unstructured data.
Will information extraction provide an adequate return on investment for your organization? To answer that, consider three factors.
First, information extraction techniques work best with a reasonably homogeneous set of documents, such as scientific abstracts, patent applications, news stories and SEC filings. The specific topics can vary widely even within these limited groups of content, but the types of information extracted are well focused. From scientific abstracts, we can extract relationships between chemical compounds or anatomical relationships. From patent applications, we can find researchers and related patents. From news stories we can find companies, information about earnings reports and new business relationships.
Second, the payoff with information extraction comes when the volume of information is too great to manage manually and the cost of missing information is high. Missing a change in a competitor's sales promotion is not nearly as important as missing an experimental result that could save the cost of early drug trials.
Third, information extraction techniques are not perfect. Facts will be missed, particularly when the text is linguistically complex as that in many scientific papers. Erroneous facts may be extracted as well. Getting high-quality results from information extraction programs often requires human review and editing.
As we develop better tools and techniques for retrieving information, we realize that even if we had a perfect search tool, we would still have too much information to process. The next step in the evolution of enterprise content management is underway in a couple of vary narrow domains; however, as the tools mature and the need grows, expect to see a wider adoption of information extraction.
For more information on related topics visit the following related portals...
Data Mining and
Dan Sullivan is president of the Ballston Group and author of Proven Portals: Best Practices in Enterprise Portals (Addison Wesley, 2003). Sullivan may be reached at email@example.com.
Provided by IndustryBrains
|See Enterprise Business Intelligence in Action|
See how business intelligence can be used to solve real business problems with this live demo from Information Builders
|Autotask: The IT Business Solution|
Run your tech support, IT projects and more with our web-based business management. Optimizes resources and tracks billable project and service work. Get a demo via the web, then try it free with sample data. Click here for your FREE WHITE PAPER!
|Design Databases with ER/Studio: Free Trial|
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.
|Verify Data at the Point of Collection: Free Trial|
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.
|Data Mining: Levels I, II & III|
Learn how experts build and deploy predictive models by attending The Modeling Agency's vendor-neutral courses. Leverage valuable information hidden within your data through predictive analytics. Click through to view upcoming events.
|Click here to advertise in this space|