Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Making Unstructured Data Findable Using Tagging and Annotation

  Article published in DM Direct Special Report
May 9, 2006 Issue
  By Ashish Sureka

The amount of data produced and stored within an organization is growing at a very rapid pace, making the task of finding relevant and accurate information quickly and efficiently a nontrivial exercise. Even small-to-medium-scale enterprises can contain documents or digital assets in the millions, archived in a variety of formats, in a variety of storage system types. A fast and effective mechanism to search and retrieve the right information is critical to the decision-making and proper business functioning of the organization. The need to effectively search content has resulted in the deployment of many enterprise search engines within an organization.

Despite the presence of advanced search engines that can crawl and index a huge volume of data, a lot of times it happens that a person is not able to find the relevant information that he or she is looking for. Even though the document or data is present in the repositories and is indexed by the search engine, it still does not show up in the search results, or it is ranked very low in the search results. This happens because the content is searchable but not findable. Let me explain the distinction between searchability and findability with an analogy. A library typically contains tens and thousands of books and still we are able to quickly find a particular book that meets our needs from a very large collection of books. This is because the books are organized subjectwise, kept methodically in shelves. The books are cataloged based on the titles, authors and keywords. It becomes hard to search for a book if it is kept in a shelf where it does not belong. In that case, even though the book is kept somewhere in the library, which means that it is searchable, it becomes unfindable because it is not catalogued properly. Books in a library are mostly organized based on their subjects. All computer science books are generally kept together, and all history books are kept together. What if the books are not labeled properly and kept randomly? A person looking for a book on history will not be able to find the right book quickly if history books are spread across many shelves and in between books on other subjects. It is not only important to have searchable content present somewhere, but it is also important for the content to be findable.

Figure 1: Result of Text Tagging and Annotation to a Sentence from a Newswire Article

In this article, I present an application of a technique called text tagging and annotation in making unstructured content, such as freeform text, findable. Text tagging and annotation is a popular technique based on natural language processing and machine learning and forms an important component of a document processing and information extraction system. Text tagging and annotation consists of analyzing free-form text and identifying terms (for example, proper nouns and numerical expressions) corresponding to domain-specific entities. The input to a text annotator is a free-form text and the output is a set of named annotations over sections of the text. Text annotation is also referred to as named entity (NE) extraction and in earlier days, NE extraction technique was used to identity common entities such as persons, locations, organizations, dates and monetary amounts from newswire text. NE detection has been a subject of research since more than a decade and has resulted in many open source as well as commercial systems. Current NE detection systems offer good accuracy and are widely used in diverse domains with applications in text mining, information extraction and natural language processing.

Consider an example of company acquisition news illustrated in Figure 1. The text of the news is "On November 16, 2005, IBM announced it had acquired Collation, a privately held company based in Redwood City, California for an undisclosed amount." The entity types present in this news text are date, acquiring organization, acquired organization, place and amount. As shown in Figure 1, a text annotator identifies such entities in the text and produces an output that tags the identified entities. The output can be either in the form of an XML document or a table in a database. Tagging of important named entities makes it easy to do an entity link and relationship analysis. The XML tags or the schema of the database table has to be predefined by the user and is domain specific. The text annotator tool also needs to be customized or programmed for it to be able to detect specific entities. It is straightforward to invoke SQL queries on the table produced as an output to text tagging and annotation.

Figure 2: Making Content Findable with Tagging and Annotation

Figure 2 illustrates the potential of text tagging and annotation in making content findable. The answer to the queries as listed in Figure 2 cannot be fetched without tagging the newswire text and without combining it with an external knowledge base. The answer to the query illustrated in Figure 2 lies in the form of free-form text stored in a content repository but still it won't show up in the search results without the help of text tagging leading to a lost opportunity. This is a typical and a common example of a content being searchable but not findable. For example, in order to provide an answer to the query, "List companies acquired by IBM in the first quarter," one needs to dig a large amount of unstructured content in the form of news articles and form relationships between IBM, the companies that it acquired as well as the date of acquisitions. The news text reporting acquisition by IBM may not directly mention Q1 or Quarter One as the date or time of the event. The knowledge than the month of January to March comprises the first quarter of a financial year comes from an external knowledge base or taxonomy.

Another example is of a query where a user wishes to know the number of companies acquired in the U.S. with a deal size greater than a certain amount. Very often the news article does not mention the country name, as it is obvious from the state or city name. If someone mentions the state of California or New York in a text, it is common knowledge that a reference is being made to a state in the U.S. However, a software program will not be aware of the relationship between California and the U.S. unless that knowledge is fed into it. In the example query that we illustrated, the list of states within the U.S. is acquired from an external knowledge base which can be ontology and the link between those states and the events taking place there is obtained from the newswire text. The point is that the answer to a search query shows up in a search result because of applying an additional process of text tagging and annotation.

A domain ontology containing entities and their relationships is prepared by a domain expert, which is then utilized by combining it with the output produced by processing unstructured content. It is clear to the end user of such an application that the information displayed to him is retrieved from multiple sources.

Figure 3: A High-Level Architectural Diagram Illustrating Text Tagging and Annotation


For more information on related topics visit the following related portals...
Business Intelligence (BI), Content Management, Data Mining, Query & Reporting and Unstructured Data.

Ashish Sureka is a researcher and practitioner in the area of data mining. He works for Software Engineering and Technology Labs (SETLabs) of Infosys Technologies Limited, India. He holds a Ph.D. degree in computer science and can be reached at Ashish_Sureka@infosys.com.

E-mail This Article E-Mail This Article
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.