FREE DM Review Site Registration!
Sign-up today and access DM Review on the Web!

Your FREE registration entitles you to:

FREE email newsletters

FREE access to all DM Review content

FREE access to web seminars, resource portals, our white paper library and more!

   

Metadata at Work

Navigating through a used book sale at my local public library, I was struck by the disorganization evident on each shelf. Cookbooks were stacked with children's books, and sci-fi novels were mixed with historical biographies. Once I rifled through the mess for about an hour, I finally came out with a novel I wanted - and a lot of wasted time.

Similar to how a library does not spend an extensive amount of time categorizing used book sales, many enterprises treat content management like a second-tier priority. In fact, the primary focus of most enterprises is to generate income, not to sift through and tag electronic documents for each employee, even though this process could save the company millions in lost revenue in the long run.

Consider that according to a U.S. job retention poll, 75 percent of employees are seeking new jobs.1 Furthermore, according to the U.S. Bureau of Labor Statistics, the average rate of employee turnover (in all non-farm companies in the U.S.) as a percentage of total employment is 3.3 percent.2 In a company of 1,000 employees, 33 will leave by the end of the year. This illustrates a nation of people on the move. Organizations must ask themselves, what happens to each employee's documents, as well as all of the intellectual capital they've generated for the company over the course of their tenure? Companies are left with a disorganized collection of data, similar to a library's stack of used books.

An autocategorization metadata system, the backbone of a successful content management system (CMS), is the solution for better search and retrieval. It not only improves accuracy and efficiency, but also saves time, money and resources. Any enterprise, despite the nature of its business, will capitalize on these benefits. The following examples demonstrate the importance of an automatic metatagging system.

The Homeland Security Digital Library

Launched in September 2005, The Homeland Security Digital Library (HSDL) is the primary online research tool for faculty and students of the Center for Homeland Defense and Security (CHDS). It operates out of the Naval Postgraduate School and sponsored by the Department of Homeland Security's (DHS) Office of Grants and Training (OG&T).

An electronic repository of scholarly works, relevant Web items and Department of Defense-written articles, the library receives a constant influx of new content. For example, when avian flu emerged, an entirely new set of metatags were created, and documents on biological outbreaks were updated with these new tags. With the time-sensitive nature of Homeland Security topics, this tagging and retagging must be done quickly and accurately.

With 200 to 300 documents added per day, it is neither practical nor efficient to do this work manually. The HSDL sought an automated process that allowed the library's technicians to use a workflow tool to edit metatags that were automatically generated, or add more rules, categories and taxonomies when new topics emerge. They found it in a semantic, rules-based model, the HSDL that now automatically categorizes documents saved in multiple formats for easy search and retrieval. Sample topics include law and justice, borders and immigration, infrastructure protection, terrorism and society, weapons and weapons systems, emergency management and public health. Each day, HSDL content developers continuously add new documents in PDF, video and audio formats to the system, ensuring that these documents are properly categorized and available to approved users.

The World Bank

The World Bank is a $20 billion global financial organization supporting 184 countries through financial or technical expertise to help reduce poverty. It is the world's largest funder of education, the fight against AIDS and worldwide corruption, and supports the basic needs of people in conflict. Much like the HSDL, the World Bank required a better way to organize and retrieve the millions of documents stored within its global repositories. By automatically applying metatags to documents in several different files, the World Bank created a system that also worked across multiple languages.

Locating the right information, in the right language, in real-time can be an immense challenge. The first step of the process is for these documents to be language-identified. When a document is read by a language ID program, it is automatically assigned a language project from one of many language dictionaries the bank has licensed. These include European, Eastern European, Asian and Arabic languages. The document is then ready for categorization, concept extraction and summarization.

Metadata tags are applied to documents based on pre-defined projects within the Bank's electronic infrastructure. This is often called data driving. The World Bank scours each document for more than 5,000 key words, applying them to 1,000 category classes. One example of a Bank category is "environment." Within that category, key words such as "bio diversity" or "pollution management" help to categorize that document. But not all words fit so easily into a category. "Contagion," the disease transmission noun, means something much different in the health category than it does in the financial category. Extracting the meaning requires a few more important steps. The document then runs through a content extractor, searching for key conceptual IDs, and a content summarizer identifies the most important and relevant sentences within the document.

Prior to automatic metatagging, World Bank personnel categorized three electronic documents per hour. Now the bank drives 50,000 PDF pages per hour through its platform, dramatically improving the processing rate while putting vital information into the hands of those who need it most, in real-time.

Metadata at Work

Using these examples, it is easy to see why in an enterprise with hundreds, even thousands of workers outputting knowledge, it may take years to tag each employee's electronic documents by hand. Organizations could struggle with the document's actual concept (this is what they were thinking when they created this document), but their own subjectivity may begin to tear away at the core meaning of what's contained in the work. In many cases, especially the case of knowledge management in an enterprise, the objectivity provided by metatagging software is essential to a project's success.

It is clear that the key to managing a company's content quickly and easily is being able to automatically generate metadata. Enterprises should treat their content the same way that online publishers do, creating metadata on the fly or the instant that content is added to the CMS. This prevents the unorganized collection of documents from piling up and enables businesses to operate more efficiently.

References:

  1. Wall Street Journal's CareerJournal.com and the Society of Human Resource Professionals, December 2006. 
  2. U.S. Bureau of Labor Statistics, February 2006.


Dr. Yves Schabes co-founded multilingual natural language technology company Teragram Corporation (www.teragram.com) with Dr. Emmanuel Roche in 1997. Schabes has spent the past fifteen years working on issues relating to natural language processing and computer science. He is the author or editor of more than 50 international scientific publications. Schabes is also an associate to the Division of Applied Science, Harvard University. Prior to founding Teragram, Schabes was a senior scientist at Mitsubishi Electric Research Laboratories.

For more information on related topics, visit the following channels:



Industry Vendors