DM Review Published in DM Review Online in September 2005.
Printed from

Volume Analytics: IBM's UIMA - and Why You Should Care

by Guy Creese

On August 8th, IBM took the covers off of UIMA (Unstructured Information Management Architecture) and pledged to make the framework available to the open source community. For those unfamiliar with the latest IBM acronym, UIMA makes it easier for a variety of text and multimedia management applications (e.g., knowledge management, search and text mining) to work together by defining interoperability standards. At first blush, UIMA may seem only distantly related to business intelligence. True, its fundamental focus - unstructured data, rather than structured data - is a different universe. But UIMA will be a significant generator of data for BI to analyze in the coming years. As such, it's worthwhile to have a fundamental understanding of UIMA, and what the technology can do.

What is UIMA?

First of all, UIMA is a specification, not a product per se, although IBM has already used it within WebSphere Information Integrator OmniFind Edition. (Given that IBM developed the spec, it's not surprising that they beat everyone else to implementing it.) The UIMA Java-based SDK is available from IBM's alphaWorks Web site.

UIMA is one of those many R&D projects that continually hum along within IBM Research. IBM, allied with DARPA, has been working on UIMA since 2001. In addition, other companies, such as BBN Technologies, MITRE Corporation and SAIC, have been involved as well as universities such as Carnegie Mellon, Columbia and the University of Massachusetts, Amherst. In short, this project has benefited from some high-powered thinking over the course of four years.

Why IBM Developed UIMA

While harnessing collective brainpower is a nice thing to do, multibillion-dollar firms such as IBM invest in such work only if they expect to get some sort of payback. In IBM's case, the company expects to reap benefits in terms of technology depth, lower professional services costs and enhanced partner channels.

For the past several years, IBM has been chanting the mantra of enterprise content management (ECM), or how corporations can better manage the unstructured data - memos, reports, e-mails, videos - that they generate. UIMA, by leveraging third-party solutions that characterize and search unstructured content, helps IBM deliver more robust ECM solutions sooner. It also helps IBM's Global Services division, by making it easier and less time-consuming for its consultants to integrate third-party content analysis solutions during a consulting project. Finally, IBM is betting that UIMA will make IBM an attractive partner for small search and categorization companies, as being UIMA-compliant suddenly includes them in IBM's global marketing and sales universe.

Stringing Together Content Analysis Engines

So much for the history lesson. UIMA's power comes from its ability to string together, concatenate, pipe - pick your term - a set of content analysis processes. By using a UIMA-defined common analysis structure (CAS) to both read content and write findings, different analysis engines can generate their own characterizations of unstructured data, whether that data is a document, image or video.

This is important because there is no one characterization technology - whether it be machine learning, statistical or rule-based natural language processing (NLP) or ontologies, to name a few - that works well in all situations. Instead, different search and categorization technologies are typically good at different tasks -one might be perfect at extracting entities, such as personal names or physical locations, while another is better at summarizing text. This technological specialization has hamstrung search and categorization for years.

UIMA finally lets these disparate technologies work together. For example, by using UIMA, three very different analysis engines can now all analyze the same content their own way and pass their combined findings to a common database. (For a quick, visual overview of how all these components work together, see the latter slides of my At-a-Glance report on UIMA.)

Generating More and Previously Unavailable Data

This is where UIMA becomes interesting to the business intelligence world. As vendors (and enterprises) adopt UIMA, they will generate valuable, structured data that data warehouses, OLAP cubes and other analytical repositories haven't captured in the past.

One example is a list of the companies referenced within the corporation's e-mails over the last month - a report of interest, perhaps, to the company's compliance officer or the VP of sales striving to understand how the prospect base is evolving.

Another example is a summary of support problems, based on the call center's telephone support logs. Such text mining-generated summaries, listing topics by themes and keywords are typically much more nuanced than the predefined check-off categories used by companies today. By drilling down into this new data, both development managers and support managers would gain greater insight into which bugs to fix and what support training is needed.

As search and categorization technology evolves and UIMA connects the pieces parts together, the amount of generated data will only grow larger.

What You Need to Do

At the moment, there's not a lot to do, other than be aware of UIMA's capabilities. The technology is still in its early stages, and the early adopters - companies such as Attensity, ClearForest, Cognos, Endeca, Factiva, Kana, Inquira, iPhrase, Inxight, SAS and SPSS - haven't necessarily released their affiliated products. However, by 2006, UIMA-compliant applications will start to arrive. So, when a colleague comes and asks you to come up with ideas on how to better analyze a section of the business, check to see if it generates a lot of unstructured data. If so, UIMA might be just the thing to bring your company's unstructured data into the structured world.

Guy Creese is an analyst with the Burton Group, covering content management and search. Creese has worked in the high tech industry for 25 years, at both Fortune 500 companies and small startups, in positions ranging from programmer to product manager to customer support engineer.  He can be reached at

Copyright 2007, SourceMedia and DM Review.