DM Review Published in DM Review Online in October 2005.
Printed from

Beyond the Data Warehouse: Primer for Unstructured Data and Semantics

by John Ladley

The predicted barrage of the unstructured content is here. This has been predicted for a number of years. However, in the last two years, the regulatory changes (Sarbanes-Oxley, etc.) and some technology advances have made this type of content a relevant target for companies and organizations to manage risk more effectively and accrue business value.

Now that the traditional gurus are addressing this issue, (Inmon, Kimball, etc.), there is a sense of legitimacy to the area. Prior to this, companies felt it was too esoteric to address. Not that the problems or opportunities did not exist previously. (Hey, Ralph and Bill are smart guys, but they didn't invent data for gosh sakes.) However, pain is a great motivator, and now many organizations are in pain due to risk and total failure to fully leverage BI technologies of the past few years.

Therefore, in this information architecture column we are going to address the architectural aspects of the amorphous "stuff" known as unstructured data. We will briefly review classification of this complicated type of content. Primarily we will focus on the architectural aspects of processing and using unstructured data. Remember, there is no value in data/information unless it is used.

Why Unstructured Data/Information (UD/UI) is Complicated

In his column entitled, "The Integration of Unstructured Data into a Business Intelligence System" on December 21, 2004, William McCrosky presents a nice view of the unstructured data spectrum. I repeat it here for reference.

1. " The source document is paper, not electronic. Insurance, medical and human resource forms are often paper-based." Note that the data of interest, in this case, is reasonably structured. Its position on the source document can be spatially located.

2. The source document is structured - as in example 1 - but it is already in an electronic form. A Web technology such as XML may be used.

3. The source document is electronic, but the data of interest is not structured. E-mail and word processing documents fall into this category.

4. The source document is paper, and the data is unstructured. There is no electronic representation of the document - perhaps a historical document prepared before the advent of word processing.

5. The source is a "blob," not a document - such as pictures, voice or video.

Bill Inmon expressed it this way recently, " Reading unstructured data is merely the first step in starting to filter it out. After the unstructured data is read, it needs to be edited and prioritized. The problem is that the unstructured data is exactly that - unstructured. There is no structure or format for the data; therefore, getting a handle on what is important and what is not important is no small feat." DM Review Magazine, December 2004.

These two excerpts point out the difficulty and complexity on getting a handle on how to look at what is important within the UI. Fortunately, enough work has been done with this "stuff" to present a few options around development of an architecture and technology to manage UI/UD.

Fundamentals Components of UD/UI Architecture

There are many generic components to consider when developing the UD/UI solutions for your organization

These are:

  • Taxonomy - A taxonomy is a hierarchical classification structure, such that it descends from broad to specific or from parent to child. The UI-UD architecture demands an effective taxonomy. This is a step that cannot be avoided. In our structured data warehouse-focused decades, most of us were able to sneak up on delivering a product without substantial (if any) meta data. UI/UD demand organized meta data.
  • Ontology - describes the rules and views of the taxonomy.  Think of taxonomy as a hierarchical logical model, and ontology more as logical tables or networks, i.e., views with triggers - crude but gets you there. Alternatively, an ontology is a way to organize taxonomies (and other expressions of data relationships), "An ontology is a formal way to organize knowledge and terms. Typically ontologies are represented as graphical relationships or networks, as opposed to taxonomies which are usually represented hierarchically." And example would be to find Cabernet in a query. The ontology would know that Cabernet is a type of wine as well.
  • Content acquisition - regardless of how it is viewed and arranged, content is the "stuff" that is read, the real instances of UI/UD.
  • Parse - no matter what tool or approach, at some point, UI/UD needs to be chopped up into bits to be presented, summarized or analyzed.
  • Tag or ascribe - semantics is a growing science, but content still needs to have meaning and context Semantic engines combine taxonomy and ontology into an expression of context. Or to wax philosophical, meaning vs. definition. Whether the meaning is extracted out of the data and stored as structure or the content is tagged in some meaningful way is irrelevant. All UI/UD needs to be examined and have some context assigned.
  • Management of UI/UD - like any other content to be managed, UI/UD needs some basic functions that are made quirky due to the nature of the content.
    • Memory and storage - Most likely UI/UD will occupy lots more disk and require lots more memory to deal with. Some of the products on the market are calling for many gigabytes of memory to manage the ontology schemes.
    • Community management - UI/UD is useless unless it can be moved about and shared. Determining who can collaborate, view and share is a function of business needs, cross-functional process design and regulatory governance.
    • Content management - Loading, tracking and storing with a good address, index and viewing platform is mandatory. UI/UD may go beyond the capability of some content management packages, however, so be prepared to look into less common software such as that used in the film and news industries.
    • View and use - Invariably, you might say, "There is a fact in this document" and another party will say, "Fine, show me the document and the fact." Therefore, query tools and reporting take on combining UI/UD with traditional "rows and columns."

Technology Approaches

How the functions and components just mentioned are implemented is extremely varied right now. Do not permit any vendor to say their way is best. No one knows. Remember relational DBMS in the early 1980s? They were kludged up file handlers, fraught with performance issues and required (what was considered then) enormous resources to work correctly. Many of these tools around UI/UD are in the same point of evolution. Many of the vendors are also giving the appearances of solutions in search of problems. This means at this point in product evolution, they may all claim to do it all. This author has seen several instances of products being applied to solve a problem with some of the features listed above, only to find out a more conventional solution was cheaper, or the vendor really was not yet capable.

There are many approaches to do the same thing. One vendor urges an ETL-type tool that scans unstructured data to create structured information. Another uses a semantic engine to read and execute queries based on tagged content and ontologies. Another uses set theory processing versus unit record processing. Another uses a parsing approach to break the data out into facts that can then be queried. And several others offer combinations of the above.

Basics of the UI/UD Architecture

Remember that architecture is a blend of people, process and technology. And remember that alignment to business usage and requirements is a formula for sustainable information architecture. IU/UD is no different in that regard. There can be no "data dumping" with UI/UD. The potential costs are too high.

However, unlike structured data, there are not metrics or reports or formats to drive requirements. Therefore, the usage, or process side, or architecture development becomes important. Business needs and problems must drive a kind of "perfect world" scenario, where your IU/UD architect identifies specific processes that will leverage IU/UD to address business issues. The blend of processes, business functions, context and timing allow the architect to define communities of practice. The demands and needs of the community of practice subsequently create the parameters that will drive technology selection and development of a road map for implementation. The key here is business communities - hierarchical functional areas must be transcended by communities.

John Ladley is a director for Navigant Consulting, a management consulting firm specializing in knowledge and information asset management and strategic business intelligence planning and delivery. Ladley is an internationally recognized speaker and, more importantly, a hands-on practitioner of information and knowledge management solutions. He can be reached at Comments, ideas, questions and corroborating or contradictory examples are welcomed.

Copyright 2006, SourceMedia and DM Review.