Published in DM Review Online in October 2005.|
Printed from DMReview.com
Beyond the Data Warehouse: Primer for Unstructured Data and Semanticsby John Ladley
The predicted barrage of the unstructured content is here. This has been predicted for a number of years. However, in the last two years, the regulatory changes (Sarbanes-Oxley, etc.) and some technology advances have made this type of content a relevant target for companies and organizations to manage risk more effectively and accrue business value.
Now that the traditional gurus are addressing this issue, (Inmon, Kimball, etc.), there is a sense of legitimacy to the area. Prior to this, companies felt it was too esoteric to address. Not that the problems or opportunities did not exist previously. (Hey, Ralph and Bill are smart guys, but they didn't invent data for gosh sakes.) However, pain is a great motivator, and now many organizations are in pain due to risk and total failure to fully leverage BI technologies of the past few years.
Therefore, in this information architecture column we are going to address the architectural aspects of the amorphous "stuff" known as unstructured data. We will briefly review classification of this complicated type of content. Primarily we will focus on the architectural aspects of processing and using unstructured data. Remember, there is no value in data/information unless it is used.
Why Unstructured Data/Information (UD/UI) is Complicated
In his column entitled, "The Integration of Unstructured Data into a Business Intelligence System" on December 21, 2004, William McCrosky presents a nice view of the unstructured data spectrum. I repeat it here for reference.
1. " The source document is paper, not electronic. Insurance, medical and human resource forms are often paper-based." Note that the data of interest, in this case, is reasonably structured. Its position on the source document can be spatially located.
2. The source document is structured - as in example 1 - but it is already in an electronic form. A Web technology such as XML may be used.
3. The source document is electronic, but the data of interest is not structured. E-mail and word processing documents fall into this category.
4. The source document is paper, and the data is unstructured. There is no electronic representation of the document - perhaps a historical document prepared before the advent of word processing.
5. The source is a "blob," not a document - such as pictures, voice or video.
Bill Inmon expressed it this way recently, " Reading unstructured data is merely the first step in starting to filter it out. After the unstructured data is read, it needs to be edited and prioritized. The problem is that the unstructured data is exactly that - unstructured. There is no structure or format for the data; therefore, getting a handle on what is important and what is not important is no small feat." DM Review Magazine, December 2004.
These two excerpts point out the difficulty and complexity on getting a handle on how to look at what is important within the UI. Fortunately, enough work has been done with this "stuff" to present a few options around development of an architecture and technology to manage UI/UD.
Fundamentals Components of UD/UI Architecture
There are many generic components to consider when developing the UD/UI solutions for your organization
How the functions and components just mentioned are implemented is extremely varied right now. Do not permit any vendor to say their way is best. No one knows. Remember relational DBMS in the early 1980s? They were kludged up file handlers, fraught with performance issues and required (what was considered then) enormous resources to work correctly. Many of these tools around UI/UD are in the same point of evolution. Many of the vendors are also giving the appearances of solutions in search of problems. This means at this point in product evolution, they may all claim to do it all. This author has seen several instances of products being applied to solve a problem with some of the features listed above, only to find out a more conventional solution was cheaper, or the vendor really was not yet capable.
There are many approaches to do the same thing. One vendor urges an ETL-type tool that scans unstructured data to create structured information. Another uses a semantic engine to read and execute queries based on tagged content and ontologies. Another uses set theory processing versus unit record processing. Another uses a parsing approach to break the data out into facts that can then be queried. And several others offer combinations of the above.
Basics of the UI/UD Architecture
Remember that architecture is a blend of people, process and technology. And remember that alignment to business usage and requirements is a formula for sustainable information architecture. IU/UD is no different in that regard. There can be no "data dumping" with UI/UD. The potential costs are too high.
However, unlike structured data, there are not metrics or reports or formats to drive requirements. Therefore, the usage, or process side, or architecture development becomes important. Business needs and problems must drive a kind of "perfect world" scenario, where your IU/UD architect identifies specific processes that will leverage IU/UD to address business issues. The blend of processes, business functions, context and timing allow the architect to define communities of practice. The demands and needs of the community of practice subsequently create the parameters that will drive technology selection and development of a road map for implementation. The key here is business communities - hierarchical functional areas must be transcended by communities.
John Ladley is a director for Navigant Consulting, a management consulting firm specializing in knowledge and information asset management and strategic business intelligence planning and delivery. Ladley is an internationally recognized speaker and, more importantly, a hands-on practitioner of information and knowledge management solutions. He can be reached at email@example.com. Comments, ideas, questions and corroborating or contradictory examples are welcomed.
Copyright 2006, SourceMedia and DM Review.