Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Volume Analytics:
Structured + Unstructured Data = N-Structured Data

online columnist Guy Creese     Column published in DMReview.com
December 16, 2004
  By Guy Creese

For decades, vendors and enterprises have treated applications that mine structured data (databases) and unstructured data (text) as completely separate. An analyst drilling into an OLAP cube navigates through numbers, not memos; a person submitting a search query views text results, not report totals. However, smart companies are starting to recognize that this application-centric separation of data types is a time-waster - after all, users just want information, no matter what application it resides in - and these application boundaries are starting to blur.

Users Want to Navigate Across All Data Structures

The result is that BI vendors are starting to investigate search technology; text miners are getting better generating metrics; and forward-thinking vendors are investigating ways to connect structured and unstructured data (often via XML) as a way to make it easier for users to seamlessly move between the two worlds. Put simply, the design problem is being turned on its head. Rather than assuming a data structure and then designing an application so the user can navigate through it, a curious user is assumed and a morphing data structure (n-structure) handles the user's navigation across all types of data.

Options 1 and 2: Structured Informs Unstructured and Vice Versa

For many years, BI vendors have depended on their users' familiarity with their software for navigation: "Sales by region? Oh, that's in report 62." However, as large corporate rollouts increase the BI user population, many of the users arrive untutored and don't use the software enough to become experts. In this case, using search to help users find the appropriate report makes a lot of sense.

Unstructured data technology can help its counterpart; the reverse is true as well. Nowadays, text mining and categorization software can help characterize text, whether it be in memos, e-mails or other forms. For example, text mining, by trawling through e-mails and customer support call logs, can score a customer's propensity to buy or default on his bills. All of a sudden, unstructured text can supply metrics that can be crunched by BI applications.

Option 3: Connecting the Structured and Unstructured Worlds

The third option is admittedly the least mature, but also the one with the greatest promise. This is the ability to identify common entities (e.g., names of customers, products and companies) as a way to tie the structured and unstructured data together. The arrival of XML and its support of semi-structured data is integral to this form of integration.

Today, an analyst at a component manufacturer may drill into an OLAP cube to discover that sales to HP are down, compared to previous quarters. At that point the analyst is at an analytical dead-end - time to pick up the phone, call some fellow worker and ask, "What's up?" If, on the other hand, the analyst were able to drill "sideways" into a repository of memos and e-mails discussing HP, he might be able to quickly understand the history and cause of the sales decline.

The reverse would apply as well - a vice president receiving a memo discussing sales to HP could click on the term "HP" and instantly receive internal corporate metrics about the HP account, including total sales and profitability. Such an auto-hyperlinking capability is not science fiction - Microsoft ships a form of it in its smart tags technology within Office 2003.

The Drivers: Ubiquitous PCs, Inexpensive Storage and the Web Browser

It is worth noting that this level of integration will not go away, but rather intensify, due to ongoing changes in information infrastructure. Years ago, structured and unstructured data were kept separate because they were stored differently - structured data lived in online databases, while unstructured data resided in printed memos. However, three drivers have made the two equally accessible online: ubiquitous PCs, inexpensive storage and the Web browser.

Twenty-five years ago, the PC was just being invented; computer terminals were for data entry, not data browsing, and most of a corporation's data resided in file cabinets. Today, with a full-featured PC costing less than $1,000, virtually every office worker has a terminal on their desk. Furthermore, almost all corporate data is created digitally - via Microsoft Word or Excel or in e-mail packages. Corporate data being online is now the rule, rather than the exception.

Echoing the price drop of PCs, disk drives and memory chips have also become inexpensive. 150+ terabyte data warehouses are no longer uncommon; 80GB disk drives now go for under $100; 1GB secure digital cards are available for PDAs. The upshot of all this inexpensive storage is that all of a corporation's transactions and musings can be accessed at the touch of a button - it is now cheaper to store everything than spend time deciding what to keep and what to archive.

Finally, the Web browser has melded the two data types into a single viewer. In the past, users viewed structured data within a BI application or Microsoft Excel and perused unstructured data with Microsoft Word or Adobe Acrobat. Today, with a little programming wizardry behind the scenes, a Web browser displays both types of data equally effectively; in fact, when browsing a Web page on Amazon.com, it's hard to discern which section of the page is generated from a database and which is free-standing text.

Use the N-Structured Data Viewpoint to Attack Switching Time Loss

Due to this sea change in infrastructure, both vendors and enterprises need to rethink how users discover and access information. Both BI and search vendors have done a wonderful job of speeding up usability and query time within their solutions; productivity within standalone applications is now quite high. The main productivity drag is now "switching" time - that is, the time users spend bopping back and forth between applications searching for answers. Software developers must stop thinking, "We handle only databases" or "We handle only text." Only by thinking in terms of n-structured data will application builders free themselves of past viewpoints and start attacking the next hurdle in user productivity.


For more information on related topics visit the following related portals...
Data Integration and Unstructured Data.

Guy Creese is an analyst with the Burton Group, covering content management and search. Creese has worked in the high tech industry for 25 years, at both Fortune 500 companies and small startups, in positions ranging from programmer to product manager to customer support engineer.  He can be reached at gcreese@burtongroup.com.

Solutions Marketplace
Provided by IndustryBrains

Data Validation Tools: FREE Trial
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Manage Data Center from Virtually Anywhere!
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space

E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.