Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Data Integration:
Understanding Your Data

online columnist Greg Mancuso and Al Moreno     Column published in DMReview.com
June 9, 2005
  By Greg Mancuso and Al Moreno

If you look at how enterprises have traditionally done analysis over time, there is a definite shift in both the input and output content of how enterprise information is being used today. Sarbanes-Oxley, HIPPA and Basel II have now forced a much tighter relationship within the world of an enterprise's data repository.

Traditional analysis started by allowing an enterprise to take historically stored information and explain to management what had occurred. Data was used to create management reports that gave a vision of what a business had done. Financial information consisted of sales, booked income, inventory and other pertinent data. The issue is that these reports gave only information on what had occurred and no insight into why or what may occur. Any type of extrapolation or predictive trending required that a data knowledgeable individual look at the reports and interject management assumptions and specialized knowledge about the business in order to formulate conclusions, goals and objectives that would be considered useful to drive the business forward.

Today, analysis has taken on a much different methodology. The advent of the business intelligence and business performance management suite of tools has opened analysis up to a whole new way of viewing the enterprise. While the traditional model used operational data, the new tool sets now take historical data, incorporate management assumptions and predictive modeling techniques to formulate a series of possible outcomes. This new technology is even now further evolving to incorporate not only the traditional operational information from the structured sources, but to also take into account the massive volumes of unstructured data held within the back office systems of an enterprise.

Perhaps the biggest single challenge presented to an enterprise today is how to incorporate the massive volumes of unstructured information. Consider that new legislation now makes management accountable for all information contained within its data systems, structured and unstructured. This means that new integration techniques need to be developed that allow a total view of any given area of business within the enterprise. Consider for example, the massive volumes of customer e-mails that are contained within a distribution company's data. Or consider for example the massive amount of information contained in notes taken in by a large call center operation.

Each piece of information describes some critical aspect of a customer, such as an insurance claim, a critical complaint about the product or some other crucial area that is critical to the organization. If the information is captured, it must be assumed to be useful and important to someone within the organization. Traditionally, only data from operational systems which had digitized representation has been available. As large as the information systems have grown, the structured world account for only 20 to 30 percent of all the available information for an enterprise. This means that the volumes of the unstructured information amount to 70-80 percent of the total information available and is virtually untapped and unavailable as a source of analysis. The challenge is how to integrate the structured and unstructured world to create an information environment that complies with the new legislation and creates a meaningful repository from which an enterprise can draw information.

When considering the sheer volumes of unstructured information available to any one organization it becomes evident that it is not sufficient to simply capture and keyword everything. As any developing ETL process for your organization's data warehouse, business intelligence, business performance management, ERP or what-have-you application, it is absolutely critical to spend the necessary time to review the analytical and reporting requirements and catalog the types of unstructured information available.

In most cases, this will result in a situation where the horse must come before the cart. That is, you have to understand and decide how to handle the types of information your enterprise has available. This will typically be e-mail, documents, audio, graphical, video, etc. Next you will need to segregate the major content classifications (financial, customer service, sales, etc.), and the sources of information (call center, e-mail, internal office automation applications, etc.). Obviously, it is not feasible to wade through all the documents in the organization to catalog each one prior to inclusion in the analytical platform. Rather, these categories should be gathered at a fairly high-level using broad groupings. These metrics are then used in a way analogous to the candidate source data system analysis of the structured ETL process. You'll need to decide what questions and analysis you want to answer and what data answers those questions. "What grouping of e-mails and documents are available that will satisfy this specific business requirement?"

But, how would one derive these metrics? While the major ETL vendors are incorporating some unstructured ETL capabilities, it is the new dedicated unstructured ETL toolsets that are leading the way by including more advanced information profiling capabilities. For example, the dedicated products provide the ability to let an automated process "loose" against the company's e-mail server and evaluate the contents of specific e-mail accounts, all accounts or only those e-mails that meet specific criteria. The output of this process would be a fairly detailed mapping of the information contained in these e-mails. The grouping could be based on keywords, common phrases, sender, subject or content (i.e., those e-mails that include MS excel spreadsheet attachments and are from a list of accounts in the finance department).

By themselves these data mappings are able to provide invaluable information to the organization. But, even more, they are extremely useful to the ETL designer when identifying the subset of unstructured information to include in the solution to a given business requirement. Then, the designer is free to tackle the next big question - is a repository of unstructured information with relationship and data density mapping sufficient; or should some or all of this information be logically linked to the related structured information from the data warehouse; or should it be physically incorporated into the warehouse itself and then directly accessible to the users from inside the data warehouse reporting environment? How useful is a structured repository of all of your customer data that would allow you to pull source e-mails regarding particular issues or product complaints?

Undoubtedly, the future of the BI and BPM tools must include the unstructured side. This technology is in its infancy, and the need to develop the capability is being driven by the new financial accountability laws and standards. As a result, enterprises are now are realizing the incredible value of the volumes of the unstructured data contained within the company systems. Future articles will detail more intense analysis techniques and will describe the emerging products that are now developed to allow an enterprise to unlock the power of their unstructured data. Understanding the value of this information is the beginning of getting a total view of your enterprise.


For more information on related topics visit the following related portals...
Data Integration.

Greg Mancuso and Al Moreno are principals with Sinecon, a business intelligence consultancy specializing in data integration and BI/DW solution architecture design. Together they have more than 29 years of data warehouse and business intelligence experience and have implemented many large-scale solutions in both the U.S. and European markets. They may be reached at gmancuso@sinecon-llc.com or amoreno@sinecon-llc.com.

Solutions Marketplace
Provided by IndustryBrains

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Data Quality Tools, Affordable and Accurate
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

cost-effective Web server security
dotDefender protects sites against DoS, SQL Injection, Cross-site Scripting, Cookie Tampering, Path Traversal and Session Hijacking. It is available for a 30-day evaluation period. It supports Apache, IIS and iPlanet Web servers and all Linux OS's.

Click here to advertise in this space

E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.