Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

Resource Portals
Analytic Applications
Business Intelligence
Business Performance Management
Data Integration
Data Quality
Data Warehousing Basics
More Portals...


Information Center
DM Review Home
Conference & Expo
Web Seminars & Archives
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

General Resources
Industry Events Calendar
Vendor Listings
White Paper Library
Software Demo Lab
Monthly Product Guides
Buyer's Guide

General Resources
About Us
Press Releases
Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Business Intelligence:
Mining the Whole Enchilada

  Column published in DM Review Magazine
March 1998 Issue
  By Susan Osterfelt

You recall some of the great debates of our times: Kennedy vs. Nixon, Internet Explorer vs. Netscape, Java vs. ActiveX and MOLAP vs. ROLAP. One of these debates had a clear winner (Kennedy). The other debates still rage on, with no declaration of a clear winner. The MOLAP vs. ROLAP debate, most familiar to those of us who are enabling our organizations to gain insight through business intelligence tools, has been hot for awhile. In 1996 Gartner Group declared that ROLAP would be the winner; however, Arbor (with Essbase) and other MOLAP solution providers have been busy taking the opposing viewpoint quietly and very successfully.

Now a new debate is commencing--a debate that promises to contend for the time and attention of data warehouse and business intelligence managers. This debate centers on whether to do data mining on the whole database (the whole enchilada) or on a sample set of records. The issue arises as data warehouses approach multi-terabyte size and detail is retained for all source records within an enterprise. Technology has now progressed to the point where there is computing power available to mine larger datasets. Where previously we could not mine it all, now with SMP and MPP solutions, it can be done. That makes some (including vendors of high performance decision support database engines) say that sampling becomes less critical, if not irrelevant.

Others, including Gartner, argue that just because mining the whole enchilada can be done does not necessarily mean that it should be done. Gartner contends that successful data mining efforts should focus more on data quality than on the size of the database. Data miners typically spend 60-80 percent of their time addressing data quality issues before they can get down to the task at hand. Data quality in a large heterogeneous data warehouse populated from tens or hundreds of sources is suspect; and, as a result, data cleansing becomes a critical aspect of data mining. In addition, the potential exists within large heterogeneous databases for semantic misalignment of data. "Balance" has the potential to mean ledger balance, average balance, ending balance, collected balance, cycle-to-date balance, net balance, etc. Mining on balance data without reconciling these semantic differences can yield deceptive results. This data may be clean; it just means different things. A practical use of sampling or subsetting based on knowledge of these semantic differences could produce more believable results.

What about sampling? One side of the argument says that sampling can bias the results of data mining and that mining more data will yield more interesting patterns and relationships among the data. Yet sampling has been used very successfully for years to reduce the record set to a manageable size; statisticians have found ways to mitigate the risks of creating possibly spurious relationships through the process of sampling. In fact, one question to ask when engaging in an internal organizational debate on the subject is whether there is a need for quick turnaround with an answer. Mining the whole enchilada could take some time; sampling can usually obtain results in much less time.

Procedural or mathematical complexity is another basis for dispute within the debate. Larger size datasets can require more data-specific, and thus complex, procedures to obtain value from them. At least one vendor--Tandem--is putting specialized data manipulation functions required by data mining algorithms into its NonStop SQL/MX database engine, allowing mining against the whole enchilada. This may mitigate the mathematical complexity issue somewhat; however, mining against smaller subsets of data remains a less complex task.

Which side will win the data mining debate? As usual, there is no easy answer. Indeed, there may not be a clear winner for some time to come. Certainly the largest organizations with significant parallel processing resources available to them will want to pursue data mining against the whole enchilada. They will undoubtedly be able to obtain business value through uncovering trends, patterns and relationships that may be obscured using sampling. However, the cost and complexity of this solution will encourage many organizations to see what can be gained from intelligent, judicious use of clean, semantically reconciled data subsets. We will assuredly see more announcements from vendors claiming expertise in mining the whole enchilada or in mining a sample/subset. These announcements will be thought provoking. While we yearn for the proverbial silver bullet, our choice, as always, remains situational.


For more information on related topics visit the following related portals...
Data Mining and Business Intelligence.

Susan Osterfelt is senior vice president at Bank of America, in Charlotte, North Carolina. She can be reached at susan.osterfelt@bankofamerica.com.

Solutions Marketplace
Provided by IndustryBrains

Enabling the Dynamic Enterprise with CommerceQuest
CommerceQuest offers a complete set of scalable enterprise business process management (BPM) software products and business solutions. Click here to learn more and download free white papers.

Bowne Global - The Language Services You Need!
We are the leading provider of translation, technical writing, interpretation services & more! We enable you to deliver locally relevant & culturally connected products, services & communications anywhere in the world! Request more information today!

Integration Simplified: More Value from IT Efforts
Free from BEA: Guide to Unified Application and Business Integration. Learn how to speed and simplify integration initiatives, reap business value more quickly, and make IT more responsive to real-world needs. Free white papers, case studies, more.

See Enterprise Business Intelligence in Action
See how business intelligence can be used to solve real business problems with this live demo from Information Builders

Online Courses in Statistics
3-4 week courses in data mining, market segmentation, survey stats, intro stats, regression, logistic regression, time series forecasting... Interact with instructors - leading experts. No set hours - go online at your convenience. 10 hrs/week.

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2005 DM Review and SourceMedia, Inc. All rights reserved.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.