Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

Resource Portals
Analytic Applications
Business Intelligence
Business Performance Management
Data Integration
Data Quality
Data Warehousing Basics
More Portals...


Information Center
DM Review Home
Conference & Expo
Web Seminars & Archives
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

General Resources
Industry Events Calendar
Vendor Listings
White Paper Library
Software Demo Lab
Monthly Product Guides
Buyer's Guide

General Resources
About Us
Press Releases
Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Notes from the Giga Advisor:
How to Distinguish a Data Miner from a Data Hacker

  Column published in DM Review Magazine
November 2001 Issue
  By Lou Agosta

The reason that data alone is not knowledge, but merely data, is it lacks structure, organization, direction, coherence, point or conceptual focus. Just as an interpretation without supporting data would be empty, data without a unifying interpretation is meaningless and leaves the data collector blind. The challenge for the data miner - and what distinguishes the data miner from the data hacker - is to systematically discover and verify meaning amidst apparent disorder and occasionally even chaos.

One important goal of data mining is to discover, define and determine the relationship between variables. These variables, in turn, represent the levers and mechanisms that move business operations. A variety of pitfalls relating to extraneous, hidden and distorter variables can result in misunderstandings and inaccurate conclusions.

For example, people within the same age group who are in hospitals experience a higher death rate than those not in hospitals. Hospital stay is the independent, upstream variable; death rate is the downstream, dependent variable. Age is the candidate control variable, but it is not a very good one. Age is an extraneous variable here. When the relationship between hospital admission and outcome (death) is controlled by diagnosis code, then the relationship is clarified. The death rate for people in the hospital for a given diagnosis code is less than that for people not admitted to the hospital. Diagnosis code is actually the independent variable here.

A hidden variable is exemplified in the case where investment grade corporate bond prices move with the price of Treasury bills. The hidden variable is the market's willingness to incur risk at all. This surfaces when Russia issues a moratorium on payments of its corporate debts, as it did in August 1998 by suspending the convertibility of the ruble, in which case T-bills soared and bonds plunged.

A distorter variable is exemplified by the apparent theory that more married people commit suicide than single people. However, when the population is segmented by age, it is found that in each age group, the single suicides outnumber the married. Thus, this actually supports the thesis that marriage reduces suicide. A distorter variable - in this case age - reveals that the correct interpretation is exactly the opposite of what was suggested by the original data.

One execution of undirected knowledge discovery showed that wealthy customers with children that had graduated from college were starting to draw on home equity lines of credit. One might expect those with children of college age to do so - but why those with children who had graduated? This example might be dismissed as useless or unintelligible. However, further investigation indicated these people were starting home-based businesses. This was an actionable result.

The recommendation? In order to understand the relationship between independent and dependent variables, the data miner must take account of other related variables (test factors) and stratify, control, decompose and associate them with the independent and dependent variables in turn. "There are no spurious relationships - only spurious interpretations," thus, a reflection circa 1968 from an early data miner.

If we don't quite know what we're looking for - as in undirected knowledge discovery - then we might not recognize it when it shows up. The job is to define and progressively refine the business result being sought through data mining. Management judgment, business experience and insight into the dynamics of the market are required to sort these out and to prioritize what anomalies are worthy of further resources, effort and investigation. Plan on leveraging management's understanding of the business and its operations as a data mining resource.

In other words - and this cannot be said enough - if neural networks and genetic algorithms can learn from experience (or at least from an encoding of experience), then management must do so as well. When management learns from experience and makes its experience available in a non- dogmatic, supportive way to the application of the technology to the business, then win/win results are enabled.

This is where data mining is distinguished from data hacking. The data miner has a depth of understanding, experience, and perspective and has systematically organized them; the data hacker has, well, data.


For more information on related topics visit the following related portals...
Data Mining.

Lou Agosta is the lead industry analyst at Forrester Research, Inc. in data warehousing, data quality and predictive analytics (data mining), and the author of The Essential Guide to Data Warehousing (Prentice Hall PTR, 2000). Please send comments or questions to lagosta@acm.org.



Solutions Marketplace
Provided by IndustryBrains

Enabling the Dynamic Enterprise with CommerceQuest
CommerceQuest offers a complete set of scalable enterprise business process management (BPM) software products and business solutions. Click here to learn more and download free white papers.

Bowne Global - The Language Services You Need!
We are the leading provider of translation, technical writing, interpretation services & more! We enable you to deliver locally relevant & culturally connected products, services & communications anywhere in the world! Request more information today!

See Enterprise Business Intelligence in Action
See how business intelligence can be used to solve real business problems with this live demo from Information Builders

Online Courses in Statistics
3-4 week courses in data mining, market segmentation, survey stats, intro stats, regression, logistic regression, time series forecasting... Interact with instructors - leading experts. No set hours - go online at your convenience. 10 hrs/week.

Online CRM solutions from Salesforce.com
Online Customer Relationship Management solutions - sales force automation, customer service and support, and marketing automation. All this and no software! Designed for rapid deployment and adoption. Free 30-day Trial.

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2005 DM Review and SourceMedia, Inc. All rights reserved.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.