Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events
Archived Events

White Paper Library

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Data Mining In Depth:
Description is Not Prediction

  Column published in DM Review Magazine
March 2003 Issue
  By Herb Edelstein

What does "data mining" really mean? As a consultant and instructor in this field, I see a lot of confusion among people looking to extract the most value from their databases.

Usually they assume that data analysis is the same thing as data mining. In the traditional query-driven approach, the analyst generates a series of questions based on his or her domain knowledge, perhaps guided by an idea (hypothesis) to be tested. The answers to these questions are used to deduce a pattern or verify the hypothesis about the data.

Businesses which rely on queries, reports and OLAP systems often consider these activities to be data mining; but, at best, they are only the first step. They run into trouble when they try to generalize from the information they've uncovered and use it as a guide to future behavior. A description is not the same as a prediction.

Data mining uses a variety of data analysis tools to discover patterns and relationships in data that can be used to make reasonably accurate predictions. It is a process, not a particular technique or algorithm. I want to emphasize that the goal of data mining is prediction, generalizing a pattern to other data. Exploring and describing the database is merely the starting point.

The traditional approach falls short on several counts when it comes to making useful predictions. First, the analyst may fail to select the most appropriate attributes (columns in the database). It may be easy to decide that annual purchases is a more significant number than customer ID; but when you're dealing with 5 million cases, each of which has 200 attributes, it is extremely difficult to identify everything that is important.

As database structure grows increasingly complex (e.g., 50 million cases each with 2,000 attributes), it becomes virtually impossible for any individual to know the data well enough to say with confidence which variables affect behavior. The difficulty is exacerbated by the fact that the best predictors may not be individual attributes, but rather a combination of attributes.

Because data mining is essentially an iterative process, quantitative results go through a reality check and are revised as needed until a meaningful predictive model evolves. The knowledge of the domain expert guides the analysis of the data and the manipulation of variables.

Data mining also addresses another failing of the descriptive approach. Even after a pattern is unearthed through a series of queries, the analyst can't be sure whether that pattern holds true for anything other than the collection of data used to find it. The analyst may try to identify potential buyers of a certain product after building a profile of customers who have already bought that product, but will this profile apply to people who are not yet customers?

For example, analysis may show that 75 percent of purchasers for a certain retail product are male. Therefore, the retailer decides to target men as the likeliest potential buyers in the future. However, if the store's overall customer distribution is 75 percent male and 25 percent female, there's not much new information in the fact that 75 percent of this particular product's buyers are male. Data mining might reveal that education and age are better predictors of buying behavior than gender. Perhaps this product will be especially popular with a particular demographic segment of women, implying a very different promotional strategy than initially planned.

Data mining methodology, on the other hand, tries to verify that the patterns you find can be used for prediction (i.e., that they are applicable beyond the original database). It does this using a variety of techniques, such as dividing the database and developing a predictive model on one portion that is then tested on the other portion. Data mining can assess both the mathematical accuracy and the potential costs and revenues of a particular predictive pattern. (If it costs $100 each to reach the ideal buyer for your $25 product, you might want to modify your marketing plan.)

Clearly, there is more to data mining than just summarizing and querying the database, but running algorithms should only require 10 to 20 percent of a project's time and resources. The bulk of the effort needs to be spent on data preparation, which includes building the data mining database, exploring the data and transforming the data for mining. As predictive models are generated, they need to be evaluated to ensure that they are meaningful. The ultimate results can be very rewarding.

This column will offer a series of short explorations of important and interesting data mining issues, based on the questions and concerns of my consulting clients and the people who attend my classes on data mining. I recognize that there are a diversity of approaches and opinions within the data warehousing, data mining and business intelligence communities. Therefore, I invite you to share your ideas with me. Please e-mail (feedback@twocrows.com) your comments, questions or suggestions about subjects you would like me to address.


For more information on related topics visit the following related portals...
Data Mining.

Herb Edelstein is an internationally recognized expert in data mining, data warehousing and CRM, consulting to both computer vendors and users. A popular speaker and teacher, he is also a co-founder of The Data Warehousing Institute. He can be reached at feedback@twocrows.com.

Solutions Marketplace
Provided by IndustryBrains

Autotask: The IT Business Solution
Run your tech support, IT projects and more with our web-based business management. Optimizes resources and tracks billable project and service work. Get a demo via the web, then try it free with sample data. Click here for your FREE WHITE PAPER!

See Enterprise Business Intelligence in Action
See how business intelligence can be used to solve real business problems with this live demo from Information Builders

OutlookSoft Business Intelligence & BPM Software
OutlookSoft is real-time, Microsoft-based business intelligence and BPM software that unifies query, reporting, analysis & OLAP with planning, budgeting, forecasting, consolidation, reporting & scorecarding. Free demo & white paper

File Replication and Web Publishing - RepliWeb
Cross-platform peer-to-peer file replication, content synchronization and one-to-many file distribution solutions enabling content delivery. Replace site server publishing.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2005 DM Review and SourceMedia, Inc. All rights reserved.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.