Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

RESOURCE PORTALS
View all Portals

WEB SEMINARS
Scheduled Events

RESEARCH VAULT
White Paper Library
Research Papers

CAREERZONE
View Job Listings
Post a job

Advertisement

INFORMATION CENTER
DM Review Home
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

GENERAL RESOURCES
Bookstore
Buyer's Guide
Glossary
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

DM REVIEW
About Us
Press Releases
Awards
Advertising/Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Data Mining Tools: Which One is Best for CRM? Part 3 Continued

  Article published in DM Direct Special Report
March 28, 2006 Issue
 
  By Robert A. Nisbet

To see the first part of this article, click here.

Insightful Miner

The next tool in order of overall usefulness (Link to Table 1), Insightful Miner, follows naturally after Clementine for another reason: This tool has the best selection of ETL functions of any data mining tool on the market. These functions include:

  • Merging, appending, sorting and filtering (similar to Clementine and Statistical Data Miner)
  • Slicing and dicing of input data for purposes of data exploration. These functions are common in database management and business intelligence (BI) tools, but it is very rare to find them in a data mining tool.
  • Joining: creates a new data set by combining columns of two other data sets.
  • Stacking and unstacking: creates a new column by combining two or more columns and vice versa.

The only other data mining toolset with greater ETL capability was Torrent Orchestrate. Torrent Orchestrate was purchased by Ascential Software in 2001. IBM acquired Ascential recently.

The rich ETL capability of Insightful Miner is mated with a graphical programming interface, similar to that of Clementine, Statistica Data Miner and SAS-Enterprise Miner. In addition, many useful algorithms are integrated as analysis nodes (neural networks, classification & regression trees, logistic regression and nave Bayes models).

Insightful Miner provides a number of very valuable capabilities for data mining. Firstly, it is built around the S statistical language, providing a rich statistical analysis and graphics capability via the menu-based S-Plus implementation of the S language. The statistical abilities of Insightful Miner rival those of Statistica Data Miner in their power and completeness.

In addition, Insightful Miner is built on a pipeline architecture that permits easy scaling to analysis of large data sets. This means that the data analysis algorithms operate in streaming mode using incremental forms of statistical analysis (e.g., based on provisional means and standard deviation, etc.). For many data miners, this may not be important. But if you must analyze large data sets from massive data warehouses, this feature can be very important. It means that you don't have to extract data to external data sets; you can stream data directly from the source data structures through the analysis algorithms and build solutions incrementally. The only other tool that can do that is Statistica Data Miner (the first tool to provide this capability), and one algorithm in Fair Isaac's Model Builder tool (not reviewed here).

The other extremely useful part of Insightful Miner is the model evaluation tools. This tool outputs both a coincidence (confusion) matrix and percent classification accuracy for both output states. Often it is very useful to know how accurate the model is for the positives and negatives separately, rather than report just the global accuracy. Like the model comparison node in Statistica Data Miner, the Lift Chart node in Insightful Miner will accept multiple inputs to produce an overlaid lift chart. But, this tool does not provide for a final classification based on a voting process among the algorithms. Finally, an overlaid ROC chart (receiver operating characteristic) can be output from multiple inputs. The area under the ROC curve represents the classification power of an algorithm; by reporting how it performs with different cut-points along the range of classification, probabilities create the binary classifications.

These features are orchestrated by Insightful Miner to provide an analytical platform that is configurable and extendable throughout the business enterprise. It is scalable and can grow as your data analysis needs grow. The very flexible S-Plus framework provides a powerful and extensible programming environment. Maybe best of all, Insightful Miner provides perpetual licensing without annual rental agreements.

The Future for Insightful Miner?

The scalable architecture of Insightful Miner should be leveraged to create Tool Kits for analyzing massive data sets. As data sets increase in size, traditional data mining tools become less and less efficient for analysis. Two approaches to analytical scalability can be followed: parallelism and streaming-mode operation. Large hardware parallel systems of IBM and NCR are very expensive. If you don't need parallelism to efficiently process storage and retrieval of massive data sets for other operational purposes, streaming-mode operation is the way to go. It is possible to build streaming-mode versions of machine learning programs also (e.g., neural nets and decision trees). These capabilities could become the "killer-apps" in the world of data mining of massive data sets.

KXEN

In the past, KXEN stood alone in many respects. It was the only implementation of statistical learning theory, it was highly automated and it minimized the amount of data preparation necessary before modeling, and the predictability of the algorithm was highest among competitors in many cases. This situation is still largely true today, but the gap is narrowing.

KXEN is composed of several modules:

  • K2C - Consistent Coder
  • K2R - Robust Regression
  • K2S - Smart Segmenter
  • KSVM - Support Vector Machine
  • KTS - Time Series
  • KEL - Event Logger
  • KMX - Model Export
  • KAR - Association Rules
  • KXEN Assistant - Menu-based Interface

All of these modules are available through the menu interface, and they can be packaged separately or in bundles. For example, SmartFocus (based in Bristol, UK) offers a suite of smart marketing support packages, including SmartModeler (composed of KXEN K2C and K2R). In fact, the primary focus of KXEN is to provide other companies with embedded data mining capabilities. This business model can't fail to win in the future. Data mining must become function-based, rather than tool-based. The analytical functions necessary to support mining of nonlinear data sets must become integrated into the very structure of other software tools, similar to that of arithmetic operations. Business users of standard vertical industry tools must be able to take data mining tools for granted. Today, there is as much art as science in data mining. This is due primarily to the structure of the analytical tools. Yes, there are problems in every data set that must be solved, and wrinkles that must be smoothed before running the algorithm. But, many of these issues can be handled automatically, at least from a theoretical standpoint. The trick is to invent automated tools that perform the same operations as humans perform or obviate them. KXEN does both, to a great extent.

KXEN automatically performs:

  • Data standardization and recoding,
  • Data segmentation into "smart" segments,
  • Creation of many derived variables from combinations and transforms of existing variables,
  • Handling of missing data by inserting intuitive inferred values.

Other requirements that usually met in data preparation are removed completely. A good example is in classification model evaluation. Most data mining tools evaluate models using the Global Accuracy approach (see Part I). This approach evaluated the accuracy of both the positive and negative classes of a binary classification. But, many data mining applications (e.g., CRM applications) focus on one or the other. KXEN focuses its evaluation on the accuracy of the positives only, as reflected in the Ki metric. Many candidate models can be performed with KXEN, using different parameters, and the best model can be evaluated by comparing Ki values. No expression language is available in KXEN for creating new variables, but Version 3.3 has a nice facility for doing so. Also, Version 3.3 provides a scripting capability to permit a user to save a modeling session into a script for running later.

All you have to do with KXEN is enter the names and paths of the input and output data sets, select the variables to include in the model, and hit the "generate" button, and in a very short time, the solution is found. How it is found is part of the genius of KXEN. The approach to the identifying the best model follows the structured risk minimization theory of Vladimir Vapnik. Vapnik's theories represents the next generation of statistical analysis: First Generation - Parametric Analysis of Fisher; Second Generation - General Linear Model, Parametric Non-Linear, and Categorical Analysis; Third Generation - Machine Learning, Fourth Generation - Statistical Learning Theory. I expect that all the data mining tools will gravitate toward Statistical Learning Theory, because the theoretical basis is much more generalized, and it avoids many of the assumptions that constrain other approaches (normality, linearity, independence, etc.).

Present Uses of KXEN

Even though KXEN is designed for ease in embedded use, over 80 percent of its customers use it in standalone mode. Why? The reason is not that KXEN has the best user interface; it does not. Nor is the reason that it enables a large variety of statistical and machine learning analyses; KXEN uses only one learning technique. I believe the reason is two-fold: velocity of model building activities and automation of the data mining process.

Velocity

KXEN goes beyond rapid prototyping to rapid final model production. Users at a major bank in the UK, for example, create thousands of models to direct specific customer interactions to optimize business via the myriad of combinations of offers, channels, and customers. One-to-one inbound marketing programs based on customer profiles driven by propensity models can generate huge rewards for companies like the UK bank.

Automation

When using other data mining tools, as much as 90 percent of the time spent in building a model is consumed by data preparation. KXEN does almost all of this automatically. The only data preparation that must be done is to create the Customer Analytic Record (CAR). CARs suitable for analysis must consist of all of the variables associated with a given customer present in the same data record. Also, appropriate dummy variables must be created to eliminate codes like "other" and "unknown"; otherwise the modeling algorithm will pick up on these as modeling variables. Creation of the CAR can be done easily in SQL. Modeling of the CAR can be done via ODBC, flat file, or via a call-level interface. Scoring of the database can be done in SQL, because KXEN is the only tool that I know of that can output models directly in SQL. The closest tool to KXEN in this capacity is SAS, which can output models in SAS code to run against the host database.

Preview of KXEN Version 3.3

This version is still in beta form, but I did get a preview of it. My initial impression is ... wow! When this version is released, it will provide the capabilities of a complete data integration and data mining workbench. Insightful Miner is the only tool that comes close. It will have many new tools to facilitate a full-range of data operations from legacy source systems to automated model scoring. The new ETL capabilities will work with most common databases to permit virtually anything you can do in SQL, but without the pain of writing the code. This ETL function generator is so strong, I can imagine that some of my old data warehousing cronies might use it just as a SQL front end. My favorite new tool, though, is a wonderful automatic iteration tool, which permits you find the minimum number of variables that produce above a user-definable threshold percentage of the total performance. For example, you can set the threshold to 98 percent, and it will find the fewest variables that meet the threshold, using one of several loss functions. I have spent many days performing this operation manually.

The Future for KXEN?

As for Clementine, the future of KXEN may lie outside the box of their GUI interface. I expect to hear of many more analytical and planning packages (e.g., SAP) incorporating data mining capabilities as enablers. Data mining is not really an "end" per se , but a means to an end. These "means" will become progressively submerged in the infrastructure of the products they serve until they are as natural to use as standard arithmetic and graphical techniques. KXEN will undoubtedly lead the charge in this direction.

XL-Miner

This tool scored the lowest in the features analysis, but not by much. It was overshadowed by the other tools largely because of its lack of ETL and descriptive statistical capabilities in the interface. But, that lack (and several others in the tool) is partially compensated for by the integration of XL-Miner into Excel. When you take into account all the capabilities of the Excel spreadsheet (including the analysis plug-in), many of these apparent lacks disappear. 11 percent of KD-Nuggets viewers use Excel as a data mining tool. Presumably, they used Excel for data exploration and analysis prior to modeling in another tool. For that reason, XL-Miner was given a moderate score for the capabilities absent in tool, but present in Excel.

The great benefit of its integration with Excel is offset to some degree by XL-Miner's greatest weakness - limitation of data set size by spreadsheet limits (65,536 rows and 256 columns). This is not a fatal weakness; many CRM models can be trained acceptably on relatively small samples of the data universe. However, for medium to large applications, another tool must be used. Another plus for XL-Miner in that bag of characteristics is its low cost ($850 - general; $100 - student), which renders it affordable in addition to purchase of a tool with a larger analysis capacity. This means that you can do all of your data assessment, data exploration and reduction of the number of variables to be submitted to the modeling algorithm (dimensionality) directly in Excel on samples of the total data set (if necessary). The only major weaknesses left (in CRM applications) are the lack of ETL capabilities and limitation on data set size.

Apart from those weaknesses, XL-Miner is a very complete tool for CRM analyses. It includes menu options for data sampling, handling of missing values, binning of continuous data (for categorical analysis), transformation of categorical data, and data set partitioning. And, the partitioning option includes one capability that does not exist in any other tool, except Clementine - Over-sampling (balancing). Clementine does it via the Distribution and Balance Nodes. XL-Miner does it by the menu option, "Partitioning with over-sampling." This option permits the user to set the desired relative frequency of the positive class (50 percent by default) to apply to data sampling. When analyzing a data set for direct marketing (for example), the tool collects all of the rare responder class, and randomly samples the non-responder class to equal the number of responders (if the desired relative frequency is set to 50 percent). The balanced data set is suitable for submission to a neural net or decision tree algorithm. This is a very valuable capability for CRM data mining. For many data mining tools, data set balancing must be done manually, either by using the general database/spreadsheet functions of the tool or by some other tool.

Three options are provided for data reduction (reduction of dimensionality): principal components analysis, hierarchical clustering and K-means clustering. These tools can help to identify the set of variables that have a sufficient relationship with the target variable to include them in the analysis. Data exploration is enhanced beyond the capabilities of Excel by the provision of a scatter plot matrix to help identify the final short-list of variables used for modeling. Only Statistica Data Miner, among the tools evaluated here, has that capability. Prediction and classification are accomplished with a relatively rich set of standard data mining algorithms, including CART and a nave Bayes classifier. There is even an association algorithm for creating association rules.

Finally, XL-Miner provides a flexible metadata mapping option for relating variables in the model with those in data sets to be scored by the model. The only other tool that does that is Clementine. This capability permits scoring of data sets from different systems with different metadata. Variables among the modeling and scoring data sets may be identical but named differently. Or, variables might not be identical, but close enough for them to be mapped to each other. For example, a model might be trained on household income, but the deployment data set might be scored on the basis of median income in the census block where a given prospect lives (because household income is not available for the deployment population). This capability can be of enormous benefit for deploying models in an industry vertical market that were developed in another. This approach can help to jumpstart modeling operations for a new product or a new vertical market lacking historical data to support modeling.

The Future for XL-Miner

One growth path for XL-Miner is to supplement the relatively sparse set of database sources for input of data to Excel. Integration with other data mining tool vendors is a hole that should be plugged. Particularly desirable are capabilities to import and export SAS and SPSS data sets. If XL-Miner is to be used in conjunction with other data mining tools, it must be able to interface with common tools that data miners use. Currently, database import capabilities are constrained by those of Excel (SQL-Server, Access, dBase/FoxBase, Oracle and Paradox). It is a relatively easy task to develop ODBC drivers for other database systems configured for use in Excel. Other candidates for inclusion are: NCR Teradata, IBM UDB and SP2, SAP and PeopleSoft. Both input and output capabilities should be provided.

The other growth path for XL-Miner is removal of the current data set size limitation of 65,536 records. XL-Miner could include "paging" operations through large data sets by analyzing blocks of data in different tabs of the spreadsheet, each of which can contain 65,536 rows. This means that the XL-Miner macros must be modified to page through large data sets like word processors page through large text documents, keeping track of top and bottom virtual block references. Large data sets could be read into a set of spreadsheet tabs (sheets) in blocks of 65,536 records. By including sheet references in processing streams of the macros, a neural net (for example) could be trained on data sets much larger than 65,536 records. This sort of processing was commonplace in the old DOS world of PC applications limited by the 640K addressable memory constraint. Maybe Microsoft will follow this path someday with the development of Excel itself. Another fascinating prospect is for Microsoft to acquire XL-Miner and add it to the list of available plug-ins furnished with the tool. Until then, XL-Miner will just have to hoof it alone.

How do you choose the best data mining tool for your use in CRM? The answer is not a simple one. Some considerations that you might consider and the tools best suited for them include:

Expected Data Mining Venue

In the data mining tool?
  • SPSS Clementine
  • SAS Enterprise Miner
  • Statistica Data Miner
  • Insightful Miner
In the database?
  • Statistica Data Miner
  • SAS Enterprise Miner
Embedded in an application?
  • KXEN
  • SAS Enterprise Miner
In financial operations based on spreadsheets?
  • XL-Miner
  • SAS Add-in for Microsoft Office
Academics?
  • XL-Miner
  • SAS (Academic license)
  • SPSS Clementine (student edition)

 Expected Purpose of the Models

To support direct mail operations?
  • SPSS Clementine
  • Statistica Data Miner
  • SAS Enterprise Miner
To support management rules reporting?
  • SAS Enterprise Miner (decision trees)
  • SPSS Clementine
  • XL-Miner
 To support sales forecasting?
  • Statistica Data Miner (time-series algorithms)
  • SAS Forecast Server, in conjunction with SAS-EM
To support strategic marketing operations?
  • Insightful Miner (slicing and dicing capability)
To support customer behavior modeling?
  • SPSS Clementine
  • Statistica Data Miner
  • SAS Enterprise Miner
To support Six-Sigma industrial applications?
  • Statistica Data Miner (Six-sigma algorithms).
...............................................................................

For more information on related topics visit the following related portals...
CRM and Data Mining.

Robert A. Nisbet, Ph.D., is an independent data mining consultant with over 35 years experience in analysis and modeling in science and business. You can contact him at Bob@rnisbet.com or (805) 685-0053.

Solutions Marketplace
Provided by IndustryBrains

Manage Data Center from Virtually Anywhere!
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.

Customer Relationship Management for IT
Web-based CRM and more with Autotask: Great business management software optimizes resources and track billable project and service work. Get a demo, then try it free with sample data. Click here for your free trial!

Numara Track-It! Help Desk & CRM Software
Numara IT Solutions provides Track-It! - the leading help desk software solution for employee & customer self-help, call tracking, problem resolution, remote control, asset management, LAN/PC auditing, and electronic software distribution. Free demo

Find CRM Consultants
Post You Project for Free. Get Bids from Thousands of Pre-Screened Consultants.

Data Mining: Levels I, II & III
Do you know who your best customers are and why? Learn how to anticipate customer behavior using your existing data with predictive modeling. View upcoming events in data mining.

Click here to advertise in this space


E-mail This Article E-Mail This Article
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Advertisement
advertisement
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.