Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events
Archived Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Volume Analytics:
Duo-Mining: Combining Data and Text Mining

online columnist Guy Creese     Column published in DMReview.com
September 16, 2004
  By Guy Creese

DMReview.com welcomes Guy Creese as a monthly columnist. He is will share his more than 25 years of expertise in this Volume Analytics column that focuses on ways to optimize your Web sites via best practices in content management, search, personalization and Web analytics.

As standalone capabilities, the pattern-finding technologies of data mining and text mining have been around for years. However, it is only recently that enterprises have started to use the two in tandem - and have discovered that it is a combination that is worth more than the sum of its parts.

First of all, what are data mining and text mining? They are similar in that they both "mine" large amounts of data, looking for meaningful patterns. However, what they analyze is quite different.

Data Mining

Data mining looks for patterns within structured data, that is, databases. The underlying technologies are based on statistics and artificial intelligence, littering the field with buzzwords such as classification and regression trees (CART), chi-squared automatic induction (CHAID), neural networks and genetic algorithms. As a process, data mining is not for the uninitiated. Typically, a statistician selects the appropriate algorithm(s) for the business problem, prepares the data for analysis and then fine-tunes the model based on the results.

Even though the process is labor-intensive, it can have significant payoffs. For example, enterprises can use data mining to understand what "clusters" their customers fall into (and plan accordingly) as well as save money by sending catalogs only to those customers with a high propensity to buy.

Text Mining

Text mining looks for patterns in unstructured data - memos and documents. Consequently, it often uses language-based techniques, such as semantic analysis and taxonomies, as well as leveraging statistics and artificial intelligence. Like data mining, you don't just press a button and have magic happen. Depending on the technology used, sometimes documents need to be "tagged" - an editor may need to manually note what the document is about. At other times, a text mining system may need to be "trained" to recognize a certain type of document. In this case, a person familiar with the content would need to collect a representative set of documents to be input to the system.

Also similar to data mining, text mining can discern patterns that have significant business value. Companies can use text mining to find overall trends in their trove of bug reports or customer complaints, for example.

Put Them Together and You Get High Value

Recently, vendors such as Intelligent Results, SAS and SPSS have started to recommend to their customers that they combine data and text mining. And the results have been interesting, to say the least.

This is not surprising, for two reasons. First, the enterprise has vastly expanded the universe in which to find patterns - always a good thing. Secondly, a pattern in data or text can amplify or clarify patterns in its counterpart. In both cases, there is a multiplier effect going on.

But rather than being theoretical, let's be specific. Collections and recovery departments in banks and credit card companies have used duo-mining to good effect. Using data mining to look at repayment trends, these enterprises have a good idea on who is going to default on a loan, for example. When logs from the collection agents are added to the mix, the understanding gets even better. For example, text mining can understand the difference in intent between, "I will pay," "I won't pay," "I paid" and generate a propensity to pay score - which, in turn, can be data mined. To take another example, if a customer says, "I can't pay because a tree fell on my house;" all of a sudden it is clear that it's not a "bad" delinquency - but rather a sales opportunity for a home loan.

By using data mining and text mining in tandem, enterprises have been able to improve average "lift" over using just one technology to around 20 percent, with the range being from 5 to 50 percent. Other areas where duo-mining has paid off include analyzing product wish lists, open-ended survey questions and customer attrition patterns at cell phone companies.

Some Practical Hints

Companies looking to do duo-mining in such applications need to be wary of several things, especially in regards to text mining. First, some text mining technologies need large amounts of text to analyze - several page memos, for example - while call logs are sometimes just snippets in comparison. Second, "stemming," a popular technique in text analysis in which various forms of a word are distilled into one word - "pay," "paid," "will pay," "won't pay" = "pay" - may need to be turned off. To take the collections example, stemming would prevent the enterprise from understanding the customer's intent. Therefore, companies need to ensure that the technology they're using is tuned to the problem at hand.

In addition, some companies' solutions are more toolkit-oriented (SAS and SPSS) while others are more application-oriented (Intelligent Results). Which is more appropriate depends on what the company wants to do and the level of in-house expertise.

With those caveats in mind, enterprises should investigate duo-mining. It's a combination of two time-tested technologies that can lead to big payoffs.


For more information on related topics visit the following related portals...
Data Mining.

Guy Creese is managing partner at Ballardvale Research, an analyst firm that investigates how organizations can optimize their Web site via best practices in content management, search, personalization and Web analytics. Creese has worked in the high tech industry for 25 years, at both Fortune 500 companies and small start-ups, in positions ranging from programmer to product manager to customer support engineer. He can be reached at guy.creese@ballardvale.com.

Solutions Marketplace
Provided by IndustryBrains

IBM Master Data Management
IBM Master Data Management provides a single view of critical information by bringing together all core components required for a successful enterprise data management strategy. Learn more!

Data Quality Tools, Affordable and Accurate
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

File Replication and Web Publishing - RepliWeb
Cross-platform peer-to-peer file replication, content synchronization and one-to-many file distribution solutions enabling content delivery. Replace site server publishing.

Click here to advertise in this space

E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2005 DM Review and SourceMedia, Inc. All rights reserved.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.