Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Enterprise Content Management:
Conducting Your First Text Mining Project, Part 1

  Column published in DM Review Magazine
February 2004 Issue
  By Dan Sullivan

In my January column, I discussed the trend in business intelligence (BI) to support unstructured as well as structured data. Statistical software vendors, long the bastion of structured data analysis, are early proponents of expanding the scope of analysis to include free-form text. Now that we have data mining, text mining and other BI tools at our disposal, how do we get started?

The first step is obvious: identify a business issue that lends itself to structured analysis, such as customer reaction to a new product line or assessment of customer comments in call centers. For the first project, look at quantifiable questions, such as how many customers complained about shipping delays or how many customers commented positively on the quality of their purchase. To keep things manageable, limit the scope to free- form text collected along with structured data -- survey results, call center records or customer relationship management (CRM) databases.

Let's assume you are working with the CRM database of a consumer electronics manufacturer. Your job is to investigate the types of problems customers have when they first purchase a digital camera. Structured data from warranty registrations provides basic information, such as name and address. This is combined with third-party demographic data to create a broad set of basic customer information. Because you are interested in customers' initial experience and problems they have using digital cameras, the set of customers is limited to those that contacted the call center within 15 days of purchasing the camera.

This brings us to the second step: review the unstructured text available for analysis to identify a fixed set of attributes that can be extracted from the comments. Some of the call center records will include comments from the customers that provide detail not captured by the structured attributes in a CRM system. These are generally short and simple comments such as: "the battery does not last long enough," "flash is erratic" and "outside shots are hazy but inside shots are OK." Once the most relevant topics are identified, map them to yes/no attributes such as "battery problem," "flash problem" and "picture quality problem."

The third step of the process is to identify text patterns that correspond to each of the derived attributes. We can use a number of approaches here, and this is where the art of text mining comes into play. We'll keep it simple and look at two broad approaches -- a statistical approach and a linguistic approach.

In the statistical approach, we identify a set of positive examples in the CRM database for each attribute, such as "battery problem." We then eliminate commonly used words, known as stop words, from the comments. The remaining words are then statistically analyzed to determine which terms are good indicators of the attribute. The simplest analysis only looks at word occurrences. Other techniques look at word pairs or triplets. For example, from "the battery does not last long enough," we could extract word pairs "battery does," "does not," "not last," "last long" and "long enough." (The word "the" is common and thus is removed; the word "not" is common but carries important meaning and therefore is not removed.) These pairs are called 2-grams or, more generally, n-grams. The same technique can be applied to characters as well as words, although character n- grams typically use several characters.

Whether words, word n-grams or character n-grams are used, the analysis is basically the same. We use statistics to identify patterns that occur frequently in the positive examples and infrequently in the negative examples. When those patterns are found in other records, we assume the attribute (e.g., "battery problem") is present; otherwise, it is not.

Linguistic approaches are different. Rather than treating the text as a string of characters, linguistic approaches identify characteristics of words, such as their part of speech and, to some extent, their meaning. In our example, "the battery" would be tagged as a noun phrase and "does not last" would be tagged as an active verb phrase. Phrases such as "the battery" and "the power system" are treated identically for our purposes; similarly, "does not last" and "dies" are equivalent. The rule in this approach is that when a power system phrase appears near a poor performance phrase, the record is flagged as having a battery problem. This level of analysis may be overkill for many problems; however, when text is long and covers two or more topics, linguistic approaches can render more precise distinctions than statistical approaches alone.

The final step is to apply data mining techniques to the expanded set of structured attributes: the originally structured attributes and those derived from free-form text.

When undertaking your first text mining project, keep these basics in mind. You just might find the process isn't all that different from the analysis you do today. Next month, I will discuss tools that provide the capabilities described here.


For more information on related topics visit the following related portals...
Content Management.

Dan Sullivan is president of the Ballston Group and author of Proven Portals: Best Practices in Enterprise Portals (Addison Wesley, 2003). Sullivan may be reached at dsullivan@ballstongroup.com.

Solutions Marketplace
Provided by IndustryBrains

Recover SQL Server or Exchange in minutes
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Speed Databases 2500% - World's Fastest Storage
Faster databases support more concurrent users and handle more simultaneous transactions. Register for FREE whitepaper, Increase Application Performance With Solid State Disk. Texas Memory Systems - makers of the World's Fastest Storage

Manage Data Center from Virtually Anywhere!
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.

Verify Addresses Before You Ship or Mail.
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.