Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Marketing Systems:
Text Analysis Systems

  Column published in DM Review Magazine
August 2003 Issue
  By David M. Raab

I had planned to stop writing about homeland surveillance (and the government, no doubt, has the e-mails to prove it); but then a newspaper article about Pentagon advisor Richard Perle's conflicts of interest casually mentioned that Perle is a director of Autonomy Corporation. Autonomy, a well-known provider of text search and retrieval software, seemed an oddly pacific interest for the famously bellicose Perle. However, a little research revealed that Autonomy has long received significant revenue from government security agencies. More exploration found that Autonomy competitors including Inxight, Stratify, Attensity and ClearForest also have large intelligence contracts. Interesting. Just how does this technology fit into a surveillance infrastructure?

Or, to put first things first, what technology do these companies offer? The most common answer would probably be search engines -- software to locate information in word-processing files, Web pages and other unstructured text formats. (Structured formats refer to databases and files where each element has a specified location. XML documents, which tag elements within an unstructured format, are sometimes called semi-structured.)

Yet search is just one reason to access text. Users also want to extract information, generate lists of related documents, visualize results and associate content with individuals. Therefore, a better label for these systems might be text analysis, indicating that they extract meaning from text in the same way that data analysis extracts meaning from data. In concrete terms, the core function of these systems is to attach labels to text so that the labels can be searched, sorted, grouped and otherwise processed like structured data. The labels might describe the whole document (an article about surveillance systems) or extract specific information (Richard Perle is a director of Autonomy Corporation).

Two different techniques are commonly used to assign the labels. Statistical techniques analyze the frequencies and patterns of words in a document; basically, they develop statistical profiles of documents in different categories. New documents are then analyzed and assigned to the categories their profiles most resemble. Semantic techniques use dictionaries and syntax rules to identify key words and relationships. They also use these to assign documents to categories.

Most text analysis systems are based on one method or the other, although vendors increasingly apply elements from both. They also typically employ supplemental techniques such as key words, rules and weights based on how often a document is used. In general, statistical methods are less language-dependent and more able to recognize complex concepts, while semantic methods are better at identifying specific facts and relationships. Both types of systems are often trained with previously classified documents during implementation. The classification scheme itself, called a taxonomy, is also usually provided by the user, although most systems can automatically generate a rough taxonomy when necessary.

Once documents have been labeled, the results can be used to retrieve search results, list related documents, create document summaries, populate databases with extracted facts, find trends in document contents over time, profile user interests (based on what they read) and expertise (based on what they write), alert users to relevant new information, and create graphic displays of related items. More fundamentally, attaching consistent labels to documents from different sources enables users to integrate information that would otherwise be searched separately or not at all.

These capabilities have many commercial applications. Text analysis products power search engines, analyze customer comments, respond automatically to e-mails, build communities of users with shared interests, select the best offer for individual customers, gather prospect or competitor data from Web sites and generate personalized news reports. Intelligence agencies have similar requirements, and most of the text analysis systems used by these agencies apparently serve similar purposes - - most, but not all. Text analysis systems can also be used for direct surveillance -- reading and classifying personal messages. Although such surveillance can also be conducted by human monitors, the software makes it possible on a much larger scale. In fact, one of Autonomy's publicized features is a capability to transcribe verbal communications and then analyze the resulting text. The company points to non-surveillance applications such as indexing television broadcasts and capturing multimedia presentations. However, the surveillance possibilities are self-evident.

Still, how useful could it be to listen in on millions of conversations? Presumably any terrorists bright enough to be dangerous would be bright enough to communicate in code. Additionally, if the software can only assign documents to categories it has been trained to recognize, how could it recognize conversations about something new?

Interestingly, those problems may not be insurmountable. A document's failure to match any existing category may itself be significant. For example, a string of gibberish is worth examining more closely to see if it represents a message in cipher. This contrasts with pattern recognition software, such as fraud detection systems, which can only look for patterns defined in advance. Text analysis systems can also identify new categories by examining documents that are currently unclassifiable. Therefore, if a group of suspects suddenly starts talking about hunting baboons, the very oddity of the phrase could set off alarms -- although other types of intelligence would still be needed to find what the suspects really meant.

Of course, mass surveillance of this type would generate many false alarms, and serious conspirators could almost surely avoid detection. Surveillance limited to known suspects would be more effective, but this begs the question of how those suspects will be identified in the first place.

Text analysis could be extremely effective at spotting political or religious opinions, but the security value of doing this is questionable; real terrorists don't make speeches. Monitoring opinions also raises privacy and civil liberties issues, although these involve political rather than technical judgments.

In short, text analysis systems can clearly help surveillance organizations work more efficiently through better research, integration and collaboration. They may also have some value in performing automated surveillance, although this is probably less effective than claimed by their supporters or feared by their critics. While the potential for abuse is real, any transgressions are ultimately the responsibility of the people and agencies that use the systems, not the software itself.


For more information on related topics visit the following related portals...
Content Management.

David M. Raab is President of Client X Client, a consulting and software firm specializing in customer value management.  He may be reached at info@raabassociates.com.

Solutions Marketplace
Provided by IndustryBrains

Data Validation Tools: FREE Trial
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Backup SQL Server or Exchange Continuously
FREE WHITE PAPER. Recover SQL Server, Exchange or NTFS data within minutes with TimeSpring?s continuous data protection (CDP) software. No protection gaps, no scheduling requirements, no backup related slowdowns and no backup windows to manage.

Manage Data Center from Virtually Anywhere!
Learn how SecureLinx remote IT management products can quickly and easily give you the ability to securely manage data center equipment (servers, switches, routers, telecom equipment) from anywhere, at any time... even if the network is down.

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Free EII Buyer's Guide
Understand EII - Trends. Tech. Apps. Calculate ROI. Download Now.

Click here to advertise in this space

View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.