Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search
advertisement

Resource Portals
Analytic Applications
Business Intelligence
Business Performance Management
Data Integration
Data Quality
Data Warehousing Basics
EDM
EII
ETL
More Portals...

Advertisement

Information Center
DM Review Home
Conference & Expo
Web Seminars & Archives
Newsletters
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

General Resources
Bookstore
Industry Events Calendar
Vendor Listings
White Paper Library
Glossary
Software Demo Lab
Monthly Product Guides
Buyer's Guide

General Resources
About Us
Press Releases
Awards
Media Kit
Reprints
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Data Mining In Depth:
Using Data Mining to Find Terrorists

  Column published in DM Review Magazine
May 2003 Issue
 
  By Herb Edelstein

My last column addressed some of the fallacies about using data mining to find terrorists. This column will look further at certain misconceptions about data analysis and data mining, and how those technologies can be effective tools for investigators.

It was recently reported that a few days after the September 11 attacks, FBI agents visited one of the largest providers of consumer data. They did so to see if the 9/11 terrorists were in the database and quickly found five of them. One of the terrorists had been in the country for less than two years, had 30 credit cards and a quarter million dollars' debt with a payment schedule of $9,800 per month. Mohammed Atta, the ringleader, had also been here less than two years and had 12 addresses under the names Mohammed Atta, Mohammed J. Atta, J. Atta and others. Surely, their report speculated, with patterns like this, we can use the databases we presently have to ferret out terrorists in our midst. Unfortunately, the answer is, "It depends."

There are limitations in using these so-called patterns of the agents' observations. We need to ask, first, how the records were found and, second, if the observed characteristics are indeed repeated patterns or merely isolated instances. Because I am not privy to any knowledge other than what was published in the report, my analysis is based on surmise.

More than likely, the FBI started their search with database queries using the suspected terrorists' names and likely variants. They found the terrorists' records and then noticed the number of credit cards, addresses and the amount of debt. However, they probably would not have known in advance to look for these attributes. Furthermore, the terrorists' records probably didn't show that they had been in the country for only two years; that is knowledge the FBI brought to the search.

We also don't know how easily the observations generalize to other terrorists or how many non-terrorists have these same attributes. Combing the database for people who have a number of credit cards, big debts or multiple addresses would undoubtedly yield both criminals (most of whom aren't terrorists) and perfectly innocent folks.

The large number of addresses for Atta may be an even more difficult screening criterion to use, considering that we don't know the names of unknown terrorists, let alone their aliases. It would be nearly impossible to conduct an aggregation across the hundreds of millions of individuals in this database to calculate the number of addresses, especially because all a terrorist would have to do to defeat such a search is use different aliases.

As I indicated last month, we don't have enough known terrorists or a consistent set of behaviors to use data mining to build predictive models. Thus, it would not be particularly productive to search for a signature.

If we can't inductively find a pattern from the data, perhaps we can just find exceptional behaviors sufficiently far from the norm to be worth investigating, such as 30 credit cards or 12 addresses. This problem (called outlier detection) is easy if you're simply searching for something very different on one dimension. It's much more difficult when you're looking for combinations of attributes whose individual values are typical, but which taken together are unusual. For example, being male or pregnant is not unusual, but pregnant males are rather uncommon! It's even more difficult to find outliers in categorical variables (data that fits in discrete classes) because the way to measure differences is not obvious. For example, what is the measure of the difference between a Ford and a Chevy?

Another trap is that if you look at enough variables, sooner or later you'll find at least one that correlates well with what you are trying to predict. This is called a specification search. When you are searching through large databases with many attributes, it is easy to find such false predictors. The problem of relying on data mining or query software as a primary line of defense is that it produces too many false positives.

What is the best way to use databases, search technology and data mining? First, recognize that "data" is more important than "mining." Resources should be spent working with the existing databases and setting up new ones that allow investigators to easily share information. Second, humans are more important than computers. Once trained investigators have generated lists of suspects, it's time to follow their tracks through the databases to verify information and check whether apparent anomalies are genuinely unusual and suspicious. Third, while the profiling and prediction aspects of data mining will be of limited use, other techniques, such as those used for finding fraud, will also help investigators spread their nets beyond the original suspects. For example, visualizations and algorithms have been used to locate doctors and lawyers who work together to defraud insurance providers. As investigations help uncover behaviors of terrorists that differentiate them from the rest of us, profiles that trigger further investigations will emerge.

Thus, we cannot rely on the magic of data mining to find terrorists or protect us from attack. No shortcuts can substitute for careful investigative work supported by good databases and a management structure that listens to and supports its investigators.

...............................................................................

For more information on related topics visit the following related portals...
Data Mining.

Herb Edelstein is an internationally recognized expert in data mining, data warehousing and CRM, consulting to both computer vendors and users. A popular speaker and teacher, he is also a co-founder of The Data Warehousing Institute. He can be reached at feedback@twocrows.com.

Solutions Marketplace
Provided by IndustryBrains

Enabling the Dynamic Enterprise with CommerceQuest
CommerceQuest offers a complete set of scalable enterprise business process management (BPM) software products and business solutions. Click here to learn more and download free white papers.

Bowne Global - The Language Services You Need!
We are the leading provider of translation, technical writing, interpretation services & more! We enable you to deliver locally relevant & culturally connected products, services & communications anywhere in the world! Request more information today!

Integration Simplified: More Value from IT Efforts
Free from BEA: Guide to Unified Application and Business Integration. Learn how to speed and simplify integration initiatives, reap business value more quickly, and make IT more responsive to real-world needs. Free white papers, case studies, more.

Online CRM solutions from Salesforce.com
Online Customer Relationship Management solutions - sales force automation, customer service and support, and marketing automation. All this and no software! Designed for rapid deployment and adoption. Free 30-day Trial.

Online Courses in Statistics
3-4 week courses in data mining, market segmentation, survey stats, intro stats, regression, logistic regression, time series forecasting... Interact with instructors - leading experts. No set hours - go online at your convenience. 10 hrs/week.

Click here to advertise in this space


View Full Issue View Full Magazine Issue
E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Advertisement
advertisement
Site Map Terms of Use Privacy Policy

Thomson Media

2005 The Thomson Corporation and DMReview.com. All rights reserved.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.