-
Marketplace
-
Channel Resources
Articles from this Site
San Diego Software Industry Council Issues Call for Technical Papers on Analytics
M2007 Data Mining Conference Hitting 10th Year and Going Strong
A Raid on the Inarticulate: Text Mining Hits an Inflection Point
KXEN Analytic Framework Version 4 Pushes Data Mining Automation to Text Data Sources
SPSS Launches Enhanced Predictive Analytics Platform
White Papers
A Quantitative Market Research Primer
Reducing Churn in the Prepaid Segment in the Telecom Industry
Data Mining Tools for Analytics Applications in Retail
Web Seminars
Books
Business Intelligence: The IBM Solution: Data Warehousing and OLAP
Data Mining: Technologies, Techniques, Tools, and Trends
Machine Learning and Data Mining; Methods & Applications
Data Mining Solutions: Methods and Tools for Solving Real-World Problems
Mining Very Large Databases With Parallel Processing
KXEN and the ABCs of SVMs
Data Mining in Depth
Support vector machines (SVMs) have become a hot, but unfortunately confusing, topic. I had an interesting and illuminating conversation recently with Rob Cooley, the U.S. technical director for KXEN (Knowledge Extraction Engines), who helped elucidate SVMs while also explaining what KXEN is doing. At this point, KXEN is the only data mining company I know of to base their product entirely on the work of Vladimir Vapnik, who is best known for developing SVMs. Other companies, including Statistica and SAS, have added SVMs to their collection of algorithms. I want to share with you what I learned about KXEN's implementation.
The goal of SVMs is to address the problem of generalization; that is, how to build a model that is applicable to a wide range of data beyond that which was used to create the model. Typically in data mining, we separate our data into a training data set and a test data set. When we have minimized the error on the test data set, we stop the training in the belief that the model thus built will be most generalizable. If we continue training, the error on the training data set will go down, but the error on the test data set will go up. This is called overfitting. Conversely, if we stop training too early, we fit neither the training nor the test data set particularly well. This is called underfitting. The more the real data differs from the training and test data, the more accuracy will suffer.
SVMs are based on a general principle Vapnik developed called structured risk minimization (SRM) to build the most generalizable model. SRM requires minimizing both a number called the VC (Vapnik-Chervonenkis) dimension and the error on the training data set, resulting in a very generalizable model.
Unfortunately, calculating the VC dimension precisely for all situations is difficult. It belongs to a class of problems for which the time it takes any algorithm to solve the problem grows very fast with the complexity. In practice, this means you must approximate the VC dimension. According to Cooley, one of KXEN's competitive advantages is the speed and accuracy of their algorithm for approximating the VC dimension.
SVMs are used primarily in classification problems (although they can be used in regression) and seem to work particularly well when the number of columns greatly exceeds the number of rows. This occurs, for example, in text mining; microarray analysis; and image, speech and handwriting recognition.
The word "machine" in SVM refers to the classifier that is found to best "shatter" or classify the data. SVMs work by taking the data and transforming it into a higher dimensional space (this is similar to adding variables) using something known as a kernel transformation. SVMs then find in this higher dimensional space a hyperplane (the equivalent of a straight line in two-dimensional space) that maximizes its distance from each collection of data points in the same class. SVMs pick the best hyperplane using SRM; in other words, they do so by minimizing both the error and a penalty term for complexity that is based on the VC dimension. Although there are many possible separators, this one will generalize best because it is furthest from each class. New data that might cross over a different separator often will not be able to cross over this hyperplane. The SVM then translates the hyperplane back into the dimensionality of the problem space.
There is a lot of activity going on in SVM research and development to make SVMs as easy to use effectively as more commonly employed data mining algorithms, as well as addressing such questions as scalability and interpretation of results.
While KXEN includes an SVM algorithm, the main thrust of the product is to use SRM to optimize a linear regression that starts with all the variables. The regression coefficients are chosen using the VC dimension in a penalty term for complexity, thus effectively eliminating excess variables. The goal is to get enough variables to avoid underfitting, while avoiding overfitting by not including too many. Thus the regression with the VC term results in a good model that generalizes well.
Like any linear regression, all variables must be numeric, and this is another area in which KXEN has developed interesting technology. Not only do they convert nominal fields (short text data) into numbers, but they encode all the data in a way that simplifies the calculation, resulting in better speed. Experimentation has shown that KXEN's algorithm (unlike some others) gives consistent answers and is not sensitive to different encodings.
The result is that KXEN does a nice job of automated variable selection and model building. Is it as good as a hand-tuned model with appropriately transformed variables? No, and it is not intended to be. KXEN's goal is to allow an unsophisticated user to easily come up with a reasonable answer or to provide the expert data miner with an exploratory tool that can help select variables. The product itself is designed to be embedded within an application such as a task-specific analytic application or a general-purpose analysis tool.
By automating variable selection along with model building, KXEN is making it possible for a broader range of applications to incorporate data mining effectively.
Herb Edelstein is an internationally recognized expert in data mining, data warehousing and CRM, consulting to both computer vendors and users. A popular speaker and teacher, he is also a co-founder of The Data Warehousing Institute. He can be reached at feedback@twocrows.com.
For more information on related topics, visit the following channels:


