Published in DM Review in August 2004.|
Printed from DMReview.com
The Future of Data Mining – Predictive Analyticsby Lou Agosta
Summary: The technology innovations in data mining since 2000 have been truly Darwinian and show promise of consolidating and stabilizing around predictive analytics.
The future of data mining lies in predictive analytics. The technology innovations in data mining since 2000 have been truly Darwinian and show promise of consolidating and stabilizing around predictive analytics. Variations, novelties and new candidate features have been expressed in a proliferation of small start-ups that have been ruthlessly culled from the herd by a perfect storm of bad economic news. Nevertheless, the emerging market for predictive analytics has been sustained by professional services, service bureaus (rent a recommendation) and profitable applications in verticals such as retail, consumer finance, telecommunications, travel and leisure, and related analytic applications. Predictive analytics have successfully proliferated into applications to support customer recommendations, customer value and churn management, campaign optimization, and fraud detection. On the product side, success stories in demand planning, just in time inventory and market basket optimization are a staple of predictive analytics. Predictive analytics should be used to get to know the customer, segment and predict customer behavior and forecast product demand and related market dynamics. Be realistic about the required complex mixture of business acumen, statistical processing and information technology support as well as the fragility of the resulting predictive model; but make no assumptions about the limits of predictive analytics. Breakthroughs often occur in the application of the tools and methods to new commercial opportunities.
Unfulfilled Expectations: In addition to a perfect storm of tough economic times, now improving measurably, one reason data mining technology has not lived up to its promise is that "data mining" is a vague and ambiguous term. It overlaps with data profiling, data warehousing and even such approaches to data analysis as online analytic processing (OLAP) and enterprise analytic applications. When high-profile success has occurred (see the front-page article in the Wall Street Journal, "Lucky Numbers: Casino Chain Mines Data on Its Gamblers, And Strikes Pay Dirt" by Christina Binkley, May 4, 2000), this has been a mixed blessing. Such results have attracted a variety of imitators with claims, solutions and products that ultimately fall short of the promises. The promises build on the mining metaphor and typically are made to sound like easy money - "gold in them thar hills." This has resulted in all the usual dilemmas of confused messages from vendors, hyperbole in the press and unfulfilled expectations from end-user enterprises.
Common Goals: The goals of data warehousing, data mining and the emerging trend in predictive analytics overlap. All aim at understanding consumer behavior, forecasting product demand, managing and building the brand, tracking performance of customers or products in the market and driving incremental revenue from transforming data into information and information into knowledge. However, they cannot be substituted for one another. Ultimately, the path to predictive analytics lies through data mining, but the latter is like the parent who must step aside to let the child develop her or his full potential. This is a trends analysis, not a manifesto in predictive analytics. Yet the slogan rings true, "Data mining is dead! Long live predictive analytics!" The center of design for cutting-edge technology and breakthrough commercial business results has shifted from data warehousing and mining to predictive analytics. From a business perspective, they employ different methods. They are positioned in different places in the technology hierarchy. Finally, they are at different stages of growth in the life cycle of technology innovation.
Technology Cycle: Data warehousing is a mature technology, with approximately 70 percent of Forrester Research survey respondents indicating they have one in production. Data mining has endured significant consolidation of products since 2000, in spite of initial high-profile success stories, and has sought shelter in encapsulating its algorithms in the recommendation engines of marketing and campaign management software. Statistical inference has been transformed into predictive modeling. As we shall see, the emerging trend in predictive analytics has been enabled by the convergence of a variety of factors, with the market really taking off in the late 1990s after a long gestation period (see Figure 1).
Technology Hierarchy: In the technology hierarchy, data warehousing is generally considered an architecture for data management. Of course, when implemented, a data warehouse is a database providing information about (among many other things) what customers are buying or using which products or services and when and where are they doing so. Data mining is a process for knowledge discovery, primarily relying on generalizations of the "law of large numbers" and the principles of statistics applied to them. Predictive analytics emerges as an application that both builds on and delimits these two predecessor technologies, exploiting large volumes of data and forward-looking inference engines, by definition, providing predictions about diverse domains.
Methods: The method of data warehousing is structured query language (SQL) and its various extensions. Data mining employs the "law of large numbers" and the principles of statistics and probability that address the issues around decision making in uncertainty. Predictive analytics carries forward the work of the two predecessor domains. Though not a silver bullet, better algorithms in operations research, risk minimization and parallel processing, when combined with hardware improvements and the lessons of usability testing, have resulted in successful new predictive applications emerging in the market. (Again, see Figure 1 on predictive analytics enabling technologies.) Widely diverging domains such as the behavior of consumers, stocks and bonds, and fraud detection have been attacked with significant success by predictive analytics on a progressively incremental scale and scope. The work of the past decade in building the data warehouse and especially of its closely related techniques, particularly parallel processing, are key enabling factors. Statistical processing has been useful in data preparation, model construction and model validation. However, it is only with predictive analytics that the inference and knowledge are actually encoded into the model that, in turn, is encapsulated in a business application.
This results in the following definition of predictive analytics: Methods of directed and undirected knowledge discovery, relying on statistical algorithms, neural networks and optimization research to prescribe (recommend) and predict (future) actions based on discovering, verifying and applying patterns in data to predict the behavior of customers, products, services, market dynamics and other critical business transactions. In general, tools in predictive analytics employ methods to identify and relate independent and dependent variables - the independent variable being "responsible for" the dependent one and the way in which the variables "relate," providing a pattern and a model for the behavior of the downstream variables. Differentiators are summarized in Figure 2.
In data warehousing, the analyst asks a question of the data set with a predefined set of conditions and qualifications, and a known output structure. The traditional data cube addresses: What customers are buying or using which product or service and when and where are they doing so? Typically, the question is represented in a piece of SQL against a relational database. The business insight needed to craft the question to be answered by the data warehouse remains hidden in a black box - the analyst's head. Data mining gives us tools with which to engage in question formulation based primarily on the "law of large numbers" of classic statistics. Predictive analytics have introduced decision trees, neural networks and other pattern-matching algorithms constrained by data percolation. It is true that in doing so, technologies such as neural networks have themselves become a black box. However, neural networks and related technologies have enabled significant progress in automating, formulating and answering questions not previously envisioned. In science, such a practice is called "hypothesis formation," where the hypothesis is treated as a question to be defined, validated and refuted or confirmed by the data. The confirmation or refutation of the hypothesis counts as knowledge in the strict sense. In neither data mining nor predictive analytics is a decision made. A prediction is a prediction, not a decision. The ultimate determining mark of predictive analytics (and applications) is that the prediction is inside the model.
A few examples will make clear the differentiators and sharpen the distinction between data mining and predictive analytics, and how the predictive analytics emerge from the context of the former.
Borderline examples are abundant. These include cases such as fraud detection, which are not primarily predictive, but also deserve attention. They map to the current approach because the assertion that the credit card transaction or insurance claim is fraudulent is like a hypothesis to be further tested and invalidated or confirmed. Because the hypothesis formulation and verification is similar to the scientific method, the results of predictive analytics are often dignified by being described as "knowledge discovery." For example, when Farmer's Insurance determined that drivers of sports cars were at higher risk for collisions than drivers of Volvos, they were predicting the obvious. However, they used IBM's Intelligent Miner to predict that drivers of sports cars who also owned "safe" cars such as Volvos fit the profile of the "safe" family car, not the risky sport car driver. Therefore, they were able to make them an offer of a modest discount and pocket the difference, creating a win/win scenario. This is something that a packaged solution is often unable to do because it lacks the flexibility and the collaborative context needed to explore variations in the independent variables.
In spite of precise differentiators between data mining and other specialties, the average person is likely to continue to refer to "data mining" as any computationally intense process that uses a large volume of data. No one is proposing to legislate how people speak. Gold mining metaphors will continue to be pervasive in marketing messages and industry discussions. The suggestion here is to substitute "predictive (analytics)" for the term when it is used in data mining requirements analyses, marketing messages, analyst reports or other professional conversations to see if it holds up to the differentiators detailed in this article. The predictable result is to deflate the hyperbole as well as that of "wanna-be" data warehousing, business intelligence or data mining vendors aspiring to predictive analytics without predictive technology. In spite of laboring mightily to differentiate between the terms, do not forget that data warehousing, data mining, OLAP and predictive analytics can often be complementary and strengthen one another. The point is simply that functionality must be diligently qualified. While the term "data mining" will continue to be used, it is important to realize that the truth and future of the term data mining lies in predictive analytics.
Lou Agosta, Ph.D., joined IBM WorldWide Business Intelligence Solutions in August 2005 as a BI strategist focusing on competitive dynamics. He is a former industry analyst with Giga Information Group, has served as an enterprise consultant with Greenbrier & Russel and has worked in the trenches as a database administrator in prior careers. His book The Essential Guide to Data Warehousing is published by Prentice Hall. Agosta may be reached at LoAgosta@us.ibm.com.
Copyright 2006, SourceMedia and DM Review.