|Covering Business Intelligence, Integration & Analytics||Advanced Search|
Multitier Architecture for High Performance Data Mining
Data mining is a powerful technology that converts data into competitive intelligence which businesses can use to proactively predict future trends, uncover meanings to historical happenings and discover business imperatives which was hitherto unknown to business.
The common perception of data mining is as a tool or as the application of an algorithm to data is not entirely correct. Data mining is a process of discovering and interpreting previously unknown patterns in data to aid business in better decision making. Data mining by nature is an iterative process and gets refined for further probing into data in a continuous manner.
From a data management point of view, the data mining process requires exploration of data, creating the analytic data sets for evaluation, generating patterns and forecasting models.
An organization planning to tap on to the vast potential of data mining needs an IT environment with the following prerequisites:
Data mining from its inception was very much limited to medical diagnosis, scientific research and behavioral profiling. Data mining has lately become an integral part of business drivers where it adds a new dimension of predictive analysis.
Traditional OLAP Versus Data Mining
Reports, data profiling cubes, ad hoc queries, etc. provide valuable insight to the data. But these tools and/or methods focus on status reporting than on finding the hidden patterns in the data.
What are the Business Drivers?
Before diving deep into architectural considerations for a high-performance data mining solution, we have to first understand, what are the business drivers? And what are the business values that can be derived from those?
Figure 2 shows business drivers and subsequent business values are derived keeping predefined data analysis points in mind. Typically, a specific descriptive or factual question is formulated and either validated or refuted through ad hoc queries. For example, business may ask, "What is the purchasing habit of people in an age range of 18-22 years in New York during the Thanksgiving Weekend?" The results to this query are factual answers that enable business to validate and take a decision.
Data mining on the other hand is a form of discovery-driven analysis where statistical learning techniques heavily derived from and based upon patterns and algorithms are used to make predictions or estimates about outcomes or traits before knowing their true values.
Data Mining Techniques
Following are the commonly used data mining techniques.
Analytical Model. A model is a set of rules or mathematical formula that represents patterns found in data which are useful for a business purpose. Once a model has been built based on one set of data, it can be reused with other similar data. These models are sometimes referred to as predictive models, since they can be used predict behaviors that relate to the discovered patterns.
Association. This modeling technique is also referred to as an affinity model and is used to identify items that occur together during a particular event. This modeling technique is commonly used to study marketbasket analysis by identifying which combinations of products are most likely to be purchased together.
Another form of the same technique is sequence analysis which focuses on the sequence of events leading to a particular behavior. For example, this technique can be used to understand the order in which customers tend to purchase specific products. The results can be successfully applied to effective cross-selling marketing strategies.
Clustering. This modeling technique helps in identifying individual items that can be placed into groups based on like characteristics. The goal of clustering is to create groups of items that are similar based on their attributes within a given group, but which are very different from items in other groups. Clustering is frequently used to create customer segments based on a customer's behavior or other characteristics.
Data Visualization. Data visualization is a process of taking large amount of data and converting those into more easily interpreted graphs, charts or tables. The focus is more on presentation styles.
Decision Tree. This technique produces a tree-shaped structure that represents a set of decisions to predict a value of the target variable. This algorithm leverages a variety of techniques to separate or classify data based upon rules. Decision trees are commonly used to model good/bad risk or loan approval/rejection.
Linear Regression. A statistical technique used to find the best-fitting liner relationship between a numeric target variable and its set of predictor variables. Linear regression can be used to predict the amount of overdraft protection to offer a customer based on their account balances, years of service and other characteristics.
Logistic Regression. A statistical technique used to find the best-fitting linear relationship between a categorical target variable and a set of predictors. It is commonly used to predict binary valued results such as yes or no. A common example would be to determine whether or not a particular transaction is likely to be fraudulent.
Neural Networks. This is a non-linear predictive modeling technique, loosely based on the structure of the human thought process that learns through training. This technique is commonly used to predict a future outcome based on historic data. However it frequently requires substantial expertise to understand the rationale for the decisions and predictions it makes.
Score. A score is an outcome of a model that represents a predicted or inferred value on some trait or characteristic of interest. For example, if the model calculates the customer value, the score for each customer may be a number that indicates a value of a particular customer.
The Architecture and Challenges
So far we have discussed the various techniques applied to mine the data and come up with meaningful business imperatives.
In this section we will learn more about alternative architectures for a data mining system. First, let us define a set of basic components of a data mining system and then evaluate the approaches based on several prerequisites for large-scale data mining in an enterprise environment.
Prerequisites for Large Scale Data Mining
No limits on the data set sizes. Data mining is especially interesting for large enterprises which have huge data sets. Since they want to derive patterns from all of their data, the architecture should not limit the size of data sets that can be handled, for example to main memory capacity.
Optimized performance for large data sets. A data mining system should incorporate optimization strategies especially for large data sets in order to enable data mining with acceptable response times. The system architecture should enable a wide range of optimization strategies like parallelism and caching.
Flexibility for different data mining techniques. Users in an enterprise environment have different business goals in mind when they want to discover hidden trends and patterns in their data. Hence the architecture should be flexible enough to support various data mining techniques and algorithms like classification, clustering or association discovery.
Support for multiple users and concurrency. In an enterprise environment a couple of users concurrently start data mining sessions on overlapping data sets. The data mining system therefore should support specific user priorities and user groups as well as the concurrent session management reflecting multiuser and multisession handling capabilities.
Full control of system resources. A data mining system is a part of an enterprise IT infrastructure in which there will be several other applications running concurrently. The data mining system needs full control of bandwidth and CPU cycles consumed by a user. This allows starting data mining activities in parallel to other applications without impairing these activities.
Full control of access to the data. In most enterprises data mining techniques are applied on data from a central data warehouse. If the warehouse data undergoes transformations while the data mining sessions are active, it may create unpredictable results. The data mining system should implement strict access control routines to maintain data consistency and prevent any unauthorized access.
Remote administration and maintenance. In a distributed enterprise environment there are many clients of the data mining system at different locations. Depending on the architecture the system might also incorporate several servers. Remote administration and maintenance is vital and should include installation and upgrading of software components.
These prerequisites will be set as a base for objective evaluation of system architectures that support large scale data mining.
Basic Components of a Data Mining System
The basic components of a data mining system are the user interface, data mining services, data access services and the data itself. The user interface allows the user to select and prepare data sets and apply data mining techniques to them.
Formatting and presenting the results of a data mining session is also an important task of the user interface. Data mining services consist of all components of the system that process a special data mining algorithm, for example, association rule discovery.
These components access data through data access services. Access services can be optimized for special database management systems or can offer a standard interface such as ODBC.
The data itself constitute the fourth component of a data mining system and typically is sourced from the enterprise data warehouse environment.
The traditional approach to data mining has been a one-tier architecture. Such a system is completely client based. The user has to select a small subset of data warehouse data and load it on the client in order to make it accessible to the data mining tool. This tool may offer several data mining techniques.
The most obvious drawback of the one-tier architecture is the size of the data set that can be mined and the speed of the mining process. This is often overcome by selecting random sample from the data. A truly random (unbiased) sample is needed to ensure the accuracy of the mined patterns, and even then patterns relating to small segments of data can be lost.
Another disadvantage is the absence of a multiuser functionality. Each user has to define his/her own subset of data and load it separately onto the client machine. Thus, there is a risk of operating on uncontrolled data points.
In a two-tier architecture the data mining tool completely resides on the client but there is no need to copy data to it in advance. The data mining application may choose to load parts of the data during different stages of the mining process and computations.
Following are the few approaches for running data mining algorithms in this architecture:
Download Approach. Data can be downloaded to the client from the data warehouse through on-demand database connectivity parameters. This is done dynamically, thereby avoiding the problems of storing huge data sets on the client. Even if data is loaded in advance, this approach is superior compared to the one-tier architecture. The automatic loading of data by the client enables it to store preprocessed data depending on the user's needs. Preprocessed data may be of reduced size and stored in a way that supports the data mining algorithm. Hence, better performance of the discovery process and less space consumption is achieved.
Query Approach. For some data mining techniques it is possible to formulate parts of the algorithm in a query language such as SQL. The client sends SQL statements to the data warehouse and uses the results for the data mining process. One advantage compared to the download approach is that only data which is really needed is sent to the client, because filtering and aggregation is already carried out by the database system. Since parts of the application logic are formulated in SQL, query processing capabilities of the data warehouse system can be exploited.
Database Approach. In this approach the complete data mining algorithm is processed by the database system. This can be realized by stored procedures and user defined functions. Only the data mining results have to be sent to the client which is responsible for displaying them. The data mining process is able to exploit the efficient processing capabilities offered by the data warehouse.
The two-tier architecture has evident advantages over purely client-based data mining. It enables direct access to the data warehouse. No data extraction is necessary as a prerequisite for data mining. The two-tier architecture does not limit the size of a database that can be mined. New information can be discovered in the masses of data stored in the data warehouse. Additionally, the data mining process can take advantage of the query processing capabilities provided by special data warehousing hardware and software.
Besides the advantages of this approach some problems still remain. One problem is the limited access to the data warehouse system. Data warehouse systems are implemented keeping strict access controls in mind. It is not allowed to install and configure applications on this system. Only the download approach and the query approach are applicable. Additionally, there is limited control over system resources. When all users directly access the data warehouse it is not possible to control the bandwidth and the CPU cycles each user needs for its data mining application. Many users concurrently access the data warehouse for data mining purposes. In the two-tier environment there is no way to control this access by data mining specific user priorities and user groups. The last drawback we want to mention here is the limited scope for optimizations. There are only two strategies to make the data mining process more efficient: the exploitation of the query processing capabilities of the data warehouse and the enhancement of the data mining algorithm. There is limited scope for parallel algorithms and reuse of results by different clients.
A three-tier architecture addresses the problems remaining with a two-tier architecture. This is achieved by an additional layer holding the data access services and parts of the data mining services. Data mining services may also be present on the client. Which part of the data mining services should be client based depends on data mining techniques and algorithms?
The data mining process works as follows in this architecture. First, the user defines the parameters for data mining in the graphical user interface. The data mining services on the client perform some preprocessing prior to calling the data mining services on the middle tier. The first task on the middle tier is authentication and authorization of the users. Then the data mining services queue and execute the tasks of several clients and send back the results. These are used in the post-processing of the client, which computes the final outcome and presents it to the user. A client may start several data mining tasks in one session. Each of them includes a number of calls to the middle tier. Data mining services use the data access services on the middle tier in order to read from different types of data sources.
This three-tier approach has several advantages compared to the two-tier architecture. First, the data mining services can control the number of connections to the warehouse as well as the number of statements currently executed by the database system. The middle tier can control the number and kind of data mining tasks that are processed in parallel. This enables the system to influence the usage of system resources for data mining purposes, especially bandwidth and CPU cycles. Second, the system can service users according to their priority and membership in user groups. This includes restricted access to data mining tables as well as user specific response behavior. Third, a wide range of optimization strategies can be realized. The tasks of the data mining services can be distributed over the client and the middle tier. The middle tier can exploit parallelism by parallel processing on the middle tier hardware and parallel connections to the database layer. Additionally, the data mining services can reuse the outcome of data mining sessions and precompute common intermediate results. In summary, the main advantage of three-tier architecture is that mining can be done in a controlled and manageable way for multiple users.
Association Rule Discovery in a Three-Tier Architecture
Mining for association rules involves two activities. The first one finds all frequent item sets whose support is greater than a given minimum support threshold. The second one generates the desired rules from the frequent item sets found in the first step. They are required to have a confidence that is greater than a given minimum confidence threshold.
The first step is computationally more expensive than the second one. This is because in all but the most trivial cases the amount of data to be processed is much larger in the first step (a database of transactions) than in the second one (frequent subsets of transactions). Furthermore, the first step often involves multiple passes over the database. This is the reason why most of the algorithms in the literature for mining association rules deal with the first step only.
The middle tier is responsible for generating frequent item sets which are sent to the client. The association rules are computed on the client based on the frequent item sets and a minimum confidence parameter. If the user varies the minimum confidence the rules have to be recomputed, but no communication with the middle tier is necessary.
Of course, a variation of the minimum support or surprise parameters requires the middle tier to run a new frequent item set discovery. Frequent item set discovery is very
well located on the middle tier for several reasons. First, the middle tier is likely to be based on a more powerful machine compared to the client. Second, the middle tier is designed for high performance, incorporating parallelism. Third, it can exploit the query processing capabilities of the data warehouse.
The dynamic item set counting algorithm reduces the number of passes for frequent item set discovery. Hence, the amount of data transferred to the middle tier is much smaller than for other algorithms. The data transfer can further be reduced by tokenization and filtering. The idea of tokenization is to map long multi-attribute identifiers on short integer values. This reduces the number of bytes that are sent to the middle tier for each identifier. Filters can be applied in two ways. First, items are selected according to the work specification. Second, all items that already turned out not to be frequent are discarded.
The data partitions are processed by parallel threads that insert new frequent item sets into a shared hash-tree data structure. The association rule discovery facility integrated into the prototype has several features which are considered important by the end user. Among them are the ability to include and exclude certain items prior to the association rule discovery. For example, many organizations maintain a hierarchy of product groups in their data warehouse. This allows for more general rules such as beverages/baby care in addition to specific rules such as beer/nappies. Both features not only help the user to concentrate on interesting aspects of the data, but they also reduce the amount of data to be processed, yielding faster response times.
Decision Tree Induction in a Three-Tier Architecture
The technique for building decision trees uses statistical information on data held in a single data mining table. The user specifies one field of the table to be the outcome. The data of the table is then classified with respect to this outcome by applying binary splits to the decision tree.
The algorithm for decision tree induction requires information on the data in the form of contingency tables (sometimes called cross tabulations or pivot tables). A contingency table is a two-dimensional matrix. One axis represents the values of an attribute and the other axis represents the values of the outcome. The cells of the matrix contain counters for the occurrences of attribute/outcome combinations. For example, an outcome attribute could be a text field with the three values yes, no and maybe.
Once the user has specified the table to be used for decision tree induction, the client requests statistical information to describe the attributes within that table. This information allows users to decide which attributes they will use for building decision trees. Decision tree construction involves repeated requests for contingency tables. A contingency table is requested per attribute per node of a tree.
Decision tree induction can also be decomposed into two steps. In the first step contingency tables are computed based on raw data. This task is performed on the middle tier. The contingency tables are created from the subset of data that is represented by the node to be split. In a second step the client applies statistics based on a high level algorithm to these contingency tables in order to induce a split of the node. The results of the split are two child nodes which represent two disjoint subsets of the parent node. For each of them a filter can be defined. The high level information traveling between the client and the middle tier is designed to satisfy any bandwidth constraints between these two tiers.
It can be seen that determining a split point in a decision tree consists of two parts. One is defining a subset of the raw data using a filter. The other part is aggregation in order to create contingency tables. It should be noted that the order of these two operations is an implementation decision. It is possible to build suitably indexed contingency tables for the full data set and then extract subsets for a given node as defined by a filter.
To develop analytic solutions that can be applied throughout your enterprise, a powerful infrastructure along with a flexible data mining architecture is required for analytic processing. The volume of data being created and captured and the amount of transaction data can cause massive bottlenecks in the decision flow: thousands of variables, millions of transactions per day, and millions of customers.
The data mining system requires timely, accurate, and sophisticated analysis of the data to maintain a competitive advantage. Reports and OLAP techniques provide the capabilities for navigating massive data warehouses but not the insight required to stay ahead of your competitors.
Data mining offers the analytic foundation to unlock the intelligence from the enterprise data warehouse.
Soumendra Mohanty is a program manager of Accenture, India where he leads the Data Warehousing/Business Intelligence Capability Group providing architectural solutions on various industry domains and DW/BI technology platforms. He has worked with several fortune 500 clients and executed projects in various industry domains as well as technology platform areas. He can be reached at Soumendra.Mohanty@accenture.com.