Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Self-Healing Application Architecture, Part 2: Predictive Data Management

  Article published in DM Direct Special Report
February 16, 2006 Issue
  By Ira Whiteside

In part 1, I described the methodology for a self-healing application architecture. Part 2 will further describe the process of predictive data management (PDM).

PDM is an approach for implementing a predictive and automated process for data cleansing, quality and analysis. The premise is to use data profiling maps and associated process maps to predict the actions necessary to cleanse data or determine the further analysis steps on a column-by-column basis.

Specifically, when this process is applied to the problem of customer data integration (CDI), the cost and time savings are enormous.

Let's examine the traditional logical, discrete steps required for data integration as defined by Jill Dyche in "Making the Case for CDI:"

Figure 1: Data Integration as a Discrete Set of Steps

As you examine these steps, notice that they tend to happen serially or like a waterfall, each step being completed prior to the next. Any errors simply cascade forward. While these represent the generally accepted logical steps for data integration, the physical implementation is often chaotic and filled with unnecessary repetition.

Source data mapping is a manually intensive effort and totally dependent on human experience. Many times, organizational and political factors will determine how accurate and timely this information is. Often the staff providing the information is the target of the replacement application. A more objective process is needed.

Data extraction is difficult and time consuming, again dependent on the current staff for proper metadata and exception definitions. The information needed for extraction and column definition are usually manually recorded and retyped for the actual extraction process.

Initial staging is a generally accepted practice and represents the only true from of redundant data in a CDI effort. Much time and effort is expended in developing the processes to extract the data and load it onto a server where further source analysis and data profiling can be attempted. The staging areas are typically worked and reworked as data quality problems and exceptions to the original source definitions are stumbled upon. Usually as problems are discover and fixed, they are recorded only in the application logic of the extract process, and will need to be copied and pasted into the final application.

Data profiling is performed in every data migration effort. In most cases it is performed as part of the previous step due to requirement for load data. In most cases it is the serial process of attempting to load staging, finding a problem, manually running queries to narrow the problem, adjusting the code and rerunning the process about a million times or until you fall asleep. The best part is in a geographically dispersed team with several staff members who will be individually experiencing the same process for the same files.

Data quality/cleansing in most cases is accomplished with custom applications. As with profiling in many projects, this logic is added to the staging data. Additional data profiling activities are usually performed as manual queries, normally to create cross reference tables or master-code tables. These cross-reference tables represent the business domain values and hierarchies. The greatest difficulty here is that the cross reference data is either created as a hodge-podge collection of tables by the team loading the staging tables or prematurely forced into a target data model designed to support the application being targeted (not necessarily the model best suited for the business).

Standardization/reconciliation is the key to data quality. This is an area where many specialty vendors exist either at the high end or low end of the cost and complexity spectrum. In many cases this is treated as a commodity or point solution - to just clean the data. This presents an obvious problem in that the exceptions required are implemented in a proprietary solution and are not available to other applications. In addition, with the existing vendor applications, there is not an automated process for capturing exceptions and correcting them. The typical solution, as described in Part 1, is to reject the entire record and send it for manual correction.

Data transformation is the process where the data is reformatted to load into the target data model. Again, in what seems to be a straightforward process, many hours are spent in trying to convert data from one database format to another. It requires expertise in both databases and the functions sometimes necessary as intermediate steps.

Figure 2: Data Integration via Predictive Data Management

As you can see in the diagram above, the PDM process is cyclical and iterative. All of the discrete processes are now connected to the data maps and process maps. As you progress in each step, your results are recorded in the data maps and process maps. As you make progress, you can access and use the information (knowledge) learned in previous steps.

The impact of the PDM approach is that all knowledge is captured automatically in the data and process maps. This allows all team members to have direct access to previously learned information.

As a review Part 1, five constructs accommodate all changes in data format and content:

  • Surveying - Gathering technical metadata describing structure for a point in time.
  • Profiling - Measuring tolerances (minimum, maximum, modes, etc.) for a point in time.
  • Pattern Analysis - Correcting format issues.
  • Domain Analysis - Maintaining and automatically updating business domains and hierarchies for a point in time.
  • Relationship Analysis - Validating required business domain relationships for a point in time.

When used together these five constructs provide the ability to predict rejected records, transactions failing business rules and invalid or new domain values, and either correct the input or adjust the target application data to the new values.

When linked in a data model, these five constructs combine to create a dynamic metadata repository or data map. We use the term "data map" as related to the art of cartography. This new approach is used to understand, analyze and navigate the data structures currently available to an individual user, database or data administrator.

As a result, the user will be able to understand what has changed in their data and, more importantly, how that change has affected the data relationships. This process map can then be used to navigate among the diverse data resources available to business users, enabling them to gain productivity, effectiveness and speed in the continual pursuit for improved business analysis and better decision-making.

The user's data map is dynamic. The map will change over time. The effort to manually develop the code necessary for this is extensive. An experienced consultant or developer will recognize this as a commonsense but time-consuming approach.

The obvious difference is that the staff completing the activities in the first process has been trained (rules) and has experience (heuristic knowledge) to rely on when making decisions. IT organizations do not have a common effective means to acquiring application knowledge and experience and retaining it. We propose that the IT organizations utilize a data map as the equivalent to heuristic knowledge and a container for the training or rules used by the staff.

Let's examine the impact of utilizing the PDM approach to our data integration example. The processes are now logically grouped into only four categories:

  1. Physical database design/source data mapping/data extraction/initial staging/data profiling: By using the data map associated with an automated data profiling component, these processes happen simultaneously. These processes will automatically capture all data profiling information as well as pattern information and store it for analysis in the process map. This isaccomplished without requiring the movement of entire databases or tables to a staging area for further analysis.
  1. Data quality/cleansing: These processes are accomplished via process maps. We will further define process maps in a later article. Process maps are generated from data maps and are also stored. Since the process maps will tell the user what the problems are for each column, in most cases the process map can be automatically generated from the data map.
  1. Standardization/reconciliation: As part of the first process, the column domain information is captured for the required columns. For columns requiring standardization the column domain values are use to prepopulate and validate the necessary standardization categories. This has the effect of editing or validating the values in the data against the standardize values without any staging or temporary extraction.
  1. Data transformation: Because the process map has knowledge of both the source and target table, the processes for data transformation are also automatically generated.

Finally, the process for master data management (MDM) is also greatly enhanced with access to the data maps and process maps.

The key to MDM, as with any other data integration activity, is providing the knowledge worker with real-time access to current knowledge regarding the duplicate records they are attempting to group or merge. Given that the data maps and process maps are available and integrated into the MDM Console application, the user has access to all information regarding the source data and any changes made to it during integration. In addition, the user has knowledge of all the existing domain values, applied standardizations and any relationships this record has to other records in other associated tables.

The impact of using a PDM approach is to greatly reduce the time and effort to move and integrate data. The knowledge acquired in the integration process will be remembered in the data maps and can be recalled when maintenance or further integration is acquired.


For more information on related topics visit the following related portals...
Business Intelligence (BI), Customer Data Integration, Master Data Management and Reference Data.

Ira W. Whiteside is the CTO of AMB Dataminers Inc. He may be reached at iwhiteside@amb-dataminers.com.

E-mail This Article E-Mail This Article
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.