-
Marketplace
-
Channel Resources
Articles from this Site
Appian and MEGA Unite Software Tools
Hagemeyer Chooses Ceedo Virtualization
Software AG Expands Real-Time Data Interoperability for Mainframe and Open Systems
VMware Helps World's Top Universities Cut Costs and Go Green
Time and Time Again
White Papers
Oracle Builds Comprehensive SOA Platform
Leverage Your Network for Greater Productivity
ITIL - Measuring the ROI of Internal ITIL Investments
Domain-Specific Modeling: 10x Faster than UML
Secure Tracking, Managing and Deploying Business Rules Across the Enterprise
Books
Enterprise Architecture Planning: Developing a Blueprint for Data, Applications, and Technology
Web Services and Service-Oriented Architecture: The Savvy Manager's Guide
Distributed Systems Architecture
Network Administrator Street Smarts:A Real World Guide to CompTIA Network+Skills
Implementing Enterprise Data Warehousing
Fuzzy Data Integration, Part 1
Enterprise Architecture View
How do you deal with integrating data from heterogeneous sources that do not share common keys?
One of the greatest challenges and most time-consuming efforts in a data warehouse project is integrating data between disparate data sources into a meaningful collection of information for users to make decisions. Significant resources need to be applied to understanding not only the data and structures, but also gathering and analyzing the meta data of source systems to fully appreciate the context of the information. Further efforts may also need to be utilized or developed if the data exists in nonstructured or proprietary formats, exasperating an already very complex process. Once this basic step of understanding and accessing your data sources is somewhat manageable, you need to determine the best approach to integrating this data. Depending on your business requirements and source system data state, integrating data from these systems can be exigent if common data columns exist between the applications. The task becomes even more difficult if common elements do not exist between source systems and new, inexact methods need to be designed and developed in order to meet business needs to integrate this data.
This column is the first part of a two-part series examining a data integration strategy for matching and integrating data between heterogeneous source systems when no common elements exist to facilitate the process. In this installment of the series, we will look at some typical data integration approaches and provide an example case. In the second part of this series, we will look at the logic needed to implement this data matching method and some of the details that need to be considered before attempting this process. We are making the assumption that the data quality and data integration product or method being used by your firm does not have this type of capability.
Data integration into the data warehouse typically occurs through one of several methods. Your data integration process should support these methods whether used individually or in combination to improve data quality in the data warehouse environment. Data validation should be able to be performed on data originating from a particular source system to identify and mark poor quality records. Validation can include use of lookup tables (e.g., ISO code standards) or data type checking (valid numeric, date and null) to qualify source records for inclusion in further processing to the data warehouse. Records that do not qualify are excluded from the primary data integration workflow and rerouted to an exception queue for alternate processing or review. The first data integration process method consists of cross matching records from several source systems by same or similar key fields (e.g., employee ID). Similar key field matching may consist of some type of masking of the comparison fields to eliminate character differences (e.g., SMITH100321 <-> SMITH, JOHN using SMITH*). The second method involves selection of record values based on the most common values occurring across the various source systems (e.g., last name Smith exists in two of the four source systems). Third, new target column values can be selected based on a prioritized hierarchy of the source systems (for example, use call center customer address data over billing unless call center data for customer X does not exist).
Because most firms do not use the same keys across all of their systems, fuzzy matching can be used on several similar columns on each source system looking for closely matching patterns such as similar last names (e.g., Smith versus Smyth) or phone numbers (e.g., 312-555-1212 versus 312-555-1221). Alternately, a pattern matching process can be set up to reconcile the dissimilar column values to integrate the data. In next month's column, we will look at one approach to this logic.
Michael Jennings is a recognized expert with more than 20 years of information technology experience and speaks frequently on business intelligence/architecture issues at major industry conferences and has been an instructor at the University of Chicago's Graham School. He is a co-author of the book Universal Meta Data Models and a contributing author of the book Building and Managing the Meta Data Repository. He works for EWSolutions, a GSA schedule and Chicago-headquartered strategic partner and systems integrator dedicated to providing companies and large government agencies with best-in-class business intelligence solutions using data warehousing, enterprise architecture and managed meta data environment technologies (www.EWSolutions.com). He may be reached directly via e-mail at MJennings@EWSolutions.com.
For more information on related topics, visit the following channels:


