-
Marketplace
-
Channel Resources
Articles from this Site
DataFlux and SAS Introduce Project Unity
Experian QAS Announced QAS Email and Phone
Attachmate Unveils Verastream Host Integrator 6.6
Electronic Arts Out-Games the Competition with Informatica
Jaspersoft Delivers BI Development Platform for NetBeans and MySQL
White Papers
Why and How to Build a Continuous Integration Environment for the .NET Platform
Informatica - Handling Variable Length Files Using XML
Maximize Business Value with Right-Time Information Using Data Services
EAI - Refine the Economics of Integration
Profiling: Calculating Return on Investment for Data Migration and Data Integration Projects
Web Seminars
Espresso Shot: Optimize Sales and Marketing with Advanced Reporting and Dashboards
Trends and Tactics for Improving Data Quality
Getting In Synch: Creative Ways to Reconcile Data Between Apps
The Trouble with Success: Methods for Addressing Shrinking Batch Windows
Books
Fuzzy Data Integration, Part 2
Enterprise Architecture View
How do you deal with integrating data from heterogeneous sources that do not share common keys?
In the previous installment of this series (see my column in the August issue of DM Review), we examined some of the various data integration strategies for matching and integrating data between heterogeneous source systems when no common data elements exist to facilitate the process. In the second part of this series, we will look at the logic needed to implement a fuzzy logic data matching method and some of the details that need to be considered before attempting this process.
Scenario
During the building of our data warehouse, we are faced with the challenge of trying to integrate heterogeneous client data from different source systems in the enterprise - some of which do not have a common key to match against. Because the source systems cannot be readily matched by common columns, the decision is made to apply fuzzy client name matching as an alternative data integration method. For some of the sources, the client name column is the only available method of integrating the data between sources. The task is made more different due to client names being entered in each system in dissimilar methods (e.g., Smith Bros, SMITH Brothers Inc., The Smith Bros. Inc). The extraction, transformation and load (ETL) logic will accept data from each source system that the business users indicate can be matched by client name.Logic Summary
The logic used:- Retrieve all client information for the specific source system, which needs to be processed into the data warehouse since the last refresh cycle.
- Process the source client name through the fuzzy name match logic to determine the standardized client name of the specific record.
a) An enterprise word translation table is used by this process to convert source system client name words into an enterprise standard format for comparison purposes. The method to create this standardized enterprise client name is explained later.
b) The data warehouse cross-reference table maintains a cross reference between the various source system key fields and the one unique data warehouse client surrogate ID. The standardized enterprise name for the client is also maintained on this table. - Determine if any of the records from the source system have a matching enterprise client name in the data warehouse surrogate key cross-reference table. If so, use the corresponding client surrogate ID and insert a record into the data warehouse surrogate key cross-reference table for future processing cycles. Update the warehouse client table with specific data from the source system using the selected update method (e.g., SCD 1, SCD 2).
- If no enterprise client name matches the source record, then insert a new warehouse surrogate key in the cross-reference table with the enterprise standard client name. Insert a new record into the warehouse client table with specific data from the source system.
Fuzzy Logic Name Match
This process uses a pattern matching method to reconcile client names from different source systems into an enterprise standard format so records from these disparate systems and the data warehouse can be compared. This process takes a client name (the legal description) as input and provides a standardized enterprise name as its output. A translate table is created and used by the process to take words found in the client name of sources and convert them into a standard enterprise format. For example, consider the word INTERNATIONAL. This word may arrive from the source in a variety of formats. Data profiling can be used on the different source systems to build the translate table for the fuzzy logic process. A company name such as Smith Bros International or Smith Brothers or Smith Brothers Int. found in three source systems would all be translated into an enterprise standard format of SMITH BROTHERS INTERNATIONAL.The process uses the following logic:- Get rid of any character not in upper or lowercase (A-Z) and (0-9).
- Tokenize the remaining words using the space character as the separator.
- For each word, determine if a matching word exists on the word translation table. If so, replace the word with the standardized word version. For example, BROS is replaced by BROTHERS. Words such as COMPANY are regarded as noise words and are removed.
- Concatenate all the standardized words into a standard enterprise client name for comparison purposes with the data warehouse surrogate key cross-reference table.
Michael Jennings is a recognized expert with more than 20 years of information technology experience and speaks frequently on business intelligence/architecture issues at major industry conferences and has been an instructor at the University of Chicago's Graham School. He is a co-author of the book Universal Meta Data Models and a contributing author of the book Building and Managing the Meta Data Repository. He works for EWSolutions, a GSA schedule and Chicago-headquartered strategic partner and systems integrator dedicated to providing companies and large government agencies with best-in-class business intelligence solutions using data warehousing, enterprise architecture and managed meta data environment technologies (www.EWSolutions.com). He may be reached directly via e-mail at MJennings@EWSolutions.com.
For more information on related topics, visit the following channels:


