FREE DM Review Site Registration!
Sign-up today and access DM Review on the Web!

Your FREE registration entitles you to:

FREE email newsletters

FREE access to all DM Review content

FREE access to web seminars, resource portals, our white paper library and more!

   

Publisher reserves the right to serve qualified requesters only.

CDI, and Names and Addresses

Knowledge Integrity

Although I have been saying for years that data quality is not all about names and addresses, I don't want people to think that name and address quality isn't a big part of the data quality process. On the contrary, any organization that needs to deal with customers is bound to have problems with the contact information for those customers. There are two growing trends that warrant a closer look at name and address quality: enterprise information integration (EII) and customer data integration (CDI).

Both of these techniques focus on building a means for consolidating information from across the enterprise in a way that reduces entity identification into a single view. The EII approach provides the means for accessing disparate data sources in place, reflecting an integrated view of data without actually moving it. Alternatively, the goal of CDI (and other master data management approaches) is to collect and aggregate entity detail (be it customer or other type of reference object) into a single repository as a single source of truth. Even though these might be "competitive" approaches, they are similar in their need for applying name and address parsing and standardization in order to provide that single view, whether it is a repository-based view or a virtual view.

In fact, in these kinds of environments, one might consider that name and address quality become the most critical component of the process. If one of the goals is to be able to provide a unified view of the individuals whose information is recorded in various databases across the enterprise, then the inability to recognize and resolve aliases and variations into a single entity will ultimately defeat the purpose.

We clearly need to have name and address parsing and cleansing as part of our EII or CDI processing. The challenge is this: traditionally, name and address cleansing has been seen as a batch process followed by a series of interactive review sessions: the data is extracted, the data sets are compared to each other and the result is trisected into those records that are definitely matches, those that definitely don't match any others and questionable matches that require manual review. However, the synchronization and potential real-time demands of applications that rely on an EII or CDI platform will not tolerate mountains of record pairs destined for the analyst's screen. Yet those manual review records are the ones that carry the most value, because merging the obvious duplicates is an automated no-brainer. Matching the ones that are too close to call is the one process that really needs to be automated!

This poses two challenges to the data quality tools community. The first lies in modifying the typical approach from the batch process to a more services-based process. The real challenge lies in being able to aggregate the meta knowledge necessary for performing the matching process; in other words, an application faced with a new customer record needs to be able to scan the set of potential aliases without necessarily having access to entire extracted data sets. Yet because duplicate and householding applications employ the variant data for the purpose of entity resolution, the absence of this data is likely to hobble the process. The challenge, therefore, is to maintain the variant/alias knowledge without needing to hold onto all of the data.

The greater challenge is to incorporate a degree of automated trainability into the data quality application. If the true bottleneck is the manual review, providing a means for an application's ability to internalize the data analyst's approaches to decision making would allow the application to "learn" and consequently become less dependent on the analyst. It is likely that the kinds of corrections applied by the analyst are neither random nor sophisticated.

I suspect that some of the vendors are already implementing some of these capabilities, although I have yet to see a true integration of any kind of knowledge discovery or machine learning applied to data quality analyst processes. Still, I am confident that as the minute name and address differences that will clog up either an EII or CDI project become more acute problems, data quality tool vendors will take on these challenges to gain the competitive advantage.


David Loshin is the president of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of Enterprise Knowledge Management - The Data Quality Approach (Morgan Kaufmann, 2001) and Business Intelligence - The Savvy Manager's Guide and is a frequent speaker on maximizing the value of information. Loshin may be reached at loshin@knowledge-integrity.com.

For more information on related topics, visit the following channels:



Industry Vendors