Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Data Integration:
The Common Problem Working with Merge/Purge and Householding

online columnist Greg Mancuso and Al Moreno     Column published in DMReview.com
July 8, 2004
  By Greg Mancuso and Al Moreno

Of all the issues related to accumulating and identifying unique data during a data warehouse implementation, perhaps the single most difficult area to control centers around data quality issues. And perhaps the most difficult task that must be accomplished once the data is cleansed and deemed "good" centers around the topic of how to do merge/purge or householding. Householding consists of grouping like data from multiple sources. It is the act of uniquely identifying a set of data records from one source system, such as order entry, and how they are related to another set of data records from the company's order fulfillment system. Householding is perhaps the single most major effort during the ETL process, and it is a task that many companies will underestimate. It takes significant time and effort to accomplish, and it takes even more effort to insure that it is done correctly.

Householding is the process of identifying related information from multiple sources and storing it in the data warehouse such that all related information about a given party is accessible when looking at any information about the identified party. A key point is that householding does not simply refer to the matching up of data about a family. The need for householding began with a requirement for clear and concise marketing data. As such, the goal was to identify all individuals that reside in the same physical residence, hence the name householding. A marketer could now send one piece of mail to the household, which saved considerable dollars in postage and mail preparation costs. Today, householding refers to any grouping of information about a given person, family, household or company. The need today for accurate identification of the same person, family, household or company across all systems is significantly greater than when householding was first identified as a required business rule for data migration. Today's marketing programs target people at many levels - as an individual, as head of household or as a member of a given group of people (i.e., club, residence and/or business).

When first considering the effort to perform householding, it is easy to assume the effort will be fairly insignificant. Nothing could be further from the truth. Before getting into specific issues for householding, let's take a quick look at the checklist of necessary tasks for any form of householding. First, all data in the data warehouse must be scrubbed and cleansed. Secondly, you must come up with standardization rules for matching, including exact data attributes to be used for matching. Third, run the match logic on the incoming records and check the data in the warehouse. Sounds easy, right? For the purposes of this column, we'll assume all data is clean.

Let's look specifically at householding in terms of grouping data from multiple sources about a given individual. Everyone has a social security number (SSN), and every SSN is a unique identifier. Therefore, match people by SSN and you are done. It would be except that historically very few systems capture and store SSN for a person. And those that do rarely validate the SSN against any official sources. When householding a given person's data from many sources, it is clear that the effort is much greater than first anticipated. Unless, of course, you are using data sources that have all been written in the past five years or so which have all been standardized to require the entry of SSN. Unfortunately, it is probably safe to assume a more likely scenario. While some source systems carry the SSN, because none of the records are standardized, we have to dig a little deeper for the additional identifying information.

Other common data elements are: surname, given name(s), other ID numbers, address fields, phone number(s) and employer's name. We include ID number in the list as some companies may have had the foresight to assign a unique identifier to a person's data records and carried this number forward to all systems. It is unlikely to be any more reliable than SSN, but it is worth investigating. Can anyone identify the first concern with the list of candidate matching elements? Remember, we said to assume all the data was clean. But, a clean U.S. address is simply one that follows valid address form - number, street, apartment, city, state and ZIP. Likewise, a valid U.S. phone number contains 11 digits - country code (1), area code, number. Further, it may include an optional extension that is often stored in the same database column as the number itself. The issue is one of name and access standardization.

Standardization must occur at every candidate data element, including name. Look at our names - Greg Mancuso and Al Moreno. Every day we receive letters addressed to Greg Mancuso, Gregory Mancuso, G Mancuso, Al Moreno, Albert Moreno and A Moreno. When one compares Greg Mancuso and Gregory Mancuso, will these two names be considered a match? More often than not, an automated routine will say "no," but a human being looking at the same names will likely say, "Maybe, but I need more information." The key point here is that business rules must be set that will guide the householding routines on what is a match and what is not.

The same holds true for address fields. In many ways, addresses are more difficult to standardize. The hardest part about standardizing addresses is to ensure the elements are uniquely identifiable. That is, the address is not stored in a 100-byte freeform text field in the database. Once the address elements - there is an attribute for house number, for street name, for city, state and ZIP - are stored in discrete fields, there are a variety of ways to ensure the addresses are standardized.

Now, assuming all candidate elements are cleansed and standardized it becomes a much easier task to identify matching elements. Business rules may be put in place to guide the logic and determine how closely all elements must match for the records to be considered a match. For example, in a householding routine with a business rule of 100 percent match or there is no match, the following two records would not match: Greg Mancuso of 123 Main St., City, ST 12345-1234 and Gregory Mancuso of 123 Main St., City, ST 12345-1234. On the other hand, a business rule of last name and address must match or no match would identify the above records as a match and merge the detail information into the same person in the target data warehouse

From this simple example it becomes clear that householding is a difficult task that must be dealt with during the ETL process. It is also a necessary task. The cost savings to a marketing group alone will likely pay for the effort with the first couple of mail campaigns, especially given the cost of postage and preparation of mail pieces that are sent to prospects today. But, it is not just marketing that benefits - all groups within an organization stand to gain. Customer service is better able to get a complete 360-degree view of their customer base thus providing an improved CRM prospective; sales can better understand the true cost of sale and cost of acquisition (how many pieces, how many phone calls, etc.); and management can get a clear understand of the key financial costs related to the various groups when looking at cost of service, cost of sale, cost of product by prospect, cost of product by customer, etc.

While this column deals with some basic rules of householding, future columns will discuss some alternatives for performing householding and delve further into the issues surrounding householding activities.


For more information on related topics visit the following related portals...
CRM, Data Quality and Database Marketing.

Greg Mancuso and Al Moreno are principals with Sinecon, a business intelligence consultancy specializing in data integration and BI/DW solution architecture design. Together they have more than 29 years of data warehouse and business intelligence experience and have implemented many large-scale solutions in both the U.S. and European markets. They may be reached at gmancuso@sinecon-llc.com or amoreno@sinecon-llc.com.

Solutions Marketplace
Provided by IndustryBrains

Customer Relationship Management for IT
Web-based CRM and more with Autotask: Great business management software optimizes resources and track billable project and service work. Get a demo, then try it free with sample data. Click here for your free trial!

Numara Track-It! Help Desk & CRM Software
Numara IT Solutions provides Track-It! - the leading help desk software solution for employee & customer self-help, call tracking, problem resolution, remote control, asset management, LAN/PC auditing, and electronic software distribution. Free demo

Stop Bad Data from Entering Your CRM Database
Protect against fraud, waste and high marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone checking, name parsing and geocoding as needed. FREE trial of Data Quality dev tools here.

ACT! CRM Helps You Help Clients.
Discover how ACT! Database automation can make your life easier. Learn More.

Help Desk On-Demand
CRMdesk automates online customer support and knowledge base creation through the Internet. It is a highly efficient technology solution for companies looking to dramatically improve quality and reduce costs of Web-based customer support.

Click here to advertise in this space

E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.