|Sign-Up for Free Exclusive Services:||Portals|||||eNewsletters|||||Web Seminars|||||dataWarehouse.com|||||DM Review Magazine|
|Covering Business Intelligence, Integration & Analytics||Advanced Search|
Building Business Intelligence:
The following column is excerpted from the white paper, "The Perfect Match: 7 Steps to a Match" by Cory Shouse. For a copy of the full paper, please visit www.csiwhq.com/news/whitepaper_requests.asp.
In part 1, we covered the three preparation steps for matching our data to third-party data. In part 2, we cover the steps required to actually perform the match.
Once the data is ready to be matched, it is important to understand latency and physical architecture, and have a well-defined workflow and a scoring system in place to ensure the master data cleansing process is ongoing.
Real Time versus Batch
Strongly consider your business requirements when assessing the latency requirements for a match. The first item for consideration is real time versus batch. For example, if the requirement is to set a credit limit, a real-time match may be critical to get a true assessment of the total customer spend; however, if the requirement is to generate a deduped mailing list for shipping a new catalog, then a one-time batch match may suffice. The cost for these two options can be substantially different when using a third party's product.
On Site versus Off Site
In addition to real time versus batch, on site versus off site must also be considered. Many vendors offer an updated reference file to keep on site at your location. You are able to use the file without ever having to send them a file or integrate the matching process directly with the vendor's database. However, that on-site file is only as good as the last update received from the vendor. You need to consider just how much decay you can afford to live with (i.e., should you receive monthly, quarterly or yearly updates of the file?).
Manual versus Automated
Another element to consider is whether or not to use a manual or automated match. Hiring temporary staff to "eyeball" records across different systems using some common elements is a cost-effective way of matching records; however, this introduces the human element, and this method may only be as good as the staff performing the match. In cases where you are dealing with millions of records (and new records will be added at a high rate), automation of the process may be the right answer. A hybrid approach can also be a likely scenario. For example, an automated system will have cases in which it will not be able to find a match, requiring a manual review of the records before determining the appropriate path.
How do you know if a match exists? The best approach is to sample a set of records and evaluate the results. Start by taking the standardized elements between the two records and compare each one. For example, if we match the address values from our system with those of a vendor, we may apply a grade such as that shown in Figure 1.
Figure 1: Example of a Match Grading
After doing this for each element, we need to score the record as a whole. D&B calls this score a "confidence code." The confidence code is a scale from one to 10 indicating the probability of a match. For example, after performing a match, we may see results as shown in Figure 2.
Figure 2: Example of Matching Results
Figure 2 shows that record 1 is a perfect match while record 5 is an absolute no match. The difficulty comes in determining what to do with the records in between. After performing this same analysis on a large sample, we may come to the conclusion that records with a confidence code greater than or equal to 8 will be flagged as a match, records with a confidence code less than 5 are an absolute no match, and those in between will be flagged for manual review.
Taking into account all the factors discussed thus far, the pieces must now be put together to create a well-defined and repeatable process for performing the match. Get a good understanding of how quickly your files will decay and your latency for the match when defining your workflow. Figure 3 is one example of a workflow to support a match using an enterprise resource planning (ERP) system, a customer master file and a vendor's external reference file. This workflow takes the following into account:
Perfecting the match takes time and recalibration of the process. Remember playing the game Memory as a kid? My four-year-old daughter loves playing it and enjoys winning by getting the most matched cards. The game itself demonstrates just how difficult matching can be. It is almost assured that a match will not be achieved when the initial two cards are turned over. However, as the players begin to uncover other cards and learn what to look for and where to look, the probability of a match increases until all cards are accounted for with a match. This same concept applies when we begin to match our customer, vendor and product master files across different systems. Continually reevaluate the status of your matching process and the results of your scoring system. Repeat steps 1 through 6 until the findings show that the matching process is working and results are positive.
One hundred percent match success is not guaranteed. There are numerous factors to consider and outline when designing your own process and achieving success in your master data match approach. However, for a match made in heaven, you must properly prepare the match, perform the match and perfect the match.
William wishes to thank Cory Shouse for his contribution to this month's column.
Cory Shouse is a senior architect with Conversion Services International. With more than 10 years of experience in business intelligence, Shouse specializes in helping companies establish, organize and deliver value to the business. He has assisted a number of Fortune 500 companies define quality assurance programs, organizational and staffing plans, change control procedures, and appropriate information and technical architectures. He may be reached at firstname.lastname@example.org or (469) 939-5385.
William McKnight has architected and directed the development of several of the largest and most successful business intelligence programs in the world and has experience with more than 50 business intelligence programs. He is senior vice president, Data Warehousing for Conversion Services International, Inc. (CSI), a leading provider of a new category of professional services focusing on strategic consulting, data warehousing, business intelligence and information technology management solutions. McKnight is a Southwest Entrepreneur of the Year Finalist, keynote speaker, an international speaker, a best practices judge, widely quoted on BI issues in the press, an expert witness, master's level instructor, author of the Reviewnet competency exams for data warehousing and has authored more than 80 articles and white papers. He is the business intelligence expert at www.searchcrm.com. McKnight is a former Information Technology Vice President of a Best Practices Business Intelligence Program and holds an MBA from Santa Clara University. He may be reached at (214) 514-1444 or email@example.com.
|View Full Magazine Issue|
|E-Mail This Column|