Published in DM Review in July 2003.|
Printed from DMReview.com
Plain English About Information Quality: Defining and Measuring Accuracyby Larry English
I have been amazed in recent months to find how many people have an inaccurate understanding of the information quality (IQ) characteristic of "accuracy" and how to measure it. I am not referring to practitioners, but to consultants, authors and educators who write and teach about it.
In this column, I address how to define accuracy, measure accuracy, design accuracy measurement tests and solve accuracy problems.
Accuracy is one of the most fundamental and important of all IQ characteristics. Without accuracy of data values, some processes may operate acceptably, but other processes will fail. The meaning of accuracy is, or should be, crystal clear.
Information, whether electronic or on paper, is simply a representation of real world objects or events. Data elements hold values that are facts that represent some attribute of a real world object or event. Therefore, the definition is: Accuracy is the degree to which data correctly reflects the real world object or event being described.1
Either the value of the attribute is correct or it is not. It is that simple. While some analog attributes such as weight or latitude/longitude of an object may be correct within some allowable variation or tolerance, this represents a measure of the IQ characteristic of the "precision" of the value. For example, the U.S. official time provided by the United States Naval Observatory and the National Institute of Standards and Technology via www.time.gov is accurate to within 0.2 seconds. At the exact moment the screen displayed 11:49:00 CDT, the real U.S. official time could have been anywhere from 11:48:59.8 CDT to 11:49:00.2 CDT. For binary data, however, such as birth date and product (selling) price, the value is right or wrong when compared to the object or event.
Kaoru Ishikawa, the great Japanese quality guru who gave us the fishbone diagram tool we use for cause-and-effect analysis, also provides the key to measure "accuracy." When you take a sample of manufactured goods from an entire lot of products produced, you measure its quality by comparing the characteristics of the product to the data (the product specification data).2 Therefore, for physical manufactured object quality, you measure the object and compare the measurements to the data.
You measure data accuracy by comparing the data values to the real world object or event. Accuracy of nearly all business attributes, such as person name, birth date and marital status, cannot be measured electronically with software. It can only be measured by going to the object itself, or to an observation or recording of an event, to confirm that the data values are correct to the object or event characteristic.3
Some examples illustrate this. When in London, Diane and I frequently attend classical concerts. On a trip a few years ago, we noticed in Time Out (a weekly calendar of events) that Placido Domingo, the great tenor, was singing that Friday. Diane purchased tickets from the local ticketing service. When we arrived at the Royal Albert Hall on Friday to pick up the tickets, the hall was strangely silent. Only after we entered did we find out the concert had been on Thursday, the night before. The concert date in the calendar of events was not accurate. The date listed was a valid and reasonable date, but we missed the concert regardless. In another example, an assessment of 2,000 persons found no invalid values for marital status. However, when the persons were contacted, 23.3 percent – nearly one out of four – of those valid values of marital status were not accurate.
One technique for attempting to measure "accuracy" is to compare data to other reference data, such as postal address or change of address data, or other transaction data collected by third-party information sources. Technically, however, this does not measure accuracy as it is reflected in the real world object, but as reflected in some reference or surrogate source "considered" to be accurate. However, the reliability of this assessment will be dependent on the accuracy of the data in that reference source. You must know the reference data accuracy level to understand the confidence level and bounds (margin of error) in the accuracy level of your own data. One of my students used a data cleansing service to "cleanse" name and address data to such reference data. Afterward, a physical accuracy assessment of the results showed that 12 percent of the "cleansed" addresses still had inaccuracies, from address number errors to people no longer living at stated addresses.
The message is clear. Validity to valid values and validity of conformance to defined business rules can be measured electronically. However, accuracy cannot be measured electronically; it can only be measured through physical inspection.
Measurement of accuracy is complicated and expensive, but it must be done. Accuracy (to reality) tests require physical comparison of the data to the real world object or event. Accuracy tests will be determined by the different categories of objects or events. People must be contacted via telephone, mail or e-mail. For physical objects, one must extract samples and measure them. For locations, one must survey and inspect the actual location. Events must be observed in real time (this measures the current process effectiveness) or recorded so the data can be confirmed. For example, measuring the accuracy of a medical insurance claim requires a qualified person to review the actual patient file at the medical provider's office. For details on how to measure accuracy, see Improving Data Warehouse and Business Information Quality, pp. 182-188.
For the best of both worlds, measure validity electronically to exploit the efficiency of electronic tests. However, design accuracy tests and apply them to a small yet statistically valid sample to measure accuracy and know the difference in validity and accuracy of the data. You may conduct accuracy assessments less frequently to reduce costs, but you must conduct them periodically.
When you report your assessment findings, you must differentiate and correctly label validity and accuracy assessments. If knowledge-workers misinterpret a measure of validity assessment as a measure of accuracy, they may have false expectations of the quality of their data.
The real solution to information quality problems is to conduct root-cause analysis on the types of problems you have, and then implement process improvements to eliminate recurrence of defective data. Implement processes that minimize information quality decay. Decay is the phenomenon in which characteristics about real world objects change without being updated in your database.
A contributing cause of our missing the Placido Domingo concert was not verifying the date of the concert when we ordered the tickets. The defect-prevention technique we now use is to verify the information, date, time and location with the source to assure the information we have is accurate.
What do you think?
Larry P. English is president and principal of INFORMATION IMPACT International, Inc., Brentwood, Tennessee, and the author of the widely acclaimed book, Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. English is cofounder of the International Association for Information and Data Quality (www.iaidq.org). English is an internationally recognized speaker, teacher, consultant and author and may be reached at email@example.com or through his Web site at www.infoimpact.com.
For more on how to become a synergistic learning organization, join the IAIDQ (visit www.iaidq.org) and attend "The (17th) Information Quality Conference" in Houston, September 19-23, 2005.
Copyright 2005, Thomson Media and DM Review.