|Sign-Up for Free Exclusive Services:||Portals|||||eNewsletters|||||Web Seminars|||||dataWarehouse.com|||||DM Review Magazine|
|Covering Business Intelligence, Integration & Analytics||Advanced Search|
Analyzing Domain Statistics
Each day organizations around the world acquire information and enter it into their transaction processing system. That information is pushed, pulled, teased and twisted for a myriad of business purposes. For many organizations, its final destination is the data warehouse. Using the data warehouse, analysts take that information, format it and give it to their managers. Those managers use the analysis along with their instincts and make decisions which help the organization serve customers better, make the operation more efficient and maintain the profitability of the organization.
What if the information is wrong? The managers make the wrong decisions, those decisions propel the organization in the wrong direction, and goals are not met. This can lead to new managers, new analysts, new information technology team members.
In his article, "The High Costs of Low Quality Data," (DM Review, January 1998), Larry English asks the question, "If the state of quality of your company's products and services was the same level of quality as the data in its databases would your company survive?"
Does your organization realize the importance of data quality? Does anyone within your organization have a goal to improve or even measure the quality of the information that you use? If you had to measure data quality would you know how? Recently I was assigned to attack this very problem. My team's approach involved four fundamental steps:
This article will review each of these steps and examine how we sought to improve the quality of the data. Because our project had a fixed time line and a fixed budget, our decisions were based on achieving some level of success within those constraints.
Identifying Important Data
The task of identifying important data is actually quite simple. The first step involves determining the subject- matter experts for the organization. These would be the analysts, the experts and the people who use the data on a daily basis. They know inherently what is important to them, and cumulatively they know what is important to the organization.
From each of these subject-matter experts, we compiled a list of the dimensions and facts that they used to measure the business. What we found was that many of the requirements of each subject-matter expert overlapped with other subject-matter experts. The resulting consolidated list was then reviewed by all of the subject- matter experts in a series of meetings.
The result was a list of the most important dimensions and facts within the organization.
Developing a Method to Measure Data Quality
After establishing a list of important facts and dimensions, we developed a method to measure the quality of the data. Early on in the project, we researched and discussed different methods to measure data quality. From our research, we determined that the best method to measure quality was a source-to-destination verification of a statistically valid sample. However, we had a number of issues with this method. The transformation process was incredibly complex, the source documents would be difficult to obtain and we had limited resources.
Another method was to validate the accuracy of the data as it came from the operational system to the data warehouse. This accuracy checking would at least verify that the data in the warehouse matched the data in the operational system. Unfortunately, we still had to replicate a complex transformation, and we still had limited resources. After researching and carefully reviewing a plan to perform accuracy checking, we determined it was beyond the scope of our project. So what could we do?
Analyze Domain Statistics
Analyzing domain statistics reminds me of a cartoon. A bunch of students were sitting in an astronomy classroom waiting for the professor. One of the students asks what the difference is between astrology and astronomy. The professor answers, "A whole lot of math." In the next frame, the professor is alone in the classroom.
The two different methods of measuring data quality described earlier involved referencing the source data in comparison to the destination data to determine data quality. Analyzing do-main statistics uses a different approach. This method involves looking at the domain of each column over time to determine if it is consistent. This will not verify the accuracy of the data. It will, however, show us patterns that may indicate a problem. At a minimum, the statistical information will expose characteristics of the data that indicate its stability and usability. Since this method does not involve replicating the transformation or acquiring the source documents, it became the only method that was practical for our project.
Domain statistics are statistics derived from the data within a specific domain. Therefore, if we count the number of rows where the data in a column is blank, we can determine the percentage of blank rows. We can also determine the number of distinct values in the column. For those columns that are numbers, we can take an average, minimum and maximum, to name a few. This data can then be compared across different periods. Using this trend analysis, we can graph the changes that happen within the data. These changes are at the heart of domain statistic analysis.
Gathering data is an essential part of any analysis. In our project, we had to determine how much data was necessary. Initially we thought we would examine all of the data. We quickly determined that gathering the data would take a lot of time, and we would be left with little time to do the actual analysis.
In addition, the purpose of our investigation required only a relative level of precision. We were not striving to correct or even find every deficiency. We wanted to ensure that the processes that extracted, transformed and loaded the data were consistent. We determined that by using statistically valid samples, we could determine the consistency of the process. By using standard statistical equations, we determined that a sample of 67,000 rows would give us an error of less than .5 percent.
We determined that we were going to analyze domain statistics, and we knew that we would use the load months as the periods. This made sense because we realized that the greatest potential for data corruption in our system occurred in the complex transformation between the operational system and the data warehouse. Since the data warehouse was loaded on a monthly basis, those loads correlated directly with the transformation. It seemed probable that we were more likely to detect data corruption if we stratified our sample by load month.
Reporting the Results
The theory behind our analysis is that even at the micro level the business has an inherent stability that tends to prevent sudden and dramatic changes. Therefore, when something changes quickly or dramatically, it probably should be investigated, even if it is a positive change. The investigation may reveal that this is indeed a change in the underlying data and truly reflects the business. It may also reveal a flaw in a transformation program or in data acquisition.
Graphing the Domain
The best method to detect sudden and dramatic changes is to graph the analysis of the domain statistics. These graphs then become pictures into the data and show patterns that help us detect when things are not stable. Using more statistical processes, we can discover patterns that should be investigated. For an illustration, please examine Graph 1.
The actual data values are the basis for all of the other lines in the graph. The actual data values are taken from the data warehouse and correlate to the data loads. They show a domain statistic (distinct values in this case) for the period January 1997 to February 1999. The other three lines that run in parallel are derived from the actual data values. The centerline is the linear regression for the actual data values and is called the trend line. The trend line is important because it helps us establish the upper and lower boundaries. These boundaries are based on the standard deviation for the list of actual values. By using specific multiples of the standard deviation, we can detect different types of aberrations in the data that require further investigation.
The simplest and most common sort of data aberration is the spike. This is a single instance where the data is outside the upper or lower boundary. If we examine Graph 2, we notice that in November 1997, the percent of values (unknown in this case) is dramatically higher than in all of the other months. This could be because of a transformation problem, a load problem, a data acquisition problem or it may be an accurate representation of the data. The spike tells us this month was different and gives us a reason to investigate.
Graph 3 presents a very different type of problem. The graph shows a trend beginning in April 1998, where the number of unknown values is increasing. The increase, however, is not dramatic enough to cause a spike until February 1999. This is because the standard deviation is based on how the values change; and when the change is gradual, the standard deviation simply grows with the change. Therefore, for this type of graph we need to understand what is being measured and whether this type of trend makes sense. If it does not make sense, further investigation will be necessary.
Using graphs similar to these, the subject-matter experts and my team reviewed the statistics for each of the "most important" dimensions and facts. Although this effort did not tell us about what was in the data, we found numerous examples of things that were not consistent or were illogical. Examples include:
We discovered many other opportunities to improve the quality of the data from this analysis. Although the information is not specific to the problem, it told us where to look.
Automating Quality Assurance
In addition to determining the existing problems, the subject-matter experts and my team also determined expected values for each dimension and fact statistic. Using these tolerances, we discussed monitoring the quality of the data warehouse by evaluating the domain statistics on a monthly basis. The tolerances are based on historical fact, expert opinion and statistical method. The tolerances can be spike indicators, or they can be based on actual values. By using actual values, we could catch a problem that is similar to the problem represented on Graph 3.
After establishing tolerances with the subject-matter experts, my team set upon creating a program to compare a fresh statistical sample with the tolerances on a monthly basis. This program gathered data from the data warehouse, calculated the trends, boundaries and actual values. The program reported the results by dimension and fact, showing the actual value and the tolerance.
Information is the critical component for the success of any organization. Strategically, each organization needs to understand not only what data is important, but also the quality of their data. With high-quality data and effective decision making, organizations can thrive.
The methods to get a glimpse of the quality of the data in the organization are varied. Using domain statistic analysis, one organization was able to identify problems and develop a method to catch them earlier and more effectively.
For more information on related topics visit the following related portals...
DW Design, Methodology, Business Intelligence and Data Analysis.
Nat Bongiovanni is a senior enterprise architect with the Keane e-Architecture Group. He has helped design and implement a number of data warehouses and data marts, and has also consulted with organizations to maximize their technology investments. His e-mail address is Nat_Bongiovanni@Keane.com.