FREE DM Review Site Registration!
Sign-up today and access DM Review on the Web!

Your FREE registration entitles you to:

FREE email newsletters

FREE access to all DM Review content

FREE access to web seminars, resource portals, our white paper library and more!

   

Publisher reserves the right to serve qualified requesters only.

Effective Data Profiling

Data profiling is an item frequently listed as a task on the high-level data warehouse project plan, but rarely completed in the implementation. When it is done, it is often left to the final stage of the business intelligence (BI) lifecycle, when the information delivery group is gearing up to finalize reporting requirements. What data will be available? What will it look like? How does the user want to see it displayed? Doing it at this stage can result in transformations and cleansing routines being pushed to the reporting layer out of convenience if anomalies are found. A discovery of a gap in the availability of data at this point can derail the whole project and cause a chain reaction impacting source systems, database design and extract, transform and load (ETL) processing. This is too late.

 

The process of data profiling is a systematic analysis of the data to ensure its integrity. It should be done on all the source data at the outset of the project and subsequently on the data warehouse tables, both the normalized central repository and the data mart fact and dimension tables.

 

The major ETL suites come packaged with profiling tools that sometimes involve an extra license fee. There are also stand alone solutions available, and it is possible to write SQL queries to produce the same results. While each comes at a cost, the gains can be significant.

 

Effective profiling at the beginning of the project will assist in many aspects of development, including:

 

  • Establishing scope,
  • Identifying domains of values,
  • Identifying data integration issues,
  • Confirming availability of data,
  • Establishing data volumes,
  • Defining transformations,
  • Identifying cleansing issues,
  • Confirming source fields are correctly labeled, and
  • Identifying appropriate data types.

Subsequent profiling on the data warehouse tables will help with:

 

  • Balance and control,
  • ETL unit testing, and
  • Reporting data validation.

There are three levels at which profiling can be conducted. The first is at the table level, giving high-level information about the tables. Where anomalies are apparent, each field can be profiled to reveal further details. Finally, an investigation can be performed to identify special characters, which may cause problems further down the line. We will now turn to a discussion of these three levels of profiling in more detail. The examples presented below are representative of a generic approach rather than a particular application.

 

Table Level

 

The table level profile in Figure 1 provides a look at each table’s fields to determine overall availability and integrity of the data.



Figure 1: Table Level Profile 

The number and percentage of distinct values can be used to confirm the uniqueness of a given field. In the sample below, the unique code (UNQ_CD) field meets the expectation of being 100 percent distinct. Any other result would be anomalous and require further investigation. This column will immediately highlight issues with fields that may contain too many or too few distinct values. For example, a field that functions as a flag should only contain two values, while a field for gender with only one distinct value may be a cause for concern.

 

The number and percentage of null values has flagged that the lookup short description (LKP_SHRT_DESC) field is entirely empty. This is a field that is not available for use and can be removed from the scope of the project. These columns are also particularly effective in quickly testing if an ETL process has successfully populated the table.

 

The minimum and maximum values indicate whether a field contains the correct range of values, or if it holds the expected data at all. Lookup code (LKP_CD) contains both numbers and alpha characters. This would signal that a question should be put to business users. The field may also be related to values in another table. Perhaps there is a discrepancy between the two tables’ ranges. Certainly the minimum value for the lookup description (LKP_DESC) field looks worthy of further investigation.

 

Similarly, the minimum and maximum range of the field lengths can provide clues to the quality of the data and a point of comparison across tables. Are lengths consistent with where they should be? If not, do they fall within an acceptable range?  

 

The total number of rows in the table can be used as a reference point for balance and control and provides the basis for the calculations in the other columns.

 

Field Level

 

A deeper level of investigation involves examining the distribution of values within a field. An analysis of the domain gives a complete breakdown of the cardinality of each value.

 

This technique should be used judiciously. It could be challenging to make sense of a query against a table with 10,000 rows containing 98 percent distinct values. However, for volumes of transactional data, this form of profiling can be informative in determining which values are in use or obsolete. Values with very low cardinality may indicate a problem or an acceptable exception. The field profile is also particularly valuable when it comes to confirming or identifying complete sets of values; for example, for lists of lookup codes or descriptions. These uses are sampled in Figure 2.

 




Figure 2: Values and Descriptions by Code

 

Perhaps a table profile of a lookup table has identified that there are more distinct values in the description field than in the codes. Query 1 reveals that there are two versions of coupon, “manufacturer coupon” and “manufacturer's coupon,” and that the latter is an anomaly in need of cleansing. For values with multiple occurrences, Query 2 shows the breakout of related codes that will help determine the ETL transformation logic.

 

The two occurrences of “percent discount” could be reduced to one and the “amount discount” and “dollar discount” values could potentially be combined into a single value. (But which one is required? Best to check with the business users.)

 



Figure 3: Profile of Transactional Data

 

The profile of the large volume of transactional data in Query 3 may show that some codes are never used, and others may be used that are unaccounted for in the list of codes and descriptions.

 

Character Level

 

At either of the higher levels, it may have been shown that unwanted characters appeared in given fields. It is equally likely that the special characters have remained hidden. They might be caused by human or system errors, or be deemed undesirable because of potential interference with ETL processing or the running of reports. For example, the ampersand and question mark can cause difficulties related to the Web browser. The dollar sign and asterisk are often reserved as wild cards in programming languages. Other characters that are frequently problematic include commas, apostrophes and double quotation marks.

 

Queries can be run to investigate the existence of any given special character, as the sample in Figure 4 demonstrates.

 



Figure 4: Query on a Special Character

 

A good tool will highlight areas of concern and allow batch processing of the profiling queries on tables and fields to facilitate an efficient compilation of results.

 

It needs to be emphasized that running these queries is only the first step. It is essential that a thorough analysis be performed on the profiles and actionable items be drawn up and addressed. A modicum of effort performed up front will save endless cycles of problem-solving and revisions throughout the project.

 


Justin Hay is founder of Justin Hay Consulting Corp. He is the designer of the Enterprise Query Platform reporting application and is a principal data architect for IBM’s BDW and RDW enterprise data warehouse models.

For more information on related topics, visit the following channels:



Industry Vendors