Portals eNewsletters Web Seminars dataWarehouse.com DM Review Magazine
DM Review | Covering Business Intelligence, Integration & Analytics
   Covering Business Intelligence, Integration & Analytics Advanced Search

View all Portals

Scheduled Events
Archived Events

White Paper Library
Research Papers

View Job Listings
Post a job


DM Review Home
Current Magazine Issue
Magazine Archives
Online Columnists
Ask the Experts
Industry News
Search DM Review

Buyer's Guide
Industry Events Calendar
Monthly Product Guides
Software Demo Lab
Vendor Listings

About Us
Press Releases
Advertising/Media Kit
Magazine Subscriptions
Editorial Calendar
Contact Us
Customer Service

Volume Analytics:
Winning the Battle Against Bad Data

online columnist Guy Creese     Column published in DMReview.com
December 15, 2005
  By Guy Creese

We've all battled bad data from time to time - for example, trying to merge the customer records from different customer databases but never quite sure if this "John Smith" matches that "Jack Smith." Most of the discussions on how to fix bad data recommend that companies purchase technology: for example, a data cleansing tool that integrates with an extract, transform and load (ETL) package. However, bad data is much more than an undeliverable postal address - and there are fixes other than buying a software package. Here are two problems and corresponding fixes that I've discovered over the years.

Problem 1: The Business Won't Give You the Data

An area that business intelligence (BI) professionals don't talk about much - I suspect because they view it as a political problem rather than a technological problem - is prying data loose from someone who is hoarding it. While it may not be inaccurate data, it can certainly lead to bad data, in the sense that a company can't understand the big picture without it.

If you've ever been responsible for a corporate data warehouse, you know what happens. The C-level folks announce, "We're all going to be one happy family and pool our information for the greater good." However, when you go to the people who own the data, some won't part with it. Fred, a colleague of mine, ran into this problem a number of years ago. The VP of manufacturing refused to allow his data to be used in the new corporate repository. "I know what the data means and my people know what it means. If I let some bozo in marketing look at my build plan they're going to screw up my world. The last time I checked, this was my data, and you aren't getting it."

After cajoling, threatening and trying every which way to obtain the data, Fred gave up - but he did have an ace up his sleeve. When he rolled out the reporting system to upper and middle management, he inserted a "?" in the spots where data was missing. Within the first five minutes of the company's president looking at the tool, Fred got a call. "Why do some of my screens have question marks here and there?"

"Oh, well, that's a stand-in for missing data," replied Fred. After further interrogation, Fred explained, "Well, George refuses to let me extract his manufacturing data. He says only he and his people can interpret it correctly, so he isn't releasing it."

"We'll see about that," growled the President. For some reason, within the next half hour, George decided that sharing his data with the rest of the company was a wonderful idea.

Problem 2: Trusted Data, Over Time, Becomes Suspect Data

This is a more insidious problem. With the issue above - when someone refuses to give you his or her data  -at least you know where you stand. Problem 2 creeps up on you. A recent example is this year's brouhaha over first- and third-party cookies in Web analytics. Web cookies are small text files sent to a user's PC by a Web server which enable Web sites to recognize a user during the course of a visit or remember pertinent details about the user across visits. A first-party cookie is set by the site that the user is visiting, while a third-party cookie is set by a third-party site providing a service - such as Web analytics or ad generation - to the Web site.

Because ad networks and spyware use third-party cookies, browser vendors and anti-spyware suppliers now discriminate between the two types, and often block third-party cookies. Furthermore, users concerned about privacy are deleting third-party cookies at a rapid clip.

If one of the third-party cookies being deleted is the cookie from a site's Web analytics provider, the Web analytics service starts reporting skewed data to its client: that 1) the number of new visitors is increasing and 2) the number of return visitors is decreasing when, in fact, they aren't. In short, the business starts thinking that customer loyalty is plummeting, when it's actually stable or increasing.

This kind of bad data is hard to guard against. Because it usually happens due to circumstances beyond a company's control, the warning signs are difficult to discern. Perhaps the best prevention is thinking through how the quality of the data could change. Have we switched suppliers recently, and might they report the data differently? Have the incentives for supplying good data changed? Maybe customers who are angry at a company's behavior in one sector of the business are punishing it by submitting bogus data in another.

For example, Sony, recently under attack for installing a hard-to-remove digital rights management rootkit on customers' PCs, may find it prudent to spot-check data from its online registration forms for the next six months. Such backlash behavior may never happen - but it's better to be safe than sorry.

Battle Bad Data to Keep Your End Users' Trust

These two problems require different responses. When someone internal to the company won't give you necessary data, you stand a chance of creating incentives - or punishments - that will get you the data you need. With problem 2, you're more at the mercy of the outside world - figuring out what has changed and perhaps making the best of a bad situation.

These types of problems are not the kind of things that any of us enjoy dealing with. However, remaining vigilant and proactively battling bad data are actions that promote accurate data and ultimately lead to end-user trust. If your users begin to mistrust the data, your job will become immeasurably harder. Better an ounce of prevention than a pound of cure.


For more information on related topics visit the following related portals...
Data Quality.

Guy Creese is managing partner at Ballardvale Research, an analyst firm that investigates how organizations can optimize their Web site via best practices in content management, search, personalization and Web analytics. Creese has worked in the high tech industry for 25 years, at both Fortune 500 companies and small start-ups, in positions ranging from programmer to product manager to customer support engineer. He can be reached at guy.creese@ballardvale.com.

Solutions Marketplace
Provided by IndustryBrains

Design Databases with ER/Studio: Free Trial
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata in an intuitive user interface.

Data Quality Tools, Affordable and Accurate
Protect against fraud, waste and excess marketing costs by cleaning your customer database of inaccurate, incomplete or undeliverable addresses. Add on phone check, name parsing and geo-coding as needed. FREE trial of Data Quality dev tools here.

Use MS Word as your Report Generator
Create reports in PDF, RTF, HTML, TXT, XLS & more. Use MS Word to design the reports and reduce development time by 90%. Easy-to-use custom secure report generation - Fast! Free Demo.

Click here to advertise in this space

E-mail This Column E-Mail This Column
Printer Friendly Version Printer-Friendly Version
Related Content Related Content
Request Reprints Request Reprints
Site Map Terms of Use Privacy Policy
SourceMedia (c) 2006 DM Review and SourceMedia, Inc. All rights reserved.
SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.