FREE DM Review Site Registration!
Sign-up today and access DM Review on the Web!

Your FREE registration entitles you to:

FREE email newsletters

FREE access to all DM Review content

FREE access to web seminars, resource portals, our white paper library and more!

   

The Analyst: Turning Data into Discovery

John Q. Analyst had a serious problem. Over the last ten years, he achieved a demigod status at his insurance firm as an unquestionable data and subject matter expert in his department. It was a good life. Poor supplicants would wait for weeks for the demigod to answer their data questions: "What does this field mean?" "Where do I find this information?" "How come these numbers do not add up?" John took his job seriously. "Data cannot be rushed," he was fond of saying. "It is delicate detective work." He took his time, but he always delivered the answer.

Truth be told, he wasn't always sure his answers were correct, but who would know otherwise? It's not like anyone was going to dig through millions of rows of data to figure it out - until now. All of these ridiculous new regulations requiring that data be consistent, data lineage be documented, and nonpublic information be protected were causing John one huge headache after another.

Take this week. On Monday, the internal auditor came to inspect John's application. John proudly shared the data models he created and maintained, and it all went well until the auditor asked to see the data. John randomly pulled a record, displayed it in the application screen and explained the various fields - name, address, youthful driver flag that was used to calculate the insurance premium because young people had such poor driving habits. "Hmm," said the auditor. "Mr. Adolf Rabinowitz. That's an unusual name for a young person. I wonder how old Adolf is." "Let's take a look," said John. He logged into a different system, brought up Adolf's complete record, and, to his horror, discovered that Adolf was 87 years old. "Very interesting," said the auditor. "How many other people are you mistakenly overcharging?"

After the auditor left, promising to come back next week for answers, John got to work. The first problem was trying to relate the data in the system that had ages to the system that had the youthful driver flag set. John randomly picked a few records, but they all seemed to have the flag set correctly. Then John put his detective hat on - he picked a person for each year of age (an 18-year-old, a 19-year-old, and so on through a 35-year-old) and checked their youthful flags. It seemed that 18-24 year olds had the flag set, but 24-35 year olds did not. John was stumped. He tried another sample and found a 47-year-old woman a with youthful flag set and a 24-year-old who did not have it set. Four days and 17 samples later, the pattern began to emerge. John figured out that young people 24 years or younger usually had the flag set, as well as people over 70, but there were random people in other age groups who did not. Now it was Friday. John's wife expected him home for a family gathering, and he was stuck at work trying desperately to get ready for the following Monday's visit by the auditor.

The Monday inspection did not go well. "What do you mean, you cannot explain why 70-year-olds have the youthful driver flag set, how many people 24 years old and younger do not have it set, and how many 24 and older do?," asked the auditor incredulously. "Call me when you figure it out!"

John Q. Analyst was always intense and focused. He prided himself on his analytical skills and loved games and puzzles. From chess to Sudoku, John would never pass a good brainteaser. He even entered a worldwide data mapping contest sponsored by a data mapping software vendor, and, while he did not place, he scored in the top 10 percent. So there was no way he was going to have some lousy youthful driver flag stump him.

He got extract, transform and load (ETL) developers to dump data into spreadsheets and databases for him. He printed the spreadsheets with data and spent hours pouring over them with a highlighter while downing a bottomless cup of coffee. He was losing sleep and weight. Another couple of weeks passed, but he was no closer to the answer, when he had an epiphany. He remembered his first manager and mentor, Claudia Copybook, and how she always told him: "John, no matter how beautiful your data model, no matter how well designed, those programmers are going to screw it up." Back then, in the spirit of youthful absolutism, John argued with her and insisted that if only data modelers had the guts and determination to reject programmer's attempts to corrupt the model and schema, all will be well.

Now, years later, after being paged in the middle of the night because some application that had to go into production required a new field for an emergency bug fix, and after having identified a field they can use "just for this special case" and only temporarily until they can implement a real fix (that was never actually implemented), after having contractors proudly deliver a system in record time only to spend the next two years discovering and fixing all the shortcuts they had taken, after being personally involved in decisions about doing things right versus shipping on time, John knew all about overloaded fields, corrupted models, undocumented features and convoluted usage patterns. This must be it - the youthful driver flag must be overloaded ... but how?

John became obsessed with the flag. He could be seen at all hours staring into space, pouring over spreadsheets, chewing on his highlighter and mumbling to himself. It took another month of intensive investigation to finally piece together the answer: youthful driver flag had been overloaded to indicate high-risk drivers, including drivers under 24 years of age, drivers over 70 years of age and drivers with DUI convictions or a high number of accidents and driving violations. Young people who did not have the flag set were 21 and older graduates of special safe driving courses with no driving violations or accidents. And, of course, some data was just dirty when the flag did not get set or changed correctly or was manually set by an end user without going through the application logic. Finally, there were about 300,000 records that got loaded three months earlier from a company they had acquired that did not comply with this rule at all and only marked the youthful driver flag for drivers younger than 26 years of age.

Does this little parable sound familiar? I have been involved in many projects like this and have spoken with countless analysts who've been there. From a major European defense contractor trying to synchronize builds of materials for a jet built by dozens of companies in four countries to an international conglomerate spending months to map one Peoplesoft HR to another even though most fields were identical (but how do you know until you look?); to an insurance company where a one-week project to map service providers ended up taking six months, I have seen project after project, company after company drown in outdated data models, poor programming practices, dirty data and, above all, unbelievable complexity combined with a lack of automation and tools.

The problem of understanding and mapping the data has recently come to the forefront driven by compliance requirements and business needs. Many large enterprises are embarking on data governance programs, establishing data governance councils and appointing data stewards to provide a single point of decision-making and responsibility.

Data governance is based on principles that data has to be understood, secure, consistent, accessible and managed through people, processes and procedures. Because you cannot govern what you do not understand, understanding the data is the first, necessary step in securing it, making it consistent and accessible, and managing it. Unfortunately, as poor John Q. Analyst knows all too well, it is a difficult and painful process. After going through all the pain to discover the business rules governing a single field in a single application, imagine trying to uncover the data rules and lineage across hundreds and even thousands of applications and millions of fields! It is enough to have John Q. Analyst thinking about a career change.

Fortunately, there is hope. With all the focus on the problems of data models, data lineage and data consistency, there is a new set of tools for data relationship discovery and management available to help fight the problem.

  • New collaboration tools like wikis that provide a searchable, editable forum are ideal places to capture the collective knowledge of the data in the enterprise and allow analysts from different groups to collaborate on defining business rules and business terms.
  • New discovery tools are available that focus on the data analysis rather than metadata, help discover the patterns and rules hidden in the data and help reverse-engineer the various rules and identify exceptions. While these tools are not "push-button," they would help John Q. Analyst discover the meaning of the youthful driver flag in days rather than months.
  • Finally, there are new tools available to validate data consistency and manage remediation of data inconsistencies that exist between distributed systems.

While the problem of understanding and untangling the current data mess is still formidable, armed with these new tools, companies are finally beginning to make progress. So, take heart, John! You can have beautiful consistent data models, documented data rules, clean data ... and weekends with your family.


Alex Gorelik is the co-founder and CTO of Exeros, Inc. and has over 20 years of experience developing cutting-edge data integration technology.

For more information on related topics, visit the following channels:



Industry Vendors