FREE DM Review Site Registration!
Sign-up today and access DM Review on the Web!

Your FREE registration entitles you to:

FREE email newsletters

FREE access to all DM Review content

FREE access to web seminars, resource portals, our white paper library and more!

   

Is Data Modeling Sufficient for Database Design?

Data modeling has been with us for several decades now and has been wildly successful. It has provided IT professionals with a set of tools, techniques and methodologies that have not only allowed us to implement databases, but which have also given us a way to communicate and discuss data architecture. So pervasive is this success that today data modeling is often taken for granted. It looks mature and rather static. Entry-level staff have to learn it, and experienced professionals may need to extend their skill sets into advanced areas, such as dimensional modeling. Yet even if there is a need for individual growth, there seems to be an unspoken assumption that the body of knowledge built up around data modeling is all that is needed to implement successful databases. Few people seem to question if this is really so. Is it possible to entertain the thought that data modeling does not completely encompass everything that is needed for successful database design? I would submit that there really are limits to what data modeling - at least in its present form - can achieve, and that in practice it cannot fully describe the architecture of any database.

Code Tables, Indicators and Nulls

One of the areas that has troubled me about data modeling involves the use of code tables. These pesky little tables, consisting of just a few columns and a few rows, are incredibly persistent in all database designs. They generally comprise anywhere from 20 to 50 percent of the tables in any data model. Their names are quite familiar to us: County Code, Product Category, Customer Type and so on. What is strange about these tables is that they usually have to be populated before a system or database goes live. Could they possibly have something to do with design?

Similarly, indicators - otherwise known as flags or switches - are pervasive types of attributes in data models. We argue over whether they should be physically implemented as "true" or "false" versus "yes" or "no," or some other representational scheme, but we rarely stop to think what they are doing in a database in the first place. And yet, like code tables, they are quite common in database designs, even if it seems a little odd that a characteristic of an entity should have a value like yes or no. Again, perhaps we are looking at something that partly embodies design rather than purely representing data.

Attributes that are indicators, or which are foreign keys from code tables, are often found in entities recognized as master data. Such entities include perennial favorites like Customer and Product, but in more specialized databases they are still found in entities that are readily classified as master data. This is not a universal rule, and both indicator and code attributes can be found in transactional data. Nevertheless, these two kinds of attributes are found most frequently in master data, and it is instructive to ask why this is. The answer seems to be, as we shall see below, that they are used to describe structures within database designs that cannot be represented by current data modeling techniques.

A separate issue, which at first appears quite unrelated, involves specifying if columns are null or not null in data models. Anybody with any experience of data modeling is likely to have come across attributes that must be null in certain circumstances, but must be not null in other circumstances. Data modeling tools do not seem to be built to handle this requirement: they ask if an attribute is potentially nullable or always not null. They cannot record if a column must be null under certain conditions but must be not null under other conditions. Once again, this is a hint that there are limits to how precisely data modeling can specify database design.

Logical versus Physical

Before we look at an example of the limits of data modeling, it is important to recognize a cultural issue in data modeling, and indeed IT as a whole. This is the great divide between the logical and physical perspectives of data management. On the one hand we have data modelers and data administrators who deal in design and related metadata, and are perceived as the builders, if not the owners, of data models. On the other hand we have database administrators and programmers who have to get databases up and running, and who look to those who traffic in more logical constructs for as much practical help as possible. Nevertheless, in my experience data modelers and data administrators nearly always see themselves as responsible solely for designs, and think that the way in which business users choose to put data into a database is entirely up to the business users. This protective attitude simplifies the life of a great number of IT professionals, but it may set limits on the level to which databases can be designed.

One problem is that if physical values that populate database columns somehow describe the structure of a database, then this structure is never going to be captured in a data model since a model never contains physical values - only the implemented database does. The logical/physical divide between models and databases will always stand in the way of such design structure being represented.

The World of Hidden Subtypes

Returning to the issues presented by code tables, indicators, and nulls, it is possible to see that database design consists of more than can be represented in a data model.

Suppose we have a manufacturing company that makes kitchen appliances and plumbing fixtures. The Product entity for this company can be represented by the model fragment shown in Figure 1.

 Figure 1: Data Model Fragment for Product Entity

As can be seen, the Product entity has three associated code table entities: Product Development Status, Product Category, and Finish Type, which are defined as follows:

  • Product Development Status: The different stages of product development that are recognized at the company. These are Research and Development, Prototype, Production and Obsolescence.
  • Product Category: The major classes of Product. These are Kitchen Appliances and Plumbing Fixtures.
  • Finish Type: The finishes available for plumbing fixtures. These are Gold, Nickel and Oil Rubbed Bronze.

If we look more closely at the Product entity, we can see that certain attributes are only populated in certain circumstances:

  • Product Model Number is only populated once a Product is in the Production stage. It must not be populated for stages prior to that.
  • Product Recall Indicator is set to "yes" if a Production Product has ever been subject to a recall. Again, it can never be populated for products that have a Product Development Status that precedes Production.
  • Product Voltage Rating is only populated for products that are Kitchen Appliances.
  • Product Water Pressure Rating is only populated for products that are Plumbing Fixtures.
  • Finish Type Code is only populated for products that are Plumbing Fixtures.
  • Product Care Disclaimer Indicator. This nonintuitive term means that customers have to be supplied with a disclaimer if they buy a Plumbing Fixture that has a Finish Type Code representing Gold or Oil Rubbed Bronze. These finishes need special care if they are not to be damaged by the customer.

By now you get the picture. Certain attributes are only populated under certain circumstances. Traditionally, we have gotten around this need by creating subtypes in data models, thus spawning additional entities. Yet, in the above example we have four sets of supertypes and subtypes (Product Development Status, Product Category, Finish Type, and another for Finish Types of Gold and Oil Rubbed Bronze). It would be an absolute nightmare to try to model such a situation using subtyping, and anyone who did would certainly face a rebellion from the people charged with implementing it. Furthermore, traditional data modeling only permits one supertype and one set of subtypes per entity. It cannot deal with the various independent, overlapping and hierarchical natures of the subtypes we find even in the relatively simple situation shown in Figure 1. The only solution is to use code tables and indicators to deal with the problem in the context of a single Product entity.

Such a solution misses out on design. On the face of it, the Product entity on Figure 1 appears to say that every record in the physically implemented table will be uniform, in accordance with third normal form. In reality there are a number of subtypes hidden within the Product entity. These hidden subtypes are never formally identified, named or defined, let alone managed.

Business Rules to the Rescue?

Hidden subtypes are especially prevalent in master data entities such as Product and Customer. That is why these entities have so many i ndicators and code tables associated with them. The indicators and code tables serve to identify the hidden subtypes in practice. Indicators also serve to represent attributes that belong only to a certain hidden subtype, like the Product Care Disclaimer Indicator in the example above. It can also be seen that telling the database that a particular attribute is null or not null makes no sense when dealing with hidden subtypes. An attribute pertaining to a specific hidden subtype must always be null on records that do not belong to that hidden subtype. Yet data modeling tools have no way to specify this.

How, then, is it possible to deal with the complex design issues inherent in hidden subtypes? Without the formal recognition of hidden subtypes, we are limited to identifying sets of attributes that behave as a group for reasons that are not usually stated. This information is typically scattered about in written documentation or people's heads.

Sometimes, though, there are attempts to formalize the knowledge of hidden subtypes. This is particularly true of business rules projects. Whenever a project is undertaken to gather and document business rules, a vast expanse of low-hanging fruit is represented by the columns in a database table that gets populated based on the values present in other columns. These are the attributes associated with hidden subtypes, but the business rules approach does not see it that way. Furthermore, the output of business rules projects is typically English-language business rules statements that may at best be available for searching and grouping according to user-defined parameters. Although such business rules projects may be lauded for such results, the reality is that they completely miss the point when it comes to hidden subtypes. Furthermore, business rules projects could be greatly reduced in scope if they did not have to deal with hidden subtypes.

Dealing with Hidden Subtypes

The dependencies inherent in the sets of attributes that form hidden subtypes would be best handled as part of data modeling. This would need data models to be aware of code values in a way they are not today. Perhaps that is still a way off in the future. Until then, the most practical way to manage hidden subtypes is in a repository that contains data model metadata. Many enterprises have such repositories today, and it requires relatively little effort to extend them in this way.

Perhaps more important, however, is recognizing that hidden subtypes represent a valid and very important layer of database design that is missing from today's data models but which is absolutely vital for understanding and specifying database design.


Malcolm Chisholm is an independent consultant focusing on meta data engineering and data management. He is the author of How to Build a Business Rules Engine and Managing Reference Data in Enterprise Databases and frequently writes and speaks on these topics. Chisholm runs two Web sites http://www.bizrulesengine.com and http://www.refdataportal.com. You can contact him at mchisholm@refdataportal.com.

For more information on related topics, visit the following channels:



Industry Vendors