
Originally published 4 March 2009
Normalization is a technique that achieves a lot of good in database design. However, there are limits to its effectiveness, and master data tables seem to be particularly sensitive to these limits. Master data tables like Product and Customer tend to have hundreds or even thousands of columns. They incorporate many different types of Product or Customer, and each of these types often has a few attributes that are unique to it, but not shared with the other types. The result is that when we examine a table such as Product, we see that certain columns only have values for certain types of Product, and never have values for other kinds of Product.
In my personal experience, I have always had issues when I have pointed these facts out to more rigorous data modelers. They have invariably told me that I have modeled the situation incorrectly, and instead of a single Product table, I should have a set of subtypes. Some of them have told me that it might be acceptable to have a single Product table in the physical implementation, but the logical data model should show the subtypes.
These attitudes have baffled me. Creating the subtypes adds complexity that creates confusion when trying to understand the data model. Having the logical data model significantly different to the corresponding database adds to this confusion. I think that a reflexive attachment to normalization is counterproductive in this case. Let me try to prove this point with an example.
Suppose we have a company that is active in many areas. It delivers services and builds one-off specialty products. It treats all of its business activities as projects, and consequently has a master data entity called Project. This entity has two basic attributes: Project Number, and Project Title. Project Number is the primary key. There are additional attributes for Project, as follows:
All of this information is summarized in Table 1, which shows the attribute, its possible values, a code that represents each value, and ancillary attributes that exist only for the specific value.
| Attribute | Value | Code | Ancillary Attribute |
| Project Security Level | Top Secret | 1 | Security Officer |
| Classified | 2 | General Council Registration Date | |
| Confidential | 3 | NDA Number | |
| Public Domain | 4 | Web Page URL | |
| Project Status | Design | A | Project Inception Date |
| Development | B | Project Development Start Date | |
| Prototype | C | Project Prototype Date | |
| Production | D | Production Start Date | |
| Support | E | Production End Date | |
| Obsolete | F | Project Obsolete Date | |
| Project Sector | Military | ML | Armed Service |
| Communications | CM | Communication Type | |
| Environment | EV | Environment Goal | |
| Financial | FN | Financial Regulator |
Table 1: Summary of Three Attributes of Project Entity
If we follow the rules of normalization, we will begin by creating a single entity called Project, with five attributes, as shown in Figure 1.

Figure 1: Basic Project Entity
Now we are confronted with a problem. According to normalization, we need to create subtypes for Project Security Level, Project Status, and Project Sector. We need one subtype per type of Project Security Level – and Table 1 shows that there are 4 of these. But we also need one subtype per type of Project Status – and Table 1 shows there are 6 of these. Since Project Security Level and Project Status are independent of each other, they have to combine in 4 x 6 = 24 possible ways. If we add Project Sector into the mix, we end up with 4 x 6 x 4 = 96 possible combinations.
So, apparently we need 96 different subtypes in addition to the parent Project table. This is a number that it is not easy for a programmer or analyst to keep track of. In reality, it is even more complex.
Figure 2 illustrates this further complexity. It shows 3 of the possible 96 subtypes. Each of these subtypes has the attribute Project Inception Date, because each represents the Project Status of “Design.” In fact, 16 of the possible 96 subtypes will have the attribute Project Status Date. Suppose a programmer is tasked to produce a report of Project by Project Inception Date. He or she will have to figure out which 16 of the 96 tables to access, and then pull out the dates. 16-table joins (or unions) are neither easy nor graceful.

Figure 2: Example of Three Subtypes
The “subtype” tables are not actually subtypes. The “Design” subtype will be implemented in 16 of the 96 tables as we have just seen. In other words, if I want to fully understand the Projects that are in the “Design” phase of Project Status, I have to consult 16 tables. None of the individual tables map to one subtype – they are all unique intersections between the 14 subtypes listed in Table 1. We might call these tables subtypes, but they are not.
Having 97 tables for the Project entity (the parent plus 96 subtypes) will overload any human trying to make sense of the data model. Persuading a DBA to implement such a structure, and a programmer to build logic around it, are almost certain to be impossible. We are forced to fall back on the single Project table with the subtypes folded into it. I would submit that it is easier to figure out the situation of each record in such a table than it is to figure out which subtype table applies to a given record in the normalized design.
We still have to manage the subtypes, be they inside the Project table or scattered across the 96 tables. The principle of locality of reference tells us that it is easier to manage related elements if they are physically close to each other. The single Project table is a better design from this perspective. Clearly, therefore, there are limits to what normalization can give us, and in master data entities, there are design alternatives which are simpler and easier to maintain.
SOURCE: Normalization is Impractical for Master Data
Recent articles by Malcolm Chisholm
Comments
Want to post a comment? Login or become a member today!
Be the first to comment!