In a second article on the normal forms of a relational database, Peter Vogel discusses forms two through four and offers some advice in dealing with management.
One of the reasons I'm taking the time to look at the various normalized forms for databases is to fight a battle I lost a number of years ago. At the time, I was working on a system that had the keen interest of one of the members of upper management. This president had played around with computers to the point where he felt he knew the essentials of good database design. And, to be fair, he did know some of them. However, he didn't know enough to appreciate the absolute necessity of normalizing data. As I normalized the original database design and the number of tables increased, he would step in and "de-normalize" the design for efficiency's sake.
We never really got the system to work after these improvements, and it cost us a great deal of money just to keep it functioning on a day-to-day basis. All in all, it was an interesting display of "efficiency." In this article, I'm not only going to provide you with a better under-standing of the normal forms, I'm also going to provide you with the business case for normalizing data -- should you ever find yourself in a similar situation.
In the January 1998 issue, I discussed first normal form (see "Working SQL: The Trouble with Normal"). I tried to make it clear that it's not just a theoretical issue and it isn't really about data integrity. It's about cost and time and capability. Letting your database slide into an unnormalized state increases the cost and time spent in delivering and maintaining any system that uses it. In addition, there are many purposes that a database in an unnormalized form can't be used for, no matter how much money is spent -- end-user reporting, for instance. No matter how attractive repeating fields might look, they're always a bad idea.
I will say, however, that unlike unnormalized data-bases, databases that are only in first normal form do have a purpose in life. Don't get me wrong -- first normal form databases can't be used for systems that are updated online, at least not without incurring tremendous costs in development and maintenance. First normal form data-bases can, however, be used for end-user reporting. Data warehouses, for instance, are frequently -- and successfully -- left in first normal form. Note that data warehouses aren't generally updated online. Instead, the vast majority of data activity is the insertion of new records and is done on a batch basis. Typically, in a data warehouse, the only updates are the ones made to sum-mary records, and these are also done on a batch basis.
Second normal form
Second normal form requires that all of the fields in a record depend only on the primary key of the record. As I discussed in the January 1998 issue, a record without a primary key can't be normalized. In one sense, the primary key defines what the record is all about, and so defines what fields belong in the record with it.
In discussing second normal form, the example that works best for me is a customer record whose key is the customer number. When you change the customer number, all of the data in the customer record changes. You are, after all, changing to a different customer. Some of the fields in the record might change to the same value (two different customers might live in the same country), but they do change.
The important part of the definition is that every non-key field must depend only on the primary key. In the customer record, a failure in second normal form occurs when a field depends not just on the customer number (the primary key), but also on some other field in the record. If the customer record includes both a country code and a country name, for instance, one of the two fields doesn't belong in the record. Changing the country or the country code will force a change in the other field, which indicates that these fields don't depend only on the primary key of the record.
Don't assume that a failure in second normal form is always so obvious. For instance, a student record might contain a count of the number of classes completed and a status indicating how close a student is to completing the prescribed run of courses. If the count of classes is tied very closely to the student's status, you could have a situation where any change in the count of classes results in an update to the status field. If so, the record isn't in second normal form, because these two fields depend on each other and not just on the primary key.
One of the warning signs of a record not yet in second normal form is developers reminding each other, "Don't forget that when you update field A that you've got to update field B in the same record."
The problem with records not in second normal form is that, of course, someone will create a program that performs only half of the update. And, once that program goes into production, the integrity of the data will be lost: The value of field A won't be correct to the value of field B. Unlike failing to put a database into first normal form (which makes the data unusable), failing to put a record in second normal form makes the data untrustworthy.
Of course, you can prevent a program that performs only half the update from going into production. You will incur increased costs in training developers, updating programmers' documentation, and testing, but you can do it. A more subtle problem is the effect that databases not in second normal form have on the test system. Since the programs in the test system are untrustworthy (or they wouldn't be being tested) they will frequently corrupt records not in second normal form. As a result, a test database not in second normal form requires more effort to maintain than a fully normalized one.
By the way, when these programs fail, the naive will say that these programs have bugs. It would be more accurate to say that the database has a bug.
Third normal form
I think of third normal form as an extension of second normal form. Third normal form requires that the fields in the record depend on the whole primary key and not just on one part of the key. Taken together, second and third normal form, developed by Dr. Codd, the father of relational database theory, provide the oath of the DBA: "The key, the whole key, and nothing but the key, so help me Codd."
Many people assume that third normal form only applies when the primary key is a composite key -- a key made out of multiple fields. In a perfect world, this would be true. In the real world, it isn't.
One of the dangers that people fall into is creating an arbitrary primary key that has information embedded in it. As I said last month, there's often no need to create an arbitrary primary key at all: Why not use the customer name or phone number as the primary key of the customer record? However, if an arbitrary key is created, it must, like the customer name, not include any information. You can't, for instance, determine anything about a customer by examining the characters that make up their name. In the same way, you shouldn't be able to determine anything about the customer from their customer number.
However, there is a tendency to "code" information into these arbitrary numbers. The customer number 519321 might, for instance, include the customer's area code as the first three digits of the number. This works until the 999th customer in the 519 area code is added, the customer moves (some people are so inconsiderate), or area codes are reassigned.
Part codes seem to be especially prone to this kind of meddling. Database designers often seem to feel that they're being helpful to the users of their systems if a part number is "meaningful." A part might have a number 62134B, for instance, where 62 is the number of the product line that the part is used in. This works fine until the part is used in two product lines.
But what if the B in the code indicates that the part is brass-plated? Now imagine that, after years of service, it's decided that the part no longer requires the brass plating. For the part number to stay "meaningful," the part number must be changed. However, if you do change the part number, you'll lose all connection to the history of that part stored under the old number. In one company, changing the plating would be considered to create a new part, which should have a different part number. But it's not hard to imagine another business where losing the connection to the old information would be an unfor-tunate thing to have happen. The key point is that, once you embed information in the part number, you no longer have a choice: Either the part number must misrepresent the part, or you must change the part number.
What has this got to do with third normal form? Well, imagine that there's a field on the record that represents the thickness of the brass plating. Since that field is dependent on the part being brass-plated (a piece of infor-mation in the key of the record) the record is not in third normal form. If the brass plating is removed, only part of the key changes, but the field giving the thickness of the brass plate must also be changed, a sure sign that the rec-ord isn't in third normal form.
Fourth and fifth
Fourth normal form moves to the meaning of the data in the record. A record in fourth normal form contains data that belongs together in a meaningful way. To return to the parts record, that record might contain information about where the part is stored and what the part is used for. However, in many businesses, there's no connection between the purpose of a part and its storage location: one might vary independently of the other. If that's the case, then the data should be broken out into separate tables to bring the record to fourth normal form.
Fourth normal form could, presumably, result in a new table for every field in the record. As a result, design-ers tend to shy away from fourth normal form. Yet, if the fields don't vary together, then their ordinality might not always agree, either. For instance, it's easy to imagine that a part might be stored in many locations or have many uses. If the number of uses and the number of locations don't agree, then a record layout that includes them both won't work.
Rather than create a fourth normal database design to handle every situation, most database designers will attempt to determine what the current situation requires. In other words, if a part can currently be stored in several different locations, the designer will create a part/location table. If every part has only one location, the designer won't bother moving the record to fourth normal form.
However, to save yourself some rework in the future, you should probably at least probe to determine which situations merely reflect current practice and which ones are built into the nature of the business. A question like, "So, do you think that you'll ever want to have a part stored in more than one place?" can save hours of work down the line when the answer is "Yes."
Fifth normal form addresses problems with data that logically belongs together but is affected by external requirements. For instance, a purchasing system might list each vendor a product is bought from and the customer it's sold to in a Vendor/Product/Customer table. This table works well (it even satisfies fourth normal form) until a law is passed to prevent exclusive marketing relationships. The law requires that any part bought from a vendor must be bought from all vendors. Suddenly, the Vendor/Product/Customer table explodes in size, as a record for every vendor must be added for every part. However, if the part number/vendor relationship is broken out into a separate table, then the number of new records required can be held to a much smaller number. The reduction in disk space and update activity caused by moving to fifth normal form can be significant. It also emphasizes how data design is affected by the real world.
Again, most database designers will only move records to fifth normal form when a specific case arises.
Summary fields are another area that deserves discussion. While the problems with summary fields don't relate directly to the normal forms, there are problems with them, and I'd like to mention them before finishing this article.
Summary fields crop up in many places. Some are obvious: The customer record might have a field giving the total purchases to date, for instance. Some summary fields are less obvious: A customer's "credit limit remaining" field is a summary field because it really represents the total of the unpaid purchases subtracted from the customer's credit limit. A status field whose value is, or could be, determined by examining other data in the database is a summary field in disguise.
While they don't seem to violate the any of the rules that define the normal forms, these fields do violate a more basic tenet of good database design -- no redundant data. It's only a matter of time before that total purchases field in the customer record disagrees with the total calculated from the sales table. One of the symptoms of these problems is the existence of batch programs that recalculate fields in one record based on data in other records. Another symptom is a set of "audit" programs that validate field values by checking other data in the database.
Unfortunately, these summary fields are sometimes a necessary evil. In a data warehouse, these fields facilitate end-user reporting. If the warehouse is updated infrequently (once a day or less), in batch, and the fields are calculated from zero, the danger can be managed. In a production database with frequent online updating where the fields are incremented by the value of the transaction, these fields are dangerous. However, if the data in the field must be retrieved online, must be accurate as of the last transaction, and can't be calculated in a reasonable period of time because of the amount of data that must be processed, then these fields are required. The cost of these fields will show up in extra testing and the creation of the audit and reset programs.
Given the costs and dangers of summary fields, it's often worthwhile to challenge the assumption that the fields must be up to date to the last transaction. If the data as of the end of the business day is "good enough," then the information can be retrieved out of a data warehouse, provided the warehouse is updated frequently enough.
A joy forever
People have forgotten, but the reason that structured programming was introduced was to make it possible to prove, mathematically, that a program was correct. In the same way, people have forgotten that the relational database concept, and the subsequent definition of the normal forms, was done to make programmers more productive.
While the people who fund our projects might not appreciate how important it is to avoid "update anomalies," they do have a strong interest in not spending money. The purpose of the various normal forms is not to create well-designed, logical database structures (though that's the result). The purpose of the various normal forms is to create economical and efficient systems that are built for a minimum cost and to the forecasted schedule.
However, to move away from the commercial aspects of this discussion to a more personal note, I should add that I care very much about good database design. For me, the result of applying the normalization rules is the creation of a thing of beauty. And creating beauty is always a good thing to do.
Other Related Pages