In most online articles about databases you can easily find a lot of articles explaining cool, complex and new features. But what’s about some basics? I decided to release some articles about the basics. The articles will mostly cover topics usable in most relational databases, but the examples and the focus will be on MS SQL Server database engine.
There are so many definitions of what database normalization is. Maybe the simplest explanation could be as follows. Database normalization is a process used to organize a database into tables and columns. The idea is that a table should be about a specific topic and that and only supporting topics included. Normalization is a systematic approach of decomposing tables to eliminate data redundancy and ensuring that data dependencies make sense.
The process of database normalization is an iterative process of using normalization rules.
Normalization rules are divided into normal forms (NF): 1st NF, 2nd NF, 3rd NF, BCNF, 4th NF, etc. While in theory there are many normal forms, usually it is enough to implement the normalization up to the 3rd normal form. You can find various articles online about the theories of normalization and if you like you could start here.
Let’s explain in simple words what is the result of the first three normal forms:
- 1st NF - eliminates duplicate columns across all tables and adding a primary key.
- 2nd NF - create relationships between tables through foreign keys.
- 3rd NF - fields should not be derived from other fields (e.g. it is not acceptable to have a calculated value in a table. If you have already the value X and Y, there is no need to have stored the R value if the R = X*Y)
Examples, examples, examples 😊 Let’s try to figure out what a database normalization is using simple examples.
Both examples have multiple problems that would be solved with a simple normalization. Main problems in this design:
- Data duplication anomaly – in example 1 we see that every person is inserted multiple times. More data means more storage used. And more storage means decreased performances.
- Insert anomaly – we could have problems if we try to expand our design and add the column “BestFriendID” that would represent the person ID of the user best friend. But what if in the table the referring friend is not present? Maybe he don’t have movies in his collection. Also, if "John Doe" already have 100 movies and he buy a new movie, in the second example you must add additional columns to be able to store this values.
- Update anomaly - What would happen if for example “John Doe” moves from “Zagreb” to “Dublin”? We should update the value from “Zagreb” to “Dublin” in every row where “John Doe” is present. Or imagine what would happened if we made a typo and that we must change the release year of a certain movie in all rows and all columns?
- Delete anomaly – If you delete the only record containing the movie “A Clockwork Orange”, you would lose the complete info about this movie. You would eventually loose the data about “John Doe” if he is the only person that have this movie and this one is the only movie he has.
- Query anomaly – Look at the second example. How would you select every person that have in his collection the “Apocalypse Now” movie? You should write something like “WHERE Movie1 = ‘Apocalypse Now’ or … or MovieX = ‘Apocalypse Now’?
Implementing normalization rules, you could create a normalized table design. One suggestion is as follows:
If a value in your table has only two possible states (e.g. true/false or black/white), always consider using a Boolean column instead of putting another table in the design containing only two rows. This approach is not maybe by normalization rules, but it will save you some resources.