Normalization is a database design technique that reduces data redundancy and eliminates undesirable characteristics like Insertion, Update and Deletion Anomalies.
What do Insertion, Update and Deletion Anomalies look like? Let's suppose we have this table:
PID | Name | Reputation | Ship | Salary |
---|---|---|---|---|
p1 | Jack Sparrow | 5 | BP | 10k |
p2 | Elizabeth Swan | 6 | BP | 15k |
p3 | Captain Barbossa | 5 | BP | 10k |
p4 | Will Turner | 6 | FD | 15k |
p5 | Davy Jones | 4 | FD | 9k |
Suppose you want to change the salary for a particular reputation. If you only change it in a tuple and not the other ones, we will have an inconsistancy in the database. We have to make sure all the tuples are modified accordigly. (For repuration 5, you need to change both the first and third row)
Can we add a salary for a particular reputation value if currently, in the database, there is no pirate with that reputation ? (Let's say for reputation 7 we want 20k salary). There is no pirate, but can we add a row like "NULL, NULL, 7, NULL, 20K" ? No, because the PK is null. So we cannot do it. To add a new rep, we also need to add a pirate with that rep. It's too restrictive
What if we fire Sparrow and Barbossa (both with reputation 5 and salary 10k)? Then we lose the salary for reputation 5. We lose the association between them.
Normalization rules divides larger tables into smaller tables and links them using relationships. The purpose of Normalization in SQL is to eliminate redundant (repetitive) data and ensure data is stored logically.
We can eliminate those 3 anomalies by creating 2 tables: One that stores pirates and one that stores the association between reputation and salary:
Table 1 | Table 2 | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
A relation R is decomposed into relations R1, R2 .. Rm. A good decomposition means that we can recover the original relation (in our case R) from the smaller relations. If the decomposition is irreversible, then it's bad.
See more at Lecture 7 (or the readMe for relational algebra if it exists)
Also read about functional dependecy before continuing here
Here is a list of Normal Forms in SQL:
- 1NF (First Normal Form)
- 2NF (Second Normal Form)
- 3NF (Third Normal Form)
- BCNF (Boyce-Codd Normal Form)
- 4NF (Fourth Normal Form)
- 5NF (Fifth Normal Form)
- 6NF (Sixth Normal Form)
Definition from lecture: A relation is in the first normal form if it doesn't have any repeating attributes.(which is in accordance with the definition of a relation in the relational model)
If a relation contains a composite or multi-valued attribute, it violates the first normal form, or the relation is in first normal form if it does not contain any composite or multi-valued attribute. A relation is in first normal form if every attribute in that relation is singled valued attribute.
Rules:
- Each table cell should contain a single value.
- Each record needs to be unique.
Let's say we have this table:
Student(name, birthyear, group, course, grade) with the key "name".
We have a composite repeating attribute: the pair {COURSE, GRADE}
BOOK(Bid, AuthorsNames, Title, Publisher, PublishingYear, Keyphrases) with the key Bid.
We have a simple repeating attribute: AuthorsNames, KeyPhrases.
If you have a relation with repeating attribute, the idea would be to decompose it such that the collection the relations contains all the data in the original relation, and the decomposition is good.
You replace the relation with the repeating attribute by a collection of other relations, via a good decomposition and the repeating attribute is now not repeating anymore.
We have 2 separate simple repeating attributes, so we are gonna decompose the book relation into 3 tables:
- BOOKS[Bid, Title, Publisher, PublishingYear] - so with the attributes that weren't repeating in the original relation
- AUTHORS[Bid, AuthorName] - we took out the first repeating attribute
- KEYPHRASES[Bid, Keyphrase] - we took out the second repeating attribute
We have a composite repeating attributes, so we are gonna decompose the relation into 2 tables:
- GENERAL_DATA[Name, Birthyear, Group] - with the attributes that were not repeated.
- RESULTS[Name, Course, Grade] - we add the key and the composite repeating attribute.
Before talking about 2NF, take a look at "functional dependency".
Definition from lecture: A relation is in the second normal form if it's in the first normal form and every non-prime attribute is fully functionally dependent on the key
An attribute is prime if there is a key K in our relation and the attribute is a proper subset of the key. Ex: if our key consists of K1, K2 then K1 is a prime attribute.
Let's say we have a relation R[A1,A2,...,An], where alpha and beta are nonempty sets of attributes in R. Beta is fully functionally dependent on alpha if:
- alpha -> beta
- if gamma is any subset of alpha, then gamma -> beta does not hold.
- Example: we have alpha = [A,B,C] and beta = [D]. Then ABC->D is true, but A->D,B->D,AC->D and so on is not true.
- Basically: Beta is fully functionally dependent on alpha if it's functionally dependent on alpha but it's not functionally dependent on any proper subset of attributes in alpha.
Let's take an example. We have this table: EXAMS[StudentName, Course, Grade, FacultyMember]
StudentName | Course | Grade | Faculty Member |
---|---|---|---|
Pop Ioana | Computer Networks | 10 | Matei Ana |
Vlad Ana | Probabilities and Statistics | 10 | Simion Bogdan |
Vlad Ana | Computer Networks | 9.98 | Matei Ana |
Dan Andrei | Probabilities and Statistics | 10 | Simion Bogdan |
Popescu Alex | Operating Systems | 9.99 | Matei Ana |
The key is {StudentName, Couse}, and as we can oberse, we have this dependency: {Course}->{Faculty Member}. We need to check if it's in the 2NF:
- First rule: to be in 1NF. It doesn't have any repeating attributes, so it's okay.
- Second rule: every non-prime attribute is fully functionally dependent on every key.
- The prime attributes are StudentName(S), Course(C), and the non-prime attributes are Grade(G), FacultyMember(F).
- We check for the Grade attribute: is it fully functionally dependent on the key? Fully functionally dependent on the key means:
- Is grade functionally dependent on StudentName and Course? Yes. SC -> G
- Is grade not functionally dependent on any subsets of the key? S->G doesn't hold because a student name can't determine a grade, and C -> G doesn't hold because just a course can't determine a grade. So Grade is not functionally dependent on any subsets of the key.
- So, Grade is functionally dependent on the key, and it is not functionally dependent on any proper subset of the key => Grade is fully functionally dependent on the key.
- We check for the FacultyMember attribute: is it fully functionally dependent on the key? Fully functionally dependent on the key means:
- Is F functionally dependent on StudentName and Course? Yes. SC -> F
- Is F not functionally dependent on any subsets of the key? S->F doesn't hold because a student name can't determine a faculty member, but C->F does hold.
- So, while FacultyMember is functionally dependent on the key, it is also functionally dependent on a proper subset of the key => FacultyMember is not fully functionally dependent on the key.
The second rule isn't respected. We can get rid of this redundancy(due to this functional dependency) by decomposiong our original relation into two relations: RESULTS[StudentName, Course, Grade] and COURSES[Course, FacultyMember]. From the original relation we'll remove the dependent(FacultyMmember), and the determinant(Course) we'll be the primary key.
So this table:
StudentName | Course | Grade | Faculty Member |
---|---|---|---|
Pop Ioana | Computer Networks | 10 | Matei Ana |
Vlad Ana | Probabilities and Statistics | 10 | Simion Bogdan |
Vlad Ana | Computer Networks | 9.98 | Matei Ana |
Dan Andrei | Probabilities and Statistics | 10 | Simion Bogdan |
Popescu Alex | Operating Systems | 9.99 | Matei Ana |
Becomes these tables:
RESULTS | COURSES | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Rules:
- Be in 2NF
- no non-prime attribute is transitevely dependent on any key
A transitive functional dependency is when changing a non-key column, might cause any of the other non-key columns to change(in the example below: changing either salary or reputation causes changes in the other column).
From lecture: attribute Z is transitively dependent on attribute X if it exists a Y sush that X->Y, Y->Z, Y->X does not hold.
Example:
PID | Name | Reputation | Salary |
---|---|---|---|
p1 | Jack Sparrow | 5 | 10k |
p2 | Elizabeth Swan | 6 | 15k |
p3 | Captain Barbossa | 5 | 10k |
p4 | Will Turner | 6 | 15k |
p5 | Davy Jones | 4 | 9k |
This table is in 2NF. We also see that {Reputation} -> {Salary}. We still have redundancy because of this dependency. Salary is transitively dependent on reputation.
- we said that: attribute Z is transitively dependent on attribute X if it exists a Y such that X->Y, Y->Z, Y->X does not hold.
- for this example, PID determines Reputation (because it's a key). Reputation determines Salary(which is not a subset of PID or Reputation). Reputation does not determine PID. By our definition, Salary is transitively dependent on PID, so the second rule is broken.
A decomposition can be:
Table 1 | Table 2 | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
BCNF = (Boyce-Codd Normal Form)
Definition from lecture: A relation is in BCNF if the determinant for each and every dependency is a candidate key.
Even when a database is in 3rd Normal Form, still there would be anomalies resulted if it has more than one Candidate Key. Sometimes is BCNF is also referred as 3.5 Normal Form.
Example: For table EX_SCHEDULE(Date, Hour, FacultyMember, Room, Group) we can have these keys:
- key{Date, Group}: a group of students has at most one exam per date.
- key{FacultyMember, Date, Hour}: on a certain date and time, a faculty member has at most one exam
- key{Room, Date, Hour}: on a certain date and time, in a certain room, we have at most one exam
- and this functional dependency: FD{FacultyMember, Date}->{Room}
This relation is in 3NF, but we still have redundancy. We decompose it like this:
- EX_SCHEDULE[Date, Hour, FacultyMember, Group] (we took out the dependent)
- we still have the keys {Date, Group} and {FacultyMember, Date, Hour}, but {Room, Date, Hour} is lost.
- ROOM_ALLOCATION[FacultyMember, Date, Room]
- here we move {Room, Date, Hour}
Now there are not any dependencies in which the left side is not a key. Reformulation: in the left side, we only need to have keys. Before, we had the key{Room, Date, Hour} and the dependency: FD{FacultyMember, Date}->{Room}, and FacultyMember was not a key.
Definition from lecture: A relation is in 4NF if and only if for every MVD that holds over our relation, one of the following statements is true:
- the MVD is trivial
- the left side of the MVD is a super key.
Basically: If no database table instance contains two or more, independent and multivalued data describing the relevant entity, then it is in 4th Normal Form.
Rules for 4th Normal Form:
- It should be in the Boyce-Codd Normal Form.
- And, the table should not have any Multi-valued Dependency.
A table is said to have multi-valued dependency, if the following conditions are true:
- For a dependency A → B, if for a single value of A, multiple value of B exists, then the table may have multi-valued dependency.
- Also, a table should have at-least 3 columns for it to have a multi-valued dependency.
- And, for a relation R(A,B,C), if there is a multi-valued dependency between, A and B, then B and C should be independent of each other.
If all these conditions are true for any relation(table), it is said to have multi-valued dependency.
Below we have a college enrolment table with columns s_id, course and hobby.
s_id | course | hobby |
---|---|---|
1 | Science | Cricket |
1 | Maths | Hockey |
2 | C# | Cricket |
2 | Php | Hockey |
As you can see in the table above, student with s_id 1 has opted for two courses, Science and Maths, and has two hobbies, Cricket and Hockey.
You must be thinking what problem this can lead to, right?
Well the two records for student with s_id 1, will give rise to two more records, as shown below, because for one student, two hobbies exists, hence along with both the courses, these hobbies should be specified.
s_id | course | hobby |
---|---|---|
1 | Science | Cricket |
1 | Maths | Hockey |
1 | Science | Hockey |
1 | Maths | Cricket |
And, in the table above, there is no relationship between the columns course and hobby. They are independent of each other. So there is multi-value dependency, which leads to un-necessary repetition of data and other anomalies as well.
How to satisfy 4th Normal Form? To make the above relation satify the 4th normal form, we can decompose the table into 2 tables.
CourseOpted Table | Hobbies Table | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
A table is in 5th Normal Form only if it is in 4NF and it cannot be decomposed into any number of smaller tables without loss of data.
From lecture: Relation 𝑅 is in 5NF if every non-trivial JD is implied by the candidate keys in 𝑅.
If a table can be recreated by joining multiple tables and each of this table have a subset of the attributes of the table, then the table is in Join Dependency. It is a generalization of Multivalued Dependency
That would mean that if by doing a join relation on some relations and the result is equal to our original relation, then it;s not in 5NF.
6th Normal Form is not standardized, yet however, it is being discussed by database experts for some time. Hopefully, we would have a clear & standardized definition for 6th Normal Form in the near future…