4. Data Normalization

Looking at an existing database and attempting to improve the design.

Information Design Guidelines

User view (logical):

Storage/base relation level:

Semantics of Attributes

Guideline 1: each tuple should represent one entity or relationship instance:

Guideline 2: design relations so that no update anomalies are present:

Guideline 3: design relations to have as few NULL values as possible:

Guideline 4: relations should be designed to satisfy the lossless join condition:

Decomposition

Dividing a relation into two or more relations. To do this, normalization is required - this is essentially a set of tests used to determine the ‘quality’ of a schema.

Functional Dependencies

Assume the universal relation schema is R(A1,A2,,An)R(A_1, A_2, \dots, A_n).

A set of attributes, XX, functionally determines a set of attributes YY if the value of XX determines a unique value for YY. XYX \rightarrow Y.

For each value of XX, there is exactly one value of YY. The simplest example of this is if XX is the primary key and YY is the tuple.

If two tuples have the same value for XX, they have the same value for YY; if tuple_1[X]=tuple_2[X]tuple\_1[X] = tuple\_2[X], then tuple_1[Y]=tuple_2[Y]tuple\_1[Y] = tuple\_2[Y].

Inference Rules

Given a set of FDs FF, we can infer additional FDs that hold whenever the FDs in FF hold.

The closure of a set FF is the set F+F+ which contains all FDs that can be inferred from FF.

If XYX \rightarrow Y is inferred from FF, then it is notated as FXYF \models X \rightarrow Y.

Armstrong’s inference rules state that:

These three rules form a sound and complete set of inference (all functional dependencies can be inferred from these rules, and all are correct).

Using these three rules, we can form deduce three more rules:

Example

Given R(A,B,C,D)R(A, B, C, D) and a set of functional dependencies F={AB,ACD,BCD,CDA}F = \{ A \rightarrow B, AC \rightarrow D, BC \rightarrow D, CD \rightarrow A \}, show ACDAC \rightarrow D is redundant.

Using Armstrong’s theorems:

Using the definition of equivalent sets of functional dependencies (see equivalence sets section):

G={AB,BCD,CDA} G = \{ A \rightarrow B, BC \rightarrow D, CD \rightarrow A \}

FF definitely covers GG, so only need to find out if GG covers FF.

For all FDs in FF:

Determining the closure

Closure of a set of attributes XX with respect to FF is the set X+X+ of all attributes that are functionally determined by XX.

{X}F+\{X\}_F^+ can be calculated by repeated applying the three primary rules:

X_plus = X
while changes_occur()
  for Y, Z in F: # F is a set of functional dependencies of the form Y -> Z
    if Y in X_plus:
      X_plus.append(Z)
Example

Given R(A,B,C,D,E)R(A, B, C, D, E) as a set of functional dependencies F={ABC,CDE,ACE,BD,EAB}F = \{ A \rightarrow BC, CD \rightarrow E, AC \rightarrow E, B \rightarrow D, E \rightarrow AB \}, determine all candidate keys in RR:

Start with single attributes:

{A}F+=A \{A\}_F^+ = { A }

Look at the functional dependencies:

As {A}F+\{A\}_F^+ contains all the attributes, AA is a candidate key.

For BB:

{C}F+={C}\{C\}_F^+ = \{ C \}: only determines itself. The same is true for DD.

After one iteration, {E}F+={A,B,E}\{E\}_F^+ = \{ A, B, E \}. After another iteration, CC and DD get added. Thus, EE is also a primary key.

Now, look at tuples with two attributes. Ignore tuples containing AA or EE as they are candidate keys by themselves:

Thus, AA, EE, BCBC and CDCD are candidate keys. No three-tuples can be created so the search stops here.

Equivalence of Sets of FDs

Two sets of FDs, FF and GG are equivalent if F+=G+F^+ = G^+: every FD in FF can be inferred from GG and vice-versa.

Alternatively, FF covers GG if G+F+G^+ \subseteq F^+. Hence, FF and GG are equivalent if FF covers GG and vice-versa. This is much easier to test:

Calculate {X}F+\{X\}_F^+ (X+X^+ with respect to FF) for each FD XYX \rightarrow Y in GG; if this includes all attributes in YY, the sets are equal.

Example

Check if FF and GG are equivalent:

F={AC,ACD,EAD,EH}G={ACD,EAH} \begin{aligned} F &= \{ A \rightarrow C, AC \rightarrow D, E \rightarrow AD, E \rightarrow H \} \\ G &= \{ A \rightarrow CD, E \rightarrow AH \} \end{aligned}

FDs in GG:

Hence, FF covers GG.

FDs in FF:

Hence, GG covers FF.

Hence, the sets are equivalent.

Minimal Sets of FDs

A set of FDs FF is minimal if:

Every set of FDs has an equivalent minimal set (or sets - there can be several equivalent minimal sets).

Example

Is GG minimal? If not, find the minimal set:

G={SSN{Name,Born,Address,DepartmentId},DepartmentId {Name,ManagerSSN}} \begin{aligned} & G = \{ \\ & \quad \text{SSN} \rightarrow \{\text{Name}, \text{Born}, \text{Address}, \text{DepartmentId} \}, \\ & \quad \text{DepartmentId } \rightarrow \{ \text{Name}, \text{ManagerSSN} \} \\ & \} \end{aligned}

First condition violated; RHS has multiple attributes for both dependencies.

Using the decomposition rule (ABCA \rightarrow BC means ABA \rightarrow B and BCB \rightarrow C):

The third rule is met as the LHS only has one element, so it cannot be reduced further.

The second rule is met; it is obvious that all dependencies are necessary.

Normal Forms Based on Primary Keys

Normalization is the process of decomposing ‘bad’ relations into smaller relations.

Keys

Superkey of relation RR: set of attributes SS, SRS \subseteq R such that no two legal tuples in RR will have the same SS values.

Key KK: superkey where removal of any attribute will cause it to not be a tuple.

If a relation has more than one key, each is called a candidate key, with one arbitrarily designated as the primary key and the others as secondary keys.

A prime attribute is a member of at least one candidate key.

First Normal Form

Every attribute is single and atomic; no composite or multivalued attributes or nested relations.

Second Normal Form

Full functional dependency: FD of the form YZY \rightarrow Z where removing any attribute from YY means the FD no longer holds. It is called a partial dependency if this does not hold.

Second normal form adds more conditions to the first normal form: no non-prime attribute is partially functionally dependent on the primary key.

To transform into second normal form, split into smaller tables: if the primary key is AA and BB, and attribute CC is fully dependent on both AA and BB but DD is dependent on only BB, create two tables, {A,B,C}\{ A, B, C\} and {B,D}\{B, D\}. The primary key of the secondary table is only BB, so DD is fully dependent on the primary key.

Third Normal Form

If XYX \rightarrow Y and YZY \rightarrow Z, XZX \rightarrow Z is a transitive functional dependency.

Relations in third normal form are in 2NF and no non-prime attributes are transitively dependent on the primary key i.e. PrimaryKeyNonPrimeAttribute\text{PrimaryKey} \rightarrow \text{NonPrimeAttribute}.

Example: SSN\text{SSN} identifies DepartmentID\text{DepartmentID} and DepartmentID\text{DepartmentID} identifies DepartmentName\text{DepartmentName}. Hence DepartmentName\text{DepartmentName} must be in a separate relation from SSN\text{SSN}, with DNumber\text{DNumber} acting as the primary key.

General Normal Form Definitions

These take into account candidate keys, not just the primary key.

Second normal form: no non-prime attribute is partially functionally dependent on any key of the relation.

Third normal form: for all FDs:

Only the first condition applies for BCNF.

Ensure at least one table contains the entire primary key, and check that it is a minimal cover before transforming.

Example

R(A,B,C,D,E,F,G,H,I,J)R(A, B, C, D, E, F, G, H, I, J) and F={ABC,ADE,BF,FGH,DIJ}F = \{ AB \rightarrow C, A \rightarrow DE, B \rightarrow F, F \rightarrow GH, D \rightarrow IJ \}.

Find the key of the table

Tip: look for any attribute that never appears on the RHS of any dependency.

{B}F+={B}\{B\}_F^+ = \{ B \}; first loop: {B,F}\{ B, F \}; second loop: {B,F,G,H}\{ B, F, G, H \}. This does not contain all attributes, so BB by itself is not a key

Try AA and BB: {AB}F+\{AB\}_F^+ = {A,B}\{ A, B \}; …; this contains all attributes, so it is the key.

Decompose RR into 2NF

All non-prime attributes are fully dependent on ABAB. Hence, there can be no dependencies of the form AXA \rightarrow X or BXB \rightarrow X. Hence, ADEA \rightarrow DE and BFB \rightarrow F violate this rule. Split the relation into multiple relations:

Decompose RR into 3NF

ADA \rightarrow D and DIJD \rightarrow IJ, so there are transitive dependencies:

BFB \rightarrow F and FGHF \rightarrow GH, so decompose:

R3R_3 has no transitive dependencies.

Boyce-Codd Normal Form

An even stricter form of 3NF: if XAX \rightarrow A, then XX is a superkey of RR.

It is not always desirable to transform a relation into BCNF as some FDs may be lost in the process.

In addition, it may not be possible to transform into BCNF.

Example

Area\text{Area} is a non-prime dependency so it violates BCNF.

Decompose Area\text{Area}, CountyName\text{CountyName} into their own table, with Area\text{Area} being the primary key, and another table with PropertyId,Area,LotNumber\text{PropertyId}, \text{Area}, \text{LotNumber}, with PropertyID\text{PropertyID} being the primary key.

Note that the dependencies that PropertyIDCountyName\text{PropertyID} \rightarrow \text{CountyName} and CountyName,LotNumberID\text{CountyName}, \text{LotNumber} \rightarrow \text{ID} have been lost. Hence, converting to BCNF can sometimes be unreliable.

Example

Determine which normal form the relation, R(A,B,C,D)R(A, B, C, D), is in. ABAB is the primary key. The functional dependencies are F={ABCD,CABD,DC}F = \{ AB \rightarrow CD, C \rightarrow ABD, D \rightarrow C \} If necessary, decompose into BCNF.

CC and DD are also candidate keys.

Example

TT lists dentist/patient data; patient given appointment at specific date and time with a dentist in a particular room; the dentist is in the same room for all appointments on a given day. Find examples of insertion/deletion/update anomalies, and describe the process of normalizing it to BCNF.

T(DNo,DName,PNo,PName,ADate,ATime,Room)T(\text{DNo}, \text{DName}, \text{PNo}, \text{PName}, \text{ADate}, \text{ATime}, \text{Room}). The primary key is DNo,PNo,ADate\text{DNo}, \text{PNo}, \text{ADate}.

Informal guideline: do not mix attributes from different entity types; the table has information on dentists, patients and appointments.

Insertion/deletion/update anomalies:

Functional dependencies:

1NF: all attributes are simple and atomic.

2NF: Non-prime attributes that fully functionally dependent on the primary key.

Decompose the table to meet 2NF:

Each table now represents information about a single entity.

3NF: LHS superkey or RHS prime attribute. BCNF: LHS superkey. For all tables, the LHS is the primary key so it it is in both 3NF and BCNF form.

Relational Synthesis

The universal relation, R, contains all attributes you wish to store. Decompose this into nn tables RiR_i with the following properties:

Algorithm

Finding a Minimal Cover

For a set of FDs to be minimal:

To find a minimal cover G for F:

Example

Find the minimal cover for F={ABCD,ABC,CB}F=\{ ABC \rightarrow D, AB \rightarrow C, C \rightarrow B \}.

GG is initially the set of:

CBC \rightarrow B cannot be minimized further

Check ABCDABC \rightarrow D:

GG is now the set of:

Now check ABCAB \rightarrow C:

Now, check if all three FDs are necessary:

Hence, the minimal cover is - G={ACD,ABC,CB}G = \{ AC \rightarrow D, AB \rightarrow C, C \rightarrow B \}:

Note that there are two minimal covers for this specific example; the other is G={ADD,ABC,CB}G = \{ AD \rightarrow D, AB \rightarrow C, C \rightarrow B \}.

Lossless Join and FD-Preserving Decomposition into 3NF

Adds a new step to ensure all functional dependencies are kept: if none of the relations’ keys contains the key of the universal relation, create another table that contains the key.