4. Data Normalization

Looking at an existing database and attempting to improve the design.

Information Design Guidelines

User view (logical):

Is the database design clear
Can users understand the whole schema

Storage/base relation level:

How are tuples stored
Is too much memory being used

Semantics of Attributes

Guideline 1: each tuple should represent one entity or relationship instance:

Don’t mix attributes from different entities
Only use foreign keys to refer to other entities
Separate entity and relationship attributes

Guideline 2: design relations so that no update anomalies are present:

If any anomalies are present, document them and ensure programs deal with them correctly

Guideline 3: design relations to have as few NULL values as possible:

If an attribute is NULL frequently, place it in a separate relation

Guideline 4: relations should be designed to satisfy the lossless join condition:

No spurious tuples should be generated from doing a natural join (i.e. all tuples are meaningful)

Decomposition

Dividing a relation into two or more relations. To do this, normalization is required - this is essentially a set of tests used to determine the ‘quality’ of a schema.

Functional Dependencies

Assume the universal relation schema is $R(A_1, A_2, \dots, A_n)$ .

A set of attributes, $X$ , functionally determines a set of attributes $Y$ if the value of $X$ determines a unique value for $Y$ . $X \rightarrow Y$ .

For each value of $X$ , there is exactly one value of $Y$ . The simplest example of this is if $X$ is the primary key and $Y$ is the tuple.

If two tuples have the same value for $X$ , they have the same value for $Y$ ; if $tuple\_1[X] = tuple\_2[X]$ , then $tuple\_1[Y] = tuple\_2[Y]$ .

Inference Rules

Given a set of FDs $F$ , we can infer additional FDs that hold whenever the FDs in $F$ hold.

The closure of a set $F$ is the set $F+$ which contains all FDs that can be inferred from $F$ .

If $X \rightarrow Y$ is inferred from $F$ , then it is notated as $F \models X \rightarrow Y$ .

Armstrong’s inference rules state that:

IR1, Reflexive: if $Y \subset X$ , $X \rightarrow Y$
IR2, Augmentation: if $X \rightarrow Y$ , then $XZ \rightarrow YZ$ (adding same set of attributes to both sides)
IR3, Transitive: if $X \rightarrow Y$ and $Y \rightarrow Z$ , then $X \rightarrow Z$

These three rules form a sound and complete set of inference (all functional dependencies can be inferred from these rules, and all are correct).

Using these three rules, we can form deduce three more rules:

IR4, Decomposition: if $X \rightarrow YZ$ , then $X \rightarrow Y$ and $X \rightarrow Z$
- By reflexivity, $YZ \rightarrow Y$
- By the transitive rule, $X \rightarrow YZ \rightarrow Y$
- Repeat for $Z$
IR5, Union: if $X \rightarrow Y$ and $X \rightarrow Z$ , $X \rightarrow YZ$
- By augmentation to $X \rightarrow Z$ , $XY \rightarrow YZ$
- By augmentation to $X \rightarrow Y$ , $XX \rightarrow XY$ so $X \rightarrow XY$
- By the transitive rule, $X \rightarrow YZ$
IR6, Pseudo-transitivity: if $X \rightarrow Y$ and $WY \rightarrow Z$ , then $WX \rightarrow Z$
- By augmentation to $X \rightarrow Y$ , $WX \rightarrow WY$
- By the transitive rule, $WX \rightarrow Z$

Example

Given $R(A, B, C, D)$ and a set of functional dependencies $F = \{ A \rightarrow B, AC \rightarrow D, BC \rightarrow D, CD \rightarrow A \}$ , show $AC \rightarrow D$ is redundant.

Using Armstrong’s theorems:

By augmentation on $A \rightarrow B$ , $AC \rightarrow BC$
By the transitive rule on $AC \rightarrow BC$ and $BC \rightarrow D$ , $AC \rightarrow D$

Using the definition of equivalent sets of functional dependencies (see equivalence sets section):

G = \{ A \rightarrow B, BC \rightarrow D, CD \rightarrow A \}

$F$ definitely covers $G$ , so only need to find out if $G$ covers $F$ .

For all FDs in $F$ :

$A \rightarrow B$ definitely covered since $G$ contains $A \rightarrow B$
…
$AC \rightarrow D$ : $\{AC\}_G^+ = \{ A, C \}$
- After one iteration: $\{ A, C, B, D \}$ . All attributes are now in the set
- $D$ is in $\{AC\}_G^+$ and all others are covered, so $G$ covers $F$ . Hence, $G$ and $F$ are equivalent so $AC \rightarrow D$ is redundant

Determining the closure

Closure of a set of attributes $X$ with respect to $F$ is the set $X+$ of all attributes that are functionally determined by $X$ .

$\{X\}_F^+$ can be calculated by repeated applying the three primary rules:

X_plus = X
while changes_occur()
  for Y, Z in F: # F is a set of functional dependencies of the form Y -> Z
    if Y in X_plus:
      X_plus.append(Z)

Given $R(A, B, C, D, E)$ as a set of functional dependencies $F = \{ A \rightarrow BC, CD \rightarrow E, AC \rightarrow E, B \rightarrow D, E \rightarrow AB \}$ , determine all candidate keys in $R$ :

Start with single attributes:

\{A\}_F^+ = { A }

Look at the functional dependencies:

$A \rightarrow BC$ and $A$ is in $\{A\}_F^+$ , so $B$ and $C$ can both be added to the set
$CD \rightarrow E$ but $D$ is not in $\{A\}_F^+$ , so skip it
$AC \rightarrow E$ and both $A$ and $E$ are in $\{A\}_F^+$ , so add $E$
$B \rightarrow D$ and $B$ is in $\{A\}_F^+$ , so add $D$
We now have all attributes so we do not need to continue

As $\{A\}_F^+$ contains all the attributes, $A$ is a candidate key.

For $B$ :

Loop through the FDs once: $\{B\}_F^+ = \{ B, D \}$
An attribute was added so repeat
No changes, so $B$ only determines $D$ and hence, it is not a candidate key

$\{C\}_F^+ = \{ C \}$ : only determines itself. The same is true for $D$ .

After one iteration, $\{E\}_F^+ = \{ A, B, E \}$ . After another iteration, $C$ and $D$ get added. Thus, $E$ is also a primary key.

Now, look at tuples with two attributes. Ignore tuples containing $A$ or $E$ as they are candidate keys by themselves:

$\{BC\}_F^+ = \{ B, C, D, E, A \}$
$\{BD\}_F^+ = \{ B, D \}$
$\{CD\}_F^+ = \{ C, D, E, A, B \}$

Thus, $A$ , $E$ , $BC$ and $CD$ are candidate keys. No three-tuples can be created so the search stops here.

Equivalence of Sets of FDs

Two sets of FDs, $F$ and $G$ are equivalent if $F^+ = G^+$ : every FD in $F$ can be inferred from $G$ and vice-versa.

Alternatively, $F$ covers $G$ if $G^+ \subseteq F^+$ . Hence, $F$ and $G$ are equivalent if $F$ covers $G$ and vice-versa. This is much easier to test:

Calculate $\{X\}_F^+$ ( $X^+$ with respect to $F$ ) for each FD $X \rightarrow Y$ in $G$ ; if this includes all attributes in $Y$ , the sets are equal.

Example

Check if $F$ and $G$ are equivalent:

\begin{aligned} F &= \{ A \rightarrow C, AC \rightarrow D, E \rightarrow AD, E \rightarrow H \} \\ G &= \{ A \rightarrow CD, E \rightarrow AH \} \end{aligned}

FDs in $G$ :

$A \rightarrow CD$ : $\{A\}_F^+ = \{ A, C, D \}$ . $C$ and $D$ are both in the set
$E \rightarrow AH$ : $\{E\}_F^+ = \{ E, A, D, H, C \}$ . $A$ and $H$ are both in the set

Hence, $F$ covers $G$ .

FDs in $F$ :

$A \rightarrow C$ : $\{A\}_F^+ = \{ A, C, D \}$ . $C$ is in the set
$AC \rightarrow D$ : $\{AC\}_F^+ = \{ A, C, D \}$ . $D$ is in the set
$E \rightarrow AH$ : $\{E\}_F^+ = \{ E, A, D, H, C \}$ . $A$ and $H$ are both in the set
$E \rightarrow H$ : $\{E\}_F^+ = \{ E, A, D, H, C \}$ . $H$ is in the set

Hence, $G$ covers $F$ .

Hence, the sets are equivalent.

Minimal Sets of FDs

A set of FDs $F$ is minimal if:

Every dependency in $F$ has a single attribute on its RHS
Removing any dependency will lead to a set that is not equivalent to $F$
Replacing any dependency $X \rightarrow A$ with $Y \rightarrow A$ , where $Y \subset X$ , will lead to a set that is not equivalent to $F$

Every set of FDs has an equivalent minimal set (or sets - there can be several equivalent minimal sets).

Example

Is $G$ minimal? If not, find the minimal set:

\begin{aligned} & G = \{ \\ & \quad \text{SSN} \rightarrow \{\text{Name}, \text{Born}, \text{Address}, \text{DepartmentId} \}, \\ & \quad \text{DepartmentId } \rightarrow \{ \text{Name}, \text{ManagerSSN} \} \\ & \} \end{aligned}

First condition violated; RHS has multiple attributes for both dependencies.

Using the decomposition rule ( $A \rightarrow BC$ means $A \rightarrow B$ and $B \rightarrow C$ ):

$\text{SSN} \rightarrow \text{Name}$
$\text{SSN} \rightarrow \text{Born}$
$\text{SSN} \rightarrow \text{Address}$
$\text{SSN} \rightarrow \text{DepartmentId}$
$\text{DepartmentId} \rightarrow \text{Name}$
$\text{DepartmentId} \rightarrow \text{ManagerSSN}$

The third rule is met as the LHS only has one element, so it cannot be reduced further.

The second rule is met; it is obvious that all dependencies are necessary.

Normal Forms Based on Primary Keys

Normalization is the process of decomposing ‘bad’ relations into smaller relations.

Keys

Superkey of relation $R$ : set of attributes $S$ , $S \subseteq R$ such that no two legal tuples in $R$ will have the same $S$ values.

Key $K$ : superkey where removal of any attribute will cause it to not be a tuple.

If a relation has more than one key, each is called a candidate key, with one arbitrarily designated as the primary key and the others as secondary keys.

A prime attribute is a member of at least one candidate key.

First Normal Form

Every attribute is single and atomic; no composite or multivalued attributes or nested relations.

Second Normal Form

Full functional dependency: FD of the form $Y \rightarrow Z$ where removing any attribute from $Y$ means the FD no longer holds. It is called a partial dependency if this does not hold.

Second normal form adds more conditions to the first normal form: no non-prime attribute is partially functionally dependent on the primary key.

To transform into second normal form, split into smaller tables: if the primary key is $A$ and $B$ , and attribute $C$ is fully dependent on both $A$ and $B$ but $D$ is dependent on only $B$ , create two tables, $\{ A, B, C\}$ and $\{B, D\}$ . The primary key of the secondary table is only $B$ , so $D$ is fully dependent on the primary key.

Third Normal Form

If $X \rightarrow Y$ and $Y \rightarrow Z$ , $X \rightarrow Z$ is a transitive functional dependency.

Relations in third normal form are in 2NF and no non-prime attributes are transitively dependent on the primary key i.e. $\text{PrimaryKey} \rightarrow \text{NonPrimeAttribute}$ .

Example: $\text{SSN}$ identifies $\text{DepartmentID}$ and $\text{DepartmentID}$ identifies $\text{DepartmentName}$ . Hence $\text{DepartmentName}$ must be in a separate relation from $\text{SSN}$ , with $\text{DNumber}$ acting as the primary key.

General Normal Form Definitions

These take into account candidate keys, not just the primary key.

Second normal form: no non-prime attribute is partially functionally dependent on any key of the relation.

Third normal form: for all FDs:

The LHS is a superkey of $R$ OR
The RHS is a prime attribute of $R$

Only the first condition applies for BCNF.

Ensure at least one table contains the entire primary key, and check that it is a minimal cover before transforming.

Example

$R(A, B, C, D, E, F, G, H, I, J)$ and $F = \{ AB \rightarrow C, A \rightarrow DE, B \rightarrow F, F \rightarrow GH, D \rightarrow IJ \}$ .

Find the key of the table

Tip: look for any attribute that never appears on the RHS of any dependency.

$\{B\}_F^+ = \{ B \}$ ; first loop: $\{ B, F \}$ ; second loop: $\{ B, F, G, H \}$ . This does not contain all attributes, so $B$ by itself is not a key

Try $A$ and $B$ : $\{AB\}_F^+$ = $\{ A, B \}$ ; …; this contains all attributes, so it is the key.

Decompose $R$ into 2NF

All non-prime attributes are fully dependent on $AB$ . Hence, there can be no dependencies of the form $A \rightarrow X$ or $B \rightarrow X$ . Hence, $A \rightarrow DE$ and $B \rightarrow F$ violate this rule. Split the relation into multiple relations:

$R_1(A, D, E, I, J)$
- Start with $(A, D, E)$ , and as $D \rightarrow IJ$ , $I$ and $J$ need to be added
$R_2(B, F, G, H)$
- Start with $(B, F)$ , and then add $G$ and $H$ as they are dependent on $F$
$R_3(A, B, C)$

Decompose $R$ into 3NF

$A \rightarrow D$ and $D \rightarrow IJ$ , so there are transitive dependencies:

$R_{11}(A, D, E)$
$R_{12}(D, I, J)$

$B \rightarrow F$ and $F \rightarrow GH$ , so decompose:

$R_{21}(B, F)$
$R_{22}(F, G, H)$

$R_3$ has no transitive dependencies.

Boyce-Codd Normal Form

An even stricter form of 3NF: if $X \rightarrow A$ , then $X$ is a superkey of $R$ .

It is not always desirable to transform a relation into BCNF as some FDs may be lost in the process.

In addition, it may not be possible to transform into BCNF.

Example

$\text{ID} \rightarrow \text{CountryName}, \text{LotNumber}, \text{Area}$
$\text{CountyName}, \text{LotNumber} \rightarrow \text{ID}, \text{Area}$
$\text{Area} \rightarrow \text{CountyName}$

$\text{Area}$ is a non-prime dependency so it violates BCNF.

Decompose $\text{Area}$ , $\text{CountyName}$ into their own table, with $\text{Area}$ being the primary key, and another table with $\text{PropertyId}, \text{Area}, \text{LotNumber}$ , with $\text{PropertyID}$ being the primary key.

Note that the dependencies that $\text{PropertyID} \rightarrow \text{CountyName}$ and $\text{CountyName}, \text{LotNumber} \rightarrow \text{ID}$ have been lost. Hence, converting to BCNF can sometimes be unreliable.

Example

Determine which normal form the relation, $R(A, B, C, D)$ , is in. $AB$ is the primary key. The functional dependencies are $F = \{ AB \rightarrow CD, C \rightarrow ABD, D \rightarrow C \}$ If necessary, decompose into BCNF.

$C$ and $D$ are also candidate keys.

1NF: all attributes single and atomic
2NF: all attributes are prime attributes, so there are no partial dependencies
3NF: either a superkey on LHS or prime attribute on RHS; all attributes are prime so 3NF
BCNF: superkey on LHS; $AB$ , $C$ and $D$ are all superkeys

Example

$T$ lists dentist/patient data; patient given appointment at specific date and time with a dentist in a particular room; the dentist is in the same room for all appointments on a given day. Find examples of insertion/deletion/update anomalies, and describe the process of normalizing it to BCNF.

$T(\text{DNo}, \text{DName}, \text{PNo}, \text{PName}, \text{ADate}, \text{ATime}, \text{Room})$ . The primary key is $\text{DNo}, \text{PNo}, \text{ADate}$ .

Informal guideline: do not mix attributes from different entity types; the table has information on dentists, patients and appointments.

Insertion/deletion/update anomalies:

Insertion: can have two tuples with the same dentist number but different name (and same for patients)
Insertion: patient that has registered but has no appointment. There will be no $\text{DNo}$ or $\text{ADate}$ , so cannot insert it into the table (same for dentists)
Deletion: deleting an appointment could remove all information about the patient if it was their only appointment
Insertion: different rooms for the same dentist on the same day

Functional dependencies:

$\{ \text{DNo}, \text{PNo}, \text{ADate} \} \rightarrow \{ \text{DName}, \text{PName}, \text{ATime}, \text{Room} \}$ (decompose this into four FDs)
$\text{DNo} \rightarrow \text{DName}$
$\text{PNo} \rightarrow \text{PName}$
$\{ \text{DNo}, \text{ADate} \} \rightarrow \text{Room}$

1NF: all attributes are simple and atomic.

2NF: Non-prime attributes that fully functionally dependent on the primary key.

Non-prime attributes: $\text{Room}$ , $\text{DName}$ , $\text{PName}$ , $\text{ATime}$
The second, third and forth dependencies violate this as $\text{DNo}$ and $\text{PNo}$ are part of but not the whole primary key

Decompose the table to meet 2NF:

From FD2: $\text{DNo}, \text{DName}$ ; $\text{DNo}$ is key
- $\text{DNo} \rightarrow \text{DName}$
From FD3: $\text{PNo}, \text{PName}$ ; $\text{PNo}$ is key
- $\text{PNo} \rightarrow \text{PName}$
From FD4: $\text{DNo}, \text{ADate}, \text{Room}$ ; $\text{DNo}$ , $\text{ADate}$ are the keys
- $\{ \text{DNo}, \text{ADate} \} \rightarrow \text{Room}$
From remaining attributes: $\text{DNo}, \text{PNo}, \text{ADate}, \text{Time}$ ; $\text{DNo}$ , $\text{PNo}$ and $\text{ADate}$ are keys
- $\{ \text{DNo}, \text{PNo}, \text{ADate} \} \rightarrow \text{ATime}$

Each table now represents information about a single entity.

3NF: LHS superkey or RHS prime attribute. BCNF: LHS superkey. For all tables, the LHS is the primary key so it it is in both 3NF and BCNF form.

Relational Synthesis

The universal relation, R, contains all attributes you wish to store. Decompose this into $n$ tables $R_i$ with the following properties:

Attribute preservation: $\cup_{i}{R_i} = R$
Dependency preservation: each FD either appears in one of the decomposed relations or can be inferred
Lossless join: when tables are joined, there are no spurious tuples

Algorithm

Find a minimal cover G for F (minimize number of FDs)
For each LHS $X$ of a FD in G, create a relation $X \cup A_1 \cup \dots \cup A_m$ where $X \rightarrow A$ (i.e. table for each FD and their dependents)
Place any remaining (unplaced) attributes in a single relation schema to ensure attributes are preserved

Finding a Minimal Cover

For a set of FDs to be minimal:

There is one attribute on the RHS
You cannot replace the LHS of any FD with a smaller set, and still have an equivalent set of FDs
You cannot remove any FDs from a set

To find a minimal cover G for F:

Initialize G to F
Replace each FD $X \rightarrow A_1, \dots, A_n$ with $n$ FDs: $X \rightarrow A_1$ , …, $X \rightarrow A_n$ (decomposition rule)
For each $X \rightarrow A$ in $G$ :
- For each attribute $B$ in $X$ :
  - Let $Y=X-B$ , $J=(G - (X \rightarrow A) ) \cup \{ Y \rightarrow A \}$
    - $J$ Remove an attribute from the LHS and in $G$ , replace the original FD with the one with the new LHS
  - Compute $Y_{J}^+$ and $Y_{G}^+$ . If they are equal, replace $X \rightarrow A$ with $Y \rightarrow A$ in $G$ , and set $X=Y$
For each $X \rightarrow A$ in $G$ :
- Compute $X_{G - (X \rightarrow A)}^+$ ; if it contains $A$ , then the FD $X \rightarrow A$ is redundant so remove it from $G$

Example

Find the minimal cover for $F=\{ ABC \rightarrow D, AB \rightarrow C, C \rightarrow B \}$ .

$G$ is initially the set of:

$ABC \rightarrow D$
$AB \rightarrow C$
$C \rightarrow B$

$C \rightarrow B$ cannot be minimized further

Check $ABC \rightarrow D$ :

Try remove $A$ : $BC \rightarrow D$
- Compute $\{BC\}_G^+$
  - $= \{ B, C \}$
  - Can’t add any more attributes
- Compute $\{BC\}_{G'}^+$ , where $G'=\{ BC \rightarrow D, AB \rightarrow C, C \rightarrow B \}$
  - $= \{ B, C \}$
  - $= \{ B, C, D \}$
- They are different, so you cannot remove $A$
Try remove $B$ : $AC \rightarrow D$
Compute $\{AC\}_G^+$
- $= \{ A, C \}$
- $= \{ A, C, B \}$
- $= \{ A, C, B, D \}$
Compute $\{AC\}_{G'}^+$ , where $G'=\{ AC \rightarrow D, AB \rightarrow C, C \rightarrow B \}$
- $= \{ A, C \}$
- $= \{A, C, B \}$
- $= \{A, C, B, D \}$
They are the same, so $B$ can be removed

$G$ is now the set of:

$AC \rightarrow D$
$AB \rightarrow C$
$C \rightarrow B$
Try remove $C$ : $A \rightarrow D$ (using the new dependency, not $ABC \rightarrow D$ )
Compute $\{A\}_G^+$
- $= \{ A \}$
- There is no functional dependency that can be used
Compute $\{A\}_{G'}^+$ , where $G'=\{ A \rightarrow D, AB \rightarrow C, C \rightarrow B \}$
- $= \{ A \}$
- $= \{A, D \}$
They are different, so $A$ cannot be removed

Now check $AB \rightarrow C$ :

Try remove $A$ : $B \rightarrow C$
- Compute $\{B\}_G^+$ :
  - $= \{ B \}$
  - $= \{ B, C \}$
- Compute $\{B\}_{G'}^+$ , where $G'=\{ AC \rightarrow D, B \rightarrow C, C \rightarrow B \}$ :
  - $= \{ B \}$
- They are different, so $A$ cannot be removed
Try remove $B$ : $A \rightarrow C$
- Compute $\{A\}_G^+$
  - $= \{ A \}$
  - $= \{ A, C \}$
  - $= \{ A, C, B \}$
  - $= \{ A, C, B, D \}$
- Compute $\{A\}_{G'}^+$ , where $G'=\{ AC \rightarrow D, A \rightarrow C, C \rightarrow B \}$ :
  - $= \{ A \}$
  - $= \{ A, C \}$
  - $= \{ A, C, B \}$
They are different, so $B$ cannot be removed

Now, check if all three FDs are necessary:

$AC \rightarrow D$ :
- $\{AC\}_{G - \{AC \rightarrow D\}}^+ = \{ A, C, B \}$ ; $D$ is not in the FD so it cannot be removed
$AB \rightarrow C$ :
- $\{AB\}_{G - \{AB \rightarrow C\}}^+ = \{ A, B \}$ ; $C$ is not in the FD so it cannot be removed
$C \rightarrow B$ :
- $\{C\}_{G - \{C \rightarrow B\}}^+ = \{ C \}$ ; $B$ is not in the FD so it cannot be removed

Hence, the minimal cover is - $G = \{ AC \rightarrow D, AB \rightarrow C, C \rightarrow B \}$ :

Create a table $A, C, D$ with $A$ and $C$ as keys
Create a table $A, B, C$ with $A$ and $B$ as keys
Create a table $C, B$ with $C$ as the key

Note that there are two minimal covers for this specific example; the other is $G = \{ AD \rightarrow D, AB \rightarrow C, C \rightarrow B \}$ .

Lossless Join and FD-Preserving Decomposition into 3NF

Adds a new step to ensure all functional dependencies are kept: if none of the relations’ keys contains the key of the universal relation, create another table that contains the key.