Looking at an existing database and attempting to improve the design.
User view (logical):
- Is the database design clear
- Can users understand the whole schema
Storage/base relation level:
- How are tuples stored
- Is too much memory being used
Guideline 1: each tuple should represent one entity or relationship instance:
- Don’t mix attributes from different entities
- Only use foreign keys to refer to other entities
- Separate entity and relationship attributes
Guideline 2: design relations so that no update anomalies are present:
- If any anomalies are present, document them and ensure programs deal with them correctly
Guideline 3: design relations to have as few NULL values as possible:
- If an attribute is NULL frequently, place it in a separate relation
Guideline 4: relations should be designed to satisfy the lossless join condition:
- No spurious tuples should be generated from doing a natural join (i.e. all tuples are meaningful)
Dividing a relation into two or more relations. To do this, normalization is required - this is essentially a set of tests used to determine the ‘quality’ of a schema.
Assume the universal relation schema is R(A1,A2,…,An).
A set of attributes, X, functionally determines a set of attributes Y if the value of X determines a unique value for Y. X→Y.
For each value of X, there is exactly one value of Y. The simplest example of this is if X is the primary key and Y is the tuple.
If two tuples have the same value for X, they have the same value for Y; if tuple_1[X]=tuple_2[X], then tuple_1[Y]=tuple_2[Y].
Given a set of FDs F, we can infer additional FDs that hold whenever the FDs in F hold.
The closure of a set F is the set F+ which contains all FDs that can be inferred from F.
If X→Y is inferred from F, then it is notated as F⊨X→Y.
Armstrong’s inference rules state that:
- IR1, Reflexive: if Y⊂X, X→Y
- IR2, Augmentation: if X→Y, then XZ→YZ (adding same set of attributes to both sides)
- IR3, Transitive: if X→Y and Y→Z, then X→Z
These three rules form a sound and complete set of inference (all functional dependencies can be inferred from these rules, and all are correct).
Using these three rules, we can form deduce three more rules:
- IR4, Decomposition: if X→YZ, then X→Y and X→Z
- By reflexivity, YZ→Y
- By the transitive rule, X→YZ→Y
- Repeat for Z
- IR5, Union: if X→Y and X→Z, X→YZ
- By augmentation to X→Z, XY→YZ
- By augmentation to X→Y, XX→XY so X→XY
- By the transitive rule, X→YZ
- IR6, Pseudo-transitivity: if X→Y and WY→Z, then WX→Z
- By augmentation to X→Y, WX→WY
- By the transitive rule, WX→Z
Given R(A,B,C,D) and a set of functional dependencies F={A→B,AC→D,BC→D,CD→A}, show AC→D is redundant.
Using Armstrong’s theorems:
- By augmentation on A→B, AC→BC
- By the transitive rule on AC→BC and BC→D, AC→D
Using the definition of equivalent sets of functional dependencies (see equivalence sets section):
G={A→B,BC→D,CD→A}
F definitely covers G, so only need to find out if G covers F.
For all FDs in F:
-
A→B definitely covered since G contains A→B
- …
-
AC→D: {AC}G+={A,C}
- After one iteration: {A,C,B,D}. All attributes are now in the set
-
D is in {AC}G+ and all others are covered, so G covers F. Hence, G and F are equivalent so AC→D is redundant
Closure of a set of attributes X with respect to F is the set X+ of all attributes that are functionally determined by X.
{X}F+ can be calculated by repeated applying the three primary rules:
X_plus = X
while changes_occur()
for Y, Z in F:
if Y in X_plus:
X_plus.append(Z)
Given R(A,B,C,D,E) as a set of functional dependencies F={A→BC,CD→E,AC→E,B→D,E→AB}, determine all candidate keys in R:
Start with single attributes:
{A}F+=A
Look at the functional dependencies:
-
A→BC and A is in {A}F+, so B and C can both be added to the set
-
CD→E but D is not in {A}F+, so skip it
-
AC→E and both A and E are in {A}F+, so add E
-
B→D and B is in {A}F+, so add D
- We now have all attributes so we do not need to continue
As {A}F+ contains all the attributes, A is a candidate key.
For B:
- Loop through the FDs once: {B}F+={B,D}
- An attribute was added so repeat
- No changes, so B only determines D and hence, it is not a candidate key
{C}F+={C}: only determines itself. The same is true for D.
After one iteration, {E}F+={A,B,E}. After another iteration, C and D get added. Thus, E is also a primary key.
Now, look at tuples with two attributes. Ignore tuples containing A or E as they are candidate keys by themselves:
-
{BC}F+={B,C,D,E,A}
-
{BD}F+={B,D}
-
{CD}F+={C,D,E,A,B}
Thus, A, E, BC and CD are candidate keys. No three-tuples can be created so the search stops here.
Two sets of FDs, F and G are equivalent if F+=G+: every FD in F can be inferred from G and vice-versa.
Alternatively, F covers
G if G+⊆F+. Hence, F and G are equivalent if F covers G and vice-versa. This is much easier to test:
Calculate {X}F+ (X+ with respect to F) for each FD X→Y in G; if this includes all attributes in Y, the sets are equal.
Check if F and G are equivalent:
FG={A→C,AC→D,E→AD,E→H}={A→CD,E→AH}
FDs in G:
-
A→CD: {A}F+={A,C,D}. C and D are both in the set
-
E→AH: {E}F+={E,A,D,H,C}. A and H are both in the set
Hence, F covers G.
FDs in F:
-
A→C: {A}F+={A,C,D}. C is in the set
-
AC→D: {AC}F+={A,C,D}. D is in the set
-
E→AH: {E}F+={E,A,D,H,C}. A and H are both in the set
-
E→H: {E}F+={E,A,D,H,C}. H is in the set
Hence, G covers F.
Hence, the sets are equivalent.
A set of FDs F is minimal if:
- Every dependency in F has a single attribute on its RHS
- Removing any dependency will lead to a set that is not equivalent to F
- Replacing any dependency X→A with Y→A, where Y⊂X, will lead to a set that is not equivalent to F
Every set of FDs has an equivalent minimal set (or sets - there can be several equivalent minimal sets).
Is G minimal? If not, find the minimal set:
G={SSN→{Name,Born,Address,DepartmentId},DepartmentId →{Name,ManagerSSN}}
First condition violated; RHS has multiple attributes for both dependencies.
Using the decomposition rule (A→BC means A→B and B→C):
-
SSN→Name
-
SSN→Born
-
SSN→Address
-
SSN→DepartmentId
-
DepartmentId→Name
-
DepartmentId→ManagerSSN
The third rule is met as the LHS only has one element, so it cannot be reduced further.
The second rule is met; it is obvious that all dependencies are necessary.
Normalization is the process of decomposing ‘bad’ relations into smaller relations.
Superkey of relation R: set of attributes S, S⊆R such that no two legal tuples in R will have the same S values.
Key
K: superkey where removal of any attribute will cause it to not be a tuple.
If a relation has more than one key, each is called a candidate key, with one arbitrarily designated as the primary key and the others as secondary keys.
A prime attribute is a member of at least one candidate key.
Every attribute is single and atomic; no composite or multivalued attributes or nested relations.
Full functional dependency: FD of the form Y→Z where removing any attribute from Y means the FD no longer holds. It is called a partial dependency if this does not hold.
Second normal form adds more conditions to the first normal form: no non-prime attribute is partially functionally dependent on the primary key.
To transform into second normal form, split into smaller tables: if the primary key is A and B, and attribute C is fully dependent on both A and B but D is dependent on only B, create two tables, {A,B,C} and {B,D}. The primary key of the secondary table is only B, so D is fully dependent on the primary key.
If X→Y and Y→Z, X→Z is a transitive functional dependency.
Relations in third normal form are in 2NF and no non-prime attributes are transitively dependent on the primary key i.e. PrimaryKey→NonPrimeAttribute.
Example: SSN identifies DepartmentID and DepartmentID identifies DepartmentName. Hence DepartmentName must be in a separate relation from SSN, with DNumber acting as the primary key.
These take into account candidate keys, not just the primary key.
Second normal form: no non-prime attribute is partially functionally dependent on any key of the relation.
Third normal form: for all FDs:
- The LHS is a superkey of R OR
- The RHS is a prime attribute of R
Only the first condition applies for BCNF.
Ensure at least one table contains the entire primary key, and check that it is a minimal cover before transforming.
R(A,B,C,D,E,F,G,H,I,J) and F={AB→C,A→DE,B→F,F→GH,D→IJ}.
Tip: look for any attribute that never appears on the RHS of any dependency.
{B}F+={B}; first loop: {B,F}; second loop: {B,F,G,H}. This does not contain all attributes, so B by itself is not a key
Try A and B: {AB}F+ = {A,B}; …; this contains all attributes, so it is the key.
All non-prime attributes are fully dependent on AB. Hence, there can be no dependencies of the form A→X or B→X. Hence, A→DE and B→F violate this rule. Split the relation into multiple relations:
-
R1(A,D,E,I,J)
- Start with (A,D,E), and as D→IJ, I and J need to be added
-
R2(B,F,G,H)
- Start with (B,F), and then add G and H as they are dependent on F
-
R3(A,B,C)
A→D and D→IJ, so there are transitive dependencies:
-
R11(A,D,E)
-
R12(D,I,J)
B→F and F→GH, so decompose:
-
R21(B,F)
-
R22(F,G,H)
R3 has no transitive dependencies.
An even stricter form of 3NF: if X→A, then X is a superkey of R.
It is not always desirable to transform a relation into BCNF as some FDs may be lost in the process.
In addition, it may not be possible to transform into BCNF.
-
ID→CountryName,LotNumber,Area
-
CountyName,LotNumber→ID,Area
-
Area→CountyName
Area is a non-prime dependency so it violates BCNF.
Decompose Area, CountyName into their own table, with Area being the primary key, and another table with PropertyId,Area,LotNumber, with PropertyID being the primary key.
Note that the dependencies that PropertyID→CountyName and CountyName,LotNumber→ID have been lost. Hence, converting to BCNF can sometimes be unreliable.
Determine which normal form the relation, R(A,B,C,D), is in. AB is the primary key.
The functional dependencies are F={AB→CD,C→ABD,D→C} If necessary, decompose into BCNF.
C and D are also candidate keys.
- 1NF: all attributes single and atomic
- 2NF: all attributes are prime attributes, so there are no partial dependencies
- 3NF: either a superkey on LHS or prime attribute on RHS; all attributes are prime so 3NF
- BCNF: superkey on LHS; AB, C and D are all superkeys
T lists dentist/patient data; patient given appointment at specific date and time with a dentist in a particular room; the dentist is in the same room for all appointments on a given day. Find examples of insertion/deletion/update anomalies, and describe the process of normalizing it to BCNF.
T(DNo,DName,PNo,PName,ADate,ATime,Room). The primary key is DNo,PNo,ADate.
Informal guideline: do not mix attributes from different entity types; the table has information on dentists, patients and appointments.
Insertion/deletion/update anomalies:
- Insertion: can have two tuples with the same dentist number but different name (and same for patients)
- Insertion: patient that has registered but has no appointment. There will be no DNo or ADate, so cannot insert it into the table (same for dentists)
- Deletion: deleting an appointment could remove all information about the patient if it was their only appointment
- Insertion: different rooms for the same dentist on the same day
Functional dependencies:
-
{DNo,PNo,ADate}→{DName,PName,ATime,Room} (decompose this into four FDs)
-
DNo→DName
-
PNo→PName
-
{DNo,ADate}→Room
1NF: all attributes are simple and atomic.
2NF: Non-prime attributes that fully functionally dependent on the primary key.
- Non-prime attributes: Room, DName, PName, ATime
- The second, third and forth dependencies violate this as DNo and PNo are part of but not the whole primary key
Decompose the table to meet 2NF:
- From FD2: DNo,DName; DNo is key
-
DNo→DName
- From FD3: PNo,PName; PNo is key
-
PNo→PName
- From FD4: DNo,ADate,Room; DNo, ADate are the keys
-
{DNo,ADate}→Room
- From remaining attributes: DNo,PNo,ADate,Time; DNo, PNo and ADate are keys
-
{DNo,PNo,ADate}→ATime
Each table now represents information about a single entity.
3NF: LHS superkey or RHS prime attribute. BCNF: LHS superkey.
For all tables, the LHS is the primary key so it it is in both 3NF and BCNF form.
The universal relation, R, contains all attributes you wish to store. Decompose this into n tables Ri with the following properties:
- Attribute preservation: ∪iRi=R
- Dependency preservation: each FD either appears in one of the decomposed relations or can be inferred
- Lossless join: when tables are joined, there are no spurious tuples
- Find a minimal cover G for F (minimize number of FDs)
- For each LHS X of a FD in G, create a relation X∪A1∪⋯∪Am where X→A (i.e. table for each FD and their dependents)
- Place any remaining (unplaced) attributes in a single relation schema to ensure attributes are preserved
For a set of FDs to be minimal:
- There is one attribute on the RHS
- You cannot replace the LHS of any FD with a smaller set, and still have an equivalent set of FDs
- You cannot remove any FDs from a set
To find a minimal cover G for F:
- Initialize G to F
- Replace each FD X→A1,…,An with n FDs: X→A1, …, X→An (decomposition rule)
- For each X→A in G:
- For each attribute B in X:
- Let Y=X−B, J=(G−(X→A))∪{Y→A}
-
J Remove an attribute from the LHS and in G, replace the original FD with the one with the new LHS
- Compute YJ+ and YG+. If they are equal, replace X→A with Y→A in G, and set X=Y
- For each X→A in G:
- Compute XG−(X→A)+; if it contains A, then the FD X→A is redundant so remove it from G
Find the minimal cover for F={ABC→D,AB→C,C→B}.
G is initially the set of:
-
ABC→D
-
AB→C
-
C→B
C→B cannot be minimized further
Check ABC→D:
- Try remove A: BC→D
- Compute {BC}G+
-
={B,C}
- Can’t add any more attributes
- Compute {BC}G′+, where G′={BC→D,AB→C,C→B}
-
={B,C}
-
={B,C,D}
- They are different, so you cannot remove A
- Try remove B: AC→D
- Compute {AC}G+
-
={A,C}
-
={A,C,B}
-
={A,C,B,D}
- Compute {AC}G′+, where G′={AC→D,AB→C,C→B}
-
={A,C}
-
={A,C,B}
-
={A,C,B,D}
- They are the same, so B can be removed
G is now the set of:
-
AC→D
-
AB→C
-
C→B
-
Try remove C: A→D (using the new dependency, not ABC→D)
-
Compute {A}G+
-
={A}
- There is no functional dependency that can be used
-
Compute {A}G′+, where G′={A→D,AB→C,C→B}
-
={A}
-
={A,D}
-
They are different, so A cannot be removed
Now check AB→C:
- Try remove A: B→C
- Compute {B}G+:
-
={B}
-
={B,C}
- Compute {B}G′+, where G′={AC→D,B→C,C→B}:
- They are different, so A cannot be removed
- Try remove B: A→C
- Compute {A}G+
-
={A}
-
={A,C}
-
={A,C,B}
-
={A,C,B,D}
- Compute {A}G′+, where G′={AC→D,A→C,C→B}:
-
={A}
-
={A,C}
-
={A,C,B}
- They are different, so B cannot be removed
Now, check if all three FDs are necessary:
-
AC→D:
-
{AC}G−{AC→D}+={A,C,B}; D is not in the FD so it cannot be removed
-
AB→C:
-
{AB}G−{AB→C}+={A,B}; C is not in the FD so it cannot be removed
-
C→B:
-
{C}G−{C→B}+={C}; B is not in the FD so it cannot be removed
Hence, the minimal cover is - G={AC→D,AB→C,C→B}:
- Create a table A,C,D with A and C as keys
- Create a table A,B,C with A and B as keys
- Create a table C,B with C as the key
Note that there are two minimal covers for this specific example; the other is G={AD→D,AB→C,C→B}.
Adds a new step to ensure all functional dependencies are kept: if none of the relations’ keys contains the key of the universal relation, create another table that contains the key.