All Files in ‘COSC265 (2020-S2)’ Merged

0. Introduction

Definitions

Database Management System (DBMS)

A software system facilitating the creation/maintenance of a database:

Functionality:

Compared to files:

The schema, also known as the database intension, rarely changes, but the state of the database, called the extension changes frequently. The actual data stored in the database at a particular moment in time is called the instance.

Three-Schema Architecture

Three-Level Data Model

DBMS Languages

1. Data Modelling

Phases of Database Design

Entity-Relationship

Attributes:

Relationships:

Weak Entities:

Cardinality Ratio:

Structural Constraints:

Enhanced Entity-Relation

Two subclasses can merge if they have the same unique identifier - they share a root superclass. In this case, the structure that is formed is called a lattice.

Union: each instance of the category is a member of one of the superclasses. The category will have an artificial surrogate key.

2. Relational Data Model

Formal Definitions

Formal Informal
Relation Table
Tuple Row
Attribute Column
Degree Num. attributes
Cardinality Num. rows

Relation: subset of the cartesian product of the domains i.e. any valid combination of values for all the attributes of the table. RR is the intension - the schema of the relation, and r of Rr\ of\ R is an extension - a specific valid value of the relation.

The ordering of tuples irrelevant, but the ordering of values within a tuple relevant.

Constraints

Conditions that hold for all valid relation instances.

Domain constraint

Specifies that all values in a relation must be inside their respective domains.

Key Constraint

Entity Integrity Constraint

Primary key values must not be null.

Referential Integrity Constraint

Referencing relation has foreign key attribute referencing the primary key of the referenced relation. The value can be null, and the primary key can itself be a foreign key.

Semantic Integrity Constraint

Constraints not expressed by the model and relating to the application semantics.

EER to Relation Mapping Algorithm

  1. Simple tables: make a new table, pick any key as the primary key
  2. Weak entity; make a new table, with the primary key being the combination of the owner and partial key. If the partial key is composite, use the components
  3. 1:N: Put the foreign key in the ‘N’ side
  4. 1:1: Put the foreign key on the side with total participation
  5. M:N: make a new table with two foreign keys; the combination of the two form the primary key
  6. Multivalued attributes: make a new table with a foreign key to the owner. The combination of the value of the attribute and foreign key is the primary key
  7. N-ary relationships: make a new table, with the combination of the foreign keys being the primary key.
  8. Specialization/generalization:
    • Superclass has its own table with general attributes; subclasses have their own table with specific attributes; the primary key is a foreign key to superclass
    • Total specialization: tables just for the subclasses (duplicating superclass attributes)
    • Disjoint specialization: single table with attributes for every subclass, plus attribute denoting which subclass it is (called the type)
    • Overlapping specializations: same as above, but Boolean for every subclass indicating which subclasses it is a member of
  9. Categories: primary key is an artificial, automatically generated ID; all superclasses of the category have this as a foreign key

Relational Algebra

Variables

VARNAMEoperation VARNAME \leftarrow operation

Select

Project

Rename

Set

Union-compatible relations: require the same degree and pairs of corresponding attributes to have the same domain.

Union and intersection are:

Joins

RconditionS R \Join_{condition} S

Equijoin:

Natural join:

θ\theta join:

The same type of joins are available for left, right and full outer joins (symbols have two horizontal lines sticking out of the top and bottom of the left/right end of the \Join symbol).

Division

‘For every’ queries.

If R(X)R(X) and S(Z)S(Z), R÷S=T(Y)R \div S = T(Y) where Y=XZY = X - Z; columns only in SS will not appear.

For a tuple tt to appear in TT, the values in tt must appear in RR in combination with every tuple in SS.

If SS contains a single key AA and RR contains an additional column BB, then TT will be a table with a single column BB. If a value bb exists in BB, there exists a tuple (a,b)(a, b) for all values of aa found in SS.

e.g. if S=(a1),(a2)S = {(a1), (a2)} and R=(a1,b1),(a2,b1),(a1,b2)R={(a1, b1), (a2, b1), (a1, b2)} then TT will contain (a1){(a1)}. If SS is a table and RR is a table with the exact same columns plus a primary key:

Equivalent to:

T1πY(R) (getting rid of columns only in S)T2πY((S×T1)R)T1T1T2 \begin{aligned} T1 &\leftarrow \pi_Y(R) \text{ (getting rid of columns only in S)} \\ T2 &\leftarrow \pi_Y((S \times T1) - R) \\ T1 &\leftarrow T1 - T2 \end{aligned}

If R×S=TR \times S = T, then T÷S=RT \div S = R.

Explanation of division:

Given δ=πAB(α)\delta = \pi_{A-B}(\alpha):

α÷β=δπAB((δ×β)α) \alpha \div \beta = \delta - \pi_{A-B}((\delta \times \beta) - \alpha)

3. SQL Queries

Oracle SQL Data Definition

SQL relations (tables) is a multi-set, NOT a set of tuples. Using PRIMARY KEY or UNIQUE, or the DISTINCT option in a query, constraints the result to a set. DISTINCT requires search to ensure duplicate tuples are not returned, so is slower than using ALL (the default).

Naming:

Expression

An attribute (i.e. column name) or combination of attributes, operators and functions.

Binary Operators

Functions

Scalar:

Aggregate/Set:

Predicates

Misc.

SQL+:

Create Schema

Schemas are used to group tables together. Permissions are set on a per-schema level.

CREATE SCHEMA schema-name
AUTHORIZATION owner

Create Table

CREATE TABLE [schema-name.]table-name
(
  column-name column-specification[,
  column-name column-specification]*

  [constraint-name constraint-specification]*
)

Use CREATE TABLE table-name AS (query) to populate the table with data.

Notes

Constraints can be inline or defined after the columns. If the latter, multiple rows can be referenced.

Inline:

column-definition constraints where constraints can be:

To define a primary key after column definitions: PRIMARY KEY (column-name). A composite primary key can be made by passing multiple arguments.

To define foreign keys after column definitions:

FOREIGN KEY (column-name)
REFERENCES foreign-table (foreign-table-key)
[ON DELETE integrity-option]
[ON UPDATE integrity-option]

To add constraints after defining the table:

ALTER TABLE table-name ADD CONSTRAINT constraint-name constraint-definition

Where integrity-option is one of RESTRICT, CASCADE, SET NULL or SET DEFAULT.

Domain: basically typedef for SQL. NOT in Oracle:

CREATE DOMAIN domain-name type-definition [CHECK check-statement]
/* e.g. `check-statement` could be `(VALUE IS NOT NULL)` */

Drop Table

Drops the relation (base table) and definition: DROP TABLE table-name.

Alter Table

Add, delete or modify attributes/constraints using ALTER commands.

Add/modify: ALTER TABLE table-name ADD|MODIFY column-name column-definition. The NOT NULL constraint is only allowed if it also has a DEFAULT default-value constraint. Otherwise, the column must be added without the NOT NULL constraint then update all the rows, then alter the definition.

Delete column and all constraints: ALTER TABLE table-name DROP [COLUMN] column-name RESTRICT|CASCADE.

Delete constraint: ALTER TABLE table-name DROP [CONSTRAINT] constraint-name RESTRICT|CASCADE. If the constraint is a primary key or unique, CASCADE drops all referencing foreign keys; RESTRICT stops the operation if there are any referencing foreign keys.

Drop Schema

DROP SCHEMA schema-name RESTRICT|CASCADE. RESTRICT only allows the operation to run if the schema is empty; CASCADE drops all children objects as well.

Oracle SQL Queries

SELECT

SELECT [ALL|DISTINCT] attribute-list
FROM table-list
[WHERE condition]
[GROUP BY grouping-attributes]
[HAVING group-condition]
[ORDER BY attribute-list]
[OFFSET n]
[FETCH NEXT m ROWS ONLY] /* Oracle specific */

Six clauses; the last four are optional:

Table Alias

Oracle uses the syntax, tableName alias.

Other SQL implementations use tableName AS alias.

This may be useful when referring to two entries in the a single table (e.g. find rows where the last names the same).

The result of a nested select statement can also be aliased: (SELECT clause that is probably using JOINs) alias.

Attribute list

The attribute list can contain column names, aggregate functions and constants:

Some WHERE predicates

GROUP BY

For applying aggregate functions to groups of tuples; each group is a set of tuples where they have the same value for the grouping attribute(s) (comma-separated list). This is used for summary information, and is often phrased as for each entity ….

e.g. SELECT origin, COUNT(origin) FROM routes GROUP BY origin returns number of times each origin value appears.

e.g. SELECT COUNT(DISTINCT type), director FROM movie GROUP BY director returns the number of genres of movies each director has directed.

When using GROUP BY, all non-aggregate attributes need to be referenced. This is as the grouping reduces the result into a single tuple, which may not be possible if they are not all referenced. For example, if (a, b) = (0, 0), (0, 1) is grouped by a, it must either pick one of the values or pick both, neither of which are allowed.

This can happen even if the primary key is used as the grouping attribute.

HAVING

Retrieves the values of these functions only for groups that satisfy certain conditions - i.e. it filters out groups.

e.g. SELECT origin, COUNT(origin) FROM routes GROUP BY origin HAVING COUNT(origin) > 10.

COUNT(*) gets the count for the group.

ORDER BY

Sorts tuples based on the values of some attributes:

ORDER BY [column-name ASC|DESC] [, column-name ASC|DESC]*

Default is ASC. Sorts by the first column name first, and when the values are equal, it sorts by the second column etc. The sort is not stable.

Nested Queries

A nested query/sub-query be specified within the WHERE and HAVING clauses of an outer query. The sub-query must be enclosed within parentheses.

Examples:

EXISTS

Used to check if result of correlated sub-query is empty.

e.g. SELECT * FROM director D1 WHERE EXISTS (SELECT * FROM movie WHERE D1.id = director AND type='comedy') finds directors that have directed at least one comedy.

Multi-table queries

Queries used nested queries using the = or IN conditions can always be expressed as a single query. This does a JOIN in the background, and so is much more efficient than correlated nested queries. The FROM clause may contain more than one table.

e.g. SELECT title FROM movie, director WHERE dnumber=director AND lname='Nolan'.

In general, if you have n tables, will need n-1 join conditions.

If there is no WHERE clause, all tuples of the relations are selected and the result is equivalent to the Cartesian product, most of which will be nonsense.

Multi-table query for the same table: SELECT DISTINCT M1.director FROM movie M1, movie M2 WHERE M1.director = M2.director AND M1.type='comedy' and M2.type='drama'. A similar result could be done using an INTERSECT operation.

Specifying Joins in SQL2

... FROM table1 JOIN_TYPE table2 ON col1=col2

Where the join type is one of:

The result of a join statement can also be used as a table e.g.:

SELECT title FROM
  (SELECT * FROM movie JOIN director ON movie.director=director.id) movieDirector
WHERE movieDirector.lname='Nolan' /* single quotes only */

To select all from both tables, use SELECT movie.*, director.*.

Set operations

UNION, EXCEPT (MINUS in Oracle) and INTERSECT. The results are sets - no duplicates, unless the keyword ALL is appended.

The two operands - select statements (in brackets), must return a table with the same number of properties and compatible data types.

ANY/ALL

Used with comparison operators and nested queries.

ANY is equivalent to IN. It is sometimes called SOME.

NOT ALL is equivalent to NOT IN.

The following will return movies that got more awards than all movies by a particular director:

SELECT * FROM movie WHERE awards > ALL (
  SELECT awards FROM movie, director WHERE movie.director = director.id AND lname = 'Nolan'
)

Division

R÷S=T(A) R \div S = T(A)

RR, SS has at least one common attribute e.g. The result of division with R(A,B)R(A, B) and S(B,C)S(B, C) will have (A,B)(B,C)=(A)(A, B) - (B, C)` = (A).

Directors that have directed at least one movie of each type:

SELECT fname, lname FROM director WHERE NOT EXISTS (
  SELECT * FROM movie M1 WHERE NOT EXISTS (
    SELECT * FROM movie M2 WHERE M2.director = director.id AND M2.type = M1.type
  )
)

The M1.type = M2.type is able to get every single type of movie as it is not constrained to any particular director.

The following also gives the same result:

SELECT fname, lname FROM DIRECTOR
  JOIN movie ON director.id = movie.director
  GROUP BY director, fname, lname
  HAVING COUNT(DISTINCT type) = (
    SELECT COUNT(DISTINCT type) FROM movie
  )

The nested query is not correlated.

DUAL table

In Oracle, it is a dummy table with a single column and row which is used to get scalar values e.g. SELECT sysdate FROM DUAL.

INSERT

Adds one or more tuples to a relation. The following requires all attributes to be listed in the same order they were specified when the table was created:

INSERT INTO table_name VALUES (attr_1, ..., attr_n)

To specify the attributes to insert:

INSERT INTO table_name (attr_name_1, ..., attr_name_n) VALUES (attr_1, ..., attr_n)

Both forms allow you to insert multiple tuples, separating them with commas.

A select statement can be used to insert multiple tuples:

INSERT INTO table_name (attr_name_list) select_statement

NB: two single quotes escape a quote in a string literal.

UPDATE

UPDATE table_name SET attr_name_1 = expression_1, ... WHERE where_condition

The where condition is optional.

DELETE

DELETE FROM table_name WHERE where_condition

You can only delete data from one table at a time (except if CASCADE is used on a foreign key).

If the where condition is missing, every row gets deleted.

Views

Views are virtual tables - essentially a stored query that allows customized access to data (e.g. remove access to some sensitive attributes of a table).

In general, for a view to be updatable (varies by implementation):

CREATE VIEW view_name AS select_statement

Once a view is created, it can be queried like a normal table.

To drop a view:

DROP VIEW view_name

In Oracle, to find out which views are updatable:

SELECT column_name, updatable FROM user_updatable_columns WHERE table_name=upper('view_name')

Indexes

Indexes are a physical-level feature; access structures to speed up queries.

CREATE INDEX index_name ON table_name (attr_name_1, ...)

In older versions of Oracle, indexes would not be created for keys. In this case, a index would need to be created with the UNIQUE keyword.

DROP INDEX index_name is used to delete indexes.

Notes:

Integrity Component

SQL 92 adds:

Domain Constraints

Not in Oracle:

CREATE DOMAIN domain_name AS data_type DEFAULT default_option CHECK (search_option)

Both DEFAULT and CHECK are optional, and search_option should reference the value of the value using VALUE.

Use DROP DOMAIN domain_name, optionally appending RESTRICT or CASCADE.

Referential Integrity Violation

The qualifiers are ON DELETE and ON UPDATE.

The options are SET NULL, CASCADE and SET DEFAULT.

Oracle only supports ON DELETE CASCADE and ON DELETE SET NULL.

Enterprise Constraints

Constraints
CONSTRAINT constraint_name CHECK (condition)

This is for a specific table, so the condition can reference multiple attributes. In SQL2, the condition may include a nested SELECT statement.

Assertions

Constraints spanning multiple tables can be made: these are called assertions.

CREATE ASSERTION assertion_name CHECK (search_condition)

The search condition will be a SQL query, probably using the EXISTS keyword.

This is not supported in Oracle.

Triggers

Triggers monitors the state of the database and performs some action when a specific condition occurs.

Triggers can:

A trigger has an event - a change that activates a trigger. This can be:

For DML events, there are row-level and statement-level triggers; the former occurs for each modified row while the latter runs once for each statement.

Once an appropriate event occurs, it must pass a condition; a query which determines if the trigger should be activated.

If the condition is met, the action will occur. This can be one or more statements (procedures).

The action can occur BEFORE, AFTER, or INSTEAD OF.

CREATE|REPLACE TRIGGER [schema.]trigger_name
BEFORE|INSTEAD OF|AFTER
DELETE|INSERT|UPDATE OF attr_name_1, attr_name_2... OR DELETE|INSERT|UPDATE OF ...
ON [schema.]table_name|view_name
[REFERENCING OLD AS old_row_alias NEW AS new_row_alias]
FOR EACH ROW|STATEMENT
WHEN (condition)
  body
/

Within the SQL statements, if REFERENCING is not specified, :new.attribute_name and :old.attribute_name are used to reference attributes values for the row in question.

The / is used to end the trigger action. It must be on a new line.

Limitations:

To alter the trigger:

ALTER TRIGGER [schema.]trigger_name ENABLE|DISABLE|COMPILE [DEBUG]

SHOW ERRORS shows any compilation errors.

PL-SQL Example Code
DECLARE
  varname type := initial_value;
  varname2 type;
BEGIN
  SELECT attribute INTO varname FROM table_name WHERE ...;
  RAISE_APPLICATION_ERROR(num => error_code, msg => 'Message');
  -- Error code between -20999 and -20000
  ...;
EXCEPTION
  handlers;
END;

Cannot use EXISTS: must store COUNT(*) into variable.

Example: Audit Trigger
CREATE [OR REPLACE] TRIGGER audit_trigger BEFORE INSERT OR DELETE OR UPDATE ON table_name
FOR EACH ROW
begin
  if INSERTING then
    INSERT INTO audit_table VALUES (USER || 'inserting' || 'new id:' || :new.id);
  elsif DELETING then
    INSERT INTO audit_table VALUES (USER || 'deleting' || 'old id:' || :old.id);
  elsif UPDATING('attribute_name') then
    INSERT INTO audit_table VALUES 
      (USER || 'updating' || 'old value:' || :new.attribute_name || 'new value:' || :new.attribute_name);
  elsif UPDATING then
    INSERT INTO audit_table VALUES 
      (USER || 'updating' || 'old id:' || :old.id || 'new id:' || :new.id);
  end if;
end;
/
Procedures

Collection of statements; basically functions:

CREATE|REPLACE PROCEDURE [schema].procedure_name
argument_name [IN|IN OUT|OUT] datatype, ...
IS|AS
  body

The body cannot contain DDL statements.

Compiling a procedure: ALTER PROCEDURE [schema.]procedure_name COMPILE.

Database Security

Discretionary and mandatory security mechanisms.

Control measures:

The DBA is responsible for the overall security of the database. They have the responsibility to grant privileges to users and assign security levels users and data.

They can set privileges at the account and relation level:

Each table is assigned an owner account; they have all privileges for the table, and can give privileges to other users.

Privileges can also be specified using views, allowing, for example, users access to only some attributes of a table.

GRANT account_privilege|table_privilege|ALL PRIVILEGES
ON object
TO user|role|PUBLIC
WITH GRANT OPTION;


REVOKE account_privilege|table_privilege|ALL PRIVILEGES
ON object
FROM user|role|PUBLIC;

/* EXAMPLES */
GRANT CREATE TABLE TO user WITH GRANT OPTION;
GRANT INSERT, DELETE ON table1, table2 TO user;
GRANT UPDATE ON table1(attribute1) TO user;

Table privileges: SELECT; INSERT; UPDATE; ALTER; DELETE; INDEX; DEBUG; REFERENCES. INSERT, UPDATE and REFERENCES (allows user to have a FK to the table) can be constraint to specific attributes.

If GRANT OPTION is set, the user granted the privilege can grant the privileges to other users; propagation. However, if the privilege is revoked, all propagated privileges are automatically revoked.

4. Data Normalization

Looking at an existing database and attempting to improve the design.

Information Design Guidelines

User view (logical):

Storage/base relation level:

Semantics of Attributes

Guideline 1: each tuple should represent one entity or relationship instance:

Guideline 2: design relations so that no update anomalies are present:

Guideline 3: design relations to have as few NULL values as possible:

Guideline 4: relations should be designed to satisfy the lossless join condition:

Decomposition

Dividing a relation into two or more relations. To do this, normalization is required - this is essentially a set of tests used to determine the ‘quality’ of a schema.

Functional Dependencies

Assume the universal relation schema is R(A1,A2,,An)R(A_1, A_2, \dots, A_n).

A set of attributes, XX, functionally determines a set of attributes YY if the value of XX determines a unique value for YY. XYX \rightarrow Y.

For each value of XX, there is exactly one value of YY. The simplest example of this is if XX is the primary key and YY is the tuple.

If two tuples have the same value for XX, they have the same value for YY; if tuple_1[X]=tuple_2[X]tuple\_1[X] = tuple\_2[X], then tuple_1[Y]=tuple_2[Y]tuple\_1[Y] = tuple\_2[Y].

Inference Rules

Given a set of FDs FF, we can infer additional FDs that hold whenever the FDs in FF hold.

The closure of a set FF is the set F+F+ which contains all FDs that can be inferred from FF.

If XYX \rightarrow Y is inferred from FF, then it is notated as FXYF \models X \rightarrow Y.

Armstrong’s inference rules state that:

These three rules form a sound and complete set of inference (all functional dependencies can be inferred from these rules, and all are correct).

Using these three rules, we can form deduce three more rules:

Example

Given R(A,B,C,D)R(A, B, C, D) and a set of functional dependencies F={AB,ACD,BCD,CDA}F = \{ A \rightarrow B, AC \rightarrow D, BC \rightarrow D, CD \rightarrow A \}, show ACDAC \rightarrow D is redundant.

Using Armstrong’s theorems:

Using the definition of equivalent sets of functional dependencies (see equivalence sets section):

G={AB,BCD,CDA} G = \{ A \rightarrow B, BC \rightarrow D, CD \rightarrow A \}

FF definitely covers GG, so only need to find out if GG covers FF.

For all FDs in FF:

Determining the closure

Closure of a set of attributes XX with respect to FF is the set X+X+ of all attributes that are functionally determined by XX.

{X}F+\{X\}_F^+ can be calculated by repeated applying the three primary rules:

X_plus = X
while changes_occur()
  for Y, Z in F: # F is a set of functional dependencies of the form Y -> Z
    if Y in X_plus:
      X_plus.append(Z)
Example

Given R(A,B,C,D,E)R(A, B, C, D, E) as a set of functional dependencies F={ABC,CDE,ACE,BD,EAB}F = \{ A \rightarrow BC, CD \rightarrow E, AC \rightarrow E, B \rightarrow D, E \rightarrow AB \}, determine all candidate keys in RR:

Start with single attributes:

{A}F+=A \{A\}_F^+ = { A }

Look at the functional dependencies:

As {A}F+\{A\}_F^+ contains all the attributes, AA is a candidate key.

For BB:

{C}F+={C}\{C\}_F^+ = \{ C \}: only determines itself. The same is true for DD.

After one iteration, {E}F+={A,B,E}\{E\}_F^+ = \{ A, B, E \}. After another iteration, CC and DD get added. Thus, EE is also a primary key.

Now, look at tuples with two attributes. Ignore tuples containing AA or EE as they are candidate keys by themselves:

Thus, AA, EE, BCBC and CDCD are candidate keys. No three-tuples can be created so the search stops here.

Equivalence of Sets of FDs

Two sets of FDs, FF and GG are equivalent if F+=G+F^+ = G^+: every FD in FF can be inferred from GG and vice-versa.

Alternatively, FF covers GG if G+F+G^+ \subseteq F^+. Hence, FF and GG are equivalent if FF covers GG and vice-versa. This is much easier to test:

Calculate {X}F+\{X\}_F^+ (X+X^+ with respect to FF) for each FD XYX \rightarrow Y in GG; if this includes all attributes in YY, the sets are equal.

Example

Check if FF and GG are equivalent:

F={AC,ACD,EAD,EH}G={ACD,EAH} \begin{aligned} F &= \{ A \rightarrow C, AC \rightarrow D, E \rightarrow AD, E \rightarrow H \} \\ G &= \{ A \rightarrow CD, E \rightarrow AH \} \end{aligned}

FDs in GG:

Hence, FF covers GG.

FDs in FF:

Hence, GG covers FF.

Hence, the sets are equivalent.

Minimal Sets of FDs

A set of FDs FF is minimal if:

Every set of FDs has an equivalent minimal set (or sets - there can be several equivalent minimal sets).

Example

Is GG minimal? If not, find the minimal set:

G={SSN{Name,Born,Address,DepartmentId},DepartmentId {Name,ManagerSSN}} \begin{aligned} & G = \{ \\ & \quad \text{SSN} \rightarrow \{\text{Name}, \text{Born}, \text{Address}, \text{DepartmentId} \}, \\ & \quad \text{DepartmentId } \rightarrow \{ \text{Name}, \text{ManagerSSN} \} \\ & \} \end{aligned}

First condition violated; RHS has multiple attributes for both dependencies.

Using the decomposition rule (ABCA \rightarrow BC means ABA \rightarrow B and BCB \rightarrow C):

The third rule is met as the LHS only has one element, so it cannot be reduced further.

The second rule is met; it is obvious that all dependencies are necessary.

Normal Forms Based on Primary Keys

Normalization is the process of decomposing ‘bad’ relations into smaller relations.

Keys

Superkey of relation RR: set of attributes SS, SRS \subseteq R such that no two legal tuples in RR will have the same SS values.

Key KK: superkey where removal of any attribute will cause it to not be a tuple.

If a relation has more than one key, each is called a candidate key, with one arbitrarily designated as the primary key and the others as secondary keys.

A prime attribute is a member of at least one candidate key.

First Normal Form

Every attribute is single and atomic; no composite or multivalued attributes or nested relations.

Second Normal Form

Full functional dependency: FD of the form YZY \rightarrow Z where removing any attribute from YY means the FD no longer holds. It is called a partial dependency if this does not hold.

Second normal form adds more conditions to the first normal form: no non-prime attribute is partially functionally dependent on the primary key.

To transform into second normal form, split into smaller tables: if the primary key is AA and BB, and attribute CC is fully dependent on both AA and BB but DD is dependent on only BB, create two tables, {A,B,C}\{ A, B, C\} and {B,D}\{B, D\}. The primary key of the secondary table is only BB, so DD is fully dependent on the primary key.

Third Normal Form

If XYX \rightarrow Y and YZY \rightarrow Z, XZX \rightarrow Z is a transitive functional dependency.

Relations in third normal form are in 2NF and no non-prime attributes are transitively dependent on the primary key i.e. PrimaryKeyNonPrimeAttribute\text{PrimaryKey} \rightarrow \text{NonPrimeAttribute}.

Example: SSN\text{SSN} identifies DepartmentID\text{DepartmentID} and DepartmentID\text{DepartmentID} identifies DepartmentName\text{DepartmentName}. Hence DepartmentName\text{DepartmentName} must be in a separate relation from SSN\text{SSN}, with DNumber\text{DNumber} acting as the primary key.

General Normal Form Definitions

These take into account candidate keys, not just the primary key.

Second normal form: no non-prime attribute is partially functionally dependent on any key of the relation.

Third normal form: for all FDs:

Only the first condition applies for BCNF.

Ensure at least one table contains the entire primary key, and check that it is a minimal cover before transforming.

Example

R(A,B,C,D,E,F,G,H,I,J)R(A, B, C, D, E, F, G, H, I, J) and F={ABC,ADE,BF,FGH,DIJ}F = \{ AB \rightarrow C, A \rightarrow DE, B \rightarrow F, F \rightarrow GH, D \rightarrow IJ \}.

Find the key of the table

Tip: look for any attribute that never appears on the RHS of any dependency.

{B}F+={B}\{B\}_F^+ = \{ B \}; first loop: {B,F}\{ B, F \}; second loop: {B,F,G,H}\{ B, F, G, H \}. This does not contain all attributes, so BB by itself is not a key

Try AA and BB: {AB}F+\{AB\}_F^+ = {A,B}\{ A, B \}; …; this contains all attributes, so it is the key.

Decompose RR into 2NF

All non-prime attributes are fully dependent on ABAB. Hence, there can be no dependencies of the form AXA \rightarrow X or BXB \rightarrow X. Hence, ADEA \rightarrow DE and BFB \rightarrow F violate this rule. Split the relation into multiple relations:

Decompose RR into 3NF

ADA \rightarrow D and DIJD \rightarrow IJ, so there are transitive dependencies:

BFB \rightarrow F and FGHF \rightarrow GH, so decompose:

R3R_3 has no transitive dependencies.

Boyce-Codd Normal Form

An even stricter form of 3NF: if XAX \rightarrow A, then XX is a superkey of RR.

It is not always desirable to transform a relation into BCNF as some FDs may be lost in the process.

In addition, it may not be possible to transform into BCNF.

Example

Area\text{Area} is a non-prime dependency so it violates BCNF.

Decompose Area\text{Area}, CountyName\text{CountyName} into their own table, with Area\text{Area} being the primary key, and another table with PropertyId,Area,LotNumber\text{PropertyId}, \text{Area}, \text{LotNumber}, with PropertyID\text{PropertyID} being the primary key.

Note that the dependencies that PropertyIDCountyName\text{PropertyID} \rightarrow \text{CountyName} and CountyName,LotNumberID\text{CountyName}, \text{LotNumber} \rightarrow \text{ID} have been lost. Hence, converting to BCNF can sometimes be unreliable.

Example

Determine which normal form the relation, R(A,B,C,D)R(A, B, C, D), is in. ABAB is the primary key. The functional dependencies are F={ABCD,CABD,DC}F = \{ AB \rightarrow CD, C \rightarrow ABD, D \rightarrow C \} If necessary, decompose into BCNF.

CC and DD are also candidate keys.

Example

TT lists dentist/patient data; patient given appointment at specific date and time with a dentist in a particular room; the dentist is in the same room for all appointments on a given day. Find examples of insertion/deletion/update anomalies, and describe the process of normalizing it to BCNF.

T(DNo,DName,PNo,PName,ADate,ATime,Room)T(\text{DNo}, \text{DName}, \text{PNo}, \text{PName}, \text{ADate}, \text{ATime}, \text{Room}). The primary key is DNo,PNo,ADate\text{DNo}, \text{PNo}, \text{ADate}.

Informal guideline: do not mix attributes from different entity types; the table has information on dentists, patients and appointments.

Insertion/deletion/update anomalies:

Functional dependencies:

1NF: all attributes are simple and atomic.

2NF: Non-prime attributes that fully functionally dependent on the primary key.

Decompose the table to meet 2NF:

Each table now represents information about a single entity.

3NF: LHS superkey or RHS prime attribute. BCNF: LHS superkey. For all tables, the LHS is the primary key so it it is in both 3NF and BCNF form.

Relational Synthesis

The universal relation, R, contains all attributes you wish to store. Decompose this into nn tables RiR_i with the following properties:

Algorithm

Finding a Minimal Cover

For a set of FDs to be minimal:

To find a minimal cover G for F:

Example

Find the minimal cover for F={ABCD,ABC,CB}F=\{ ABC \rightarrow D, AB \rightarrow C, C \rightarrow B \}.

GG is initially the set of:

CBC \rightarrow B cannot be minimized further

Check ABCDABC \rightarrow D:

GG is now the set of:

Now check ABCAB \rightarrow C:

Now, check if all three FDs are necessary:

Hence, the minimal cover is - G={ACD,ABC,CB}G = \{ AC \rightarrow D, AB \rightarrow C, C \rightarrow B \}:

Note that there are two minimal covers for this specific example; the other is G={ADD,ABC,CB}G = \{ AD \rightarrow D, AB \rightarrow C, C \rightarrow B \}.

Lossless Join and FD-Preserving Decomposition into 3NF

Adds a new step to ensure all functional dependencies are kept: if none of the relations’ keys contains the key of the universal relation, create another table that contains the key.

5: Physical Layer

Internal Schema

Good database design usually requires multiple tables and hence, many select operations will require access to multiple tables, which lead to performance complexities.

DBMS designers must ensure the database, which will be stored in persistent memory, performs well. They must take into factors such as:

The database designers can affect performance through:

Memory Hierarchy

Primary memory - cache and main memory:

Secondary/tertiary memory - flash, HDDs, optical disks, tape:

Disk Storage Devices

Record Management

Multiple records are stored in a block:

File Organization

The DBMS:

The file descriptor/header contains metadata for the file, such as field names, data types and addresses of the file blocks on the disks. The physical disk blocks can be contiguous, linked or indexed.

Typical file operations:

Unordered files:

Ordered files:

Hashed Files (External Hashing)

Collision Resolution:

Misc. notes:

Redundant Arrays of Independent Disks (RAID)

Uses parallelism across multiple disks to increase reads/write performance using data striping.

RAID levels:

Indexing

Indexes provide an efficient way to access records in different orders without expensive operations like sorting or duplicate data by creating a data structure that maps between a key and address.

Dense indexes have entries for every record, while sparse indexes contain entries for only some of the records (not a one-to-one relationship).

A non-clustered index stores the indexes in a structure separate from the actual data, containing the index key values and a pointer to the row.

A clustered index physically defines the order in which records are stored. Hence, there can only be one clustered index. This course limits clustering indexes to be a non-key field so a file has either a primary or clustered index.

Under this definition, a clustered index will have an entry for each distinct value of the field and hence, it is a sparse index.

A primary index is indexed by the primary key and contains an entry for each block. Each entry contains a pointer to the block and either the value of the first or last record on the block. Hence, the primary index is a sparse index. To improve performance when inserting records, a overflow block may be used.

Secondary indexes are dense indexes that can be on either candidate keys or non-keys, and contain the value of the field and a pointer to either a block or record.

Example: Storage/Speed Tradeoffs

r=30,000r = 30,000 records, record size R=100BR = 100 B, block size B=1024B = 1024, index key size V=9BV = 9 B, block pointer size P=6BP = 6 B. Blocks are unspanned.

Case 1: Ordered File

The blocking factor BFR=floor(block_sizerecord_size)=floor(BR)=floor(1,024100)=10BFR = floor(\frac{block\_size}{record\_size}) = floor(\frac{B}{R}) = floor(\frac{1,024}{100}) = 10. Hence, 10 records fit in a block.

Hence, the number of blocks required, b=ceil(number_of_recordsblocking_factor)=ceil(30,00010)=3,000b = ceil(\frac{number\_of\_records}{blocking\_factor}) = ceil(\frac{30,000}{10}) = 3,000.

For linear search:

For binary search, log2(b)12log_{2}(b) \approx 12 blocks must be read from the disk.

Case 2: Primary Index

The primary index will have a entry for each block. Each index entry has a size of I=V+P=9B+6B=15BI = V + P = 9 B + 6 B = 15 B.

Hence, the blocking factor for the index block, BFRI=floor(block_sizeindex_entrysize)=floor(BI)=floor(1,02415)=68BFRI = floor(\frac{block\_size}{index\_entry size}) = floor(\frac{B}{I}) = floor(\frac{1,024}{15}) = 68, so the index requires ceil(number_of_indexentriesindex_blocking_factor)=ceil(bBFRI)=ceil(300068)=45ceil(\frac{number\_of\_index entries}{index\_blocking\_factor}) = ceil(\frac{b}{BFRI}) = ceil(\frac{3000}{68}) = 45.

To find an individual record, binary search is done on the index, so the the number of blocks read is log2(45)6log_{2}(45) \approx 6. There is an extra read required to read the block the record is stored in, so it requires 77 reads in total.

The primary index must be ordered, so update/insertion/deletion operations are costly.

Case 3: Secondary Index

The secondary index is a dense index with an entry for each entry.

Hence, the index requires ceil(number_of_index_entriesindex_blocking_factor)=ceil(rBFRI)=ceil(30,00068)=442ceil(\frac{number\_of\_index\_entries}{index\_blocking\_factor}) = ceil(\frac{r}{BFRI}) = ceil(\frac{30,000}{68}) = 442 blocks.

Searching the index requires log2(442)9log_{2}(442) \approx 9 block reads, and then an additional read to read the block the record is stored in, so it requires 1010 reads in total.

Unlike the primary index, it does not need to be ordered so update/insertion/deletion operations are faster.

Multi-Level Indexes

The index itself is an ordered file, so a primary index to the index can be created; the original index is called the first-level index, the index to the index the second-level index etc.

This process can be repeated until all entries fit in a single block.

Assuming the first-level index is a primary index, each block for the first first-level index can store references to BFRI1BFRI^1 blocks, the second-level index references to BFRI2BFRI^2 blocks and the nnth-level index BFRInBFRI^n blocks.

This has consequences for performance: as the indexes are ordered, update/insertion/deletion operations may require indexes at all levels to update. Hence, this is more useful in read-heavy scenarios.

B-Trees and B±Trees

Due to the insertion/deletion problem, B or B+ tree structures are used.

In a B-tree, a node of order p has the structure P1,<K1,Pr1>,P2,<K2,Pr2>,,Pq1,<Kq1,Prq1>,PqP_1, <K_1, Pr_1>, P_2, <K_2, Pr_2>, \dots, P_{q-1}, <K_{q-1}, Pr_{q-1}>, P_q where:

A B-tree of order m has:

The keys divide the subtree: all values in P2P_2 will be between K2K_2 and K3K_3.

On insertion, if a node becomes full then the node is spit into two nodes, with the mid-value being used as the key for the parent. If the parent then becomes full, this may cause a cascade. This behavior ensures it is always balanced.

On deletion, if the node has less than m / 2 keys, it can be merged with a sibling.

In a B±tree, pointers to data records only exist in leaf nodes:

Example: Calculating Order

Find the order p of a B±tree if the block size B=512BB = 512 B, index key size V=9BV = 9 B, data pointer size Pr=7BPr = 7 B, tree pointer size P=6BP = 6 B:

Each node can have up to p tree pointers and p - 1 keys, taking up a total of 6p+(p1)96 * p + (p - 1) * 9 bytes. This must be less than the block size:

15p9<512p<=521/1534.715 * p - 9 < 512 \therefore p <= 521/15 \approx 34.7. Hence, the order and maximum value of p is p=34p = 34.

To store one million records requires log34(1,000,000)4log_{34}(1,000,000) \approx 4 levels (data is stored in the leaf nodes, so one million leaf nodes are required).

6. Query Processing and Optimization

πlname,fname(σsnumber = star  movie = mnumber  title = ’Gone with the wind’(star×stars×movie)) \pi_{lname, fname}(\sigma_{\text{snumber = star } \land \text{ movie = mnumber } \land \text{ title = 'Gone with the wind'}}(star \times stars \times movie))

If this is done naively, it will get the Cartesian product of the three tables (if each has 100 rows, then this results in 1,000,000 rows) then gets the names of stars that performed in a particular movie - probably something like 10 rows. Needless to say, this produces a lot of garbage rows and is very inefficient. In a real database query optimization is done.

Query Processing

Code can be executed directly or stored and executer later (compiled).

Translating SQL Queries into Relational Algebra

Query block: the basic unit that can be translated into algebraic operators and optimized. This is a single SELECT ... FROM ... WHERE [... GROUP BY ... HAVING] expression. Nested queries will have multiple query blocks.

The query SELECT name FROM employee WHERE salary > (SELECT MAX(salary) FROM employee WHERE dept = 5) can be split into:

For the second query, an unoptimized process could be:

Side Note: Sort-Merge

The process can be done recursively.

Algorithms for WHERE

Simple selection has only as single condition; complex selections have multiple conditions.

Selectivity: ratio of the number of tuples that satisfy the condition to the total number of tuples.

For simple selection:

For complex selection:

Algorithms for JOIN

Cross join/Cartesian product is time consuming. Equi-join and natural joins are much more efficient. Join performance is affected by:

If an index/hash key exists for one of the join attributes, a single-loop join can be performed: read all rows from the table without the index and use the index to directly retrieve matching records. If this is not available, a nested-loop join is required (brute force).

Algorithms for SELECT

If the attribute list contains candidate keys, all that is required is removing unwanted attributes.

Otherwise, duplicates must be removed. This can be done through either sorting or hashing.

Algorithms for Set Operations

Sort both relations on the same attributes, then scan and merge sorted files concurrently, deciding on if the tuple(s) should be kept by checking their equality.

Algorithms for Aggregate Functions

Min, max:

Sum, count, average:

Group by:

Query Representation

Query tree:

When execution of an internal node occurs, the node and its children are replaced with the resultant relation.

Canonical tree: standard tree corresponding to an SQL query without optimization. From this tree optimized alternatives can be generated.

Query Optimization

Heuristic Query Optimization

The parser generates an initial internal representation - the canonical tree. After this, heuristics should be used to optimize the internal representation.

The main heuristic is to apply operations that reduce the size of intermediate results

To determine its efficiency, we can do things such as:

Some optimizations that can be done on the first example:

Query Execution Plan

Query tree + information about access methods to be used for each relation + methods used to compute relational operators stored in the tree.

Materialized evaluation: when a temporary relation (result of an operation) is made. For example, SELECT * FROM employee WHERE salary = (SELECT MAX(salary) FROM employee) will store the nested query rather than computing it for each row.

Pipelined evaluation: when the result of an operation is forwarded to the next operator in the sequence.

Query Optimization in Oracle

Earlier versions used rule- or heuristic- based query optimization where the optimizer chooses execution plans based on heuristically ranked operations.

With cost-based query optimization, the optimizer examines alternative access paths and operator algorithms, available indexes, past-access statistics etc. and chooses the plan with the lowest estimated cost (estimated IO, CPU, memory requirements etc.).

TO view the cost of each operation in a query, use the EXPLAIN query: EXPLAIN PLAN SET STATEMENT_ID = 'some_identifier' FOR some_statement and then SELECT * FROM plan_table to view the estimated CPU, IO, time, temp space etc.

7. Data Catalogs

Data catalogs store metadata for database objects such as relations, attributes, domains, constraints, authorization and security information, owners, indexes.

The catalog is stored as a set of read-only relations and views (system tables).

The catalog is used:

The catalog consists of base tables and user-accessible views. The latter consists of:

Examples

Some interesting catalog tables:

ALL_TABLES

Contains information about all tables you have view permissions for. To view the number of rows for a specific table:

SELECT NUM_ROWS FROM ALL_TABLES WHERE OWNER = 'owner_name' AND TABLE_NAME = 'table_name'

Number of blocks used is also stored.

The catalog is updated regularly, so it may contain stale data. The LAST_ANALYZED row contains the time it was last updated. To force an update, use the command ANALYZE TABLE table_name COMPUTE_STATISTICS.

ALL_USERS

Contains information about all users you have view permissions for. The password is not viewable.

USER_USERS contains information about your own user.

ALL_INDEXES

Contains information about indexes: metadata such as the number of distinct keys, if it is primary/secondary etc.

ALL_VIEWS

Contains the definition (SQL query) for the view and other metadata. The definition cannot be modified directly as the catalog is read-only.

ALL_CONS_COLUMNS

Check constraints for all tables: constraint name, table name, owner, name of the column the constraint is for etc.

ALL_CONSTRAINTS

Contains more information about constraints including the search condition (SQL constraint definition) and constraint type (stored as a char - can be check constraint, primary/unique key, referential integrity constraint etc.).

TABLE_PRIVILEGES

Grantee, owner, grantor, table name and privileges (select, insert etc.) for each grantee.

8. Transaction Processing

A logical unit of work consisting of one or more statements. The effects of the transaction only becomes permanent when the transaction is completed successfully.

An application program may contain several transactions.

The basic data operations are read and write. A read operation will:

A write operation will:

A transaction can be in the following states:

The transaction can be aborted when in the active or partially committed states.

Concurrent Execution of Transactions

Multi-user systems:

Concurrency:

Lost Update Problem

When two transactions that update the same database items have their operations interleaved in a way that makes the value of some database item incorrect.

Example:

Transaction 1  |  Transaction 2
               |
read_item(X)   |
X := X - N     |
               |  read_item(X)
               |  X := X + M
               |
write_item(X)  <-- This is lost!
               |  write_item(X)

Temporary Update Problem

One transaction updates a database item but the transaction fails. Another transaction accesses the updated item before it is reverted to its original value.

Reading the uncommitted value is called a dirty read.

Example:

Transaction 1  | Transaction 2
read_item(X)   |
X := X - N     |
write_item(X)  |
               |  read_item(X) <- Dirty Read
               |  X := X + M
               |  write_item(X)
throw_error()  |
undo_change()  |

Incorrect Summary Problem

While one transaction is running an aggregate function while other transactions update some of the records, causing the aggregate function to read values both before and after they are updated.

ACID Properties

Atomic: a transaction is either performed entirely or not at all.

Consistent: the transaction will move the database from one consistent state to another.

Isolated: the transaction’s changes should not be visible to other transactions until it is committed - this can solve the temporary update problem.

Durable: once a transaction is committed, system failures will not cause the changes to be lost.

The concurrency system is the primary means by which the isolation property is preserved, and supports the atomicity and consistency properties.

The recovery system is the primary means by which the atomicity and durability properties are preserved, and supports the consistency and isolation properties.

Concurrency Control and Recovery

Not covered due to time limitations.

Test Notes

Intro

ER Diagrams

Attributes:

Relationships:

Weak entities:

EER:

Relational Data Model

Domain constraint: value must be in their domain.

Key constraint:

Referential integrity constraint: foreign key matches existing primary key of relation, or is null.

Semantic integrity constraint: domain-specific constraints.

EER To Relation

  1. Simple entity: new relation, pick any key as PK
  2. Weak entity: new relation, PK is owner PK + partial key
  3. 1:N relationship: FK on N side
  4. 1:1 relationship: FK on side with total participation
  5. M:N relationship: new relation with both FKs as PK
  6. Multivalued attributes: new relation, PK is owner PK + attribute value
  7. N-ary relationships: new relation with FKs as PKs
  8. Specialization/generalization
    • Any specialization: superclass has its own relation; subclasses have their own relation with their attributes; PK is FK for the superclass
    • Total specialization: subclasses have their own relation, duplicating superclass attributes
    • Disjoint specialization: single relation with properties from every subclass, plus type attribute
    • Overlapping specialization: single relation with properties from every subclass, plus Boolean for each subclass indicating if the entity is a member of the subclass
  9. Category: PK is artificial, auto-generated ID. All superclasses of the category have this as a FK

Normalization

Guidelines:

Functional Dependencies

Armstrong’s Inference Rules

  1. Reflexivity: YX    XYY \subset X \implies X \rightarrow Y
  2. Augmentation: XY    XZYZX \rightarrow Y \implies XZ \rightarrow YZ
  3. Transitivity: XYYZ    XZX \rightarrow Y \land Y \rightarrow Z \implies X \rightarrow Z

From these, the following can be generated:

  1. Decomposition: XYZ    XYXZX \rightarrow YZ \implies X \rightarrow Y \land X \rightarrow Z
  2. Union: XYXZ    XYZX \rightarrow Y \land X \rightarrow Z \implies X \rightarrow YZ
  3. Pseudo-transitivity: XYWYZ    WXZX \rightarrow Y \land WY \rightarrow Z \implies WX \rightarrow Z

Equivalence of Sets

FF covers GG if G+F+G^+ \subseteq F^+. Equivalent if FF covers GG and GG covers FF.

To test if FF covers GG, for each FD XYX \rightarrow Y in GG, check if Y{X}F+Y \subseteq \{X\}_F^+.

Minimal Sets

Minimal if RHS of all FDs is a single element and:

leads to a set that is not equivalent.

Normal Forms

If elements can be removed from the RHS, then it is a partial FD. Otherwise, it is called a full FD.

If XYX \rightarrow Y and YZY \rightarrow Z, XZX \rightarrow Z is a transitive functional dependency.

Minimal Cover Algorithm

G = {}
for fd in F:
  for el in fd.lhs
    G.append(FD(el, fd.rhs)) # Decomposition

for i in range(len(G)):
  fd = G[i]
  for el in fd.lhs:
    J = G.copy()
    J.replace(fd, FD(fd.lhs - el, fd.rhs))
    # For each FD, see if any element can be removed from the LHS
    if cover(fd.lhs - el, G) == cover(fd.lhs - el, J):
        # Covers equal, so element can be removed
        G = J

for fd in G:
  if cover(fd.lhs, G - fd) == cover(fd.lhs, G):
    G.remove(fd)

Physical Layer

Hashing

RAID 0: striping data; 1: mirroring.

Indexing

B-Trees

Structure with pointer to subtrees and (key value + pointer) pairs that divide the subtree (all left subtree values less than key value etc.): kk child nodes, k1k-1 key-pointer pairs. Nodes must be at least half full.

B+ tree: internal nodes store only key values. Leaf nodes have key value + pointer pairs, plus pointer to next leaf node in the tree.

Transactions

ACID:

Misc.

Materialized evaluation: result of temporary relation stored (e.g. comparing value to max value from nested operation). c.f. pipelined evaluation.