0. Introduction

Definitions

Entity: concept/object about which information is stored
Attribute: property of an entity
Mini-world: part of the real world about which data is stored
- A database represents only relevant aspects of the mini-world

Database Management System (DBMS)

A software system facilitating the creation/maintenance of a database:

Database software: DBMS + applications
Database system: database + software

Functionality:

Database definition
Loading the database into memory
Manipulation: queries/inserts/deletions/updates
Concurrent processing by users/programs
Security measures to prevent unauthorized access

Compared to files:

Disadvantages: complexity, size, cost, performance
Advantages: reduced development time, flexibility, economies of scale

The schema, also known as the database intension, rarely changes, but the state of the database, called the extension changes frequently. The actual data stored in the database at a particular moment in time is called the instance.

Three-Schema Architecture

Internal schema: deals with the physical level (e.g. path to database), but not the actual physical devices
Conceptual schema: describes the structure of the database; entities, attributes relationships etc.
External schema: view visible to end users. It may be a subset of the conceptual schema, allowing the user to access only what is relevant to them

Three-Level Data Model

Conceptual/high-level/semantic: concepts close to the way users perceive the data
Physical/low-level/internal: concepts describing details of how data is actually stored
Implementation/representational: half-way point

DBMS Languages

Data Definition/Description Language (DDL): specifies the schema of the database (e.g. CREATE, ALTER)
View Definition Language (VDL): language used to create user views - mapping from the conceptual to the external schema
Storage Definition Language (SDL): language used to specify the internal schema - how the data is actually stored
Data Manipulation Language (DML): language used to manipulate the database (e.g. SELECT, INSERT). They can be high level, non-procedural languages like SQL or low-level procedural languages where records are updated one at a time

1. Data Modelling

Phases of Database Design

Conceptual design: ER/EER diagrams
Logical design: conversion into DBMS-specific form (SQL schema)
Physical design: hardware the data is stored on

Entity-Relationship

Attributes:

Composite attributes: Name(Sub, Property)
- Decompose into atomic attributes
Multivalued: { attributeName }
- Double eclipse
Derived: dotted eclipse
Primary Key
- Can be composite
- Must be single valued
- Underlined

Relationships:

Degree: number of participating entity types
- Recursive relationships are unary
  - Require labels for the roles
Can have attributes, but not key attributes
Relationship type: description of schema, constraints etc.
Relationship set: current set of relationships

Weak Entities:

Exactly one partial key
- May be composite
- Unique for a given owner
Can have multiple owners in its identifying relationship

Cardinality Ratio:

‘{A} can be in {B cardinality} relationships with {B}’

Structural Constraints:

(min, max). max can be n (or m)
Read the opposite way to cardinality constraints
Use for n-ary relationships

Enhanced Entity-Relation

Subclass ----\subset ---- Superclass: optional subclass
(Subclass1 ----\subset, Subclass2 ----\subset)----d----Superclass: partial disjoint specialization
(Subclass1 ----\subset, Subclass2 ----\subset)----d====Superclass: total disjoint specialization
(Subclass1 ----\subset, Subclass2 ----\subset)----o----Superclass: partial overlapping specialization

Two subclasses can merge if they have the same unique identifier - they share a root superclass. In this case, the structure that is formed is called a lattice.

Category====\subset====u----(Class1-----, Class2=====): class 1 is partial, class 2 is total

Union: each instance of the category is a member of one of the superclasses. The category will have an artificial surrogate key.

2. Relational Data Model

Formal Definitions

Formal	Informal
Relation	Table
Tuple	Row
Attribute	Column
Degree	Num. attributes
Cardinality	Num. rows

Relation: set of tuples
Schema of relation: $R(A_1, \dots, A_n)$
Tuple: ordered set of values
Domain: set of valid values for an attribute

Relation: subset of the cartesian product of the domains i.e. any valid combination of values for all the attributes of the table. $R$ is the intension - the schema of the relation, and $r\ of\ R$ is an extension - a specific valid value of the relation.

The ordering of tuples irrelevant, but the ordering of values within a tuple relevant.

Constraints

Conditions that hold for all valid relation instances.

Domain constraint

Specifies that all values in a relation must be inside their respective domains.

Key Constraint

Superkey of $R$ : set of attributes such that no two tuples will have the same value
Key - a minimal superkey: removing any attribute stops it from being a superkey
A primary key is arbitrarily chosen from the set of candidate keys. The rest are called secondary keys
- Attributes of the candidate keys are called prime attributes

Entity Integrity Constraint

Primary key values must not be null.

Referential Integrity Constraint

Referencing relation has foreign key attribute referencing the primary key of the referenced relation. The value can be null, and the primary key can itself be a foreign key.

Semantic Integrity Constraint

Constraints not expressed by the model and relating to the application semantics.

EER to Relation Mapping Algorithm

Simple tables: make a new table, pick any key as the primary key
Weak entity; make a new table, with the primary key being the combination of the owner and partial key. If the partial key is composite, use the components
1:N: Put the foreign key in the ‘N’ side
1:1: Put the foreign key on the side with total participation
M:N: make a new table with two foreign keys; the combination of the two form the primary key
Multivalued attributes: make a new table with a foreign key to the owner. The combination of the value of the attribute and foreign key is the primary key
N-ary relationships: make a new table, with the combination of the foreign keys being the primary key.
Specialization/generalization:
- Superclass has its own table with general attributes; subclasses have their own table with specific attributes; the primary key is a foreign key to superclass
- Total specialization: tables just for the subclasses (duplicating superclass attributes)
- Disjoint specialization: single table with attributes for every subclass, plus attribute denoting which subclass it is (called the type)
- Overlapping specializations: same as above, but Boolean for every subclass indicating which subclasses it is a member of
Categories: primary key is an artificial, automatically generated ID; all superclasses of the category have this as a foreign key

Relational Algebra

Variables

VARNAME \leftarrow operation

Select

$\sigma_{condition}(relation)$
Horizontal slices of a table
Condition can be composite; use logical operators

Project

$\pi_{attribute\_list}(relation)$
Vertical slices of a table
Returns a set, so if two rows are identical the duplicate will not be added

Rename

$\rho_{S(col1, \dots coln)}(R)$
Renames relation properties

Set

Union-compatible relations: require the same degree and pairs of corresponding attributes to have the same domain.

Union: $S \cup R$
Intersection: $S \cap R$
Difference: $S - R$
Cartesian Product: $S \times R$

Union and intersection are:

Commutative: order of arguments doesn’t matter
Associative: order of operations doesn’t matter

Joins

R \Join_{condition} S

Equijoin:

$condition \leftarrow R.col = S.col$
$R.col$ and $S.col$ are both present

Natural join:

$R * S$
Special case of equijoin where primary and foreign keys have the same name
The primary and foreign key columns are merged

$\theta$ join:

$condition \leftarrow R.col\ OP\ S.col$ where $OP\in\{<, <=, >=, >, !=\}$

The same type of joins are available for left, right and full outer joins (symbols have two horizontal lines sticking out of the top and bottom of the left/right end of the $\Join$ symbol).

Division

‘For every’ queries.

If $R(X)$ and $S(Z)$ , $R \div S = T(Y)$ where $Y = X - Z$ ; columns only in $S$ will not appear.

For a tuple $t$ to appear in $T$ , the values in $t$ must appear in $R$ in combination with every tuple in $S$ .

If $S$ contains a single key $A$ and $R$ contains an additional column $B$ , then $T$ will be a table with a single column $B$ . If a value $b$ exists in $B$ , there exists a tuple $(a, b)$ for all values of $a$ found in $S$ .

e.g. if $S = {(a1), (a2)}$ and $R={(a1, b1), (a2, b1), (a1, b2)}$ then $T$ will contain ${(a1)}$ . If $S$ is a table and $R$ is a table with the exact same columns plus a primary key:

For the id to appear, the id must appear along with every tuple in $S$
The resultant table will only have the primary key (columns in $R$ minus columns in $S$ )

Equivalent to:

\begin{aligned} T1 &\leftarrow \pi_Y(R) \text{ (getting rid of columns only in S)} \\ T2 &\leftarrow \pi_Y((S \times T1) - R) \\ T1 &\leftarrow T1 - T2 \end{aligned}

If $R \times S = T$ , then $T \div S = R$ .

Explanation of division:

Given $\delta = \pi_{A-B}(\alpha)$ :

\alpha \div \beta = \delta - \pi_{A-B}((\delta \times \beta) - \alpha)

$\delta$ is a superset of the desired result; some need to be removed
$\delta \times \beta$ gives all possible tuples that could be contained in a relation with schema $A$
$(\delta \times \beta) - \alpha$ gives all possible tuples not in $\alpha$
This can be subtracted from $\delta$ after a project operation

3. SQL Queries

Oracle SQL Data Definition

SQL relations (tables) is a multi-set, NOT a set of tuples. Using PRIMARY KEY or UNIQUE, or the DISTINCT option in a query, constraints the result to a set. DISTINCT requires search to ensure duplicate tuples are not returned, so is slower than using ALL (the default).

Naming:

Databases up to 8 characters
Other names up to 30 characters
Table names cannot begin with sys_
Case insensitive

Expression

An attribute (i.e. column name) or combination of attributes, operators and functions.

Binary Operators

Comparisons: < <=, =, !=/<>, >= and >
Logical: NOT, AND and OR, in that order of priority. Use parentheses if needed
+, -, *, and / can be used for numeric values
String concatenation: ||

Functions

Scalar:

Type conversion: TO_CHAR, TO_DATE, TO_NUMBER etc.
Numeric: ABS, CEIL, FLOOR, EXP, LOG, MOD, POWER, ROUND, SQRT` etc.
String: SUBSTRING, UPPER, LOWER, TRANSLATE, CONVERT, LENGTH, LPAD etc.
Date/time: SYSDATE, ADD_MONTHS, MONTHS_BETWEEN, NEXT_DAY etc.

Aggregate/Set:

SUM, COUNT, MAX, MIN, AVG etc.

Predicates

LIKE, BETWEEN, IN, ANY, ALL, EXISTS, IS NULL

Misc.

SQL+:

set autocommit on
describe schema.tablename
set linesize n
@sql-script-path

Create Schema

Schemas are used to group tables together. Permissions are set on a per-schema level.

CREATE SCHEMA schema-name
AUTHORIZATION owner

Create Table

CREATE TABLE [schema-name.]table-name
(
  column-name column-specification[,
  column-name column-specification]*

  [constraint-name constraint-specification]*
)

Use CREATE TABLE table-name AS (query) to populate the table with data.

Notes

Constraints can be inline or defined after the columns. If the latter, multiple rows can be referenced.

Inline:

column-definition constraints where constraints can be:

NOT NULL
DEFAULT default-value
UNIQUE
REFERENCES table-name (with the primary key being the foreign key)
CONSTRAINT constraint-name constraint-definition where constraint-definition can be:
- PRIMARY KEY
- REFERENCES referenced-table[.foreign-key-column]
- CHECK (condition)
  - e.g. column-name > 10, column-1 != column-2
  - Value in array: column-name IN ('val1', 'val2')
  - The condition can only involve the name of the current column if it is inline

To define a primary key after column definitions: PRIMARY KEY (column-name). A composite primary key can be made by passing multiple arguments.

To define foreign keys after column definitions:

FOREIGN KEY (column-name)
REFERENCES foreign-table (foreign-table-key)
[ON DELETE integrity-option]
[ON UPDATE integrity-option]

To add constraints after defining the table:

ALTER TABLE table-name ADD CONSTRAINT constraint-name constraint-definition

Where integrity-option is one of RESTRICT, CASCADE, SET NULL or SET DEFAULT.

Domain: basically typedef for SQL. NOT in Oracle:

CREATE DOMAIN domain-name type-definition [CHECK check-statement]
/* e.g. `check-statement` could be `(VALUE IS NOT NULL)` */

Drop Table

Drops the relation (base table) and definition: DROP TABLE table-name.

Alter Table

Add, delete or modify attributes/constraints using ALTER commands.

Add/modify: ALTER TABLE table-name ADD|MODIFY column-name column-definition. The NOT NULL constraint is only allowed if it also has a DEFAULT default-value constraint. Otherwise, the column must be added without the NOT NULL constraint then update all the rows, then alter the definition.

Delete column and all constraints: ALTER TABLE table-name DROP [COLUMN] column-name RESTRICT|CASCADE.

Delete constraint: ALTER TABLE table-name DROP [CONSTRAINT] constraint-name RESTRICT|CASCADE. If the constraint is a primary key or unique, CASCADE drops all referencing foreign keys; RESTRICT stops the operation if there are any referencing foreign keys.

Drop Schema

DROP SCHEMA schema-name RESTRICT|CASCADE. RESTRICT only allows the operation to run if the schema is empty; CASCADE drops all children objects as well.

Oracle SQL Queries

`SELECT`

SELECT [ALL|DISTINCT] attribute-list
FROM table-list
[WHERE condition]
[GROUP BY grouping-attributes]
[HAVING group-condition]
[ORDER BY attribute-list]
[OFFSET n]
[FETCH NEXT m ROWS ONLY] /* Oracle specific */

Six clauses; the last four are optional:

SELECT is like the project, $\pi$ operator
- ALL is the default; DISTINCT removes duplicate tuples
  - The latter requires comparisons for each tuple so may be slower
- attribute-list is a comma separated list where each item is:
  - * for all columns in the table
  - A column name
  - The result of some operation (e.g. column_name * 10)
WHERE is like the select, $\sigma$ operator
ORDER BY does sorting

Table Alias

Oracle uses the syntax, tableName alias.

Other SQL implementations use tableName AS alias.

This may be useful when referring to two entries in the a single table (e.g. find rows where the last names the same).

The result of a nested select statement can also be aliased: (SELECT clause that is probably using JOINs) alias.

Attribute list

The attribute list can contain column names, aggregate functions and constants:

Except for COUNT(*), aggregate functions ignore NULL VALUES
Can also use DISTINCT within aggregate functions: COUNT(DISTINCT origin) will return the number of distinct origin values
Can also be a nested select statement returning a single row and column

Some `WHERE` predicates

Range (inclusive): column-name BETWEEN min AND max
- Can be used for numbers, dates and strings
Membership: column-name IN (val_1, ..., val_n)
Pattern matching with LIKE: column-name LIKE pattern
- Enclose pattern in single quotes
- % acts like .* in a regular expression
- _ acts like . in a regular expression
- Can set custom escape character: 'Pattern\_Using\_Underscore ESCAPE ‘’
Pattern matching with regular expressions: REGEXP_LIKE(column-name, pattern, [param])
- param: i (case-insensitive), c (case-sensitive), x (ignore whitespace), ‘m’ (multiline)
NULL: IS [NOT] NULL; NULL != NULL so cannot use equality comparisons

`GROUP BY`

For applying aggregate functions to groups of tuples; each group is a set of tuples where they have the same value for the grouping attribute(s) (comma-separated list). This is used for summary information, and is often phrased as for each entity ….

e.g. SELECT origin, COUNT(origin) FROM routes GROUP BY origin returns number of times each origin value appears.

e.g. SELECT COUNT(DISTINCT type), director FROM movie GROUP BY director returns the number of genres of movies each director has directed.

When using GROUP BY, all non-aggregate attributes need to be referenced. This is as the grouping reduces the result into a single tuple, which may not be possible if they are not all referenced. For example, if (a, b) = (0, 0), (0, 1) is grouped by a, it must either pick one of the values or pick both, neither of which are allowed.

This can happen even if the primary key is used as the grouping attribute.

`HAVING`

Retrieves the values of these functions only for groups that satisfy certain conditions - i.e. it filters out groups.

e.g. SELECT origin, COUNT(origin) FROM routes GROUP BY origin HAVING COUNT(origin) > 10.

COUNT(*) gets the count for the group.

`ORDER BY`

Sorts tuples based on the values of some attributes:

ORDER BY [column-name ASC|DESC] [, column-name ASC|DESC]*

Default is ASC. Sorts by the first column name first, and when the values are equal, it sorts by the second column etc. The sort is not stable.

Nested Queries

A nested query/sub-query be specified within the WHERE and HAVING clauses of an outer query. The sub-query must be enclosed within parentheses.

ORDER BY is not allowed
The sub-query SELECT list must consist of a single attribute/expression unless the outer query uses EXISTS
- If using the in predicate, it should return a table with one column and n rows
- If using a equality predicate, it should return a table with one column and one row
Attributes from the outer query’s table can be referenced in the sub-query; this is called correlation, and means that the result of the sub-query is different for each row of the outer query
- A table alias may need to be used
A sub-query can appear on both sides of a comparison

Examples:

SELECT * FROM customer WHERE num_orders > (SELECT AVG(num_orders) FROM customer)
SELECT * FROM movie WHERE (SELECT id FROM director WHERE lname='Nolan')
SELECT * FROM movie M1 WHERE M1.awards = (SELECT MAX(awards) FROM movie M2 WHERE M1.director = M2.director) returns the movies that have one the most awards for a given director

`EXISTS`

Used to check if result of correlated sub-query is empty.

e.g. SELECT * FROM director D1 WHERE EXISTS (SELECT * FROM movie WHERE D1.id = director AND type='comedy') finds directors that have directed at least one comedy.

Multi-table queries

Queries used nested queries using the = or IN conditions can always be expressed as a single query. This does a JOIN in the background, and so is much more efficient than correlated nested queries. The FROM clause may contain more than one table.

e.g. SELECT title FROM movie, director WHERE dnumber=director AND lname='Nolan'.

In general, if you have n tables, will need n-1 join conditions.

If there is no WHERE clause, all tuples of the relations are selected and the result is equivalent to the Cartesian product, most of which will be nonsense.

Multi-table query for the same table: SELECT DISTINCT M1.director FROM movie M1, movie M2 WHERE M1.director = M2.director AND M1.type='comedy' and M2.type='drama'. A similar result could be done using an INTERSECT operation.

Specifying Joins in SQL2

... FROM table1 JOIN_TYPE table2 ON col1=col2

Where the join type is one of:

JOIN
NATURAL JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
CROSS JOIN (Cartesian product)

The result of a join statement can also be used as a table e.g.:

SELECT title FROM
  (SELECT * FROM movie JOIN director ON movie.director=director.id) movieDirector
WHERE movieDirector.lname='Nolan' /* single quotes only */

To select all from both tables, use SELECT movie.*, director.*.

Set operations

UNION, EXCEPT (MINUS in Oracle) and INTERSECT. The results are sets - no duplicates, unless the keyword ALL is appended.

The two operands - select statements (in brackets), must return a table with the same number of properties and compatible data types.

`ANY`/`ALL`

Used with comparison operators and nested queries.

ANY is equivalent to IN. It is sometimes called SOME.

NOT ALL is equivalent to NOT IN.

The following will return movies that got more awards than all movies by a particular director:

SELECT * FROM movie WHERE awards > ALL (
  SELECT awards FROM movie, director WHERE movie.director = director.id AND lname = 'Nolan'
)

Division

R \div S = T(A)

$R$ , $S$ has at least one common attribute e.g. The result of division with $R(A, B)$ and $S(B, C)$ will have $(A, B) - (B, C)` = (A)$ .

Directors that have directed at least one movie of each type:

SELECT fname, lname FROM director WHERE NOT EXISTS (
  SELECT * FROM movie M1 WHERE NOT EXISTS (
    SELECT * FROM movie M2 WHERE M2.director = director.id AND M2.type = M1.type
  )
)

The M1.type = M2.type is able to get every single type of movie as it is not constrained to any particular director.

The following also gives the same result:

SELECT fname, lname FROM DIRECTOR
  JOIN movie ON director.id = movie.director
  GROUP BY director, fname, lname
  HAVING COUNT(DISTINCT type) = (
    SELECT COUNT(DISTINCT type) FROM movie
  )

The nested query is not correlated.

`DUAL` table

In Oracle, it is a dummy table with a single column and row which is used to get scalar values e.g. SELECT sysdate FROM DUAL.

`INSERT`

Adds one or more tuples to a relation. The following requires all attributes to be listed in the same order they were specified when the table was created:

INSERT INTO table_name VALUES (attr_1, ..., attr_n)

To specify the attributes to insert:

INSERT INTO table_name (attr_name_1, ..., attr_name_n) VALUES (attr_1, ..., attr_n)

Both forms allow you to insert multiple tuples, separating them with commas.

A select statement can be used to insert multiple tuples:

INSERT INTO table_name (attr_name_list) select_statement

NB: two single quotes escape a quote in a string literal.

`UPDATE`

UPDATE table_name SET attr_name_1 = expression_1, ... WHERE where_condition

The where condition is optional.

`DELETE`

DELETE FROM table_name WHERE where_condition

You can only delete data from one table at a time (except if CASCADE is used on a foreign key).

If the where condition is missing, every row gets deleted.

Views

Views are virtual tables - essentially a stored query that allows customized access to data (e.g. remove access to some sensitive attributes of a table).

In general, for a view to be updatable (varies by implementation):

Must be defined over a single table
Must contain a primary or candidate key of the base relation
Cannot use grouping or aggregate functions

CREATE VIEW view_name AS select_statement

Once a view is created, it can be queried like a normal table.

To drop a view:

DROP VIEW view_name

In Oracle, to find out which views are updatable:

SELECT column_name, updatable FROM user_updatable_columns WHERE table_name=upper('view_name')

Indexes

Indexes are a physical-level feature; access structures to speed up queries.

CREATE INDEX index_name ON table_name (attr_name_1, ...)

In older versions of Oracle, indexes would not be created for keys. In this case, a index would need to be created with the UNIQUE keyword.

DROP INDEX index_name is used to delete indexes.

Notes:

Indexes should be created before tables are populated
Indexes should not be created for attributes that are frequently updated
Current versions of Oracle creates indexes for primary and secondary keys
Indexes are often used in search and join conditions

Integrity Component

SQL 92 adds:

Required data (NOT NULL)
Domain constraints
Entity integrity (PRIMARY KEY, UNIQUE)
Referential integrity (FOREIGN KEY, REFERENCES)
Referential integrity constraint violation behavior (ON UPDATE, ON DELETE)
Enterprise constraints

Domain Constraints

Not in Oracle:

CREATE DOMAIN domain_name AS data_type DEFAULT default_option CHECK (search_option)

Both DEFAULT and CHECK are optional, and search_option should reference the value of the value using VALUE.

Use DROP DOMAIN domain_name, optionally appending RESTRICT or CASCADE.

Referential Integrity Violation

The qualifiers are ON DELETE and ON UPDATE.

The options are SET NULL, CASCADE and SET DEFAULT.

Oracle only supports ON DELETE CASCADE and ON DELETE SET NULL.

Enterprise Constraints

Constraints

CONSTRAINT constraint_name CHECK (condition)

This is for a specific table, so the condition can reference multiple attributes. In SQL2, the condition may include a nested SELECT statement.

Assertions

Constraints spanning multiple tables can be made: these are called assertions.

CREATE ASSERTION assertion_name CHECK (search_condition)

The search condition will be a SQL query, probably using the EXISTS keyword.

This is not supported in Oracle.

Triggers

Triggers monitors the state of the database and performs some action when a specific condition occurs.

Triggers can:

Calculate/update values of derived attributes
Enforce additional constraints
Prevent invalid transactions
Automatically perform actions
Audit changes
Maintain replicated tables

A trigger has an event - a change that activates a trigger. This can be:

A DML event such as UPDATE, INSERT or DELETE
A DDL event such as CREATE
A database event such as SERVERERROR ON DATABASE

For DML events, there are row-level and statement-level triggers; the former occurs for each modified row while the latter runs once for each statement.

Once an appropriate event occurs, it must pass a condition; a query which determines if the trigger should be activated.

If the condition is met, the action will occur. This can be one or more statements (procedures).

The action can occur BEFORE, AFTER, or INSTEAD OF.

CREATE|REPLACE TRIGGER [schema.]trigger_name
BEFORE|INSTEAD OF|AFTER
DELETE|INSERT|UPDATE OF attr_name_1, attr_name_2... OR DELETE|INSERT|UPDATE OF ...
ON [schema.]table_name|view_name
[REFERENCING OLD AS old_row_alias NEW AS new_row_alias]
FOR EACH ROW|STATEMENT
WHEN (condition)
  body
/

Within the SQL statements, if REFERENCING is not specified, :new.attribute_name and :old.attribute_name are used to reference attributes values for the row in question.

The / is used to end the trigger action. It must be on a new line.

Limitations:

The name of the trigger must be unique
The body cannot contain DDL or transaction statements
The trigger cannot read from or modify the mutating table, with some exceptions
INSTEAD OF is only available for views
WHEN is only available for row-level triggers. The condition cannot be a query

To alter the trigger:

ALTER TRIGGER [schema.]trigger_name ENABLE|DISABLE|COMPILE [DEBUG]

SHOW ERRORS shows any compilation errors.

PL-SQL Example Code

DECLARE
  varname type := initial_value;
  varname2 type;
BEGIN
  SELECT attribute INTO varname FROM table_name WHERE ...;
  RAISE_APPLICATION_ERROR(num => error_code, msg => 'Message');
  -- Error code between -20999 and -20000
  ...;
EXCEPTION
  handlers;
END;

Cannot use EXISTS: must store COUNT(*) into variable.

Example: Audit Trigger

CREATE [OR REPLACE] TRIGGER audit_trigger BEFORE INSERT OR DELETE OR UPDATE ON table_name
FOR EACH ROW
begin
  if INSERTING then
    INSERT INTO audit_table VALUES (USER || 'inserting' || 'new id:' || :new.id);
  elsif DELETING then
    INSERT INTO audit_table VALUES (USER || 'deleting' || 'old id:' || :old.id);
  elsif UPDATING('attribute_name') then
    INSERT INTO audit_table VALUES 
      (USER || 'updating' || 'old value:' || :new.attribute_name || 'new value:' || :new.attribute_name);
  elsif UPDATING then
    INSERT INTO audit_table VALUES 
      (USER || 'updating' || 'old id:' || :old.id || 'new id:' || :new.id);
  end if;
end;
/

Procedures

Collection of statements; basically functions:

CREATE|REPLACE PROCEDURE [schema].procedure_name
argument_name [IN|IN OUT|OUT] datatype, ...
IS|AS
  body

The body cannot contain DDL statements.

Compiling a procedure: ALTER PROCEDURE [schema.]procedure_name COMPILE.

Database Security

Discretionary and mandatory security mechanisms.

Control measures:

Access control: user accounts
Inference control: operations a user can make

The DBA is responsible for the overall security of the database. They have the responsibility to grant privileges to users and assign security levels users and data.

They can set privileges at the account and relation level:

Account: specifies operations that the account is allowed to run
Relation: specifies operations the account is allowed to run on individual tables

Each table is assigned an owner account; they have all privileges for the table, and can give privileges to other users.

Privileges can also be specified using views, allowing, for example, users access to only some attributes of a table.

GRANT account_privilege|table_privilege|ALL PRIVILEGES
ON object
TO user|role|PUBLIC
WITH GRANT OPTION;


REVOKE account_privilege|table_privilege|ALL PRIVILEGES
ON object
FROM user|role|PUBLIC;

/* EXAMPLES */
GRANT CREATE TABLE TO user WITH GRANT OPTION;
GRANT INSERT, DELETE ON table1, table2 TO user;
GRANT UPDATE ON table1(attribute1) TO user;

Table privileges: SELECT; INSERT; UPDATE; ALTER; DELETE; INDEX; DEBUG; REFERENCES. INSERT, UPDATE and REFERENCES (allows user to have a FK to the table) can be constraint to specific attributes.

If GRANT OPTION is set, the user granted the privilege can grant the privileges to other users; propagation. However, if the privilege is revoked, all propagated privileges are automatically revoked.

4. Data Normalization

Looking at an existing database and attempting to improve the design.

Information Design Guidelines

User view (logical):

Is the database design clear
Can users understand the whole schema

Storage/base relation level:

How are tuples stored
Is too much memory being used

Semantics of Attributes

Guideline 1: each tuple should represent one entity or relationship instance:

Don’t mix attributes from different entities
Only use foreign keys to refer to other entities
Separate entity and relationship attributes

Guideline 2: design relations so that no update anomalies are present:

If any anomalies are present, document them and ensure programs deal with them correctly

Guideline 3: design relations to have as few NULL values as possible:

If an attribute is NULL frequently, place it in a separate relation

Guideline 4: relations should be designed to satisfy the lossless join condition:

No spurious tuples should be generated from doing a natural join (i.e. all tuples are meaningful)

Decomposition

Dividing a relation into two or more relations. To do this, normalization is required - this is essentially a set of tests used to determine the ‘quality’ of a schema.

Functional Dependencies

Assume the universal relation schema is $R(A_1, A_2, \dots, A_n)$ .

A set of attributes, $X$ , functionally determines a set of attributes $Y$ if the value of $X$ determines a unique value for $Y$ . $X \rightarrow Y$ .

For each value of $X$ , there is exactly one value of $Y$ . The simplest example of this is if $X$ is the primary key and $Y$ is the tuple.

If two tuples have the same value for $X$ , they have the same value for $Y$ ; if $tuple\_1[X] = tuple\_2[X]$ , then $tuple\_1[Y] = tuple\_2[Y]$ .

Inference Rules

Given a set of FDs $F$ , we can infer additional FDs that hold whenever the FDs in $F$ hold.

The closure of a set $F$ is the set $F+$ which contains all FDs that can be inferred from $F$ .

If $X \rightarrow Y$ is inferred from $F$ , then it is notated as $F \models X \rightarrow Y$ .

Armstrong’s inference rules state that:

IR1, Reflexive: if $Y \subset X$ , $X \rightarrow Y$
IR2, Augmentation: if $X \rightarrow Y$ , then $XZ \rightarrow YZ$ (adding same set of attributes to both sides)
IR3, Transitive: if $X \rightarrow Y$ and $Y \rightarrow Z$ , then $X \rightarrow Z$

These three rules form a sound and complete set of inference (all functional dependencies can be inferred from these rules, and all are correct).

Using these three rules, we can form deduce three more rules:

IR4, Decomposition: if $X \rightarrow YZ$ , then $X \rightarrow Y$ and $X \rightarrow Z$
- By reflexivity, $YZ \rightarrow Y$
- By the transitive rule, $X \rightarrow YZ \rightarrow Y$
- Repeat for $Z$
IR5, Union: if $X \rightarrow Y$ and $X \rightarrow Z$ , $X \rightarrow YZ$
- By augmentation to $X \rightarrow Z$ , $XY \rightarrow YZ$
- By augmentation to $X \rightarrow Y$ , $XX \rightarrow XY$ so $X \rightarrow XY$
- By the transitive rule, $X \rightarrow YZ$
IR6, Pseudo-transitivity: if $X \rightarrow Y$ and $WY \rightarrow Z$ , then $WX \rightarrow Z$
- By augmentation to $X \rightarrow Y$ , $WX \rightarrow WY$
- By the transitive rule, $WX \rightarrow Z$

Example

Given $R(A, B, C, D)$ and a set of functional dependencies $F = \{ A \rightarrow B, AC \rightarrow D, BC \rightarrow D, CD \rightarrow A \}$ , show $AC \rightarrow D$ is redundant.

Using Armstrong’s theorems:

By augmentation on $A \rightarrow B$ , $AC \rightarrow BC$
By the transitive rule on $AC \rightarrow BC$ and $BC \rightarrow D$ , $AC \rightarrow D$

Using the definition of equivalent sets of functional dependencies (see equivalence sets section):

G = \{ A \rightarrow B, BC \rightarrow D, CD \rightarrow A \}

$F$ definitely covers $G$ , so only need to find out if $G$ covers $F$ .

For all FDs in $F$ :

$A \rightarrow B$ definitely covered since $G$ contains $A \rightarrow B$
…
$AC \rightarrow D$ : $\{AC\}_G^+ = \{ A, C \}$
- After one iteration: $\{ A, C, B, D \}$ . All attributes are now in the set
- $D$ is in $\{AC\}_G^+$ and all others are covered, so $G$ covers $F$ . Hence, $G$ and $F$ are equivalent so $AC \rightarrow D$ is redundant

Determining the closure

Closure of a set of attributes $X$ with respect to $F$ is the set $X+$ of all attributes that are functionally determined by $X$ .

$\{X\}_F^+$ can be calculated by repeated applying the three primary rules:

X_plus = X
while changes_occur()
  for Y, Z in F: # F is a set of functional dependencies of the form Y -> Z
    if Y in X_plus:
      X_plus.append(Z)

Example

Given $R(A, B, C, D, E)$ as a set of functional dependencies $F = \{ A \rightarrow BC, CD \rightarrow E, AC \rightarrow E, B \rightarrow D, E \rightarrow AB \}$ , determine all candidate keys in $R$ :

Start with single attributes:

\{A\}_F^+ = { A }

Look at the functional dependencies:

$A \rightarrow BC$ and $A$ is in $\{A\}_F^+$ , so $B$ and $C$ can both be added to the set
$CD \rightarrow E$ but $D$ is not in $\{A\}_F^+$ , so skip it
$AC \rightarrow E$ and both $A$ and $E$ are in $\{A\}_F^+$ , so add $E$
$B \rightarrow D$ and $B$ is in $\{A\}_F^+$ , so add $D$
We now have all attributes so we do not need to continue

As $\{A\}_F^+$ contains all the attributes, $A$ is a candidate key.

For $B$ :

Loop through the FDs once: $\{B\}_F^+ = \{ B, D \}$
An attribute was added so repeat
No changes, so $B$ only determines $D$ and hence, it is not a candidate key

$\{C\}_F^+ = \{ C \}$ : only determines itself. The same is true for $D$ .

After one iteration, $\{E\}_F^+ = \{ A, B, E \}$ . After another iteration, $C$ and $D$ get added. Thus, $E$ is also a primary key.

Now, look at tuples with two attributes. Ignore tuples containing $A$ or $E$ as they are candidate keys by themselves:

$\{BC\}_F^+ = \{ B, C, D, E, A \}$
$\{BD\}_F^+ = \{ B, D \}$
$\{CD\}_F^+ = \{ C, D, E, A, B \}$

Thus, $A$ , $E$ , $BC$ and $CD$ are candidate keys. No three-tuples can be created so the search stops here.

Equivalence of Sets of FDs

Two sets of FDs, $F$ and $G$ are equivalent if $F^+ = G^+$ : every FD in $F$ can be inferred from $G$ and vice-versa.

Alternatively, $F$ covers $G$ if $G^+ \subseteq F^+$ . Hence, $F$ and $G$ are equivalent if $F$ covers $G$ and vice-versa. This is much easier to test:

Calculate $\{X\}_F^+$ ( $X^+$ with respect to $F$ ) for each FD $X \rightarrow Y$ in $G$ ; if this includes all attributes in $Y$ , the sets are equal.

Example

Check if $F$ and $G$ are equivalent:

\begin{aligned} F &= \{ A \rightarrow C, AC \rightarrow D, E \rightarrow AD, E \rightarrow H \} \\ G &= \{ A \rightarrow CD, E \rightarrow AH \} \end{aligned}

FDs in $G$ :

$A \rightarrow CD$ : $\{A\}_F^+ = \{ A, C, D \}$ . $C$ and $D$ are both in the set
$E \rightarrow AH$ : $\{E\}_F^+ = \{ E, A, D, H, C \}$ . $A$ and $H$ are both in the set

Hence, $F$ covers $G$ .

FDs in $F$ :

$A \rightarrow C$ : $\{A\}_F^+ = \{ A, C, D \}$ . $C$ is in the set
$AC \rightarrow D$ : $\{AC\}_F^+ = \{ A, C, D \}$ . $D$ is in the set
$E \rightarrow AH$ : $\{E\}_F^+ = \{ E, A, D, H, C \}$ . $A$ and $H$ are both in the set
$E \rightarrow H$ : $\{E\}_F^+ = \{ E, A, D, H, C \}$ . $H$ is in the set

Hence, $G$ covers $F$ .

Hence, the sets are equivalent.

Minimal Sets of FDs

A set of FDs $F$ is minimal if:

Every dependency in $F$ has a single attribute on its RHS
Removing any dependency will lead to a set that is not equivalent to $F$
Replacing any dependency $X \rightarrow A$ with $Y \rightarrow A$ , where $Y \subset X$ , will lead to a set that is not equivalent to $F$

Every set of FDs has an equivalent minimal set (or sets - there can be several equivalent minimal sets).

Example

Is $G$ minimal? If not, find the minimal set:

\begin{aligned} & G = \{ \\ & \quad \text{SSN} \rightarrow \{\text{Name}, \text{Born}, \text{Address}, \text{DepartmentId} \}, \\ & \quad \text{DepartmentId } \rightarrow \{ \text{Name}, \text{ManagerSSN} \} \\ & \} \end{aligned}

First condition violated; RHS has multiple attributes for both dependencies.

Using the decomposition rule ( $A \rightarrow BC$ means $A \rightarrow B$ and $B \rightarrow C$ ):

$\text{SSN} \rightarrow \text{Name}$
$\text{SSN} \rightarrow \text{Born}$
$\text{SSN} \rightarrow \text{Address}$
$\text{SSN} \rightarrow \text{DepartmentId}$
$\text{DepartmentId} \rightarrow \text{Name}$
$\text{DepartmentId} \rightarrow \text{ManagerSSN}$

The third rule is met as the LHS only has one element, so it cannot be reduced further.

The second rule is met; it is obvious that all dependencies are necessary.

Normal Forms Based on Primary Keys

Normalization is the process of decomposing ‘bad’ relations into smaller relations.

Keys

Superkey of relation $R$ : set of attributes $S$ , $S \subseteq R$ such that no two legal tuples in $R$ will have the same $S$ values.

Key $K$ : superkey where removal of any attribute will cause it to not be a tuple.

If a relation has more than one key, each is called a candidate key, with one arbitrarily designated as the primary key and the others as secondary keys.

A prime attribute is a member of at least one candidate key.

First Normal Form

Every attribute is single and atomic; no composite or multivalued attributes or nested relations.

Second Normal Form

Full functional dependency: FD of the form $Y \rightarrow Z$ where removing any attribute from $Y$ means the FD no longer holds. It is called a partial dependency if this does not hold.

Second normal form adds more conditions to the first normal form: no non-prime attribute is partially functionally dependent on the primary key.

To transform into second normal form, split into smaller tables: if the primary key is $A$ and $B$ , and attribute $C$ is fully dependent on both $A$ and $B$ but $D$ is dependent on only $B$ , create two tables, $\{ A, B, C\}$ and $\{B, D\}$ . The primary key of the secondary table is only $B$ , so $D$ is fully dependent on the primary key.

Third Normal Form

If $X \rightarrow Y$ and $Y \rightarrow Z$ , $X \rightarrow Z$ is a transitive functional dependency.

Relations in third normal form are in 2NF and no non-prime attributes are transitively dependent on the primary key i.e. $\text{PrimaryKey} \rightarrow \text{NonPrimeAttribute}$ .

Example: $\text{SSN}$ identifies $\text{DepartmentID}$ and $\text{DepartmentID}$ identifies $\text{DepartmentName}$ . Hence $\text{DepartmentName}$ must be in a separate relation from $\text{SSN}$ , with $\text{DNumber}$ acting as the primary key.

General Normal Form Definitions

These take into account candidate keys, not just the primary key.

Second normal form: no non-prime attribute is partially functionally dependent on any key of the relation.

Third normal form: for all FDs:

The LHS is a superkey of $R$ OR
The RHS is a prime attribute of $R$

Only the first condition applies for BCNF.

Ensure at least one table contains the entire primary key, and check that it is a minimal cover before transforming.

Example

$R(A, B, C, D, E, F, G, H, I, J)$ and $F = \{ AB \rightarrow C, A \rightarrow DE, B \rightarrow F, F \rightarrow GH, D \rightarrow IJ \}$ .

Find the key of the table

Tip: look for any attribute that never appears on the RHS of any dependency.

$\{B\}_F^+ = \{ B \}$ ; first loop: $\{ B, F \}$ ; second loop: $\{ B, F, G, H \}$ . This does not contain all attributes, so $B$ by itself is not a key

Try $A$ and $B$ : $\{AB\}_F^+$ = $\{ A, B \}$ ; …; this contains all attributes, so it is the key.

Decompose $R$ into 2NF

All non-prime attributes are fully dependent on $AB$ . Hence, there can be no dependencies of the form $A \rightarrow X$ or $B \rightarrow X$ . Hence, $A \rightarrow DE$ and $B \rightarrow F$ violate this rule. Split the relation into multiple relations:

$R_1(A, D, E, I, J)$
- Start with $(A, D, E)$ , and as $D \rightarrow IJ$ , $I$ and $J$ need to be added
$R_2(B, F, G, H)$
- Start with $(B, F)$ , and then add $G$ and $H$ as they are dependent on $F$
$R_3(A, B, C)$

Decompose $R$ into 3NF

$A \rightarrow D$ and $D \rightarrow IJ$ , so there are transitive dependencies:

$R_{11}(A, D, E)$
$R_{12}(D, I, J)$

$B \rightarrow F$ and $F \rightarrow GH$ , so decompose:

$R_{21}(B, F)$
$R_{22}(F, G, H)$

$R_3$ has no transitive dependencies.

Boyce-Codd Normal Form

An even stricter form of 3NF: if $X \rightarrow A$ , then $X$ is a superkey of $R$ .

It is not always desirable to transform a relation into BCNF as some FDs may be lost in the process.

In addition, it may not be possible to transform into BCNF.

Example

$\text{ID} \rightarrow \text{CountryName}, \text{LotNumber}, \text{Area}$
$\text{CountyName}, \text{LotNumber} \rightarrow \text{ID}, \text{Area}$
$\text{Area} \rightarrow \text{CountyName}$

$\text{Area}$ is a non-prime dependency so it violates BCNF.

Decompose $\text{Area}$ , $\text{CountyName}$ into their own table, with $\text{Area}$ being the primary key, and another table with $\text{PropertyId}, \text{Area}, \text{LotNumber}$ , with $\text{PropertyID}$ being the primary key.

Note that the dependencies that $\text{PropertyID} \rightarrow \text{CountyName}$ and $\text{CountyName}, \text{LotNumber} \rightarrow \text{ID}$ have been lost. Hence, converting to BCNF can sometimes be unreliable.

Example

Determine which normal form the relation, $R(A, B, C, D)$ , is in. $AB$ is the primary key. The functional dependencies are $F = \{ AB \rightarrow CD, C \rightarrow ABD, D \rightarrow C \}$ If necessary, decompose into BCNF.

$C$ and $D$ are also candidate keys.

1NF: all attributes single and atomic
2NF: all attributes are prime attributes, so there are no partial dependencies
3NF: either a superkey on LHS or prime attribute on RHS; all attributes are prime so 3NF
BCNF: superkey on LHS; $AB$ , $C$ and $D$ are all superkeys

Example

$T$ lists dentist/patient data; patient given appointment at specific date and time with a dentist in a particular room; the dentist is in the same room for all appointments on a given day. Find examples of insertion/deletion/update anomalies, and describe the process of normalizing it to BCNF.

$T(\text{DNo}, \text{DName}, \text{PNo}, \text{PName}, \text{ADate}, \text{ATime}, \text{Room})$ . The primary key is $\text{DNo}, \text{PNo}, \text{ADate}$ .

Informal guideline: do not mix attributes from different entity types; the table has information on dentists, patients and appointments.

Insertion/deletion/update anomalies:

Insertion: can have two tuples with the same dentist number but different name (and same for patients)
Insertion: patient that has registered but has no appointment. There will be no $\text{DNo}$ or $\text{ADate}$ , so cannot insert it into the table (same for dentists)
Deletion: deleting an appointment could remove all information about the patient if it was their only appointment
Insertion: different rooms for the same dentist on the same day

Functional dependencies:

$\{ \text{DNo}, \text{PNo}, \text{ADate} \} \rightarrow \{ \text{DName}, \text{PName}, \text{ATime}, \text{Room} \}$ (decompose this into four FDs)
$\text{DNo} \rightarrow \text{DName}$
$\text{PNo} \rightarrow \text{PName}$
$\{ \text{DNo}, \text{ADate} \} \rightarrow \text{Room}$

1NF: all attributes are simple and atomic.

2NF: Non-prime attributes that fully functionally dependent on the primary key.

Non-prime attributes: $\text{Room}$ , $\text{DName}$ , $\text{PName}$ , $\text{ATime}$
The second, third and forth dependencies violate this as $\text{DNo}$ and $\text{PNo}$ are part of but not the whole primary key

Decompose the table to meet 2NF:

From FD2: $\text{DNo}, \text{DName}$ ; $\text{DNo}$ is key
- $\text{DNo} \rightarrow \text{DName}$
From FD3: $\text{PNo}, \text{PName}$ ; $\text{PNo}$ is key
- $\text{PNo} \rightarrow \text{PName}$
From FD4: $\text{DNo}, \text{ADate}, \text{Room}$ ; $\text{DNo}$ , $\text{ADate}$ are the keys
- $\{ \text{DNo}, \text{ADate} \} \rightarrow \text{Room}$
From remaining attributes: $\text{DNo}, \text{PNo}, \text{ADate}, \text{Time}$ ; $\text{DNo}$ , $\text{PNo}$ and $\text{ADate}$ are keys
- $\{ \text{DNo}, \text{PNo}, \text{ADate} \} \rightarrow \text{ATime}$

Each table now represents information about a single entity.

3NF: LHS superkey or RHS prime attribute. BCNF: LHS superkey. For all tables, the LHS is the primary key so it it is in both 3NF and BCNF form.

Relational Synthesis

The universal relation, R, contains all attributes you wish to store. Decompose this into $n$ tables $R_i$ with the following properties:

Attribute preservation: $\cup_{i}{R_i} = R$
Dependency preservation: each FD either appears in one of the decomposed relations or can be inferred
Lossless join: when tables are joined, there are no spurious tuples

Algorithm

Find a minimal cover G for F (minimize number of FDs)
For each LHS $X$ of a FD in G, create a relation $X \cup A_1 \cup \dots \cup A_m$ where $X \rightarrow A$ (i.e. table for each FD and their dependents)
Place any remaining (unplaced) attributes in a single relation schema to ensure attributes are preserved

Finding a Minimal Cover

For a set of FDs to be minimal:

There is one attribute on the RHS
You cannot replace the LHS of any FD with a smaller set, and still have an equivalent set of FDs
You cannot remove any FDs from a set

To find a minimal cover G for F:

Initialize G to F
Replace each FD $X \rightarrow A_1, \dots, A_n$ with $n$ FDs: $X \rightarrow A_1$ , …, $X \rightarrow A_n$ (decomposition rule)
For each $X \rightarrow A$ in $G$ :
- For each attribute $B$ in $X$ :
  - Let $Y=X-B$ , $J=(G - (X \rightarrow A) ) \cup \{ Y \rightarrow A \}$
    - $J$ Remove an attribute from the LHS and in $G$ , replace the original FD with the one with the new LHS
  - Compute $Y_{J}^+$ and $Y_{G}^+$ . If they are equal, replace $X \rightarrow A$ with $Y \rightarrow A$ in $G$ , and set $X=Y$
For each $X \rightarrow A$ in $G$ :
- Compute $X_{G - (X \rightarrow A)}^+$ ; if it contains $A$ , then the FD $X \rightarrow A$ is redundant so remove it from $G$

Example

Find the minimal cover for $F=\{ ABC \rightarrow D, AB \rightarrow C, C \rightarrow B \}$ .

$G$ is initially the set of:

$ABC \rightarrow D$
$AB \rightarrow C$
$C \rightarrow B$

$C \rightarrow B$ cannot be minimized further

Check $ABC \rightarrow D$ :

Try remove $A$ : $BC \rightarrow D$
- Compute $\{BC\}_G^+$
  - $= \{ B, C \}$
  - Can’t add any more attributes
- Compute $\{BC\}_{G'}^+$ , where $G'=\{ BC \rightarrow D, AB \rightarrow C, C \rightarrow B \}$
  - $= \{ B, C \}$
  - $= \{ B, C, D \}$
- They are different, so you cannot remove $A$
Try remove $B$ : $AC \rightarrow D$
Compute $\{AC\}_G^+$
- $= \{ A, C \}$
- $= \{ A, C, B \}$
- $= \{ A, C, B, D \}$
Compute $\{AC\}_{G'}^+$ , where $G'=\{ AC \rightarrow D, AB \rightarrow C, C \rightarrow B \}$
- $= \{ A, C \}$
- $= \{A, C, B \}$
- $= \{A, C, B, D \}$
They are the same, so $B$ can be removed

$G$ is now the set of:

$AC \rightarrow D$
$AB \rightarrow C$
$C \rightarrow B$
Try remove $C$ : $A \rightarrow D$ (using the new dependency, not $ABC \rightarrow D$ )
Compute $\{A\}_G^+$
- $= \{ A \}$
- There is no functional dependency that can be used
Compute $\{A\}_{G'}^+$ , where $G'=\{ A \rightarrow D, AB \rightarrow C, C \rightarrow B \}$
- $= \{ A \}$
- $= \{A, D \}$
They are different, so $A$ cannot be removed

Now check $AB \rightarrow C$ :

Try remove $A$ : $B \rightarrow C$
- Compute $\{B\}_G^+$ :
  - $= \{ B \}$
  - $= \{ B, C \}$
- Compute $\{B\}_{G'}^+$ , where $G'=\{ AC \rightarrow D, B \rightarrow C, C \rightarrow B \}$ :
  - $= \{ B \}$
- They are different, so $A$ cannot be removed
Try remove $B$ : $A \rightarrow C$
- Compute $\{A\}_G^+$
  - $= \{ A \}$
  - $= \{ A, C \}$
  - $= \{ A, C, B \}$
  - $= \{ A, C, B, D \}$
- Compute $\{A\}_{G'}^+$ , where $G'=\{ AC \rightarrow D, A \rightarrow C, C \rightarrow B \}$ :
  - $= \{ A \}$
  - $= \{ A, C \}$
  - $= \{ A, C, B \}$
They are different, so $B$ cannot be removed

Now, check if all three FDs are necessary:

$AC \rightarrow D$ :
- $\{AC\}_{G - \{AC \rightarrow D\}}^+ = \{ A, C, B \}$ ; $D$ is not in the FD so it cannot be removed
$AB \rightarrow C$ :
- $\{AB\}_{G - \{AB \rightarrow C\}}^+ = \{ A, B \}$ ; $C$ is not in the FD so it cannot be removed
$C \rightarrow B$ :
- $\{C\}_{G - \{C \rightarrow B\}}^+ = \{ C \}$ ; $B$ is not in the FD so it cannot be removed

Hence, the minimal cover is - $G = \{ AC \rightarrow D, AB \rightarrow C, C \rightarrow B \}$ :

Create a table $A, C, D$ with $A$ and $C$ as keys
Create a table $A, B, C$ with $A$ and $B$ as keys
Create a table $C, B$ with $C$ as the key

Note that there are two minimal covers for this specific example; the other is $G = \{ AD \rightarrow D, AB \rightarrow C, C \rightarrow B \}$ .

Lossless Join and FD-Preserving Decomposition into 3NF

Adds a new step to ensure all functional dependencies are kept: if none of the relations’ keys contains the key of the universal relation, create another table that contains the key.

5: Physical Layer

Internal Schema

Good database design usually requires multiple tables and hence, many select operations will require access to multiple tables, which lead to performance complexities.

DBMS designers must ensure the database, which will be stored in persistent memory, performs well. They must take into factors such as:

Secondary storage
Buffering, caching
Indexing strategy
Query optimization
Networks
Concurrency

The database designers can affect performance through:

Data types (e.g. using $int$ instead of $varchar$ )
Degree of normalization
Query formulation
Overhead for constraints, triggers etc.
Indexing strategies
Media (never store media in a DB)

Memory Hierarchy

Primary memory - cache and main memory:

Fast access
Limited capacity
Volatile

Secondary/tertiary memory - flash, HDDs, optical disks, tape:

High capacity
Low cost
Non-volatile

Disk Storage Devices

Data stored as magnetized areas
A disk pack contains several magnetic disks connected to a rotating spindle
Each surface of the disk is divided into concentric circular tracks, each storing 4-50 kB
Tracks are divided into blocks/sectors - arcs of the sector
- They are typically 512 B to 4 kB in size
- The entire block is read/written to at once
A read-write head moves to the track containing the block to be transferred
- Reading/writing is time consuming due to seek times, s, and rotational delay, rd
A physical disk block address consists of:
- A cylinder number: the disk containing the data
- The track number
- The block number

Record Management

A file is a sequence of records
Each record is a collection of fields; a row
Each field contains values for a certain data type
A file can have fixed- or variable-length records, and records can have fixed- or variable-length fields
- If variable-length, separator characters or length fields are required

Multiple records are stored in a block:

The blocking factor, bfr is the number of records per block
File records can be:
- unspanned: records are always in a single block. This is preferred as it means only a single read is required to get a particular record
- spanned: records can be stored in multiple blocks
If unspanned and the blocking factor is not an integer, there will be empty space on a block
If a file contains fixed-length records, unspanned blocking is usually used

File Organization

The DBMS:

Uses the OS/file system APIs to read/write physical blocks (pages) between disks and memory
May store DB data on multiple disks; if the block size is different and the records are unspanned, this may cause headaches
May have its own file system
- An in-memory database relies on the main memory for data storage, leading to faster access but also the potential for data loss if the system crashes

The file descriptor/header contains metadata for the file, such as field names, data types and addresses of the file blocks on the disks. The physical disk blocks can be contiguous, linked or indexed.

Typical file operations:

Open: ready the file
Find: search for the first record satisfying a condition, and makes it the current record
Find next: searches for the next record satisfying a condition, and makes it a current record
Read: read the current record into a variable
Insert: insert a new record into the file, and make it the current record
Delete: remove the current file record (usually by marking it as invalid)
Modify: change the value of fields in the current record
Close: terminate access to a file
Reorganize: e.g. remove records marked as invalid
Read ordered: read file blocks ordered by a certain field in the file

Unordered files:

Called a heap/pile file
Efficient insertion: new records inserted at the end
Linear search required to search for a record
Reading ordered by a particular field requires sorting

Ordered files:

Records sorted by an ordering field
Update is expensive; must be inserted in the correct order
- A overflow/transaction file, which is periodically merged with the main ordered file, may be used to improve performance
- If it is near the start, all records after that may need to be moved
Binary search can be used to search for the record on its ordering field value
Reading the records in the order of the ordering field is efficient

Hashed Files (External Hashing)

File blocks divided into M equal-size buckets (usually corresponding to one or a fixed number of disk blocks)
The record with hash key K is stored in bucket i where $i=h(K)$ , $h$ is the hashing function (e.g. modulo)
Search via the hash key is efficient
When a bucket is full, a overflow bucket is used; bucket has pointer to overflow bucket

Collision Resolution:

Open addressing: put in the next available position
Chaining: place in overflow bucket
Multiple-hashing: if collision occurs, use a secondary hash function etc. Fall back to open addressing

Misc. notes:

For performance, the hash file is kept 70-80% complete
The hash function should distribute the records uniformly; otherwise, overflow records will increase search time
Ordered access on the hash key is inefficient; requires sorting
Fixed number of buckets; problem if the number of records grows or shrinks

Redundant Arrays of Independent Disks (RAID)

Uses parallelism across multiple disks to increase reads/write performance using data striping.

RAID levels:

Level 0: striping, but no redundant data; best write performance
Level 1: mirrored disks; show as writes have to be performed on both disks
Level 4: block-level data striping, with parity information stored on one or more disks
Level 5: level 4, but parity information distributed across all disks
Level 6: P + Q redundancy with Reed-Solomon codes. Two disks can fail with no data loss
Level 10 (1 + 0): RAID 0, but each ‘disk’ is a set of RAID 1 mirrored disks

Indexing

Indexes provide an efficient way to access records in different orders without expensive operations like sorting or duplicate data by creating a data structure that maps between a key and address.

Dense indexes have entries for every record, while sparse indexes contain entries for only some of the records (not a one-to-one relationship).

A non-clustered index stores the indexes in a structure separate from the actual data, containing the index key values and a pointer to the row.

A clustered index physically defines the order in which records are stored. Hence, there can only be one clustered index. This course limits clustering indexes to be a non-key field so a file has either a primary or clustered index.

Under this definition, a clustered index will have an entry for each distinct value of the field and hence, it is a sparse index.

A primary index is indexed by the primary key and contains an entry for each block. Each entry contains a pointer to the block and either the value of the first or last record on the block. Hence, the primary index is a sparse index. To improve performance when inserting records, a overflow block may be used.

Secondary indexes are dense indexes that can be on either candidate keys or non-keys, and contain the value of the field and a pointer to either a block or record.

Example: Storage/Speed Tradeoffs

$r = 30,000$ records, record size $R = 100 B$ , block size $B = 1024$ , index key size $V = 9 B$ , block pointer size $P = 6 B$ . Blocks are unspanned.

Case 1: Ordered File

The blocking factor $BFR = floor(\frac{block\_size}{record\_size}) = floor(\frac{B}{R}) = floor(\frac{1,024}{100}) = 10$ . Hence, 10 records fit in a block.

Hence, the number of blocks required, $b = ceil(\frac{number\_of\_records}{blocking\_factor}) = ceil(\frac{30,000}{10}) = 3,000$ .

For linear search:

The best case is reading a single data block from the disk
The worst case is reading all data blocks from the disk
The average case is $b/2$ blocks

For binary search, $log_{2}(b) \approx 12$ blocks must be read from the disk.

Case 2: Primary Index

The primary index will have a entry for each block. Each index entry has a size of $I = V + P = 9 B + 6 B = 15 B$ .

Hence, the blocking factor for the index block, $BFRI = floor(\frac{block\_size}{index\_entry size}) = floor(\frac{B}{I}) = floor(\frac{1,024}{15}) = 68$ , so the index requires $ceil(\frac{number\_of\_index entries}{index\_blocking\_factor}) = ceil(\frac{b}{BFRI}) = ceil(\frac{3000}{68}) = 45$ .

To find an individual record, binary search is done on the index, so the the number of blocks read is $log_{2}(45) \approx 6$ . There is an extra read required to read the block the record is stored in, so it requires $7$ reads in total.

The primary index must be ordered, so update/insertion/deletion operations are costly.

Case 3: Secondary Index

The secondary index is a dense index with an entry for each entry.

Hence, the index requires $ceil(\frac{number\_of\_index\_entries}{index\_blocking\_factor}) = ceil(\frac{r}{BFRI}) = ceil(\frac{30,000}{68}) = 442$ blocks.

Searching the index requires $log_{2}(442) \approx 9$ block reads, and then an additional read to read the block the record is stored in, so it requires $10$ reads in total.

Unlike the primary index, it does not need to be ordered so update/insertion/deletion operations are faster.

Multi-Level Indexes

The index itself is an ordered file, so a primary index to the index can be created; the original index is called the first-level index, the index to the index the second-level index etc.

This process can be repeated until all entries fit in a single block.

Assuming the first-level index is a primary index, each block for the first first-level index can store references to $BFRI^1$ blocks, the second-level index references to $BFRI^2$ blocks and the $n$ th-level index $BFRI^n$ blocks.

This has consequences for performance: as the indexes are ordered, update/insertion/deletion operations may require indexes at all levels to update. Hence, this is more useful in read-heavy scenarios.

B-Trees and B±Trees

Due to the insertion/deletion problem, B or B+ tree structures are used.

In a B-tree, a node of order p has the structure $P_1, <K_1, Pr_1>, P_2, <K_2, Pr_2>, \dots, P_{q-1}, <K_{q-1}, Pr_{q-1}>, P_q$ where:

$P_i$ is a tree pointer
$K_i$ is a key value
$Pr_i$ is a data pointer

A B-tree of order m has:

At most m children (tree pointers)
If a non-leaf node:
- Has at least m / 2 children (except the root node)
- If it has k children, it contains k - 1 keys
All leaf nodes are at the same level

The keys divide the subtree: all values in $P_2$ will be between $K_2$ and $K_3$ .

On insertion, if a node becomes full then the node is spit into two nodes, with the mid-value being used as the key for the parent. If the parent then becomes full, this may cause a cascade. This behavior ensures it is always balanced.

On deletion, if the node has less than m / 2 keys, it can be merged with a sibling.

In a B±tree, pointers to data records only exist in leaf nodes:

Internal nodes of the form $P_1, K_1, \dots, P_{q-1}, K_{q-1}, P_q$
- Keys sandwiched between block pointers
Leaf nodes of the form $<K_1, Pr_1>, \dots, <K_{q-1}, Pr_{q-1}>, P_next$
- The last node is a block pointer to the next leaf node

Example: Calculating Order

Find the order p of a B±tree if the block size $B = 512 B$ , index key size $V = 9 B$ , data pointer size $Pr = 7 B$ , tree pointer size $P = 6 B$ :

Each node can have up to p tree pointers and p - 1 keys, taking up a total of $6 * p + (p - 1) * 9$ bytes. This must be less than the block size:

$15 * p - 9 < 512 \therefore p <= 521/15 \approx 34.7$ . Hence, the order and maximum value of p is $p = 34$ .

To store one million records requires $log_{34}(1,000,000) \approx 4$ levels (data is stored in the leaf nodes, so one million leaf nodes are required).

6. Query Processing and Optimization

\pi_{lname, fname}(\sigma_{\text{snumber = star } \land \text{ movie = mnumber } \land \text{ title = 'Gone with the wind'}}(star \times stars \times movie))

If this is done naively, it will get the Cartesian product of the three tables (if each has 100 rows, then this results in 1,000,000 rows) then gets the names of stars that performed in a particular movie - probably something like 10 rows. Needless to say, this produces a lot of garbage rows and is very inefficient. In a real database query optimization is done.

Query Processing

Scanning, parsing and validating -> immediate form of query
Query optimizer -> execution plan
Query code generator -> code to execute the query
Runtime database processor -> result

Code can be executed directly or stored and executer later (compiled).

Translating SQL Queries into Relational Algebra

Query block: the basic unit that can be translated into algebraic operators and optimized. This is a single SELECT ... FROM ... WHERE [... GROUP BY ... HAVING] expression. Nested queries will have multiple query blocks.

The query SELECT name FROM employee WHERE salary > (SELECT MAX(salary) FROM employee WHERE dept = 5) can be split into:

SELECT name FROM employee WHERE salary > C
- $\pi_{name}(\sigma_{salary > c}(employee))$
SELECT MAX(salary) FROM employee WHERE dept = 5
- $\mathcal{F}_{MAX salary}(\sigma_{DNO=5}(employee))$

For the second query, an unoptimized process could be:

Retrieve all blocks from employee, filter out those with the wrong department, and sort them - unlikely to be indexed or be sorted by salary
- If there a lot of rows, sorting may need to be done using external sorting
Find the top row that has the maximum salary

Side Note: Sort-Merge

Read x rows, sort them (quicksort etc.) and write to disk
Repeat n times
Read the first x/(n+1) rows from each subfile and perform an n-way merge. Store the result in a buffer; when the buffer size reaches x, write it to disk

The process can be done recursively.

Algorithms for `WHERE`

Simple selection has only as single condition; complex selections have multiple conditions.

Selectivity: ratio of the number of tuples that satisfy the condition to the total number of tuples.

For simple selection:

Linear search: no index, unstructured data
Binary search: no index, sorted data
Primary index/hash key to retrieve a single record
Primary index to retrieve multiple records (inequalities)
Clustering index to retrieve multiple records (equality)
Secondary index

For complex selection:

Conjunctive selection using an index
- Use the index for a simple condition, then check if the retrieved tuple satisfies the other conditions
Conjunctive selection using a composite index (index exists for two or more attributes involved in the conditions)
Conjunctive selection using the intersections of record pointers -Secondary indexes available for some of the attributes involved in the conditions, providing pointers to records (not blocks)
- Get the record pointers from each index, and find the intersection between them
- Then, check if the retrieved tuples satisfy the other conditions
Disjunctive conditions (union)
- If there are no indexes, use brute force

Algorithms for `JOIN`

Cross join/Cartesian product is time consuming. Equi-join and natural joins are much more efficient. Join performance is affected by:

Available buffer space
Join selection factor - fraction of records in one file that will be joined with records in the other file
Choice of inner/outer relation

If an index/hash key exists for one of the join attributes, a single-loop join can be performed: read all rows from the table without the index and use the index to directly retrieve matching records. If this is not available, a nested-loop join is required (brute force).

Algorithms for `SELECT`

If the attribute list contains candidate keys, all that is required is removing unwanted attributes.

Otherwise, duplicates must be removed. This can be done through either sorting or hashing.

Algorithms for Set Operations

Sort both relations on the same attributes, then scan and merge sorted files concurrently, deciding on if the tuple(s) should be kept by checking their equality.

Algorithms for Aggregate Functions

Min, max:

If an index exists, this can be used. If want to find the maximum and an ascending index exists, the right-most pointer in each index node can be followed
Otherwise, each row must be read

Sum, count, average:

For a dense index, the computation can be done on the values in the index
If an index does not exist or it is a non-dense index, each row must be read

Group by:

Sorting or hashing is used to partition the files into appropriate groups
The aggregate function can be done on tuples in each group

Query Representation

Query tree:

Data structure corresponding relational algebra expressions
Relations are leaf nodes
Relational algebra operations are internal nodes

When execution of an internal node occurs, the node and its children are replaced with the resultant relation.

Canonical tree: standard tree corresponding to an SQL query without optimization. From this tree optimized alternatives can be generated.

Query Optimization

Heuristic Query Optimization

The parser generates an initial internal representation - the canonical tree. After this, heuristics should be used to optimize the internal representation.

The main heuristic is to apply operations that reduce the size of intermediate results

Apply SELECT or WHERE before JOINs or other binary operations
- Apply the more selective condition first (e.g. equals condition on unique column)
Use joins instead of Cartesian products

To determine its efficiency, we can do things such as:

Use the data catalog to retrieve the number of tuples in the input tables and:
- Calculate the number of tuples outputted by the cross join
  - Find the upper bound for the number of tuples returned by the join
Estimate/calculate the number of tuples returned by each condition in the WHERE clause

Some optimizations that can be done on the first example:

The condition, title = 'Gone with the wind involves only a single table, so the select operation can be done before the Cartesian product. If each table has 100 rows, then the resulting table will only have 100,000 rows
Only the movie ID is needed from the movie table so a project operation can be done to only keep the movie ID before executing the Cartesian product
Use a join instead of a Cartesian product and select clause $$ \pi_{lname, fname}(star \Join_{,\text{snumber = star}} stars \Join_{\text{movie = mnumber}} \pi_{movie_id}(\sigma_{\text{title = ‘Gone with the wind’}}(movie))) $$

Query Execution Plan

Query tree + information about access methods to be used for each relation + methods used to compute relational operators stored in the tree.

Materialized evaluation: when a temporary relation (result of an operation) is made. For example, SELECT * FROM employee WHERE salary = (SELECT MAX(salary) FROM employee) will store the nested query rather than computing it for each row.

Pipelined evaluation: when the result of an operation is forwarded to the next operator in the sequence.

Query Optimization in Oracle

Earlier versions used rule- or heuristic- based query optimization where the optimizer chooses execution plans based on heuristically ranked operations.

With cost-based query optimization, the optimizer examines alternative access paths and operator algorithms, available indexes, past-access statistics etc. and chooses the plan with the lowest estimated cost (estimated IO, CPU, memory requirements etc.).

TO view the cost of each operation in a query, use the EXPLAIN query: EXPLAIN PLAN SET STATEMENT_ID = 'some_identifier' FOR some_statement and then SELECT * FROM plan_table to view the estimated CPU, IO, time, temp space etc.

7. Data Catalogs

Data catalogs store metadata for database objects such as relations, attributes, domains, constraints, authorization and security information, owners, indexes.

The catalog is stored as a set of read-only relations and views (system tables).

The catalog is used:

By the DDL and SDL compilers
By the query and DML parsers
- These convert high-level queries and DML statements into low-level file access commands
- They use the internal schema to determine an optimal way to execute the query/command
By the DBA for authorization and security checking, who has privileges to update the catalog
For external-to-conceptual mapping of queries/DML commands

The catalog consists of base tables and user-accessible views. The latter consists of:

USER views (objects owned by user)
- e.g. USER_CATALOG, USER_TAB_COLUMNS, USER_CONSTRAINTS
ALL views (objects the user has privileges to view)
- e.g. ALL_CATALOG, ALL_OBJECTS
DBA views (all objects)

Examples

Some interesting catalog tables:

`ALL_TABLES`

Contains information about all tables you have view permissions for. To view the number of rows for a specific table:

SELECT NUM_ROWS FROM ALL_TABLES WHERE OWNER = 'owner_name' AND TABLE_NAME = 'table_name'

Number of blocks used is also stored.

The catalog is updated regularly, so it may contain stale data. The LAST_ANALYZED row contains the time it was last updated. To force an update, use the command ANALYZE TABLE table_name COMPUTE_STATISTICS.

`ALL_USERS`

Contains information about all users you have view permissions for. The password is not viewable.

USER_USERS contains information about your own user.

`ALL_INDEXES`

Contains information about indexes: metadata such as the number of distinct keys, if it is primary/secondary etc.

`ALL_VIEWS`

Contains the definition (SQL query) for the view and other metadata. The definition cannot be modified directly as the catalog is read-only.

`ALL_CONS_COLUMNS`

Check constraints for all tables: constraint name, table name, owner, name of the column the constraint is for etc.

`ALL_CONSTRAINTS`

Contains more information about constraints including the search condition (SQL constraint definition) and constraint type (stored as a char - can be check constraint, primary/unique key, referential integrity constraint etc.).

`TABLE_PRIVILEGES`

Grantee, owner, grantor, table name and privileges (select, insert etc.) for each grantee.

8. Transaction Processing

A logical unit of work consisting of one or more statements. The effects of the transaction only becomes permanent when the transaction is completed successfully.

An application program may contain several transactions.

The basic data operations are read and write. A read operation will:

Find the address of the disk block that contains the item
Copy that block into a buffer in main memory
Copy the item from the buffer to a program variable

A write operation will:

Find the address of the disk block that contains the item
Copy the block into a buffer in main memory
Copy the item from the program variable into the correct location in the buffer
Copy the updated buffer back to disk, either immediately or at some later point in time

A transaction can be in the following states:

Active: running read/write operations
Partially committed: all writes to memory finished; DBMS performs housekeeping (logs changes etc.)
Committed: all changes copied from memory to disk

The transaction can be aborted when in the active or partially committed states.

Concurrent Execution of Transactions

Multi-user systems:

Single-user system: at most one user can use the system at a time
Multi-user system: many users can access the system concurrently

Concurrency:

Interleaved processing: transactions run interleaved in a single CPU
Parallel processing: transactions concurrently executed in multiple CPUs

Lost Update Problem

When two transactions that update the same database items have their operations interleaved in a way that makes the value of some database item incorrect.

Example:

Transaction 1  |  Transaction 2
               |
read_item(X)   |
X := X - N     |
               |  read_item(X)
               |  X := X + M
               |
write_item(X)  <-- This is lost!
               |  write_item(X)

Temporary Update Problem

One transaction updates a database item but the transaction fails. Another transaction accesses the updated item before it is reverted to its original value.

Reading the uncommitted value is called a dirty read.

Example:

Transaction 1  | Transaction 2
read_item(X)   |
X := X - N     |
write_item(X)  |
               |  read_item(X) <- Dirty Read
               |  X := X + M
               |  write_item(X)
throw_error()  |
undo_change()  |

Incorrect Summary Problem

While one transaction is running an aggregate function while other transactions update some of the records, causing the aggregate function to read values both before and after they are updated.

ACID Properties

Atomic: a transaction is either performed entirely or not at all.

Consistent: the transaction will move the database from one consistent state to another.

Isolated: the transaction’s changes should not be visible to other transactions until it is committed - this can solve the temporary update problem.

Durable: once a transaction is committed, system failures will not cause the changes to be lost.

The concurrency system is the primary means by which the isolation property is preserved, and supports the atomicity and consistency properties.

The recovery system is the primary means by which the atomicity and durability properties are preserved, and supports the consistency and isolation properties.

Concurrency Control and Recovery

Not covered due to time limitations.

Test Notes

Intro

Data Definition Language: CREATE etc.
Data Manipulation Language: SELECT etc.
View Definition Language: TODO
Storage Definition Language: internal schema (DBMS specific commands)
Internal: storage structures etc.
Conceptual schema: relationships, tables etc.
External schema: subset of conceptual; relevant content only

ER Diagrams

Attributes:

Multivalued: { attr_name }
Composite: attr_name(prop_1, prop_2)
Derived: dotted
Keys: underlined. Must be single-valued

Relationships:

Degree: num. entity types - recursive relationships unary
No key attributes
Relationship type: intension; relationship set; extension
Cardinality ratio: {A} in {B cardinality} relationships with {B}; structural constraints the opposite way around

Weak entities:

One partial key, may be composite; unique for a given owner
Can have multiple owners

EER:

{subclass_1 \subset, subclass_2 \subset} o|d (?:=|-)+ superclass
Subclasses can merge if they share a root superclass - structure called a lattice
- category \subset u {implementor_1, implementor_2}

Relational Data Model

Domain constraint: value must be in their domain.

Key constraint:

Superkey: set such that no two attributes have the same value
Key: minimal superkey
Prime attributes: attributes in any candidate key

Referential integrity constraint: foreign key matches existing primary key of relation, or is null.

Semantic integrity constraint: domain-specific constraints.

EER To Relation

Simple entity: new relation, pick any key as PK
Weak entity: new relation, PK is owner PK + partial key
1:N relationship: FK on N side
1:1 relationship: FK on side with total participation
M:N relationship: new relation with both FKs as PK
Multivalued attributes: new relation, PK is owner PK + attribute value
N-ary relationships: new relation with FKs as PKs
Specialization/generalization
- Any specialization: superclass has its own relation; subclasses have their own relation with their attributes; PK is FK for the superclass
- Total specialization: subclasses have their own relation, duplicating superclass attributes
- Disjoint specialization: single relation with properties from every subclass, plus type attribute
- Overlapping specialization: single relation with properties from every subclass, plus Boolean for each subclass indicating if the entity is a member of the subclass
Category: PK is artificial, auto-generated ID. All superclasses of the category have this as a FK

Normalization

Guidelines:

Each tuple represents a single entity: use only FKs to refer to other entities
No update anomalies should be possible (e.g. reduce data duplication as much as possible)
Reduce the number of null attributes
Lossless join condition: when doing a natural join, all tuple should be meaningful (no spurious tuples)

Functional Dependencies

Closure of set of FDs $F$ is $F^+$ : all FDs that can be inferred from $F$
Closure of FD $X$ with respect to $F$ is $\{X\}_F^+$
- The closure is initially elements in the RHS of $X$
- While there are changes, loop through all FDs. If the LHS is a subset of the closure, add elements from the RHS to the closure

Armstrong’s Inference Rules

Reflexivity: $Y \subset X \implies X \rightarrow Y$
Augmentation: $X \rightarrow Y \implies XZ \rightarrow YZ$
Transitivity: $X \rightarrow Y \land Y \rightarrow Z \implies X \rightarrow Z$

From these, the following can be generated:

Decomposition: $X \rightarrow YZ \implies X \rightarrow Y \land X \rightarrow Z$
Union: $X \rightarrow Y \land X \rightarrow Z \implies X \rightarrow YZ$
Pseudo-transitivity: $X \rightarrow Y \land WY \rightarrow Z \implies WX \rightarrow Z$

Equivalence of Sets

$F$ covers $G$ if $G^+ \subseteq F^+$ . Equivalent if $F$ covers $G$ and $G$ covers $F$ .

To test if $F$ covers $G$ , for each FD $X \rightarrow Y$ in $G$ , check if $Y \subseteq \{X\}_F^+$ .

Minimal Sets

Minimal if RHS of all FDs is a single element and:

Removing any dependency OR
Removing any element from the LHS

leads to a set that is not equivalent.

Normal Forms

If elements can be removed from the RHS, then it is a partial FD. Otherwise, it is called a full FD.

If $X \rightarrow Y$ and $Y \rightarrow Z$ , $X \rightarrow Z$ is a transitive functional dependency.

1NF: every attribute single and atomic
2NF: every non-prime attribute fully functionally dependent on every key
3NF: LHS superkey OR RHS prime attribute
BCNF: LHS superkey

Minimal Cover Algorithm

G = {}
for fd in F:
  for el in fd.lhs
    G.append(FD(el, fd.rhs)) # Decomposition

for i in range(len(G)):
  fd = G[i]
  for el in fd.lhs:
    J = G.copy()
    J.replace(fd, FD(fd.lhs - el, fd.rhs))
    # For each FD, see if any element can be removed from the LHS
    if cover(fd.lhs - el, G) == cover(fd.lhs - el, J):
        # Covers equal, so element can be removed
        G = J

for fd in G:
  if cover(fd.lhs, G - fd) == cover(fd.lhs, G):
    G.remove(fd)

Physical Layer

Disk divided into tracks, tracks divided into blocks/sectors.
Records: seconds of fields
- Unspanned/spanned
File: sequence of records. May be fixed or variable length
Blocking factor: records/block

Hashing

Files divided into buckets; function on hash key determines which bucket a record goes in
Bucket has pointer to overflow buckets

RAID 0: striping data; 1: mirroring.

Indexing

Dense/sparse: 1:1 mapping
Clustered index: determines order of records. One entry/distinct value of field
Primary index: entry/block (pointer + value of first/last record in the block)
Secondary index: dense indexes with pointers to each record
Multi-level index: all entries fit in a single block

B-Trees

Structure with pointer to subtrees and (key value + pointer) pairs that divide the subtree (all left subtree values less than key value etc.): $k$ child nodes, $k-1$ key-pointer pairs. Nodes must be at least half full.

B+ tree: internal nodes store only key values. Leaf nodes have key value + pointer pairs, plus pointer to next leaf node in the tree.

Transactions

Lost update: multiple transactions updating the same items have operations interleaved
Temporary update/dirty read: when transaction reads data being modified by another transaction before it is committed
Incorrect summary: transaction running aggregate method while another transaction is updating records

ACID:

Atomic: performed entirely or not at all. Recovery
Consistent: move from one valid state to another. Recovery
Isolated: changes not visible to others until committed. Concurrency
Durable: once committed, system failures will not affect data

Misc.

Materialized evaluation: result of temporary relation stored (e.g. comparing value to max value from nested operation). c.f. pipelined evaluation.