0. Introduction
Definitions
- Entity: concept/object about which information is stored
- Attribute: property of an entity
- Mini-world: part of the real world about which data is stored
- A database represents only relevant aspects of the mini-world
Database Management System (DBMS)
A software system facilitating the creation/maintenance of a database:
- Database software: DBMS + applications
- Database system: database + software
Functionality:
- Database definition
- Loading the database into memory
- Manipulation: queries/inserts/deletions/updates
- Concurrent processing by users/programs
- Security measures to prevent unauthorized access
Compared to files:
-
Disadvantages: complexity, size, cost, performance
-
Advantages: reduced development time, flexibility, economies of scale
The schema, also known as the database intension, rarely changes, but the state of the database, called the extension changes frequently. The actual data stored in the database at a particular moment in time is called the instance.
Three-Schema Architecture
- Internal schema: deals with the physical level (e.g. path to database), but not the actual physical devices
- Conceptual schema: describes the structure of the database; entities, attributes relationships etc.
- External schema: view visible to end users. It may be a subset of the conceptual schema, allowing the user to access only what is relevant to them
Three-Level Data Model
- Conceptual/high-level/semantic: concepts close to the way users perceive the data
- Physical/low-level/internal: concepts describing details of how data is actually stored
- Implementation/representational: half-way point
DBMS Languages
- Data Definition/Description Language (DDL): specifies the schema of the database (e.g.
CREATE,ALTER) - View Definition Language (VDL): language used to create user views - mapping from the conceptual to the external schema
- Storage Definition Language (SDL): language used to specify the internal schema - how the data is actually stored
- Data Manipulation Language (DML): language used to manipulate the database (e.g.
SELECT,INSERT). They can be high level, non-procedural languages like SQL or low-level procedural languages where records are updated one at a time
1. Data Modelling
Phases of Database Design
- Conceptual design: ER/EER diagrams
- Logical design: conversion into DBMS-specific form (SQL schema)
- Physical design: hardware the data is stored on
Entity-Relationship
Attributes:
- Composite attributes:
Name(Sub, Property)- Decompose into atomic attributes
- Multivalued:
{ attributeName }- Double eclipse
- Derived: dotted eclipse
- Primary Key
- Can be composite
- Must be single valued
- Underlined
Relationships:
- Degree: number of participating entity types
- Recursive relationships are unary
- Require labels for the roles
- Recursive relationships are unary
- Can have attributes, but not key attributes
- Relationship type: description of schema, constraints etc.
- Relationship set: current set of relationships
Weak Entities:
- Exactly one partial key
- May be composite
- Unique for a given owner
- Can have multiple owners in its identifying relationship
Cardinality Ratio:
- ‘{A} can be in {B cardinality} relationships with {B}’
Structural Constraints:
(min, max).maxcan ben(orm)- Read the opposite way to cardinality constraints
- Use for n-ary relationships
Enhanced Entity-Relation
Subclass ----\subset ---- Superclass: optional subclass(Subclass1 ----\subset, Subclass2 ----\subset)----d----Superclass: partial disjoint specialization(Subclass1 ----\subset, Subclass2 ----\subset)----d====Superclass: total disjoint specialization(Subclass1 ----\subset, Subclass2 ----\subset)----o----Superclass: partial overlapping specialization
Two subclasses can merge if they have the same unique identifier - they share a root superclass. In this case, the structure that is formed is called a lattice.
Category====\subset====u----(Class1-----, Class2=====): class 1 is partial, class 2 is total
Union: each instance of the category is a member of one of the superclasses. The category will have an artificial surrogate key.
2. Relational Data Model
Formal Definitions
| Formal | Informal |
|---|---|
| Relation | Table |
| Tuple | Row |
| Attribute | Column |
| Degree | Num. attributes |
| Cardinality | Num. rows |
- Relation: set of tuples
- Schema of relation:
- Tuple: ordered set of values
- Domain: set of valid values for an attribute
Relation: subset of the cartesian product of the domains i.e. any valid combination of values for all the attributes of the table.
The ordering of tuples irrelevant, but the ordering of values within a tuple relevant.
Constraints
Conditions that hold for all valid relation instances.
Domain constraint
Specifies that all values in a relation must be inside their respective domains.
Key Constraint
- Superkey of
: set of attributes such that no two tuples will have the same value - Key - a minimal superkey: removing any attribute stops it from being a superkey
- A primary key is arbitrarily chosen from the set of candidate keys. The rest are called secondary keys
- Attributes of the candidate keys are called prime attributes
Entity Integrity Constraint
Primary key values must not be null.
Referential Integrity Constraint
Referencing relation has foreign key attribute referencing the primary key of the referenced relation. The value can be null, and the primary key can itself be a foreign key.
Semantic Integrity Constraint
Constraints not expressed by the model and relating to the application semantics.
EER to Relation Mapping Algorithm
- Simple tables: make a new table, pick any key as the primary key
- Weak entity; make a new table, with the primary key being the combination of the owner and partial key. If the partial key is composite, use the components
- 1:N: Put the foreign key in the ‘N’ side
- 1:1: Put the foreign key on the side with total participation
- M:N: make a new table with two foreign keys; the combination of the two form the primary key
- Multivalued attributes: make a new table with a foreign key to the owner. The combination of the value of the attribute and foreign key is the primary key
- N-ary relationships: make a new table, with the combination of the foreign keys being the primary key.
- Specialization/generalization:
- Superclass has its own table with general attributes; subclasses have their own table with specific attributes; the primary key is a foreign key to superclass
- Total specialization: tables just for the subclasses (duplicating superclass attributes)
- Disjoint specialization: single table with attributes for every subclass, plus attribute denoting which subclass it is (called the type)
- Overlapping specializations: same as above, but Boolean for every subclass indicating which subclasses it is a member of
- Categories: primary key is an artificial, automatically generated ID; all superclasses of the category have this as a foreign key
Relational Algebra
Variables
Select
-
- Horizontal slices of a table
- Condition can be composite; use logical operators
Project
-
- Vertical slices of a table
- Returns a set, so if two rows are identical the duplicate will not be added
Rename
-
- Renames relation properties
Set
Union-compatible relations: require the same degree and pairs of corresponding attributes to have the same domain.
- Union:
- Intersection:
- Difference:
- Cartesian Product:
Union and intersection are:
- Commutative: order of arguments doesn’t matter
- Associative: order of operations doesn’t matter
Joins
Equijoin:
-
-
and are both present
Natural join:
-
- Special case of equijoin where primary and foreign keys have the same name
- The primary and foreign key columns are merged
-
where
The same type of joins are available for left, right and full outer joins (symbols have two horizontal lines sticking out of the top and bottom of the left/right end of the
Division
‘For every’ queries.
If
For a tuple
If
e.g. if
- For the id to appear, the id must appear along with every tuple in
- The resultant table will only have the primary key (columns in
minus columns in )
Equivalent to:
If
Given
-
is a superset of the desired result; some need to be removed -
gives all possible tuples that could be contained in a relation with schema -
gives all possible tuples not in - This can be subtracted from
after a project operation
3. SQL Queries
Oracle SQL Data Definition
SQL relations (tables) is a multi-set, NOT a set of tuples. Using PRIMARY KEY or UNIQUE, or the DISTINCT option in a query, constraints the result to a set. DISTINCT requires search to ensure duplicate tuples are not returned, so is slower than using ALL (the default).
Naming:
- Databases up to 8 characters
- Other names up to 30 characters
- Table names cannot begin with
sys_ - Case insensitive
Expression
An attribute (i.e. column name) or combination of attributes, operators and functions.
Binary Operators
- Comparisons:
<<=,=,!=/<>,>=and> - Logical:
NOT,ANDandOR, in that order of priority. Use parentheses if needed +,-,*, and/can be used for numeric values- String concatenation:
||
Functions
Scalar:
- Type conversion:
TO_CHAR,TO_DATE,TO_NUMBERetc. - Numeric:
ABS,CEIL,FLOOR,EXP,LOG,MOD,POWER,ROUND, SQRT` etc. - String:
SUBSTRING,UPPER,LOWER,TRANSLATE,CONVERT,LENGTH,LPADetc. - Date/time:
SYSDATE,ADD_MONTHS,MONTHS_BETWEEN,NEXT_DAYetc.
Aggregate/Set:
SUM,COUNT,MAX,MIN,AVGetc.
Predicates
LIKE,BETWEEN,IN,ANY,ALL,EXISTS,IS NULL
Misc.
SQL+:
set autocommit ondescribe schema.tablenameset linesize n@sql-script-path
Create Schema
Schemas are used to group tables together. Permissions are set on a per-schema level.
CREATE SCHEMA schema-name
AUTHORIZATION owner
Create Table
CREATE TABLE [schema-name.]table-name
(
column-name column-specification[,
column-name column-specification]*
[constraint-name constraint-specification]*
)
Use CREATE TABLE table-name AS (query) to populate the table with data.
Notes
Constraints can be inline or defined after the columns. If the latter, multiple rows can be referenced.
Inline:
column-definition constraints where constraints can be:
NOT NULLDEFAULT default-valueUNIQUEREFERENCES table-name(with the primary key being the foreign key)CONSTRAINT constraint-name constraint-definitionwhereconstraint-definitioncan be:PRIMARY KEYREFERENCES referenced-table[.foreign-key-column]CHECK (condition)- e.g.
column-name > 10,column-1 != column-2 - Value in array:
column-name IN ('val1', 'val2') - The condition can only involve the name of the current column if it is inline
- e.g.
To define a primary key after column definitions: PRIMARY KEY (column-name). A composite primary key can be made by passing multiple arguments.
To define foreign keys after column definitions:
FOREIGN KEY (column-name)
REFERENCES foreign-table (foreign-table-key)
[ON DELETE integrity-option]
[ON UPDATE integrity-option]
To add constraints after defining the table:
ALTER TABLE table-name ADD CONSTRAINT constraint-name constraint-definition
Where integrity-option is one of RESTRICT, CASCADE, SET NULL or SET DEFAULT.
Domain: basically typedef for SQL. NOT in Oracle:
CREATE DOMAIN domain-name type-definition [CHECK check-statement]
/* e.g. `check-statement` could be `(VALUE IS NOT NULL)` */
Drop Table
Drops the relation (base table) and definition: DROP TABLE table-name.
Alter Table
Add, delete or modify attributes/constraints using ALTER commands.
Add/modify: ALTER TABLE table-name ADD|MODIFY column-name column-definition. The NOT NULL constraint is only allowed if it also has a DEFAULT default-value constraint. Otherwise, the column must be added without the NOT NULL constraint then update all the rows, then alter the definition.
Delete column and all constraints: ALTER TABLE table-name DROP [COLUMN] column-name RESTRICT|CASCADE.
Delete constraint: ALTER TABLE table-name DROP [CONSTRAINT] constraint-name RESTRICT|CASCADE. If the constraint is a primary key or unique, CASCADE drops all referencing foreign keys; RESTRICT stops the operation if there are any referencing foreign keys.
Drop Schema
DROP SCHEMA schema-name RESTRICT|CASCADE. RESTRICT only allows the operation to run if the schema is empty; CASCADE drops all children objects as well.
Oracle SQL Queries
SELECT
SELECT [ALL|DISTINCT] attribute-list
FROM table-list
[WHERE condition]
[GROUP BY grouping-attributes]
[HAVING group-condition]
[ORDER BY attribute-list]
[OFFSET n]
[FETCH NEXT m ROWS ONLY] /* Oracle specific */
Six clauses; the last four are optional:
SELECTis like the project,operator ALLis the default;DISTINCTremoves duplicate tuples- The latter requires comparisons for each tuple so may be slower
attribute-listis a comma separated list where each item is:*for all columns in the table- A column name
- The result of some operation (e.g.
column_name * 10)
WHEREis like the select,operator ORDER BYdoes sorting
Table Alias
Oracle uses the syntax, tableName alias.
Other SQL implementations use tableName AS alias.
This may be useful when referring to two entries in the a single table (e.g. find rows where the last names the same).
The result of a nested select statement can also be aliased: (SELECT clause that is probably using JOINs) alias.
Attribute list
The attribute list can contain column names, aggregate functions and constants:
- Except for
COUNT(*), aggregate functions ignoreNULL VALUES - Can also use
DISTINCTwithin aggregate functions:COUNT(DISTINCT origin)will return the number of distinct origin values - Can also be a nested select statement returning a single row and column
Some WHERE predicates
- Range (inclusive):
column-name BETWEEN min AND max- Can be used for numbers, dates and strings
- Membership:
column-name IN (val_1, ..., val_n) - Pattern matching with
LIKE:column-name LIKE pattern- Enclose
patternin single quotes %acts like.*in a regular expression_acts like.in a regular expression- Can set custom escape character:
'Pattern\_Using\_UnderscoreESCAPE ‘’
- Enclose
- Pattern matching with regular expressions:
REGEXP_LIKE(column-name, pattern, [param])param:i(case-insensitive),c(case-sensitive),x(ignore whitespace), ‘m’ (multiline)
NULL:IS [NOT] NULL;NULL != NULLso cannot use equality comparisons
GROUP BY
For applying aggregate functions to groups of tuples; each group is a set of tuples where they have the same value for the grouping attribute(s) (comma-separated list). This is used for summary information, and is often phrased as for each entity ….
e.g. SELECT origin, COUNT(origin) FROM routes GROUP BY origin returns number of times each origin value appears.
e.g. SELECT COUNT(DISTINCT type), director FROM movie GROUP BY director returns the number of genres of movies each director has directed.
When using GROUP BY, all non-aggregate attributes need to be referenced. This is as the grouping reduces the result into a single tuple, which may not be possible if they are not all referenced. For example, if (a, b) = (0, 0), (0, 1) is grouped by a, it must either pick one of the values or pick both, neither of which are allowed.
This can happen even if the primary key is used as the grouping attribute.
HAVING
Retrieves the values of these functions only for groups that satisfy certain conditions - i.e. it filters out groups.
e.g. SELECT origin, COUNT(origin) FROM routes GROUP BY origin HAVING COUNT(origin) > 10.
COUNT(*) gets the count for the group.
ORDER BY
Sorts tuples based on the values of some attributes:
ORDER BY [column-name ASC|DESC] [, column-name ASC|DESC]*
Default is ASC. Sorts by the first column name first, and when the values are equal, it sorts by the second column etc. The sort is not stable.
Nested Queries
A nested query/sub-query be specified within the WHERE and HAVING clauses of an outer query. The sub-query must be enclosed within parentheses.
ORDER BYis not allowed- The sub-query
SELECTlist must consist of a single attribute/expression unless the outer query usesEXISTS- If using the
inpredicate, it should return a table with one column and n rows - If using a equality predicate, it should return a table with one column and one row
- If using the
- Attributes from the outer query’s table can be referenced in the sub-query; this is called correlation, and means that the result of the sub-query is different for each row of the outer query
- A table alias may need to be used
- A sub-query can appear on both sides of a comparison
Examples:
SELECT * FROM customer WHERE num_orders > (SELECT AVG(num_orders) FROM customer)SELECT * FROM movie WHERE (SELECT id FROM director WHERE lname='Nolan')SELECT * FROM movie M1 WHERE M1.awards = (SELECT MAX(awards) FROM movie M2 WHERE M1.director = M2.director)returns the movies that have one the most awards for a given director
EXISTS
Used to check if result of correlated sub-query is empty.
e.g. SELECT * FROM director D1 WHERE EXISTS (SELECT * FROM movie WHERE D1.id = director AND type='comedy') finds directors that have directed at least one comedy.
Multi-table queries
Queries used nested queries using the = or IN conditions can always be expressed as a single query. This does a JOIN in the background, and so is much more efficient than correlated nested queries. The FROM clause may contain more than one table.
e.g. SELECT title FROM movie, director WHERE dnumber=director AND lname='Nolan'.
In general, if you have n tables, will need n-1 join conditions.
If there is no WHERE clause, all tuples of the relations are selected and the result is equivalent to the Cartesian product, most of which will be nonsense.
Multi-table query for the same table: SELECT DISTINCT M1.director FROM movie M1, movie M2 WHERE M1.director = M2.director AND M1.type='comedy' and M2.type='drama'. A similar result could be done using an INTERSECT operation.
Specifying Joins in SQL2
... FROM table1 JOIN_TYPE table2 ON col1=col2
Where the join type is one of:
JOINNATURAL JOINLEFT OUTER JOINRIGHT OUTER JOINFULL OUTER JOINCROSS JOIN(Cartesian product)
The result of a join statement can also be used as a table e.g.:
SELECT title FROM
(SELECT * FROM movie JOIN director ON movie.director=director.id) movieDirector
WHERE movieDirector.lname='Nolan' /* single quotes only */
To select all from both tables, use SELECT movie.*, director.*.
Set operations
UNION, EXCEPT (MINUS in Oracle) and INTERSECT. The results are sets - no duplicates, unless the keyword ALL is appended.
The two operands - select statements (in brackets), must return a table with the same number of properties and compatible data types.
ANY/ALL
Used with comparison operators and nested queries.
ANY is equivalent to IN. It is sometimes called SOME.
NOT ALL is equivalent to NOT IN.
The following will return movies that got more awards than all movies by a particular director:
SELECT * FROM movie WHERE awards > ALL (
SELECT awards FROM movie, director WHERE movie.director = director.id AND lname = 'Nolan'
)
Division
Directors that have directed at least one movie of each type:
SELECT fname, lname FROM director WHERE NOT EXISTS (
SELECT * FROM movie M1 WHERE NOT EXISTS (
SELECT * FROM movie M2 WHERE M2.director = director.id AND M2.type = M1.type
)
)
The M1.type = M2.type is able to get every single type of movie as it is not constrained to any particular director.
The following also gives the same result:
SELECT fname, lname FROM DIRECTOR
JOIN movie ON director.id = movie.director
GROUP BY director, fname, lname
HAVING COUNT(DISTINCT type) = (
SELECT COUNT(DISTINCT type) FROM movie
)
The nested query is not correlated.
DUAL table
In Oracle, it is a dummy table with a single column and row which is used to get scalar values e.g. SELECT sysdate FROM DUAL.
INSERT
Adds one or more tuples to a relation. The following requires all attributes to be listed in the same order they were specified when the table was created:
INSERT INTO table_name VALUES (attr_1, ..., attr_n)
To specify the attributes to insert:
INSERT INTO table_name (attr_name_1, ..., attr_name_n) VALUES (attr_1, ..., attr_n)
Both forms allow you to insert multiple tuples, separating them with commas.
A select statement can be used to insert multiple tuples:
INSERT INTO table_name (attr_name_list) select_statement
NB: two single quotes escape a quote in a string literal.
UPDATE
UPDATE table_name SET attr_name_1 = expression_1, ... WHERE where_condition
The where condition is optional.
DELETE
DELETE FROM table_name WHERE where_condition
You can only delete data from one table at a time (except if CASCADE is used on a foreign key).
If the where condition is missing, every row gets deleted.
Views
Views are virtual tables - essentially a stored query that allows customized access to data (e.g. remove access to some sensitive attributes of a table).
In general, for a view to be updatable (varies by implementation):
- Must be defined over a single table
- Must contain a primary or candidate key of the base relation
- Cannot use grouping or aggregate functions
CREATE VIEW view_name AS select_statement
Once a view is created, it can be queried like a normal table.
To drop a view:
DROP VIEW view_name
In Oracle, to find out which views are updatable:
SELECT column_name, updatable FROM user_updatable_columns WHERE table_name=upper('view_name')
Indexes
Indexes are a physical-level feature; access structures to speed up queries.
CREATE INDEX index_name ON table_name (attr_name_1, ...)
In older versions of Oracle, indexes would not be created for keys. In this case, a index would need to be created with the UNIQUE keyword.
DROP INDEX index_name is used to delete indexes.
Notes:
- Indexes should be created before tables are populated
- Indexes should not be created for attributes that are frequently updated
- Current versions of Oracle creates indexes for primary and secondary keys
- Indexes are often used in search and join conditions
Integrity Component
SQL 92 adds:
- Required data (
NOT NULL) - Domain constraints
- Entity integrity (
PRIMARY KEY,UNIQUE) - Referential integrity (
FOREIGN KEY,REFERENCES) - Referential integrity constraint violation behavior (
ON UPDATE,ON DELETE) - Enterprise constraints
Domain Constraints
Not in Oracle:
CREATE DOMAIN domain_name AS data_type DEFAULT default_option CHECK (search_option)
Both DEFAULT and CHECK are optional, and search_option should reference the value of the value using VALUE.
Use DROP DOMAIN domain_name, optionally appending RESTRICT or CASCADE.
Referential Integrity Violation
The qualifiers are ON DELETE and ON UPDATE.
The options are SET NULL, CASCADE and SET DEFAULT.
Oracle only supports ON DELETE CASCADE and ON DELETE SET NULL.
Enterprise Constraints
Constraints
CONSTRAINT constraint_name CHECK (condition)
This is for a specific table, so the condition can reference multiple attributes. In SQL2, the condition may include a nested SELECT statement.
Assertions
Constraints spanning multiple tables can be made: these are called assertions.
CREATE ASSERTION assertion_name CHECK (search_condition)
The search condition will be a SQL query, probably using the EXISTS keyword.
This is not supported in Oracle.
Triggers
Triggers monitors the state of the database and performs some action when a specific condition occurs.
Triggers can:
- Calculate/update values of derived attributes
- Enforce additional constraints
- Prevent invalid transactions
- Automatically perform actions
- Audit changes
- Maintain replicated tables
A trigger has an event - a change that activates a trigger. This can be:
- A DML event such as
UPDATE,INSERTorDELETE - A DDL event such as
CREATE - A database event such as
SERVERERROR ON DATABASE
For DML events, there are row-level and statement-level triggers; the former occurs for each modified row while the latter runs once for each statement.
Once an appropriate event occurs, it must pass a condition; a query which determines if the trigger should be activated.
If the condition is met, the action will occur. This can be one or more statements (procedures).
The action can occur BEFORE, AFTER, or INSTEAD OF.
CREATE|REPLACE TRIGGER [schema.]trigger_name
BEFORE|INSTEAD OF|AFTER
DELETE|INSERT|UPDATE OF attr_name_1, attr_name_2... OR DELETE|INSERT|UPDATE OF ...
ON [schema.]table_name|view_name
[REFERENCING OLD AS old_row_alias NEW AS new_row_alias]
FOR EACH ROW|STATEMENT
WHEN (condition)
body
/
Within the SQL statements, if REFERENCING is not specified, :new.attribute_name and :old.attribute_name are used to reference attributes values for the row in question.
The / is used to end the trigger action. It must be on a new line.
Limitations:
- The name of the trigger must be unique
- The body cannot contain DDL or transaction statements
- The trigger cannot read from or modify the mutating table, with some exceptions
INSTEAD OFis only available for viewsWHENis only available for row-level triggers. The condition cannot be a query
To alter the trigger:
ALTER TRIGGER [schema.]trigger_name ENABLE|DISABLE|COMPILE [DEBUG]
SHOW ERRORS shows any compilation errors.
PL-SQL Example Code
DECLARE
varname type := initial_value;
varname2 type;
BEGIN
SELECT attribute INTO varname FROM table_name WHERE ...;
RAISE_APPLICATION_ERROR(num => error_code, msg => 'Message');
-- Error code between -20999 and -20000
...;
EXCEPTION
handlers;
END;
Cannot use EXISTS: must store COUNT(*) into variable.
Example: Audit Trigger
CREATE [OR REPLACE] TRIGGER audit_trigger BEFORE INSERT OR DELETE OR UPDATE ON table_name
FOR EACH ROW
begin
if INSERTING then
INSERT INTO audit_table VALUES (USER || 'inserting' || 'new id:' || :new.id);
elsif DELETING then
INSERT INTO audit_table VALUES (USER || 'deleting' || 'old id:' || :old.id);
elsif UPDATING('attribute_name') then
INSERT INTO audit_table VALUES
(USER || 'updating' || 'old value:' || :new.attribute_name || 'new value:' || :new.attribute_name);
elsif UPDATING then
INSERT INTO audit_table VALUES
(USER || 'updating' || 'old id:' || :old.id || 'new id:' || :new.id);
end if;
end;
/
Procedures
Collection of statements; basically functions:
CREATE|REPLACE PROCEDURE [schema].procedure_name
argument_name [IN|IN OUT|OUT] datatype, ...
IS|AS
body
The body cannot contain DDL statements.
Compiling a procedure: ALTER PROCEDURE [schema.]procedure_name COMPILE.
Database Security
Discretionary and mandatory security mechanisms.
Control measures:
- Access control: user accounts
- Inference control: operations a user can make
The DBA is responsible for the overall security of the database. They have the responsibility to grant privileges to users and assign security levels users and data.
They can set privileges at the account and relation level:
- Account: specifies operations that the account is allowed to run
- Relation: specifies operations the account is allowed to run on individual tables
Each table is assigned an owner account; they have all privileges for the table, and can give privileges to other users.
Privileges can also be specified using views, allowing, for example, users access to only some attributes of a table.
GRANT account_privilege|table_privilege|ALL PRIVILEGES
ON object
TO user|role|PUBLIC
WITH GRANT OPTION;
REVOKE account_privilege|table_privilege|ALL PRIVILEGES
ON object
FROM user|role|PUBLIC;
/* EXAMPLES */
GRANT CREATE TABLE TO user WITH GRANT OPTION;
GRANT INSERT, DELETE ON table1, table2 TO user;
GRANT UPDATE ON table1(attribute1) TO user;
Table privileges: SELECT; INSERT; UPDATE; ALTER; DELETE; INDEX; DEBUG; REFERENCES. INSERT, UPDATE and REFERENCES (allows user to have a FK to the table) can be constraint to specific attributes.
If GRANT OPTION is set, the user granted the privilege can grant the privileges to other users; propagation. However, if the privilege is revoked, all propagated privileges are automatically revoked.
4. Data Normalization
Looking at an existing database and attempting to improve the design.
Information Design Guidelines
User view (logical):
- Is the database design clear
- Can users understand the whole schema
Storage/base relation level:
- How are tuples stored
- Is too much memory being used
Semantics of Attributes
Guideline 1: each tuple should represent one entity or relationship instance:
- Don’t mix attributes from different entities
- Only use foreign keys to refer to other entities
- Separate entity and relationship attributes
Guideline 2: design relations so that no update anomalies are present:
- If any anomalies are present, document them and ensure programs deal with them correctly
Guideline 3: design relations to have as few NULL values as possible:
- If an attribute is NULL frequently, place it in a separate relation
Guideline 4: relations should be designed to satisfy the lossless join condition:
- No spurious tuples should be generated from doing a natural join (i.e. all tuples are meaningful)
Decomposition
Dividing a relation into two or more relations. To do this, normalization is required - this is essentially a set of tests used to determine the ‘quality’ of a schema.
Functional Dependencies
Assume the universal relation schema is
A set of attributes,
For each value of
If two tuples have the same value for
Inference Rules
Given a set of FDs
The closure of a set
If
Armstrong’s inference rules state that:
- IR1, Reflexive: if
, - IR2, Augmentation: if
, then (adding same set of attributes to both sides) - IR3, Transitive: if
and , then
These three rules form a sound and complete set of inference (all functional dependencies can be inferred from these rules, and all are correct).
Using these three rules, we can form deduce three more rules:
- IR4, Decomposition: if
, then and - By reflexivity,
- By the transitive rule,
- Repeat for
- By reflexivity,
- IR5, Union: if
and , - By augmentation to
, - By augmentation to
, so - By the transitive rule,
- By augmentation to
- IR6, Pseudo-transitivity: if
and , then - By augmentation to
, - By the transitive rule,
- By augmentation to
Example
Given
Using Armstrong’s theorems:
- By augmentation on
, - By the transitive rule on
and ,
Using the definition of equivalent sets of functional dependencies (see equivalence sets section):
For all FDs in
-
definitely covered since contains - …
-
: - After one iteration:
. All attributes are now in the set -
is in and all others are covered, so covers . Hence, and are equivalent so is redundant
- After one iteration:
Determining the closure
Closure of a set of attributes
X_plus = X
while changes_occur()
for Y, Z in F: # F is a set of functional dependencies of the form Y -> Z
if Y in X_plus:
X_plus.append(Z)
Example
Given
Start with single attributes:
Look at the functional dependencies:
-
and is in , so and can both be added to the set -
but is not in , so skip it -
and both and are in , so add -
and is in , so add - We now have all attributes so we do not need to continue
As
For
- Loop through the FDs once:
- An attribute was added so repeat
- No changes, so
only determines and hence, it is not a candidate key
After one iteration,
Now, look at tuples with two attributes. Ignore tuples containing
Thus,
Equivalence of Sets of FDs
Two sets of FDs,
Alternatively,
Calculate
Example
Check if
FDs in
-
: . and are both in the set -
: . and are both in the set
Hence,
FDs in
-
: . is in the set -
: . is in the set -
: . and are both in the set -
: . is in the set
Hence,
Hence, the sets are equivalent.
Minimal Sets of FDs
A set of FDs
- Every dependency in
has a single attribute on its RHS - Removing any dependency will lead to a set that is not equivalent to
- Replacing any dependency
with , where , will lead to a set that is not equivalent to
Every set of FDs has an equivalent minimal set (or sets - there can be several equivalent minimal sets).
Example
Is
First condition violated; RHS has multiple attributes for both dependencies.
Using the decomposition rule (
The third rule is met as the LHS only has one element, so it cannot be reduced further.
The second rule is met; it is obvious that all dependencies are necessary.
Normal Forms Based on Primary Keys
Normalization is the process of decomposing ‘bad’ relations into smaller relations.
Keys
Superkey of relation
Key
If a relation has more than one key, each is called a candidate key, with one arbitrarily designated as the primary key and the others as secondary keys.
A prime attribute is a member of at least one candidate key.
First Normal Form
Every attribute is single and atomic; no composite or multivalued attributes or nested relations.
Second Normal Form
Full functional dependency: FD of the form
Second normal form adds more conditions to the first normal form: no non-prime attribute is partially functionally dependent on the primary key.
To transform into second normal form, split into smaller tables: if the primary key is
Third Normal Form
If
Relations in third normal form are in 2NF and no non-prime attributes are transitively dependent on the primary key i.e.
Example:
General Normal Form Definitions
These take into account candidate keys, not just the primary key.
Second normal form: no non-prime attribute is partially functionally dependent on any key of the relation.
Third normal form: for all FDs:
- The LHS is a superkey of
OR - The RHS is a prime attribute of
Only the first condition applies for BCNF.
Ensure at least one table contains the entire primary key, and check that it is a minimal cover before transforming.
Example
Find the key of the table
Tip: look for any attribute that never appears on the RHS of any dependency.
Try
Decompose into 2NF
All non-prime attributes are fully dependent on
-
- Start with
, and as , and need to be added
- Start with
-
- Start with
, and then add and as they are dependent on
- Start with
-
Decompose into 3NF
Boyce-Codd Normal Form
An even stricter form of 3NF: if
It is not always desirable to transform a relation into BCNF as some FDs may be lost in the process.
In addition, it may not be possible to transform into BCNF.
Example
Decompose
Note that the dependencies that
Example
Determine which normal form the relation,
- 1NF: all attributes single and atomic
- 2NF: all attributes are prime attributes, so there are no partial dependencies
- 3NF: either a superkey on LHS or prime attribute on RHS; all attributes are prime so 3NF
- BCNF: superkey on LHS;
, and are all superkeys
Example
Informal guideline: do not mix attributes from different entity types; the table has information on dentists, patients and appointments.
Insertion/deletion/update anomalies:
- Insertion: can have two tuples with the same dentist number but different name (and same for patients)
- Insertion: patient that has registered but has no appointment. There will be no
or , so cannot insert it into the table (same for dentists) - Deletion: deleting an appointment could remove all information about the patient if it was their only appointment
- Insertion: different rooms for the same dentist on the same day
Functional dependencies:
-
(decompose this into four FDs) -
-
-
1NF: all attributes are simple and atomic.
2NF: Non-prime attributes that fully functionally dependent on the primary key.
- Non-prime attributes:
, , , - The second, third and forth dependencies violate this as
and are part of but not the whole primary key
Decompose the table to meet 2NF:
- From FD2:
; is key -
- From FD3:
; is key -
- From FD4:
; , are the keys -
- From remaining attributes:
; , and are keys -
Each table now represents information about a single entity.
3NF: LHS superkey or RHS prime attribute. BCNF: LHS superkey. For all tables, the LHS is the primary key so it it is in both 3NF and BCNF form.
Relational Synthesis
The universal relation, R, contains all attributes you wish to store. Decompose this into
- Attribute preservation:
- Dependency preservation: each FD either appears in one of the decomposed relations or can be inferred
- Lossless join: when tables are joined, there are no spurious tuples
Algorithm
- Find a minimal cover G for F (minimize number of FDs)
- For each LHS
of a FD in G, create a relation where (i.e. table for each FD and their dependents) - Place any remaining (unplaced) attributes in a single relation schema to ensure attributes are preserved
Finding a Minimal Cover
For a set of FDs to be minimal:
- There is one attribute on the RHS
- You cannot replace the LHS of any FD with a smaller set, and still have an equivalent set of FDs
- You cannot remove any FDs from a set
To find a minimal cover G for F:
- Initialize G to F
- Replace each FD
with FDs: , …, (decomposition rule) - For each
in : - For each attribute
in : - Let
, -
Remove an attribute from the LHS and in , replace the original FD with the one with the new LHS
-
- Compute
and . If they are equal, replace with in , and set
- Let
- For each attribute
- For each
in : - Compute
; if it contains , then the FD is redundant so remove it from
- Compute
Example
Find the minimal cover for
Check
- Try remove
: - Compute
-
- Can’t add any more attributes
-
- Compute
, where -
- They are different, so you cannot remove
- Compute
- Try remove
: - Compute
-
- Compute
, where -
- They are the same, so
can be removed
-
-
-
-
Try remove
: (using the new dependency, not ) -
Compute
-
- There is no functional dependency that can be used
-
-
Compute
, where -
-
They are different, so
cannot be removed
Now check
- Try remove
: - Compute
: -
- Compute
, where : -
- They are different, so
cannot be removed
- Compute
- Try remove
: - Compute
-
- Compute
, where : -
- Compute
- They are different, so
cannot be removed
Now, check if all three FDs are necessary:
-
: -
; is not in the FD so it cannot be removed
-
-
: -
; is not in the FD so it cannot be removed
-
-
: -
; is not in the FD so it cannot be removed
-
Hence, the minimal cover is -
- Create a table
with and as keys - Create a table
with and as keys - Create a table
with as the key
Note that there are two minimal covers for this specific example; the other is
Lossless Join and FD-Preserving Decomposition into 3NF
Adds a new step to ensure all functional dependencies are kept: if none of the relations’ keys contains the key of the universal relation, create another table that contains the key.
5: Physical Layer
Internal Schema
Good database design usually requires multiple tables and hence, many select operations will require access to multiple tables, which lead to performance complexities.
DBMS designers must ensure the database, which will be stored in persistent memory, performs well. They must take into factors such as:
- Secondary storage
- Buffering, caching
- Indexing strategy
- Query optimization
- Networks
- Concurrency
The database designers can affect performance through:
- Data types (e.g. using
instead of ) - Degree of normalization
- Query formulation
- Overhead for constraints, triggers etc.
- Indexing strategies
- Media (never store media in a DB)
Memory Hierarchy
Primary memory - cache and main memory:
- Fast access
- Limited capacity
- Volatile
Secondary/tertiary memory - flash, HDDs, optical disks, tape:
- High capacity
- Low cost
- Non-volatile
Disk Storage Devices
- Data stored as magnetized areas
- A disk pack contains several magnetic disks connected to a rotating spindle
- Each surface of the disk is divided into concentric circular tracks, each storing 4-50 kB
- Tracks are divided into blocks/sectors - arcs of the sector
- They are typically 512 B to 4 kB in size
- The entire block is read/written to at once
- A read-write head moves to the track containing the block to be transferred
- Reading/writing is time consuming due to seek times, s, and rotational delay, rd
- A physical disk block address consists of:
- A cylinder number: the disk containing the data
- The track number
- The block number
Record Management
- A file is a sequence of records
- Each record is a collection of fields; a row
- Each field contains values for a certain data type
- A file can have fixed- or variable-length records, and records can have fixed- or variable-length fields
- If variable-length, separator characters or length fields are required
Multiple records are stored in a block:
- The blocking factor, bfr is the number of records per block
- File records can be:
- unspanned: records are always in a single block. This is preferred as it means only a single read is required to get a particular record
- spanned: records can be stored in multiple blocks
- If unspanned and the blocking factor is not an integer, there will be empty space on a block
- If a file contains fixed-length records, unspanned blocking is usually used
File Organization
The DBMS:
- Uses the OS/file system APIs to read/write physical blocks (pages) between disks and memory
- May store DB data on multiple disks; if the block size is different and the records are unspanned, this may cause headaches
- May have its own file system
- An in-memory database relies on the main memory for data storage, leading to faster access but also the potential for data loss if the system crashes
The file descriptor/header contains metadata for the file, such as field names, data types and addresses of the file blocks on the disks. The physical disk blocks can be contiguous, linked or indexed.
Typical file operations:
- Open: ready the file
- Find: search for the first record satisfying a condition, and makes it the current record
- Find next: searches for the next record satisfying a condition, and makes it a current record
- Read: read the current record into a variable
- Insert: insert a new record into the file, and make it the current record
- Delete: remove the current file record (usually by marking it as invalid)
- Modify: change the value of fields in the current record
- Close: terminate access to a file
- Reorganize: e.g. remove records marked as invalid
- Read ordered: read file blocks ordered by a certain field in the file
Unordered files:
- Called a heap/pile file
- Efficient insertion: new records inserted at the end
- Linear search required to search for a record
- Reading ordered by a particular field requires sorting
Ordered files:
- Records sorted by an ordering field
- Update is expensive; must be inserted in the correct order
- A overflow/transaction file, which is periodically merged with the main ordered file, may be used to improve performance
- If it is near the start, all records after that may need to be moved
- Binary search can be used to search for the record on its ordering field value
- Reading the records in the order of the ordering field is efficient
Hashed Files (External Hashing)
- File blocks divided into M equal-size buckets (usually corresponding to one or a fixed number of disk blocks)
- The record with hash key K is stored in bucket i where
, is the hashing function (e.g. modulo) - Search via the hash key is efficient
- When a bucket is full, a overflow bucket is used; bucket has pointer to overflow bucket
Collision Resolution:
- Open addressing: put in the next available position
- Chaining: place in overflow bucket
- Multiple-hashing: if collision occurs, use a secondary hash function etc. Fall back to open addressing
Misc. notes:
- For performance, the hash file is kept 70-80% complete
- The hash function should distribute the records uniformly; otherwise, overflow records will increase search time
- Ordered access on the hash key is inefficient; requires sorting
- Fixed number of buckets; problem if the number of records grows or shrinks
Redundant Arrays of Independent Disks (RAID)
Uses parallelism across multiple disks to increase reads/write performance using data striping.
RAID levels:
- Level 0: striping, but no redundant data; best write performance
- Level 1: mirrored disks; show as writes have to be performed on both disks
- Level 4: block-level data striping, with parity information stored on one or more disks
- Level 5: level 4, but parity information distributed across all disks
- Level 6: P + Q redundancy with Reed-Solomon codes. Two disks can fail with no data loss
- Level 10 (1 + 0): RAID 0, but each ‘disk’ is a set of RAID 1 mirrored disks
Indexing
Indexes provide an efficient way to access records in different orders without expensive operations like sorting or duplicate data by creating a data structure that maps between a key and address.
Dense indexes have entries for every record, while sparse indexes contain entries for only some of the records (not a one-to-one relationship).
A non-clustered index stores the indexes in a structure separate from the actual data, containing the index key values and a pointer to the row.
A clustered index physically defines the order in which records are stored. Hence, there can only be one clustered index. This course limits clustering indexes to be a non-key field so a file has either a primary or clustered index.
Under this definition, a clustered index will have an entry for each distinct value of the field and hence, it is a sparse index.
A primary index is indexed by the primary key and contains an entry for each block. Each entry contains a pointer to the block and either the value of the first or last record on the block. Hence, the primary index is a sparse index. To improve performance when inserting records, a overflow block may be used.
Secondary indexes are dense indexes that can be on either candidate keys or non-keys, and contain the value of the field and a pointer to either a block or record.
Example: Storage/Speed Tradeoffs
Case 1: Ordered File
The blocking factor
Hence, the number of blocks required,
For linear search:
- The best case is reading a single data block from the disk
- The worst case is reading all data blocks from the disk
- The average case is
blocks
For binary search,
Case 2: Primary Index
The primary index will have a entry for each block. Each index entry has a size of
Hence, the blocking factor for the index block,
To find an individual record, binary search is done on the index, so the the number of blocks read is
The primary index must be ordered, so update/insertion/deletion operations are costly.
Case 3: Secondary Index
The secondary index is a dense index with an entry for each entry.
Hence, the index requires
Searching the index requires
Unlike the primary index, it does not need to be ordered so update/insertion/deletion operations are faster.
Multi-Level Indexes
The index itself is an ordered file, so a primary index to the index can be created; the original index is called the first-level index, the index to the index the second-level index etc.
This process can be repeated until all entries fit in a single block.
Assuming the first-level index is a primary index, each block for the first first-level index can store references to
This has consequences for performance: as the indexes are ordered, update/insertion/deletion operations may require indexes at all levels to update. Hence, this is more useful in read-heavy scenarios.
B-Trees and B±Trees
Due to the insertion/deletion problem, B or B+ tree structures are used.
In a B-tree, a node of order p has the structure
-
is a tree pointer -
is a key value -
is a data pointer
A B-tree of order m has:
- At most m children (tree pointers)
- If a non-leaf node:
- Has at least m / 2 children (except the root node)
- If it has k children, it contains k - 1 keys
- All leaf nodes are at the same level
The keys divide the subtree: all values in
On insertion, if a node becomes full then the node is spit into two nodes, with the mid-value being used as the key for the parent. If the parent then becomes full, this may cause a cascade. This behavior ensures it is always balanced.
On deletion, if the node has less than m / 2 keys, it can be merged with a sibling.
In a B±tree, pointers to data records only exist in leaf nodes:
- Internal nodes of the form
- Keys sandwiched between block pointers
- Leaf nodes of the form
- The last node is a block pointer to the next leaf node
Example: Calculating Order
Find the order p of a B±tree if the block size
Each node can have up to p tree pointers and p - 1 keys, taking up a total of
To store one million records requires
6. Query Processing and Optimization
If this is done naively, it will get the Cartesian product of the three tables (if each has 100 rows, then this results in 1,000,000 rows) then gets the names of stars that performed in a particular movie - probably something like 10 rows. Needless to say, this produces a lot of garbage rows and is very inefficient. In a real database query optimization is done.
Query Processing
- Scanning, parsing and validating -> immediate form of query
- Query optimizer -> execution plan
- Query code generator -> code to execute the query
- Runtime database processor -> result
Code can be executed directly or stored and executer later (compiled).
Translating SQL Queries into Relational Algebra
Query block: the basic unit that can be translated into algebraic operators and optimized. This is a single SELECT ... FROM ... WHERE [... GROUP BY ... HAVING] expression. Nested queries will have multiple query blocks.
The query SELECT name FROM employee WHERE salary > (SELECT MAX(salary) FROM employee WHERE dept = 5) can be split into:
SELECT name FROM employee WHERE salary > C-
SELECT MAX(salary) FROM employee WHERE dept = 5-
For the second query, an unoptimized process could be:
- Retrieve all blocks from employee, filter out those with the wrong department, and sort them - unlikely to be indexed or be sorted by salary
- If there a lot of rows, sorting may need to be done using external sorting
- Find the top row that has the maximum salary
Side Note: Sort-Merge
- Read x rows, sort them (quicksort etc.) and write to disk
- Repeat n times
- Read the first x/(n+1) rows from each subfile and perform an n-way merge. Store the result in a buffer; when the buffer size reaches x, write it to disk
The process can be done recursively.
Algorithms for WHERE
Simple selection has only as single condition; complex selections have multiple conditions.
Selectivity: ratio of the number of tuples that satisfy the condition to the total number of tuples.
For simple selection:
- Linear search: no index, unstructured data
- Binary search: no index, sorted data
- Primary index/hash key to retrieve a single record
- Primary index to retrieve multiple records (inequalities)
- Clustering index to retrieve multiple records (equality)
- Secondary index
For complex selection:
- Conjunctive selection using an index
- Use the index for a simple condition, then check if the retrieved tuple satisfies the other conditions
- Conjunctive selection using a composite index (index exists for two or more attributes involved in the conditions)
- Conjunctive selection using the intersections of record pointers
-Secondary indexes available for some of the attributes involved in the conditions, providing pointers to records (not blocks)
- Get the record pointers from each index, and find the intersection between them
- Then, check if the retrieved tuples satisfy the other conditions
- Disjunctive conditions (union)
- If there are no indexes, use brute force
Algorithms for JOIN
Cross join/Cartesian product is time consuming. Equi-join and natural joins are much more efficient. Join performance is affected by:
- Available buffer space
- Join selection factor - fraction of records in one file that will be joined with records in the other file
- Choice of inner/outer relation
If an index/hash key exists for one of the join attributes, a single-loop join can be performed: read all rows from the table without the index and use the index to directly retrieve matching records. If this is not available, a nested-loop join is required (brute force).
Algorithms for SELECT
If the attribute list contains candidate keys, all that is required is removing unwanted attributes.
Otherwise, duplicates must be removed. This can be done through either sorting or hashing.
Algorithms for Set Operations
Sort both relations on the same attributes, then scan and merge sorted files concurrently, deciding on if the tuple(s) should be kept by checking their equality.
Algorithms for Aggregate Functions
Min, max:
- If an index exists, this can be used. If want to find the maximum and an ascending index exists, the right-most pointer in each index node can be followed
- Otherwise, each row must be read
Sum, count, average:
- For a dense index, the computation can be done on the values in the index
- If an index does not exist or it is a non-dense index, each row must be read
Group by:
- Sorting or hashing is used to partition the files into appropriate groups
- The aggregate function can be done on tuples in each group
Query Representation
Query tree:
- Data structure corresponding relational algebra expressions
- Relations are leaf nodes
- Relational algebra operations are internal nodes
When execution of an internal node occurs, the node and its children are replaced with the resultant relation.
Canonical tree: standard tree corresponding to an SQL query without optimization. From this tree optimized alternatives can be generated.
Query Optimization
Heuristic Query Optimization
The parser generates an initial internal representation - the canonical tree. After this, heuristics should be used to optimize the internal representation.
The main heuristic is to apply operations that reduce the size of intermediate results
- Apply
SELECTorWHEREbeforeJOINs or other binary operations- Apply the more selective condition first (e.g. equals condition on unique column)
- Use joins instead of Cartesian products
To determine its efficiency, we can do things such as:
- Use the data catalog to retrieve the number of tuples in the input tables and:
- Calculate the number of tuples outputted by the cross join
- Find the upper bound for the number of tuples returned by the join
- Calculate the number of tuples outputted by the cross join
- Estimate/calculate the number of tuples returned by each condition in the
WHEREclause
Some optimizations that can be done on the first example:
- The condition,
title = 'Gone with the windinvolves only a single table, so the select operation can be done before the Cartesian product. If each table has 100 rows, then the resulting table will only have 100,000 rows - Only the movie ID is needed from the movie table so a project operation can be done to only keep the movie ID before executing the Cartesian product
- Use a join instead of a Cartesian product and select clause $$ \pi_{lname, fname}(star \Join_{,\text{snumber = star}} stars \Join_{\text{movie = mnumber}} \pi_{movie_id}(\sigma_{\text{title = ‘Gone with the wind’}}(movie))) $$
Query Execution Plan
Query tree + information about access methods to be used for each relation + methods used to compute relational operators stored in the tree.
Materialized evaluation: when a temporary relation (result of an operation) is made. For example, SELECT * FROM employee WHERE salary = (SELECT MAX(salary) FROM employee) will store the nested query rather than computing it for each row.
Pipelined evaluation: when the result of an operation is forwarded to the next operator in the sequence.
Query Optimization in Oracle
Earlier versions used rule- or heuristic- based query optimization where the optimizer chooses execution plans based on heuristically ranked operations.
With cost-based query optimization, the optimizer examines alternative access paths and operator algorithms, available indexes, past-access statistics etc. and chooses the plan with the lowest estimated cost (estimated IO, CPU, memory requirements etc.).
TO view the cost of each operation in a query, use the EXPLAIN query: EXPLAIN PLAN SET STATEMENT_ID = 'some_identifier' FOR some_statement and then SELECT * FROM plan_table to view the estimated CPU, IO, time, temp space etc.
7. Data Catalogs
Data catalogs store metadata for database objects such as relations, attributes, domains, constraints, authorization and security information, owners, indexes.
The catalog is stored as a set of read-only relations and views (system tables).
The catalog is used:
- By the DDL and SDL compilers
- By the query and DML parsers
- These convert high-level queries and DML statements into low-level file access commands
- They use the internal schema to determine an optimal way to execute the query/command
- By the DBA for authorization and security checking, who has privileges to update the catalog
- For external-to-conceptual mapping of queries/DML commands
The catalog consists of base tables and user-accessible views. The latter consists of:
- USER views (objects owned by user)
- e.g.
USER_CATALOG,USER_TAB_COLUMNS,USER_CONSTRAINTS
- e.g.
- ALL views (objects the user has privileges to view)
- e.g.
ALL_CATALOG,ALL_OBJECTS
- e.g.
- DBA views (all objects)
Examples
Some interesting catalog tables:
ALL_TABLES
Contains information about all tables you have view permissions for. To view the number of rows for a specific table:
SELECT NUM_ROWS FROM ALL_TABLES WHERE OWNER = 'owner_name' AND TABLE_NAME = 'table_name'
Number of blocks used is also stored.
The catalog is updated regularly, so it may contain stale data. The LAST_ANALYZED row contains the time it was last updated. To force an update, use the command ANALYZE TABLE table_name COMPUTE_STATISTICS.
ALL_USERS
Contains information about all users you have view permissions for. The password is not viewable.
USER_USERS contains information about your own user.
ALL_INDEXES
Contains information about indexes: metadata such as the number of distinct keys, if it is primary/secondary etc.
ALL_VIEWS
Contains the definition (SQL query) for the view and other metadata. The definition cannot be modified directly as the catalog is read-only.
ALL_CONS_COLUMNS
Check constraints for all tables: constraint name, table name, owner, name of the column the constraint is for etc.
ALL_CONSTRAINTS
Contains more information about constraints including the search condition (SQL constraint definition) and constraint type (stored as a char - can be check constraint, primary/unique key, referential integrity constraint etc.).
TABLE_PRIVILEGES
Grantee, owner, grantor, table name and privileges (select, insert etc.) for each grantee.
8. Transaction Processing
A logical unit of work consisting of one or more statements. The effects of the transaction only becomes permanent when the transaction is completed successfully.
An application program may contain several transactions.
The basic data operations are read and write. A read operation will:
- Find the address of the disk block that contains the item
- Copy that block into a buffer in main memory
- Copy the item from the buffer to a program variable
A write operation will:
- Find the address of the disk block that contains the item
- Copy the block into a buffer in main memory
- Copy the item from the program variable into the correct location in the buffer
- Copy the updated buffer back to disk, either immediately or at some later point in time
A transaction can be in the following states:
- Active: running read/write operations
- Partially committed: all writes to memory finished; DBMS performs housekeeping (logs changes etc.)
- Committed: all changes copied from memory to disk
The transaction can be aborted when in the active or partially committed states.
Concurrent Execution of Transactions
Multi-user systems:
- Single-user system: at most one user can use the system at a time
- Multi-user system: many users can access the system concurrently
Concurrency:
- Interleaved processing: transactions run interleaved in a single CPU
- Parallel processing: transactions concurrently executed in multiple CPUs
Lost Update Problem
When two transactions that update the same database items have their operations interleaved in a way that makes the value of some database item incorrect.
Example:
Transaction 1 | Transaction 2
|
read_item(X) |
X := X - N |
| read_item(X)
| X := X + M
|
write_item(X) <-- This is lost!
| write_item(X)
Temporary Update Problem
One transaction updates a database item but the transaction fails. Another transaction accesses the updated item before it is reverted to its original value.
Reading the uncommitted value is called a dirty read.
Example:
Transaction 1 | Transaction 2
read_item(X) |
X := X - N |
write_item(X) |
| read_item(X) <- Dirty Read
| X := X + M
| write_item(X)
throw_error() |
undo_change() |
Incorrect Summary Problem
While one transaction is running an aggregate function while other transactions update some of the records, causing the aggregate function to read values both before and after they are updated.
ACID Properties
Atomic: a transaction is either performed entirely or not at all.
Consistent: the transaction will move the database from one consistent state to another.
Isolated: the transaction’s changes should not be visible to other transactions until it is committed - this can solve the temporary update problem.
Durable: once a transaction is committed, system failures will not cause the changes to be lost.
The concurrency system is the primary means by which the isolation property is preserved, and supports the atomicity and consistency properties.
The recovery system is the primary means by which the atomicity and durability properties are preserved, and supports the consistency and isolation properties.
Concurrency Control and Recovery
Not covered due to time limitations.
Test Notes
Intro
-
Data Definition Language:
CREATEetc. -
Data Manipulation Language:
SELECTetc. -
View Definition Language: TODO
-
Storage Definition Language: internal schema (DBMS specific commands)
-
Internal: storage structures etc.
-
Conceptual schema: relationships, tables etc.
-
External schema: subset of conceptual; relevant content only
ER Diagrams
Attributes:
- Multivalued:
{ attr_name } - Composite:
attr_name(prop_1, prop_2) - Derived: dotted
- Keys: underlined. Must be single-valued
Relationships:
- Degree: num. entity types - recursive relationships unary
- No key attributes
- Relationship type: intension; relationship set; extension
- Cardinality ratio:
{A} in {B cardinality} relationships with {B}; structural constraints the opposite way around
Weak entities:
- One partial key, may be composite; unique for a given owner
- Can have multiple owners
EER:
{subclass_1 \subset, subclass_2 \subset} o|d (?:=|-)+ superclass- Subclasses can merge if they share a root superclass - structure called a lattice
category \subset u {implementor_1, implementor_2}
Relational Data Model
Domain constraint: value must be in their domain.
Key constraint:
- Superkey: set such that no two attributes have the same value
- Key: minimal superkey
- Prime attributes: attributes in any candidate key
Referential integrity constraint: foreign key matches existing primary key of relation, or is null.
Semantic integrity constraint: domain-specific constraints.
EER To Relation
- Simple entity: new relation, pick any key as PK
- Weak entity: new relation, PK is owner PK + partial key
- 1:N relationship: FK on N side
- 1:1 relationship: FK on side with total participation
- M:N relationship: new relation with both FKs as PK
- Multivalued attributes: new relation, PK is owner PK + attribute value
- N-ary relationships: new relation with FKs as PKs
- Specialization/generalization
- Any specialization: superclass has its own relation; subclasses have their own relation with their attributes; PK is FK for the superclass
- Total specialization: subclasses have their own relation, duplicating superclass attributes
- Disjoint specialization: single relation with properties from every subclass, plus type attribute
- Overlapping specialization: single relation with properties from every subclass, plus Boolean for each subclass indicating if the entity is a member of the subclass
- Category: PK is artificial, auto-generated ID. All superclasses of the category have this as a FK
Normalization
Guidelines:
- Each tuple represents a single entity: use only FKs to refer to other entities
- No update anomalies should be possible (e.g. reduce data duplication as much as possible)
- Reduce the number of null attributes
- Lossless join condition: when doing a natural join, all tuple should be meaningful (no spurious tuples)
Functional Dependencies
-
Closure of set of FDs
is : all FDs that can be inferred from -
Closure of FD
with respect to is - The closure is initially elements in the RHS of
- While there are changes, loop through all FDs. If the LHS is a subset of the closure, add elements from the RHS to the closure
- The closure is initially elements in the RHS of
Armstrong’s Inference Rules
- Reflexivity:
- Augmentation:
- Transitivity:
From these, the following can be generated:
- Decomposition:
- Union:
- Pseudo-transitivity:
Equivalence of Sets
To test if
Minimal Sets
Minimal if RHS of all FDs is a single element and:
- Removing any dependency OR
- Removing any element from the LHS
leads to a set that is not equivalent.
Normal Forms
If elements can be removed from the RHS, then it is a partial FD. Otherwise, it is called a full FD.
If
- 1NF: every attribute single and atomic
- 2NF: every non-prime attribute fully functionally dependent on every key
- 3NF: LHS superkey OR RHS prime attribute
- BCNF: LHS superkey
Minimal Cover Algorithm
G = {}
for fd in F:
for el in fd.lhs
G.append(FD(el, fd.rhs)) # Decomposition
for i in range(len(G)):
fd = G[i]
for el in fd.lhs:
J = G.copy()
J.replace(fd, FD(fd.lhs - el, fd.rhs))
# For each FD, see if any element can be removed from the LHS
if cover(fd.lhs - el, G) == cover(fd.lhs - el, J):
# Covers equal, so element can be removed
G = J
for fd in G:
if cover(fd.lhs, G - fd) == cover(fd.lhs, G):
G.remove(fd)
Physical Layer
- Disk divided into tracks, tracks divided into blocks/sectors.
- Records: seconds of fields
- Unspanned/spanned
- File: sequence of records. May be fixed or variable length
- Blocking factor: records/block
Hashing
- Files divided into buckets; function on hash key determines which bucket a record goes in
- Bucket has pointer to overflow buckets
RAID 0: striping data; 1: mirroring.
Indexing
- Dense/sparse: 1:1 mapping
- Clustered index: determines order of records. One entry/distinct value of field
- Primary index: entry/block (pointer + value of first/last record in the block)
- Secondary index: dense indexes with pointers to each record
- Multi-level index: all entries fit in a single block
B-Trees
Structure with pointer to subtrees and (key value + pointer) pairs that divide the subtree (all left subtree values less than key value etc.):
B+ tree: internal nodes store only key values. Leaf nodes have key value + pointer pairs, plus pointer to next leaf node in the tree.
Transactions
- Lost update: multiple transactions updating the same items have operations interleaved
- Temporary update/dirty read: when transaction reads data being modified by another transaction before it is committed
- Incorrect summary: transaction running aggregate method while another transaction is updating records
ACID:
- Atomic: performed entirely or not at all. Recovery
- Consistent: move from one valid state to another. Recovery
- Isolated: changes not visible to others until committed. Concurrency
- Durable: once committed, system failures will not affect data
Misc.
Materialized evaluation: result of temporary relation stored (e.g. comparing value to max value from nested operation). c.f. pipelined evaluation.