5: Physical Layer

Internal Schema

Good database design usually requires multiple tables and hence, many select operations will require access to multiple tables, which lead to performance complexities.

DBMS designers must ensure the database, which will be stored in persistent memory, performs well. They must take into factors such as:

Secondary storage
Buffering, caching
Indexing strategy
Query optimization
Networks
Concurrency

The database designers can affect performance through:

Data types (e.g. using $int$ instead of $varchar$ )
Degree of normalization
Query formulation
Overhead for constraints, triggers etc.
Indexing strategies
Media (never store media in a DB)

Memory Hierarchy

Primary memory - cache and main memory:

Fast access
Limited capacity
Volatile

Secondary/tertiary memory - flash, HDDs, optical disks, tape:

High capacity
Low cost
Non-volatile

Disk Storage Devices

Data stored as magnetized areas
A disk pack contains several magnetic disks connected to a rotating spindle
Each surface of the disk is divided into concentric circular tracks, each storing 4-50 kB
Tracks are divided into blocks/sectors - arcs of the sector
- They are typically 512 B to 4 kB in size
- The entire block is read/written to at once
A read-write head moves to the track containing the block to be transferred
- Reading/writing is time consuming due to seek times, s, and rotational delay, rd
A physical disk block address consists of:
- A cylinder number: the disk containing the data
- The track number
- The block number

Record Management

A file is a sequence of records
Each record is a collection of fields; a row
Each field contains values for a certain data type
A file can have fixed- or variable-length records, and records can have fixed- or variable-length fields
- If variable-length, separator characters or length fields are required

Multiple records are stored in a block:

The blocking factor, bfr is the number of records per block
File records can be:
- unspanned: records are always in a single block. This is preferred as it means only a single read is required to get a particular record
- spanned: records can be stored in multiple blocks
If unspanned and the blocking factor is not an integer, there will be empty space on a block
If a file contains fixed-length records, unspanned blocking is usually used

File Organization

The DBMS:

Uses the OS/file system APIs to read/write physical blocks (pages) between disks and memory
May store DB data on multiple disks; if the block size is different and the records are unspanned, this may cause headaches
May have its own file system
- An in-memory database relies on the main memory for data storage, leading to faster access but also the potential for data loss if the system crashes

The file descriptor/header contains metadata for the file, such as field names, data types and addresses of the file blocks on the disks. The physical disk blocks can be contiguous, linked or indexed.

Typical file operations:

Open: ready the file
Find: search for the first record satisfying a condition, and makes it the current record
Find next: searches for the next record satisfying a condition, and makes it a current record
Read: read the current record into a variable
Insert: insert a new record into the file, and make it the current record
Delete: remove the current file record (usually by marking it as invalid)
Modify: change the value of fields in the current record
Close: terminate access to a file
Reorganize: e.g. remove records marked as invalid
Read ordered: read file blocks ordered by a certain field in the file

Unordered files:

Called a heap/pile file
Efficient insertion: new records inserted at the end
Linear search required to search for a record
Reading ordered by a particular field requires sorting

Ordered files:

Records sorted by an ordering field
Update is expensive; must be inserted in the correct order
- A overflow/transaction file, which is periodically merged with the main ordered file, may be used to improve performance
- If it is near the start, all records after that may need to be moved
Binary search can be used to search for the record on its ordering field value
Reading the records in the order of the ordering field is efficient

Hashed Files (External Hashing)

File blocks divided into M equal-size buckets (usually corresponding to one or a fixed number of disk blocks)
The record with hash key K is stored in bucket i where $i=h(K)$ , $h$ is the hashing function (e.g. modulo)
Search via the hash key is efficient
When a bucket is full, a overflow bucket is used; bucket has pointer to overflow bucket

Collision Resolution:

Open addressing: put in the next available position
Chaining: place in overflow bucket
Multiple-hashing: if collision occurs, use a secondary hash function etc. Fall back to open addressing

Misc. notes:

For performance, the hash file is kept 70-80% complete
The hash function should distribute the records uniformly; otherwise, overflow records will increase search time
Ordered access on the hash key is inefficient; requires sorting
Fixed number of buckets; problem if the number of records grows or shrinks

Redundant Arrays of Independent Disks (RAID)

Uses parallelism across multiple disks to increase reads/write performance using data striping.

RAID levels:

Level 0: striping, but no redundant data; best write performance
Level 1: mirrored disks; show as writes have to be performed on both disks
Level 4: block-level data striping, with parity information stored on one or more disks
Level 5: level 4, but parity information distributed across all disks
Level 6: P + Q redundancy with Reed-Solomon codes. Two disks can fail with no data loss
Level 10 (1 + 0): RAID 0, but each ‘disk’ is a set of RAID 1 mirrored disks

Indexing

Indexes provide an efficient way to access records in different orders without expensive operations like sorting or duplicate data by creating a data structure that maps between a key and address.

Dense indexes have entries for every record, while sparse indexes contain entries for only some of the records (not a one-to-one relationship).

A non-clustered index stores the indexes in a structure separate from the actual data, containing the index key values and a pointer to the row.

A clustered index physically defines the order in which records are stored. Hence, there can only be one clustered index. This course limits clustering indexes to be a non-key field so a file has either a primary or clustered index.

Under this definition, a clustered index will have an entry for each distinct value of the field and hence, it is a sparse index.

A primary index is indexed by the primary key and contains an entry for each block. Each entry contains a pointer to the block and either the value of the first or last record on the block. Hence, the primary index is a sparse index. To improve performance when inserting records, a overflow block may be used.

Secondary indexes are dense indexes that can be on either candidate keys or non-keys, and contain the value of the field and a pointer to either a block or record.

Example: Storage/Speed Tradeoffs

$r = 30,000$ records, record size $R = 100 B$ , block size $B = 1024$ , index key size $V = 9 B$ , block pointer size $P = 6 B$ . Blocks are unspanned.

Case 1: Ordered File

The blocking factor $BFR = floor(\frac{block\_size}{record\_size}) = floor(\frac{B}{R}) = floor(\frac{1,024}{100}) = 10$ . Hence, 10 records fit in a block.

Hence, the number of blocks required, $b = ceil(\frac{number\_of\_records}{blocking\_factor}) = ceil(\frac{30,000}{10}) = 3,000$ .

For linear search:

The best case is reading a single data block from the disk
The worst case is reading all data blocks from the disk
The average case is $b/2$ blocks

For binary search, $log_{2}(b) \approx 12$ blocks must be read from the disk.

Case 2: Primary Index

The primary index will have a entry for each block. Each index entry has a size of $I = V + P = 9 B + 6 B = 15 B$ .

Hence, the blocking factor for the index block, $BFRI = floor(\frac{block\_size}{index\_entry size}) = floor(\frac{B}{I}) = floor(\frac{1,024}{15}) = 68$ , so the index requires $ceil(\frac{number\_of\_index entries}{index\_blocking\_factor}) = ceil(\frac{b}{BFRI}) = ceil(\frac{3000}{68}) = 45$ .

To find an individual record, binary search is done on the index, so the the number of blocks read is $log_{2}(45) \approx 6$ . There is an extra read required to read the block the record is stored in, so it requires $7$ reads in total.

The primary index must be ordered, so update/insertion/deletion operations are costly.

Case 3: Secondary Index

The secondary index is a dense index with an entry for each entry.

Hence, the index requires $ceil(\frac{number\_of\_index\_entries}{index\_blocking\_factor}) = ceil(\frac{r}{BFRI}) = ceil(\frac{30,000}{68}) = 442$ blocks.

Searching the index requires $log_{2}(442) \approx 9$ block reads, and then an additional read to read the block the record is stored in, so it requires $10$ reads in total.

Unlike the primary index, it does not need to be ordered so update/insertion/deletion operations are faster.

Multi-Level Indexes

The index itself is an ordered file, so a primary index to the index can be created; the original index is called the first-level index, the index to the index the second-level index etc.

This process can be repeated until all entries fit in a single block.

Assuming the first-level index is a primary index, each block for the first first-level index can store references to $BFRI^1$ blocks, the second-level index references to $BFRI^2$ blocks and the $n$ th-level index $BFRI^n$ blocks.

This has consequences for performance: as the indexes are ordered, update/insertion/deletion operations may require indexes at all levels to update. Hence, this is more useful in read-heavy scenarios.

B-Trees and B±Trees

Due to the insertion/deletion problem, B or B+ tree structures are used.

In a B-tree, a node of order p has the structure $P_1, <K_1, Pr_1>, P_2, <K_2, Pr_2>, \dots, P_{q-1}, <K_{q-1}, Pr_{q-1}>, P_q$ where:

$P_i$ is a tree pointer
$K_i$ is a key value
$Pr_i$ is a data pointer

A B-tree of order m has:

At most m children (tree pointers)
If a non-leaf node:
- Has at least m / 2 children (except the root node)
- If it has k children, it contains k - 1 keys
All leaf nodes are at the same level

The keys divide the subtree: all values in $P_2$ will be between $K_2$ and $K_3$ .

On insertion, if a node becomes full then the node is spit into two nodes, with the mid-value being used as the key for the parent. If the parent then becomes full, this may cause a cascade. This behavior ensures it is always balanced.

On deletion, if the node has less than m / 2 keys, it can be merged with a sibling.

In a B±tree, pointers to data records only exist in leaf nodes:

Internal nodes of the form $P_1, K_1, \dots, P_{q-1}, K_{q-1}, P_q$
- Keys sandwiched between block pointers
Leaf nodes of the form $<K_1, Pr_1>, \dots, <K_{q-1}, Pr_{q-1}>, P_next$
- The last node is a block pointer to the next leaf node

Example: Calculating Order

Find the order p of a B±tree if the block size $B = 512 B$ , index key size $V = 9 B$ , data pointer size $Pr = 7 B$ , tree pointer size $P = 6 B$ :

Each node can have up to p tree pointers and p - 1 keys, taking up a total of $6 * p + (p - 1) * 9$ bytes. This must be less than the block size:

$15 * p - 9 < 512 \therefore p <= 521/15 \approx 34.7$ . Hence, the order and maximum value of p is $p = 34$ .

To store one million records requires $log_{34}(1,000,000) \approx 4$ levels (data is stored in the leaf nodes, so one million leaf nodes are required).