6. Query Processing and Optimization

\pi_{lname, fname}(\sigma_{\text{snumber = star } \land \text{ movie = mnumber } \land \text{ title = 'Gone with the wind'}}(star \times stars \times movie))

If this is done naively, it will get the Cartesian product of the three tables (if each has 100 rows, then this results in 1,000,000 rows) then gets the names of stars that performed in a particular movie - probably something like 10 rows. Needless to say, this produces a lot of garbage rows and is very inefficient. In a real database query optimization is done.

Query Processing

Scanning, parsing and validating -> immediate form of query
Query optimizer -> execution plan
Query code generator -> code to execute the query
Runtime database processor -> result

Code can be executed directly or stored and executer later (compiled).

Translating SQL Queries into Relational Algebra

Query block: the basic unit that can be translated into algebraic operators and optimized. This is a single SELECT ... FROM ... WHERE [... GROUP BY ... HAVING] expression. Nested queries will have multiple query blocks.

The query SELECT name FROM employee WHERE salary > (SELECT MAX(salary) FROM employee WHERE dept = 5) can be split into:

SELECT name FROM employee WHERE salary > C
- $\pi_{name}(\sigma_{salary > c}(employee))$
SELECT MAX(salary) FROM employee WHERE dept = 5
- $\mathcal{F}_{MAX salary}(\sigma_{DNO=5}(employee))$

For the second query, an unoptimized process could be:

Retrieve all blocks from employee, filter out those with the wrong department, and sort them - unlikely to be indexed or be sorted by salary
- If there a lot of rows, sorting may need to be done using external sorting
Find the top row that has the maximum salary

Side Note: Sort-Merge

Read x rows, sort them (quicksort etc.) and write to disk
Repeat n times
Read the first x/(n+1) rows from each subfile and perform an n-way merge. Store the result in a buffer; when the buffer size reaches x, write it to disk

The process can be done recursively.

Algorithms for `WHERE`

Simple selection has only as single condition; complex selections have multiple conditions.

Selectivity: ratio of the number of tuples that satisfy the condition to the total number of tuples.

For simple selection:

Linear search: no index, unstructured data
Binary search: no index, sorted data
Primary index/hash key to retrieve a single record
Primary index to retrieve multiple records (inequalities)
Clustering index to retrieve multiple records (equality)
Secondary index

For complex selection:

Conjunctive selection using an index
- Use the index for a simple condition, then check if the retrieved tuple satisfies the other conditions
Conjunctive selection using a composite index (index exists for two or more attributes involved in the conditions)
Conjunctive selection using the intersections of record pointers -Secondary indexes available for some of the attributes involved in the conditions, providing pointers to records (not blocks)
- Get the record pointers from each index, and find the intersection between them
- Then, check if the retrieved tuples satisfy the other conditions
Disjunctive conditions (union)
- If there are no indexes, use brute force

Algorithms for `JOIN`

Cross join/Cartesian product is time consuming. Equi-join and natural joins are much more efficient. Join performance is affected by:

Available buffer space
Join selection factor - fraction of records in one file that will be joined with records in the other file
Choice of inner/outer relation

If an index/hash key exists for one of the join attributes, a single-loop join can be performed: read all rows from the table without the index and use the index to directly retrieve matching records. If this is not available, a nested-loop join is required (brute force).

Algorithms for `SELECT`

If the attribute list contains candidate keys, all that is required is removing unwanted attributes.

Otherwise, duplicates must be removed. This can be done through either sorting or hashing.

Algorithms for Set Operations

Sort both relations on the same attributes, then scan and merge sorted files concurrently, deciding on if the tuple(s) should be kept by checking their equality.

Algorithms for Aggregate Functions

Min, max:

If an index exists, this can be used. If want to find the maximum and an ascending index exists, the right-most pointer in each index node can be followed
Otherwise, each row must be read

Sum, count, average:

For a dense index, the computation can be done on the values in the index
If an index does not exist or it is a non-dense index, each row must be read

Group by:

Sorting or hashing is used to partition the files into appropriate groups
The aggregate function can be done on tuples in each group

Query Representation

Query tree:

Data structure corresponding relational algebra expressions
Relations are leaf nodes
Relational algebra operations are internal nodes

When execution of an internal node occurs, the node and its children are replaced with the resultant relation.

Canonical tree: standard tree corresponding to an SQL query without optimization. From this tree optimized alternatives can be generated.

Query Optimization

Heuristic Query Optimization

The parser generates an initial internal representation - the canonical tree. After this, heuristics should be used to optimize the internal representation.

The main heuristic is to apply operations that reduce the size of intermediate results

Apply SELECT or WHERE before JOINs or other binary operations
- Apply the more selective condition first (e.g. equals condition on unique column)
Use joins instead of Cartesian products

To determine its efficiency, we can do things such as:

Use the data catalog to retrieve the number of tuples in the input tables and:
- Calculate the number of tuples outputted by the cross join
  - Find the upper bound for the number of tuples returned by the join
Estimate/calculate the number of tuples returned by each condition in the WHERE clause

Some optimizations that can be done on the first example:

The condition, title = 'Gone with the wind involves only a single table, so the select operation can be done before the Cartesian product. If each table has 100 rows, then the resulting table will only have 100,000 rows
Only the movie ID is needed from the movie table so a project operation can be done to only keep the movie ID before executing the Cartesian product
Use a join instead of a Cartesian product and select clause $$ \pi_{lname, fname}(star \Join_{,\text{snumber = star}} stars \Join_{\text{movie = mnumber}} \pi_{movie_id}(\sigma_{\text{title = ‘Gone with the wind’}}(movie))) $$

Query Execution Plan

Query tree + information about access methods to be used for each relation + methods used to compute relational operators stored in the tree.

Materialized evaluation: when a temporary relation (result of an operation) is made. For example, SELECT * FROM employee WHERE salary = (SELECT MAX(salary) FROM employee) will store the nested query rather than computing it for each row.

Pipelined evaluation: when the result of an operation is forwarded to the next operator in the sequence.

Query Optimization in Oracle

Earlier versions used rule- or heuristic- based query optimization where the optimizer chooses execution plans based on heuristically ranked operations.

With cost-based query optimization, the optimizer examines alternative access paths and operator algorithms, available indexes, past-access statistics etc. and chooses the plan with the lowest estimated cost (estimated IO, CPU, memory requirements etc.).

TO view the cost of each operation in a query, use the EXPLAIN query: EXPLAIN PLAN SET STATEMENT_ID = 'some_identifier' FOR some_statement and then SELECT * FROM plan_table to view the estimated CPU, IO, time, temp space etc.