6. Query Processing and Optimization

πlname,fname(σsnumber = star  movie = mnumber  title = ’Gone with the wind’(star×stars×movie)) \pi_{lname, fname}(\sigma_{\text{snumber = star } \land \text{ movie = mnumber } \land \text{ title = 'Gone with the wind'}}(star \times stars \times movie))

If this is done naively, it will get the Cartesian product of the three tables (if each has 100 rows, then this results in 1,000,000 rows) then gets the names of stars that performed in a particular movie - probably something like 10 rows. Needless to say, this produces a lot of garbage rows and is very inefficient. In a real database query optimization is done.

Query Processing

Code can be executed directly or stored and executer later (compiled).

Translating SQL Queries into Relational Algebra

Query block: the basic unit that can be translated into algebraic operators and optimized. This is a single SELECT ... FROM ... WHERE [... GROUP BY ... HAVING] expression. Nested queries will have multiple query blocks.

The query SELECT name FROM employee WHERE salary > (SELECT MAX(salary) FROM employee WHERE dept = 5) can be split into:

For the second query, an unoptimized process could be:

Side Note: Sort-Merge

The process can be done recursively.

Algorithms for WHERE

Simple selection has only as single condition; complex selections have multiple conditions.

Selectivity: ratio of the number of tuples that satisfy the condition to the total number of tuples.

For simple selection:

For complex selection:

Algorithms for JOIN

Cross join/Cartesian product is time consuming. Equi-join and natural joins are much more efficient. Join performance is affected by:

If an index/hash key exists for one of the join attributes, a single-loop join can be performed: read all rows from the table without the index and use the index to directly retrieve matching records. If this is not available, a nested-loop join is required (brute force).

Algorithms for SELECT

If the attribute list contains candidate keys, all that is required is removing unwanted attributes.

Otherwise, duplicates must be removed. This can be done through either sorting or hashing.

Algorithms for Set Operations

Sort both relations on the same attributes, then scan and merge sorted files concurrently, deciding on if the tuple(s) should be kept by checking their equality.

Algorithms for Aggregate Functions

Min, max:

Sum, count, average:

Group by:

Query Representation

Query tree:

When execution of an internal node occurs, the node and its children are replaced with the resultant relation.

Canonical tree: standard tree corresponding to an SQL query without optimization. From this tree optimized alternatives can be generated.

Query Optimization

Heuristic Query Optimization

The parser generates an initial internal representation - the canonical tree. After this, heuristics should be used to optimize the internal representation.

The main heuristic is to apply operations that reduce the size of intermediate results

To determine its efficiency, we can do things such as:

Some optimizations that can be done on the first example:

Query Execution Plan

Query tree + information about access methods to be used for each relation + methods used to compute relational operators stored in the tree.

Materialized evaluation: when a temporary relation (result of an operation) is made. For example, SELECT * FROM employee WHERE salary = (SELECT MAX(salary) FROM employee) will store the nested query rather than computing it for each row.

Pipelined evaluation: when the result of an operation is forwarded to the next operator in the sequence.

Query Optimization in Oracle

Earlier versions used rule- or heuristic- based query optimization where the optimizer chooses execution plans based on heuristically ranked operations.

With cost-based query optimization, the optimizer examines alternative access paths and operator algorithms, available indexes, past-access statistics etc. and chooses the plan with the lowest estimated cost (estimated IO, CPU, memory requirements etc.).

TO view the cost of each operation in a query, use the EXPLAIN query: EXPLAIN PLAN SET STATEMENT_ID = 'some_identifier' FOR some_statement and then SELECT * FROM plan_table to view the estimated CPU, IO, time, temp space etc.