If this is done naively, it will get the Cartesian product of the three tables (if each has 100 rows, then this results in 1,000,000 rows) then gets the names of stars that performed in a particular movie - probably something like 10 rows. Needless to say, this produces a lot of garbage rows and is very inefficient. In a real database query optimization is done.
Query Processing
- Scanning, parsing and validating -> immediate form of query
- Query optimizer -> execution plan
- Query code generator -> code to execute the query
- Runtime database processor -> result
Code can be executed directly or stored and executer later (compiled).
Translating SQL Queries into Relational Algebra
Query block: the basic unit that can be translated into algebraic operators and optimized. This is a single SELECT ... FROM ... WHERE [... GROUP BY ... HAVING] expression. Nested queries will have multiple query blocks.
The query SELECT name FROM employee WHERE salary > (SELECT MAX(salary) FROM employee WHERE dept = 5) can be split into:
SELECT name FROM employee WHERE salary > C-
SELECT MAX(salary) FROM employee WHERE dept = 5-
For the second query, an unoptimized process could be:
- Retrieve all blocks from employee, filter out those with the wrong department, and sort them - unlikely to be indexed or be sorted by salary
- If there a lot of rows, sorting may need to be done using external sorting
- Find the top row that has the maximum salary
Side Note: Sort-Merge
- Read x rows, sort them (quicksort etc.) and write to disk
- Repeat n times
- Read the first x/(n+1) rows from each subfile and perform an n-way merge. Store the result in a buffer; when the buffer size reaches x, write it to disk
The process can be done recursively.
Algorithms for WHERE
Simple selection has only as single condition; complex selections have multiple conditions.
Selectivity: ratio of the number of tuples that satisfy the condition to the total number of tuples.
For simple selection:
- Linear search: no index, unstructured data
- Binary search: no index, sorted data
- Primary index/hash key to retrieve a single record
- Primary index to retrieve multiple records (inequalities)
- Clustering index to retrieve multiple records (equality)
- Secondary index
For complex selection:
- Conjunctive selection using an index
- Use the index for a simple condition, then check if the retrieved tuple satisfies the other conditions
- Conjunctive selection using a composite index (index exists for two or more attributes involved in the conditions)
- Conjunctive selection using the intersections of record pointers
-Secondary indexes available for some of the attributes involved in the conditions, providing pointers to records (not blocks)
- Get the record pointers from each index, and find the intersection between them
- Then, check if the retrieved tuples satisfy the other conditions
- Disjunctive conditions (union)
- If there are no indexes, use brute force
Algorithms for JOIN
Cross join/Cartesian product is time consuming. Equi-join and natural joins are much more efficient. Join performance is affected by:
- Available buffer space
- Join selection factor - fraction of records in one file that will be joined with records in the other file
- Choice of inner/outer relation
If an index/hash key exists for one of the join attributes, a single-loop join can be performed: read all rows from the table without the index and use the index to directly retrieve matching records. If this is not available, a nested-loop join is required (brute force).
Algorithms for SELECT
If the attribute list contains candidate keys, all that is required is removing unwanted attributes.
Otherwise, duplicates must be removed. This can be done through either sorting or hashing.
Algorithms for Set Operations
Sort both relations on the same attributes, then scan and merge sorted files concurrently, deciding on if the tuple(s) should be kept by checking their equality.
Algorithms for Aggregate Functions
Min, max:
- If an index exists, this can be used. If want to find the maximum and an ascending index exists, the right-most pointer in each index node can be followed
- Otherwise, each row must be read
Sum, count, average:
- For a dense index, the computation can be done on the values in the index
- If an index does not exist or it is a non-dense index, each row must be read
Group by:
- Sorting or hashing is used to partition the files into appropriate groups
- The aggregate function can be done on tuples in each group
Query Representation
Query tree:
- Data structure corresponding relational algebra expressions
- Relations are leaf nodes
- Relational algebra operations are internal nodes
When execution of an internal node occurs, the node and its children are replaced with the resultant relation.
Canonical tree: standard tree corresponding to an SQL query without optimization. From this tree optimized alternatives can be generated.
Query Optimization
Heuristic Query Optimization
The parser generates an initial internal representation - the canonical tree. After this, heuristics should be used to optimize the internal representation.
The main heuristic is to apply operations that reduce the size of intermediate results
- Apply
SELECTorWHEREbeforeJOINs or other binary operations- Apply the more selective condition first (e.g. equals condition on unique column)
- Use joins instead of Cartesian products
To determine its efficiency, we can do things such as:
- Use the data catalog to retrieve the number of tuples in the input tables and:
- Calculate the number of tuples outputted by the cross join
- Find the upper bound for the number of tuples returned by the join
- Calculate the number of tuples outputted by the cross join
- Estimate/calculate the number of tuples returned by each condition in the
WHEREclause
Some optimizations that can be done on the first example:
- The condition,
title = 'Gone with the windinvolves only a single table, so the select operation can be done before the Cartesian product. If each table has 100 rows, then the resulting table will only have 100,000 rows - Only the movie ID is needed from the movie table so a project operation can be done to only keep the movie ID before executing the Cartesian product
- Use a join instead of a Cartesian product and select clause $$ \pi_{lname, fname}(star \Join_{,\text{snumber = star}} stars \Join_{\text{movie = mnumber}} \pi_{movie_id}(\sigma_{\text{title = ‘Gone with the wind’}}(movie))) $$
Query Execution Plan
Query tree + information about access methods to be used for each relation + methods used to compute relational operators stored in the tree.
Materialized evaluation: when a temporary relation (result of an operation) is made. For example, SELECT * FROM employee WHERE salary = (SELECT MAX(salary) FROM employee) will store the nested query rather than computing it for each row.
Pipelined evaluation: when the result of an operation is forwarded to the next operator in the sequence.
Query Optimization in Oracle
Earlier versions used rule- or heuristic- based query optimization where the optimizer chooses execution plans based on heuristically ranked operations.
With cost-based query optimization, the optimizer examines alternative access paths and operator algorithms, available indexes, past-access statistics etc. and chooses the plan with the lowest estimated cost (estimated IO, CPU, memory requirements etc.).
TO view the cost of each operation in a query, use the EXPLAIN query: EXPLAIN PLAN SET STATEMENT_ID = 'some_identifier' FOR some_statement and then SELECT * FROM plan_table to view the estimated CPU, IO, time, temp space etc.