Query Optimization
Last updated
Last updated
Query Plan is a tree, with relational algebra operators as nodes and access paths as leaves. Each operator labeled with a choice of algorithm.
Query optimization steps:
Query first broken into "blocks"
Each block converted to relational algebra
Then, for each block, several alternative query plans are considered
Cross product is costly. We can convert cross product into natural join. We'd better reduce the amount of tuples needed in the join.
We can also "push down" projection:
Plan with the lowest estimated cost is selected
For each plan considered, must estimate cost:
Must estimate size of result for each operation in tree
Must estimate cost of each operation in plan tree
Depends on input cardinalities
To decide on the cost, the optimizer needs information about the relations and indices involved. This information is stored in the system catalogs.
Catalogs typically contain at least:
and
Distinct key values for each index
Low/high values for each tree index
Index height for each tree index
Index pages for each index
Statistics in catalogs are updated periodically.
Reduction factor associated with each predicate reflects the impact of the predicate in reducing the result size. RF is also called selectivity.
Single table selection:
Joins over tables:
Col = value
Col > value
Col < value
Col_A=Col_B (for joins)
No information
Two main cases:
Single-relation plans
Multiple-relation plans (joins)
For each available access path (file scan/ index) is considered, and the one with the lowest estimated cost is chosen.
Heap scan is always one alternative
Each index can be another alternative (if matching selection predicates)
Sequential scan of data file: Cost=
Index selection over a primary key (just a single tuple):
Cost(B+ Tree) =
Cost(Hash Index) = ProbeCost(I) ~ 1.2
Clustered index matching one or more predicates:
Cost(B+ Tree) =
Cost(Hash Index) =
Non-clustered index matching one or more predicates
Cost(B+ Tree) =
Cost(Hash Index) =
Steps:
Select order of relations
Maximum possible orderings =
For each join, select join algorithm
For each input relation, select access method
As number of joins increases, number of alternative plans grows rapidly. We need to restrict search space.
Fundamental decision in System R: only left-deep join trees are considered
The outcome of a previous join is immediately fed into the next join without writing to the disk. Hence, the cost of reading outer pages for the next join is discarded.
Note that only the first reading cost is discarded. For hash join, however, it writes the partition to the disk and read the disk again. This part of reading cost cannot be pruned. The cost is
Cross products is instantly pruned.