3. Query processing and
Optimization
3
Parsing checks the
query syntax to
determine whether
it is formulated
according to the
syntax rules (rules
of grammar) of the
query language.
scanner
identifies the
query tokensâ
such as SQL
keywords,
attribute names,
and relation
namesâthat
appear in the
text of the query
validate checking
that all attribute
and relation
names are valid
and semantically
meaningful names
in the schema of
the particular
database being
queried.
4. Query processing
ï§ What is Query Processing?
âą Steps required to transform high level SQL query into a
correct and âefficientâ strategy for execution and retrieval.
âą Processing can be divided into : Decomposition,
Optimization, Execution, and Code generation
1. Query Decomposition
âą It is the process of transforming a high level query
into a relational algebra query, and to check that
the query is syntactically and semantically correct. It
Consists of parsing and validation
5
5. Typical stages in query decomposition are:
i. Analysis: lexical and syntactical analysis of the
query(correctness) based on attributes, data type.. ,. Query
tree will be built for the query containing leaf node for base
relations, one or many non-leaf nodes for relations produced
by relational algebra operations and root node for the result of
the query. Sequence of operation is from the leaves to the
root.
(SELECT * FROM Catalog c ,Author a Where a.authorid =
c.authorid AND c.price>200 AND a.country= â USAâ )
ii. Normalization: convert the query into a normalized form.
The predicate WHERE will be converted to Conjunctive ( )
âš
or Disjunctive ( ) Normal form.
â§
6
6. iii. Semantic Analysis: to reject normalized queries that
are not correctly formulated or contradictory. Incorrect
if components do not contribute to generate result.
Contradictory if the predicate can not be satisfied by any
tuple. Say for example,(Catalog =âBSâ ï Catalog= âCSâ)
since a given book can only be classified in either of the
category at a time
iv. Simplification: to detect redundant qualifications,
eliminate common sub-expressions, and transform the
query to a semantically equivalent but more easily and
effectively computed form. For example, If a user donât
have the necessary access to all of the objects of the
query , it should be rejected.
7
7. 2. Query Optimization
What is Query Optimization?
â The activity of choosing a single âefficientâ execution
strategy (from hundreds) as determined by database
catalog statistics.
â Which relational algebra expression, equivalent to the
given query, will lead to the most efficient solution
plan?
â For each algebraic operator, what algorithm (of several
available) do we use to compute that operator?
â How do operations pass data (main memory buffer,
disk buffer,âŠ)?
8
8. ï§ Everyone wants the performance of their database to be optimal. In particular,
there is often a requirement for a specific query or object that is query based, to
run faster.
ï§ Problem of query optimization is to find the sequence of steps that produces
the answer to user request in the most efficient manner, given the database
structure.
ï§ The performance of a query is affected by the tables or queries that underlies
the query and by the complexity of the query.
ï§ Given a request for data manipulation or retrieval, an optimizer will choose an
optimal plan for evaluating the request from among the manifold alternative
strategies. i.e. there are many ways (access paths) for accessing desired
file/record.
ï§ hence ,DBMS is responsible to pick the best execution strategy based on
various considerations( Least amount of I/O and CPU resources. ) 9
9. âŠcontinued
âą A query typically has many possible execution
strategies, and the process of choosing a suitable
one for processing a query is known as query
optimization.
âą Is not the optimal (or absolute best) strategyâit is
just a reasonably efficient strategy for executing
the query.
10
10. âŠcontinued
âą There are two main techniques that are employed
during query optimization.
âą The first technique is based on heuristic rules for
ordering the operations in a query execution strategy. A
heuristic is a rule that works well in most cases but is
not guaranteed to work well in every case. The rules
typically reorder the operations in a query tree.
âą The second technique involves systematically
estimating the cost of different execution strategies and
choosing the execution plan with the lowest cost
estimate. These techniques are usually combined in a
query optimizer.
11
11. âŠcontinued
ï§ Example: Consider relations r(AB) and s(CD). We
require r X s.
ï§ Method 1 :
a. Load next record of r in RAM.
b. Load all records of s, one at a time and
concatenate with r.
c. All records of r concatenated?
ï§ NO: goto a.
ï§ YES: exit (the result in RAM or on disk).
ï§ Performance: Too many accesses.
12
12. âŠcontinued
ï§ Method 2: Improvement
a. Load as many blocks of r as possible leaving
room for one block of s.
b. Run through the s file completely one block
at a time.
ï§ Performance: Reduces the number of times s blocks are
loaded by a factor of equal to the number of r records than
can fit in main memory.
ï§ Considerations during query Optimization:
â Narrow down intermediate result sets
quickly. SELECT and PROJECTION before
JOIN
â Use access structures (indexes).
13
13. Using Heuristics in Query Optimization
âą In practice, SQL is the query language that is
used in most commercial RDBMSs. An SQL
query is first translated into an equivalent
extended relational algebra expression-
represented as a query tree data structure-
that is then optimized.
âą Typically, SQL queries are decomposed into
query blocks, which form the basic units that
can be translated into the algebraic operators
and optimized.
15. Transformation rule for relational
algebra with example
1. Cascade of SELECTION
Rule: Multiple SELECTION operations
can be combined into a single
SELECTION operation.
Example:
ï· Initial Query:
ï· Optimized Query:
Explanation: Instead of first selecting
employees with a salary greater than
50,000 and then selecting those older
than 30, you can combine these
conditions into one SELECTION
2. Commutativity of SELECTION
Rule: The order of SELECTION
operations can be interchanged
without affecting the result.
Example:
ï· Initial Query:
ï· Equivalent Query:
Explanation: Whether you first
select employees older than 30 or
those in the HR department, the
final result will be the same.
16. Transformation rule for relational
algebra with exampleâŠ.
3. Cascade of PROJECTION
Rule: In a sequence of
PROJECTION operations, only the
last one is necessary.
Example:
ï· Initial Query:
ï· Optimized Query:
Explanation: If you first project
the attributes name, age, and
salary, and then project only
name and age, you can directly
project name and age from the
start.
4. Commutativity of SELECTION with
PROJECTION
Rule: SELECTION and PROJECTION
operations can be interchanged if the
SELECTION predicate involves only the
attributes in the PROJECTION list.
Example:
ï· Initial Query:
ï· Equivalent Query:
Explanation: If you first project the
attributes name and age and then
select employees older than 30, or if
you first select employees older than
30 and then project name and age, the
17. Transformation rule for relational
algebra with exampleâŠ.
5. Commutativity of THETA JOIN/Cartesian Product
Rule: The THETA JOIN (âš) and Cartesian Product (Ă)
operations are commutative, meaning the order of
the relations can be swapped without affecting the
result.
Example:
ï· Initial Query:
RĂS
ï· Equivalent Query:
SĂR
Explanation: Whether you join R with S or S with R,
the result will be the same set of tuples.
18. Transformation rule for relational
algebra with exampleâŠ.
Case b: SELECTION Predicate
Involves Attributes of Both
Relations
Example:
ï· Initial Query:
ï· Equivalent Query:
Explanation: If c1 involves only
attributes of R and c2 involves
only attributes of S, you can first
select the tuples from R that
satisfy c1 and the tuples from S
that satisfy c2, and then join the
results.
6. Commutativity of SELECTION with
THETA JOIN
Rule: If the SELECTION predicate
involves only attributes of one of the
relations being joined, the SELECTION
and JOIN operations can be
interchanged.
Case a: SELECTION Predicate
Involves Only Attributes of One
Relation
Example:
ï· Initial Query:
ï· Equivalent Query:
Explanation: If the predicate c1
involves only attributes of R, you can
19. Transformation rule for relational
algebra with exampleâŠ.
7. Commutativity of PROJECTION and THETA JOIN
Rule: If the projection list is of the form
L1, L2, where L1 involves only attributes of R and L2 involves
only attributes of S being joined, and the predicate Ξ involves
only attributes in the projection list, then:
Example:
ï· Initial Query:
ï· Optimized Query:
Explanation: Instead of projecting the attributes after the join,
you can project the relevant attributes from each relation
before performing the join.
20. Transformation rule for relational
algebra with exampleâŠ.
8. Commutativity of the Set
Operations: UNION and
INTERSECTION but not SET
DIFFERENCE
Rule: UNION and INTERSECTION
operations are commutative, but
SET DIFFERENCE is not.
Example:
ï· Initial Query:
ï· Optimized Query:
Explanation: The order of UNION
9. Associativity of the THETA JOIN,
CARTESIAN PRODUCT, UNION, and
INTERSECTION
Rule: These operations are associative.
Explanation: The order in which you
perform the JOIN, CARTESIAN
PRODUCT, UNION, and INTERSECTION
does not affect the final result.
21. Transformation rule for relational
algebra with exampleâŠ.
10. Commuting SELECTION with SET OPERATIONS
Rule: SELECTION operations can commute with UNION and
INTERSECTION.
Example:
Explanation: Instead of applying the SELECTION after the UNION,
you can apply the SELECTION to each relation before performing
the UNION.
22. Transformation rule for relational
algebra with exampleâŠ.
11. Commuting PROJECTION with UNION
Rule: PROJECTION operations can commute with UNION.
Example:
Explanation: Instead of projecting the attributes after the UNION,
you can project the relevant attributes from each relation before
performing the UNION.
24. Using Heuristics
Heuristic optimization in query processing
involves using rule-based techniques to
transform a query into a more efficient form.
Hereâs a detailed explanation of the process:
Process for heuristics optimization
1. Initial Internal Representation:
ï§ When a high-level query (like SQL) is
submitted, the parser translates it
into an initial internal representation,
often in the form of a relational
algebra tree. This tree represents the
logical steps needed to execute the
query.
25. Using HeuristicsâŠ
2. Applying Heuristic Rules:
o Heuristic rules are applied to this internal
representation to optimize it. These rules are
based on general principles that typically lead
to more efficient query execution. Some
common heuristic rules include:
ï§ Selection Pushdown: Moving selection
operations as close to the base relations as
possible to reduce the size of intermediate
results.
ï§ Projection Pushdown: Moving projection
operations down the query tree to
eliminate unnecessary columns early.
ï§ Join Reordering: Reordering join
operations to minimize the size of
26. Using HeuristicsâŠ
3. Generating a Query Execution Plan:
ï§ After applying heuristic rules, the optimized
internal representation is used to generate
a query execution plan. This plan outlines
the specific steps and methods the DBMS
will use to execute the query.
ï§ The execution plan considers the access
paths available, such as indexes and
sequential scans, to determine the most
efficient way to retrieve and process the
data.
ï§ The plan may include operations like index
scans, nested loop joins, hash joins, and
sort-merge joins, depending on the
available access paths and the structure of
27. Using HeuristicsâŠ
ï§ The main heuristic is to apply first the operations that reduce
the size of intermediate results.
â E.g. Apply SELECT and PROJECT operations
before applying the JOIN or other binary operations.
Intermediate results in the context of database
query processing are the temporary data sets
produced during the execution of a query before
arriving at the final result. Intermediate results are
not stored permanently in the database. They exist
only for the duration of the query execution and are
discarded once the final result is produced. Sli
de
15-
28
28. âŠcontinued
âą Heuristics Approach uses the knowledge of the
characteristics of the relational algebra operations and
the relationship between the operators to optimize the
query.
âą Thus the heuristic approach of optimization will make
use of:
â Properties of individual operators
â Association between operators
â Query Tree: a graphical representation of the operators,
relations, attributes and predicates and processing
sequence during query processing.
âą It is composed of three main parts:
â Sequence of execution of operation in a query tree will
29
29. âŠcontinued
ï§ Query block: The basic unit that can be translated
into the algebraic operators and optimized.
ï§ A query block contains a single SELECT-FROM-
WHERE expression, as well as GROUP BY and
HAVING clause if these are part of the block.
ï§ Nested queries within a query are identified as
separate query blocks.
ï§ There are two types of nested queries: 30
30. Uncorrelated Nested Queries
Uncorrelated nested queries could be
performed separately and their results will be
used in outer query.
SELECT name
FROM employees
WHERE department_id IN (SELECT department_id
FROM departments WHERE location = 'New Yorkâ);
In this example, the inner query (SELECT
department_id FROM departments WHERE location
= 'New York') is executed first, and its result is used
by the outer query to filter employees.
31. Correlated Nested Queries
âą Correlated nested queries need
information (tuple variable) from outer
query in their execution.
SELECT name
FROM employees e
WHERE salary > (SELECT AVG(salary) FROM
employees WHERE department_id =
e.department_id);
In this example, the inner query (SELECT
AVG(salary) FROM employees WHERE
department_id = e.department_id) depends on the
department_id of each row in the outer query.
Therefore, the inner query is executed for each
employee to compare their salary with the average
32. Sli
de
15-
33
âą Query tree:
â A tree data structure that corresponds to a relational
algebra expression. It represents the input relations
of the query as leaf nodes of the tree, and represents
the relational algebra operations as internal nodes.
â Leafs: the base relations used for processing
the query/ extracting the required information
â Root: the final result/relation as an out put
based on the operation on the relations used
for query processing
â Nodes: intermediate results or relations
before reaching the final result.
âą An execution of the query tree consists of executing an
internal node operation whenever its operands are
available and then replacing that internal node by the
33. âą A query graph is a visual representation used in
database theory to illustrate a relational calculus
expression. Hereâs a breakdown of the key points:
ï§ Graph Data Structure: The query graph is a type of
graph that visually represents the relationships and
constraints of a query.
ï§ Relational Calculus Expression: It corresponds to a
relational calculus expression, which is a non-
procedural query language used to specify what
data to retrieve rather than how to retrieve it.
ï§ No Operation Order: The graph does not specify
the order in which operations should be performed.
It simply shows the relationships and constraints.
ï§ Uniqueness: Each query has a unique
corresponding graph, meaning there is only one
34
Query graph
34. âŠcontinued
ï§ Example:
âą For every project located in âStaffordâ, retrieve the project number, the
controlling department number and the department managerâs last
name, address and birthdate.
ï§ Relation algebra:
ÏPNUMBER, DNUM, LNAME, ADDRESS, BDATE (((ÏPLOCATION=âSTAFFORDâ(PROJECT))
DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))
ï§ SQL query:
SELECT P.NUMBER,P.DNUM,E.LNAME,E.ADDRESS,
E.BDATE FROM PROJECT AS P,DEPARTMENT AS D,
EMPLOYEE AS E WHERE P.DNUM=D.DNUMBER AND
D.MGRSSN=E.SSN AND P.PLOCATION=âSTAFFORDâ;
35
37. âŠcont
Step 1. Perform Selection operation as early as
possible : By using selection operation at early
stages, you can reduce the unwanted number of
record or data, to transfer from database to
primary memory. Optimizer use transformation
rule 1 to divide selection operations with
conjunctive conditions into a cascade of selection
operations.
38. ⊠cont
Step 2. Perform commutativity of selection operation
with other operations as early as possible : Optimizer
use transformation rule 2, 4, 6, and 9 to move
selection operation as far down the tree as possible
and keep selection predicates on the same relation
together. By keeping selection operation down at
tree reduces the unwanted data transfer and by
keeping selection predicates together on same
relations reduces the number of times of database
manipulation to retrieve records from same
database table.
39. ⊠cont
Step 3. Combine the Cartesian Product with subsequent
selection operation whose predicates represents a join
condition into a JOIN operation : Optimizer uses
transformation rule 13 to convert a selection and
cartesian product sequence into join. It reduces data
transfer. It is always better to transfer only required data
from database instead of transferring whole data and
then refine it. (Cartesian product combines all data of all
the tables mention in query while join operation retrieves
only those records from database that satisfy the join
condition).
Step 4. Use Commutativity and Associativity of Binary
operations : Optimizer use transformation rules 5, 11, and
12 to execute the most restrictive selection operations
first.
40. Step 5. Perform projection operations as early as possible :
After performing selection operations, optimizer use
transformation rules 3, 4, 7 and 10 to reduce the number
of columns of a relation by moving projection operations
as far down the tree as possible and keeping projection
predicates on the same relation together.
Step 6. Compute common expressions only once: It is used
to identify sub-trees that represent groups of operations
that can be executed by a single algorithm.
41. âą Heuristic Optimization of Query Trees:
â The same query could correspond to many
different relational algebra expressions â and
hence many different query trees.
â The task of heuristic optimization of query trees
is to find a final query tree that is efficient to
execute.
âą Example:
Q2: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = âAQUARIUSâAND
PNMUBER=PNO AND ESSN=SSN AND BDATE
Sli
de
15-
42
42. Sli
de
15-
43
(a) Initial (canonical)
query tree for SQL
query Q.
Executing this tree directly
first creates a very large file
containing the CARTESIAN
PRODUCT of the entire
EMPLOYEE, WORKS_ON,
and PROJECT files.
(b) Moving SELECT
operations down the
query tree.
an improved query tree that
first applies the SELECT
operations to reduce the
number of tuples that appear in
the CARTESIAN PRODUCT.
(c) Applying the more
restrictive SELECT
operation first.
A further improvement is achieved
by switching the positions of the
EMPLOYEE and PROJECT
relations in the tree, as shown in
(c).This uses the information that
Pnumber is a key attribute of the
PROJECT relation, and hence the
SELECT operation on the
PROJECT relation will retrieve a
43. Sli
de
15-
44
(d) Replacing CARTESIAN
PRODUCT and SELECT
with JOIN operations.
We can further improve the
query tree by replacing any
CARTESIAN PRODUCT
operation that is followed by a
join condition with a JOIN
operation
(e) Moving PROJECT
operations down the query
tree.
Another improvement is to keep
only the attributes needed by
subsequent operations in the
intermediate relations, by
including PROJECT (Ï) operations
as early as possible in the query
tree, as shown in (e). This reduces
the attributes (columns) of the
44. Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that reduce the size
of intermediate results.
2. Perform select operations as early as possible to reduce the number of
tuples and perform project operations as early as possible to reduce the
number of attributes. (This is done by moving select and project
operations as far down the tree as possible.)
3. The select and join operations that are most restrictive should be
executed before other similar operations. (This is done by reordering
the leaf nodes of the tree among themselves and adjusting the rest of
the tree appropriately.)
Slide 15-
45
45. B. Cost Estimation Approach to Query Optimization
âą The main idea is to minimize he cost of processing a query. The cost
function is comprised of:
âą I/O cost + CPU processing cost + communication cost + Storage
cost
âą These components might have different weights in different
processing environments
âą The DBMs will use information stored in the system catalogue for
the purpose of estimating cost.
âą The main target of query optimization is to minimize the size of the
intermediate relation. The size will have effect in the cost of:
âą Disk Access
âą Data Transportation
âą Storage space in the Primary Memory
âą Writing on Disk
46
46. âą Cost-based query optimization:
âą Estimate and compare the costs of executing a
query using different execution strategies and
choose the strategy with the lowest cost estimate.
(Compare to heuristic query optimization)
âą Issues
âą Cost function
âą Number of execution strategies to be considered
Sli
de
15-
47
âą Cost Components for Query Execution
1. Access cost to secondary storage
2. Storage cost
3. Computation cost
4. Memory usage cost
5. Communication cost
47. 1. Access Cost of Secondary Storage
âą Data is going to be accessed from secondary storage, as a query will
be needing some part of the data stored in the database. The disk
access cost can again be analyzed in terms of:
â Searching
â Reading, and
â Writing, data blocks used to store some portion of a
relation.
âą Remark: The disk access cost will vary depending on
â The file organization used and the access method
implemented for the file organization.
â whether the data is stored contiguously or in
scattered manner, will affect the disk access cost.
48
48. âŠcontinued
49
2. Storage Cost
âą While processing a query, as any query would be
composed of many database operations, there could
be one or more intermediate results before reaching
the final output. These intermediate results should be
stored in primary memory for further processing. The
bigger the intermediate relation, the larger the
memory requirement, which will have impact on the
limited available space. This will be considered as a
49. 3. Query Execution Plans
âAn execution plan for a relational algebra
query consists of a combination of the
relational algebra query tree and
information about the access methods to be
used for each relation as well as the
methods to be used in computing the Sli
de
15-
50
50. 4. Computation Cost
âą Query is composed of many operations. The operations could be database
operations like reading and writing to a disk, or mathematical and other
operations like:
âą Searching
âą Sorting
âą Merging
âą Computation on field values
51
5. Communication Cost
âą In most database systems the database resides in one
station and various queries originate from different
terminals. This will have impact on the performance
of the system adding cost for query processing. Thus,
the cost of transporting data between the database site
and the terminal from where the query originate
should be analyzed.