80% found this document useful (10 votes)
9K views

Lect#2 DDBS (Characteristics and Layers of Query Processing)

This document discusses query processing in distributed database systems. It describes the key characteristics of query processors, including the languages they support, types of optimization, when optimization occurs, use of statistics, where decisions are made, how network topology and replicated fragments are exploited, and use of semi-joins. It then explains the four main layers involved in distributed query processing: query decomposition, data localization, global query optimization, and distributed query execution. Query decomposition transforms queries into relational algebra and normalizes, analyzes, simplifies, and restructures queries.

Uploaded by

ridagul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
80% found this document useful (10 votes)
9K views

Lect#2 DDBS (Characteristics and Layers of Query Processing)

This document discusses query processing in distributed database systems. It describes the key characteristics of query processors, including the languages they support, types of optimization, when optimization occurs, use of statistics, where decisions are made, how network topology and replicated fragments are exploited, and use of semi-joins. It then explains the four main layers involved in distributed query processing: query decomposition, data localization, global query optimization, and distributed query execution. Query decomposition transforms queries into relational algebra and normalizes, analyzes, simplifies, and restructures queries.

Uploaded by

ridagul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

Distributed Database

Systems
Week 5 and 6

Characteristics of query processing


Layers of Query Processing

by Razaullah Khan, The AUP.

Distributed Database Systems 1


Characterization of Query Processors
• Important characteristics of query processors that can be used as a basis
for comparison. First four characteristics hold for both centralized and
distributed query processors while the next four for distributed query
processors.
• 1. Languages
• Relational DBMSs use relational calculus
• Object DBMSs use Object calculus ( an extension of RDBMs)
• XML: Used to store and transport data over the internet, is
another data model uses XQuery and XPath (Chap 17)
• XQuery vs XPath. XQuery is a query language that is used to
query a group of XML data. XQuery for XML is the same as SQL for
DB. XPath is a xml path language that is used to select nodes
(navigate through elements) from an xml document using queries.
• Query processor must perform efficient mapping from the input
language to the output language

Distributed Database Systems 2


Characterization of Query Processors
2. Types of Optimization
•Query Optimization aims at choosing the “best” point in the solution space of
all possible execution strategies
•(i) Exhaustive search approach: An immediate method for query optimization
is to search the solution space, exhaustively predict the cost of each strategy,
and select the strategy with minimum cost.
•Although this method is effective in selecting the best strategy. It may incur a
significant processing cost for the optimization itself.
•The problem is that the solution space can be large; that is, there may be
many equivalent strategies, even with a small number of relations.
•(ii) Heuristics: Restrict the solution space to only a few strategies. Process
unary operators first, and then binary operators with increasing sizes, e.g.
replace join with semi-join to minimize data communication cost (as we have
discussed in fragments).

Distributed Database Systems 3


Characterization of Query Processors
3. Optimization Timing
Optimization can be done statically before executing the query or dynamically
as the query is executed.
Static query optimization; At query compilation time. Suitable for exhaustive
search method. The run time must be estimated using database statistics. But,
error may occur.
Dynamically; as the query is executed.
At any point of execution, the choice of the best next operator can be based on
accurate knowledge of the results of the operators executed previously.
Main adv. Is that the size of intermediate relations are available with query
processors.
Main disadv. Must be repeated for each execution of the query (so an
expensive task).
Hybrid query optimization; Basically static but dynamic query opt. may take
place. Adv. of both static and dynamic QP.

Distributed Database Systems 4


Characterization of Query Processors
4. Statistics
•The effectiveness of query optimization relies on statistics on the database.
•Dynamic query optimization requires statistics in order to choose which
operators should be done first.
•Static query optimization is even more demanding since the size of
intermediate relations must also be estimated based on statistical information.
•In DDB; statistics is related to fragments, size, and number of distinct values of
each attribute. Sometimes, to minimize the probability of errors; Histograms of
attribute values (freq. of occurrences for each attribute value), are created.
•Periodic updating is performed to achieve accuracy that might result in query
re-optimization.

Distributed Database Systems 5


Characterization of Query Processors
5. Decision Sites
•In static optimization: a single site or several sites may participate in the
selection of the strategy to be applied for answering the query.
•Most systems use the centralized decision approach, in which a single site
generates the strategy.
•However, the decision process could be distributed among various sites
participating in the elaboration of the best strategy.
•The centralized approach is simpler but requires knowledge of the entire
DDB, while the distributed approach requires only local information.
•Hybrid approaches where one site makes the major decision and other sites
can make local decisions.

Distributed Database Systems 6


Characterization of Query Processors
6. Exploitation of the Network Topology
•The network topology is generally exploited by the distributed
query processor.
•With WAN, the cost function can be restricted to the data
communication cost, and can be divided into two separate problems:
selection of the global execution strategy; based on inter-site
communication, and selection of each local execution strategy, based
on a centralized query processing algorithm.
•With LAN, communication costs are comparable to I/O costs.
•Therefore, it is reasonable for the distributed query processor to
increase parallel execution at the expense of communication cost.
•In a client-server env.; data shipping is also performed. To solve
the problem in an optimized way, the query work is divided among
server and client. Client also participate to execute the query.

Distributed Database Systems 7


Characterization of Query Processors
7. Exploitation of Replicated Fragments
•A distributed relation is usually divided into relation fragments.
•Distributed queries expressed on global relations are mapped into
queries on physical fragments of relations by translating relations
into fragments. This process is called localization because its main
function is to localize the data involved in the query.
•For higher reliability, it is useful to have fragments replicated at
different sites.
•Replicated fragments at run time helps to minimize communication
time.

Distributed Database Systems 8


Characterization of Query Processors
8. Use of Semi-joins
•The basic idea from semijoin is to reduce the communication cost
between different sites.
•It reduces the size of the operand relation.
•When the main cost component considered by the query processor is
communication, a semijoin is particularly useful for improving the
processing of distributed join operators as it reduces the size of data
exchanged between sites. For example:

Oracle semijoin q1: SELECT D.dept_id, D.dept_name FROM dept D WHERE EXISTS (SELECT 1
FROM emp E WHERE E.dept_id = D.dept_id) ORDER BY D.dept_id;

Oracle conventional join q2: SELECT D.dept_id, D.dept_name FROM dept D, emp E WHERE
E.dept_id = D.dept_id ORDER BY D.dept_id;

q1 sample output
q2 sample output

Distributed Database Systems 9


Layers of Query Processing

• The problem of query processing can be decomposed into


several sub-problems, corresponding to various layers.
• Each layer solves a well-defined sub-problem.
• The input is a query on global data expressed in relational
calculus.
• This query is posed on global (distributed) relations, meaning
that data distribution is hidden.

Distributed Database Systems 10


Layers of Query Processing
• Four main layers are involved in distributed query processing.
• Query decomposition
• Data localization
• Global query optimization, and
• Distributed query execution
• The first three layers map the input query into an optimized
distributed query execution plan.
• Query decomposition and data localization correspond to query
rewriting.
• The first three layers are performed by a central control site and
use schema information stored in the global directory (global query
optimizer  global conceptual schema). Schema is a skeleton or
structure of entire database.
• The fourth layer performs distributed query execution by executing
the plan and returns the answer to the query.

Distributed Database Systems 11


Distributed Database Systems 12
Layers of Query Processing
1. Query Decomposition
• Query decomposition is the first phase of query processing that
transforms a relational calculus query into a relational algebra
query.
• The information needed for this transformation is found in the
global conceptual schema describing the global relations.
•Both input and output queries refer to global relations, without
knowledge of the distribution of data.
•Therefore, query decomposition is the same for centralized and
distributed systems.

•The successive steps of query decomposition are (1) normalization,


(2) analysis, (3) elimination of redundancy, and (4) rewriting.

Distributed Database Systems 13


Layers of Query Processing
• Query Decomposition
• Query decomposition can be viewed as four successive steps.
• First, the calculus query is rewritten in a normalized form  logical
operator priority.
• Second, the normalized query is analyzed semantically so that  incorrect
queries are detected and rejected as early as possible.
• Third, the correct query (still expressed in relational calculus) is simplified.
One way to simplify a query is to eliminate redundant predicates.
• Fourth, the calculus query is restructured as an algebraic query.
• Several algebraic queries can be derived from the same calculus query, and
that some algebraic queries are “better” than others.
• Relational algebra query is represented graphically in an operator
tree.

Distributed Database Systems 14


Query Decomposition: operator tree
• An operator tree is a tree in which a leaf node is a relation stored in
the database, and a non-leaf node is an intermediate relation
produced by a relational algebra operator. The sequence of
operations is directed from the leaves to the root, which represents
the answer to the query.
• The transformation of a tuple relational calculus query into an
operator tree can easily be achieved as follows.
• In SQL, the leaves are immediately available in the FROM clause.
• Second, the root node is created as a project operation involving the
result attributes. These are found in the SELECT clause in SQL.
• Third, the qualification (SQL WHERE clause) is translated into the
appropriate sequence of relational operations (select, join, union,
etc.) going from the leaves to the root.
• The sequence can be given directly by the order of appearance of the
predicates and operators.

Distributed Database Systems 15


Example of Operator Tree

Distributed Database Systems 16


Layers of Query Processing
2. Data Localization
•The input to the second layer is an algebraic query on global
relations.
•The main role of the second layer is to localize the query’s data
using data distribution information in the fragment schema.
•In DDB, relations are fragmented and stored in disjoint subsets,
called fragments, each being stored at a different site.
•This layer determines which fragments are involved in the query
and transforms the distributed query into a query on fragments.

Distributed Database Systems 17


Layers of Query Processing
3. Global Query Optimization
•The input to the third layer is an algebraic query on fragments.
•The goal of query optimization is to find an execution strategy for the
query which is close to optimal.
•Query optimization consists of finding the “best” ordering of
operators in the query, including communication operators that
minimize a cost function (disk space, I/O, buffer space, CPU cost,
communication cost i.e. limited bandwidth).
•So, predict statistically the execution cost (i.e. static optimization)
•One aspect of query optimization is join ordering through the semijoin
operators.

Distributed Database Systems 18


Layers of Query Processing

4. Distributed Query Execution


•The last layer is performed by all the sites having fragments involved in
the query.
•Each subquery executing at one site, called a local query, is then
optimized using the local schema of the site and executed.
•At this time, the algorithms to perform the relational operators may be
chosen.
•Local optimization uses the algorithms of centralized systems

Distributed Database Systems 19


The End

Distributed Database Systems 20

You might also like