Chapter 5 - Distributed Databases Roobera
Chapter 5 - Distributed Databases Roobera
Distributed Databases
Distributed Database Concepts
A distributed computing system is a system which consists of a
number of processing elements, not necessarily homogeneous, that
are interconnected by a computer network, and that cooperate in
performing certain assigned tasks.
A distributed database (DDB) is a collection of multiple logically
interrelated databases distributed over a computer network.
A distributed database management system (DDBMS) is a
software system that manages a distributed database while making
the distribution transparent to the user.
2
Advantages of Distributed Databases
1. Management of distributed data with different
levels of transparency:
Data Transparency: refers to the ability of a user to
access data as if it were stored in a single location,
regardless of its physical location within the network.
Location Transparency: refers to the ability of the
system to hide the physical location of data, and the
user can access data without knowing its exact
location.
Replication Transparency: refers to the ability of
the system to manage the replication of data across
multiple nodes automatically and transparently to
the user. 3
Distributed Databases
Failure Transparency: refers to the ability of the
system to transparently handle node failures, allowing
users to continue accessing the data, even if one or
more nodes have failed.
Scalability Transparency: refers to the ability of the
system to transparently handle increasing amounts of
data and user load, allowing the system to scale out as
needed.
These levels of transparency are important in
ensuring that a distributed database operates
seamlessly, effectively and transparently to the user.
4
Advantages of Distributed Databases
1. Management of distributed data with different levels of
transparency:
A DBMS should be distribution transparent in the sense of hiding the
details of where each file (table, relation) is physically stored within the
system.
The physical placement of data (files, relations, etc.) which is not
known to the user (distribution transparency).
5
Advantages…..
The EMPLOYEE, PROJECT, and WORKS_ON tables may be
fragmented horizontally and stored with possible replication as
shown below.
6
Advantages …..
• Types of transparencies
• Distribution or network transparency: users do not have to worry about operational details of
the network.
There is Location transparency, which refers to freedom of issuing command from any
location without affecting its working.
Then there is Naming transparency, which allows access to any names object (files, relations,
etc.) from any location.
• Replication transparency:
It allows to store copies of a data at multiple sites.
This is done to minimize access time to the required data.
It makes the user unaware of the existence of copies.
• Fragmentation transparency:
Allows to fragment a relation horizontally (create a subset of tuples of a relation) or
vertically (create a subset of columns of a relation).
Makes the user unaware of the existence of fragments.
7
Advantages …..
2. Increased reliability and availability:
Reliability is the probability that a system is running (not down) at a certain time
point.
Availability is the probability that the system is continuously available (usable or
most.
This reduces data management (access and modification) time significantly
configuration. 8
Functions of DDBMS
Keeping track of data: The ability to keep track of the data distribution,
fragmentation, and replication by expanding the DDBMS catalog.
Distributed query processing: The ability to access remote sites and
transmit queries and data among the various sites via a communication
network.
Distributed transaction management: The ability to devise execution
strategies for queries and transactions that access data from more than one
site and to synchronize the access to distributed data and maintain integrity
of the overall database.
Replicated data management: The ability to decide which copy of a
replicated data item to access and to maintain the consistency of copies of a
replicated data item. 9
Functions …..
Distributed database recovery: The ability to recover from individual site crashes
and from new types of failures such as the failure of a communication links.
Security: Distributed transactions must be executed with the proper management of
the security of the data and the authorization/access privileges of users.
Distributed directory (catalog) management: A directory contains information
(metadata) about data in the database. The directory may be global for the entire
DDB, or local for each site. The placement and distribution of the directory are
design and policy issues.
At the physical hardware level, the following main factors distinguish a DDBMS
from a centralized system:
There are multiple computers, called sites or nodes.
These sites must be connected by some type of communication network to transmit data and
commands among sites,
10
Disadvantages of Distributed Databases
• Complexity- The data replication, failure recovery, network management
…make the system more complex than the central DBMSs.
• Cost- Since DDBMS needs more people and more hardware, maintaining
and running the system can be more expensive than the centralized system .
• Problem of connecting Dissimilar Machine- Additional layers of
operating system software are needed to translate and coordinate the flow
of data between machines.
• Data integrity and security problem- Because data maintained by
distributed systems can be accessed at any locations in the network,
controlling the integrity of a database can be difficult.
11
Data Fragmentation
There are two approaches to store the relation in the distributed database:
Replication and Fragmentation
Data Fragmentation: is a technique used to break up the database into logically
related units called fragments.
A database can be fragmented as:
Horizontal Fragmentation
Vertical Fragmentation
Mixed (Hybrid) Fragmentation
12
Data Fragmentation …..
Horizontal Fragmentation: divides a relation "horizontally" by
All tuples satisfy this condition will create a subset which will be a horizontal
fragment of EMPLOYEE relation.
A selection condition may be composed of several conditions connected by
AND or OR.
13
Data Fragmentation …..
Vertical Fragmentation: divides a relation "vertically" by columns.
It is a subset of a relation which is created by a subset of columns. Thus
attributes.
Each fragment must include the primary key attribute of the parent
by Li(Ci (R)).
14
Data Fragmentation …..
Representation
There are three rules that must be followed during fragmentation
Completeness: if a relation r is decomposed into fragments
r1, r2… rn , each data item that can be found in r must
appear in at least one fragment.
Reconstruction: it must be possible to define a relation
operation that will reconstruct the relation r from fragments.
Disjointness: if a data item di appears in fragment ri , then
it shouldn’t appear in any other fragment
15
Data Replication
Database is replicated to all sites.
16
Data Replication…..
It also improves performance of retrieval for global
queries, because the result of such a query can be obtained
locally from anyone site.
The disadvantage of full replication is that it can slow
down update operations.
Each fragment(or each copy of a fragment) must be
assigned to a particular site in the distributed system. This
process is called data distribution (or data allocation).
17
Types of Distributed Systems
Homogeneous
All sites of the database system have identical setup, i.e., same database
system software.
The system may have little or no local autonomy(not standalone)
The underlying operating systems can be a mixture of Linux, Window,
Unix, etc.
Window
Site 5 Unix
Oracle Site 1
Oracle
Window
Site 4 Communications
network
Oracle
Site 3 Site 2
Linux Oracle Linux Oracle 18
Types…..
Heterogeneous
Federated: Each site may run different database system but the data access is
managed through a single conceptual schema.
This implies that the degree of local autonomy is minimum. Each site must
adhere to a centralized access policy. There may be a global schema.
Each server is an independent and autonomous centralized DBMS that has its
Network
Object DBMS
Oriented Site 3 Site 2 Relational
Linux Linux 19
Types…..
Federated Database Management Systems Issues
Differences in data models:
Each site may have their own data accessing and processing
constraints.
Differences in query language:
Some site may use SQL, some may use SQL-89, some may use
SQL-92, and so on.
20
Query Processing in Distributed Databases
• Issues
Cost of transferring data (files and results) over the network.
Thiscost is usually high, so some optimization is necessary.
Example: relations Employee at site1 and Department at Site2
Employee at site 1. 10,000 rows. Row size = 100 bytes.
Table size = 106 bytes.
Department at Site 2. 100 rows. Row size = 35 bytes.
22
Query Processing…..
Strategies:
1. Transfer Employee and Department to site 3.
• Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3.
• Query result size = 40 * 10,000 = 400,000 bytes. Total transfer
size = 400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site 1,
and send the result to site 3.
• Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
23
Query Processing…..
Consider the query
– Q’: For each department, retrieve the department name and the name of the
department manager
Relational Algebra expression:
Fname,Lname,Dname (Employee Mgrssn = SSN Department)
The result of this query will have 100 tuples, assuming that every department has a
manager, the execution strategies are:
1. Transfer Employee and Department to the result site and perform the join at site 3.
• Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3.
Query result size = 40 * 100 = 4000 bytes.
• Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1 and send the result to
site 3.
• Total transfer size = 4000 + 3500 = 7500 bytes.
– Preferred strategy: Choose strategy 3.
24
Query Processing…..
25
Concurrency Control and Recovery
Distributed Databases encounter a number of concurrency
control and recovery problems which are not present in
centralized databases. Some of them are:
Dealing with multiple copies of data items
Distributed commit
Distributed deadlock
26
Concurrency Control …..
Dealing with multiple copies of data items:
The concurrency control must maintain global consistency. Likewise, the recovery
mechanism must recover all copies and maintain consistency after recovery.
Global consistency refers to the state in which all nodes in a distributed system
have the same view of the data. It is a property of a distributed system that ensures
that all operations and updates are performed in a consistent and coordinated
manner across all nodes, regardless of their location.
recovery scheme must recover them before they are available for use.
27
Concurrency Control …..
Communication link failure:
Communication link failure refers to the inability of two devices to communicate with each
other due to a problem with the communication channel connecting them.
This failure may create network partition which would affect database availability even
though all database sites may be running.
Distributed commit:
Distributed commit is a mechanism for ensuring that a transaction in a distributed
system is either fully committed or fully rolled back, ensuring the consistency of
data across the system. In a distributed commit,
A transaction may be fragmented and they may be executed by a number of
sites. This require a two or three-phase commit approach for transaction
commit.
Distributed deadlock:
Since transactions are processed at multiple sites, two or more sites may get
involved in deadlock. This must be resolved in a distributed manner.
28
Concurrency Control …..
The process of distributed commit typically involves the
following steps:
All nodes participating in the transaction begin by preparing to
commit.
The coordinator node sends a commit request to all participants.
Each participant acknowledges its readiness to commit.
If all participants have acknowledged, the coordinator sends a final
commit message to all participants.
Each participant then performs the necessary updates and sends a
final confirmation message to the coordinator.
The coordinator, upon receiving final confirmation messages from all
participants, sends a "commit done" message to all participants.
Each participant performs any necessary clean up and the transaction
is considered complete.
Distributed deadlock:
Since transactions are processed at multiple sites, two or more sites may get
involved in deadlock. This must be resolved in a distributed manner.
29
Concurrency Control …..
Distributed deadlock:
Since transactions are processed at multiple sites, two or more sites may get
involved in deadlock. This must be resolved in a distributed manner.
30
Concurrency Control …..
There are several methods for detecting and resolving
distributed deadlocks, including:
Centralized Deadlock Detection: In this approach,
a central coordinator node periodically examines the
state of the system to detect deadlocks.
Distributed Deadlock Detection: In this approach,
each node in the system periodically exchanges
information with its neighbors to detect deadlocks.
Timeouts: In this approach, each process waits for a
specified amount of time before assuming that a
deadlock has occurred and taking appropriate action.
31
Concurrency Control …..
Once a deadlock is detected, there are several
strategies for resolving it, including:
Abort one or more of the processes involved in the
deadlock.
Preempt one or more resources held by a process
involved in the deadlock and allocate them to another
process.
Rollback one or more of the transactions involved in
the deadlock.
32
Concurrency Control…..
Distributed Concurrency Control Based on a Distinguished Copy of a
Data Item
A distinguished copy of a data item is a specific instance of
a data item that is designated as the authoritative or
primary copy of that data item in a distributed system.
Site 3 Site 2
35
Concurrency Control…..
Primary site technique:
in a distributed system, multiple copies of a data item may exist across
different nodes. The primary site technique designates one node, the
primary site, as the authoritative source for the data item. All updates
to the data item are made to the primary site first, and then
propagated to all other nodes in the system that store a copy of the
data item.
36
Concurrency Control…..
Primary site technique:
The choice of the primary site can be based on various factors, such as
the location of the data item, the processing capabilities of the node
where the data item is stored, or the availability of the data item. The
primary site can be designated statically, for example by configuring
the system to always use a specific node as the primary site, or
dynamically, for example by using an algorithm to choose the most
appropriate node based on the current state of the system.
37
Concurrency Control…..
Primary site technique:
39
Concurrency Control…..
Primary site technique…..
Advantages:
An extension to the centralized two phase locking so
implementation and management is simple.
Data items are locked only at one site but they can be accessed
at any site.
Disadvantages
All transaction management activities go to primary site which
is likely to overload the site.
If the primary site fails, the entire system is inaccessible.
This can limit system reliability and availability.
To aid recovery, a backup site is designated which behaves as a
shadow of primary site. In case of primary site failure, backup site
can act as primary site.
40
Concurrency Control…..
The Primary Copy Technique is a method used in
distributed systems to manage data consistency across
multiple nodes. The technique designates one node as the
primary node, and all updates to the data are made on the
primary node first. This node holds the authoritative copy
of the data, and all other nodes in the system have replicas
of the data.
41
Concurrency Control…..
In the primary copy technique, conflicts between
updates made by different transactions can be
resolved by the primary node, which has the complete
information about the state of the data. The primary
copy technique can be used in various types of
distributed systems, including database systems,
cloud computing systems, and file systems. It is an
important aspect of distributed data management and
should be carefully designed and implemented to
ensure the correct functioning of the syste
42
Concurrency Control…..
Primary Copy Technique:
In this approach, instead of a site, a data item partition is designated as
primary copy. To lock a data item just the primary copy of the data item
is locked.
Advantages:
Since primary copies are distributed at various sites, a single site is not
overloaded with locking and unlocking requests.
Disadvantages:
43
Concurrency Control…..
Recovery from a coordinator failure
Recovery from a coordinator failure in a distributed system
refers to the process of restoring normal functioning of the
system after a failure of the coordinator node. The coordinator
is a special node in a distributed system that is responsible for
coordinating the execution of transactions and ensuring data
consistency across the nodes.
The recovery process can vary depending on the design of the
system and the type of coordinator failure. In some cases, the
recovery process may involve the selection of a new coordinator
node, which takes over the responsibilities of the failed node. In
other cases, the recovery process may involve rolling back any
transactions that were executed by the failed coordinator but
not yet committed, and then resuming normal processing.
.
44
RECOVARY…..
The recovery process should be designed to minimize the impact of
the coordinator failure on the system, and to ensure that the data
remains consistent and available to users. This can involve
implementing robust data backup and recovery strategies, and
monitoring the system for potential failures and taking corrective
actions as needed.
45
RECOVARY…..
Primary site approach with no backup site:
Aborts and restarts all active transactions at all sites.
46
RECOVARY…..
Primary site approach with backup site:
Suspends all active transactions, designates the backup site as the
The recovery process can vary depending on the design of the system
and the type of coordinator failure. In some cases, the recovery
process may involve the selection of a new coordinator node, which
takes over the responsibilities of the failed node. In other cases, the
recovery process may involve rolling back any transactions that were
executed by the failed coordinator but not yet committed, and then
resuming normal processing.
49
RECOVARY…..
Recovery from a coordinator failure
The recovery process should be designed to minimize the
impact of the coordinator failure on the system, and to
ensure that the data remains consistent and available to
users. This can involve implementing robust data backup
and recovery strategies, and monitoring the system for
potential failures and taking corrective actions as needed.
51
Concurrency Control…..
Primary and backup sites fail or no backup site:
The statement describes an election process for selecting a new
coordinator site in a distributed system. The process is triggered when
a site, process Y, fails to communicate with the existing coordinator
and assumes that the coordinator is down.
53
Concurrency Control…..
Distributed Concurrency control based on voting:
Distributed concurrency control based on voting is a
technique for managing concurrent access to shared
resources in a distributed system. In this technique,
the coordinator site collects voting information from
all participating sites before deciding to execute a
transaction.
The voting process ensures that all participating sites
agree on the order of transactions, and that
transactions are executed in a consistent manner
across all sites. This helps to ensure that the data
remains consistent and up-to-date, even in the
presence of concurrent access to the shared resources.
54
Concurrency Control…..
Distributed Concurrency control based on voting:
The coordinator site acts as the central authority,
collecting voting information from all sites, and making
the final decision on the execution of transactions. The
coordinator site also ensures that transactions are executed
in a serializable order, preventing inconsistencies that can
result from concurrent access to shared resources.
Each copy maintains its own lock and can grant or deny the request for it.
If a transaction wants to lock a data item, it sends lock request to all the
data item.
To avoid unacceptably long wait, a time-out period is defined. If the
requesting transaction does not get any vote information then the
transaction is aborted.
Locking information (grant or denied) is sent to all these sites.
56
Distributed Recovery
There are two major problems with regard to distributed recovery.
1. It is difficult to determine a site is down without exchanging
numerous messages with other sites.
• Suppose that site X sends a message to site Y and expects a response
from Y but does not receive it. There are several possible explanations
for this:
The message was not delivered to Y because of communication failure.
Site Y is down and could not respond.
Site Y is running and sent a response, but the response was not delivered.
2. Distributed commit.
• When a transaction is updating data at several sites, it cannot commit
until it is sure that the effect of the transaction on every site cannot be
lost.
57
Thank You
Any Question???
58