0% found this document useful (0 votes)
11 views

Week 4_Lecture Notes

The document discusses the importance of time and clock synchronization in cloud data centers, highlighting various algorithms and challenges associated with distributed systems. It covers concepts such as clock skew, drift, and inaccuracies, as well as synchronization techniques like Christian's Algorithm, NTP, Berkley's Algorithm, and the Datacenter Time Protocol (DTP). The need for synchronization is emphasized for correctness and fairness in distributed computations.

Uploaded by

narutoworld2806
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Week 4_Lecture Notes

The document discusses the importance of time and clock synchronization in cloud data centers, highlighting various algorithms and challenges associated with distributed systems. It covers concepts such as clock skew, drift, and inaccuracies, as well as synchronization techniques like Christian's Algorithm, NTP, Berkley's Algorithm, and the Datacenter Time Protocol (DTP). The need for synchronization is emphasized for correctness and fairness in distributed computations.

Uploaded by

narutoworld2806
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

Time and Clock Synchronization

in Cloud Data Centers

EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Preface
Content of this Lecture:

In this lecture, we will discuss the fundamentals of clock


synchronization in cloud and its different algorithms.

EL
We will also discuss the causality and a general framework of

PT
logical clocks and present two systems of logical time, namely,
lamport and vector, timestamps to capture causality between
events of a distributed computation .
N

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Need of Synchronization
You want to catch a bus at 9.05 am, but your watch is off
by 15 minutes
What if your watch is Late by 15 minutes?

EL
• You’ll miss the bus!
What if your watch is Fast by 15 minutes?

PT
• You’ll end up unfairly waiting for a longer time than
you intended
N
Time synchronization is required for:
Correctness
Fairness

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Time and Synchronization
Time and Synchronization
(“There’s is never enough time…”)

Distributed Time

EL
The notion of time is well defined (and measurable) at
each single location
PT
But the relationship between time at different locations
is unclear
N
Time Synchronization is required for:
Correctness
Fairness
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Synchronization in the cloud
Example: Cloud based airline reservation system:

Server X receives, a client request, to purchase the last ticket on


a flight, say PQR 123.

EL
Server X timestamps the purchase using its local clock as
6h:25m:42.55s. It then logs it. Replies ok to the client.

PT
That was the very last seat, Server X sends a message to Server Y
saying the “flight is full”.
Y enters, “Flight PQR 123 is full” + its own local clock value,
N
(which happens to read 6h:20m:20.21s).
Server Z, queries X's and Y's logs. Is confused that a client
purchased a ticket at X after the flight became full at Y.
This may lead to full incorrect actions at Z

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Key Challenges
End-hosts in Internet based systems (like clouds)
Each have its own clock
Unlike processors (CPUs) within one server or

EL
workstation which share a system clock.
Processes in internet based systems follow an
asynchronous model.
No bounds on PT
– Messages delays
N
– Processing delays
Unlike multi-processor (or parallel) systems which follow
a synchronous system model

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Definitions
An asynchronous distributed system consists of a number of
processes.
Each process has a state (values of variables).

EL
Each process takes actions to change its state, which may be
an instruction or a communication action (send, receive).

PT
An event is the occurrence of an action.
Each process has a large clock – events within a process can
be assigned timestamps, and thus ordered linearly.
N
But- in a distributed system, we also need to know the time
order of events across different processes.

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Space-time diagram
Message
Internal
Message receive event
event
Process send event

EL
PT
N

Figure : The space-time diagram of a distributed execution.


Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Clock Skew vs. Clock Drift
Each process (running at some end host) has its own clock.
When comparing two clocks at two processes.
Clock Skew = Relative difference in clock values of two
processes.

EL
• Like distance between two vehicles on road.
Clock Drift = Relative difference in clock frequencies (rates)
of two processes
PT
• Like difference in speeds of two vehicles on the road.
N
A non-zero clock skew implies clocks are not synchronized
A non-zero clock drift causes skew increases (eventually).
If faster vehicle is ahead, it will drift away.
If faster vehicle is behind, it will catch up and then drift away.

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Clock Inaccuracies
Clocks that must not only be
synchronized with each other but also
have to adhere to physical time are
termed physical clocks.

EL
Physical clocks are synchronized to an
accurate real-time standard like UTC

PT
(Universal Coordinated Time).

However, due to the clock inaccuracy, a


N
timer (clock) is said to be working within
its specification if (where constant ρ is the
Figure: The behavior of fast, slow, and
maximum skew rate specified by the perfect clocks with respect to UTC.
manufacturer)
1−ρ≤ ≤1+ρ

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
How often to Synchronize

Maximum Drift rate (MDR) of a clock


Absolute MDR is defined to relative coordinated universal
Time (UTC). UTC is the correct time at any point of time.

EL
• MDR of any process depends on the environment.
Maximum drift rate between two clocks with similar MDR
is 2*MDR.
PT
Given a maximum acceptable skew M between any pair of
N
clocks, need to synchronize at least once every:
M/ (2* MDR) time units.
• Since time = Distance/ Speed.

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
External vs Internal Synchronization
Consider a group of processes
External synchronization
Each process C(i)’s clock is within a bounded D of a well-
known clock S external to the group

EL
|C(i)- S|< D at all times.
External clock may be connected to UTC (Universal

PT
Coordinated Time) or an atomic clock.
Example: Christian’s algorithm, NTP
N
Internal Synchronization
Every pair of processes in group have clocks within bound D
|C(i)- C(j)|< D at all times and for all processes i,j.
Example: Berkley Algorithm, DTP

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
External vs Internal Synchronization
External synchronization with D => Internal
synchronization with 2*D.

EL
Internal synchronization does not imply External
Synchronization.

PT
• In fact, the entire system may drift away from the
external clock S!
N

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Basic Fundamentals
External time synchronization
All processes P synchronize with a time server S.
Time
P Set clock to t

EL
What’s the time?
Here’s the time t
S

What’s Wrong:
PT
Check local clock to find time t
N
By the time the message has received at P, time has moved on.
P’s time set to t is in accurate.
Inaccuracy a function of message latencies.
Since latencies unbounded in an asynchronous system, the inaccuracy
cannot be bounded.

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
(i) Christians Algorithm
P measures the round-trip-time RTT of message exchange
Suppose we know the minimum P → S latency min1
And the minimum S → P latency min2
 Min1 and Min2 depends on the OS overhead to buffer messages, TCP

EL
time to queue messages, etc.
The actual time at P when it receives response is between
[t+min2, t + RTT-min1]

P
PT
RTT
Set clock to t Time
N
What’s the time?
Here’s the time t!
S
Check local clock to find time t
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Christians Algorithm
The actual time at P when it receives response is between
[t+min2, t + RTT-min1]
P sets its time to halfway through this interval
To: t + (RTT+min2-min1)/2

EL
Error is at most (RTT- min2- min1)/2
Bounded

P
PT
RTT
Set clock to t Time
N
What’s the time?
Here’s the time t!
S
Check local clock to find time t
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Error Bounds

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Error Bounds
Allowed to increase clock value but should never
decrease clock value
– May violate ordering of events within the same

EL
process.

PT
Allowed to increase or decrease speed of clock
N
If error is too high, take multiple readings and
average them

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Christians Algorithm: Example
Send request at 5:08:15.100 (T0) If best-case message
time=200 msec
Receive response at 5:08:15.900 (T1)
T0 = 5:08:15.100
– Response contains 5:09:25.300 (Tserver)
T1 = 5:08:15.900

EL
Elapsed time is T1 -T0 T server= 5:09:25:300
5:08:15.900 - 5:08:15.100 = 800 msec Tmin = 200msec
Best guess: timestamp was generated
400 msec ago
PT
Set time to Tserver+ elapsed time
N
5:09:25.300 + 400 = 5:09.25.700

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
(ii) NTP: Network time protocol
(1991, 1992) Internet Standard, version 3: RFC 1305
NTP servers organized in a tree.
Each client = a leaf of a tree.

EL
Each node synchronizes with its tree parent

Primary servers

PT Secondary servers
N
Tertiary servers

Client
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
NTP Protocol

Message 1 recv time tr1


Message 2 send time ts2
Time
Child

EL
Let’s start protocol Message 2

Parent PT
Message 1
ts1, tr2
N
Message 2 recv time tr2
Message 1 send time ts1

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Why o = (tr1-tr2 + ts2- ts1)/2 ?
Offset o = (tr1-tr2 + ts2- ts1)/2
Let’s calculate the error.
Suppose real offset is oreal

EL
Child is ahead of parent by oreal.
Parent is ahead of child by –oreal.

PT
Suppose one way latency of Message 1 is L1.
(L2 for Message 2)
N
No one knows L1 or L2!
Then
tr1 = ts1 + L1 + oreal
tr2 = ts2 + L2 – oreal
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Why o = (tr1-tr2 + ts2- ts1)/2 ?
Then
tr1 = ts1 + L1 + oreal.
tr2 = ts2 + L2 – oreal.

EL
Subtracting second equation from first
PT
oreal = (tr1-tr2 + ts2- ts1)/2 – (L2-L1)/2
N
=> oreal = o + (L2-L1)/2
=> |oreal – o|< |(L2-L1)/2| < |(L2+L1)/2|
• Thus the error is bounded by the round trip time (RTT)

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
(iii) Berkley’s Algorithm
Gusella & Zatti, 1989
Master poll’s each machine periodically
Ask each machine for time

EL
– Can use Christian’s algorithm to compensate the network’s
latency.
When results are in compute,
PT
Including master’s time.
Hope: average cancels out individual clock’s tendency to run
N
fast or slow
Send offset by which each clock needs adjustment to each
slave
• Avoids problems with network delays if we send a time-stamp.

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Berkley’s Algorithm : Example

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
(iv) DTP: Datacenter Time Protocol

ACM SIGCOMM 2016

DTP uses the physical layer of network

EL
devices to implement a decentralized clock
synchronization protocol.

PT
Highly Scalable with bounded precision!
– ~25ns (4 clock ticks) between peers
N
– ~150ns for a datacenter with six hops
– No Network Traffic
– Internal Clock Synchronization
End-to-End: ~200ns precision!
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
DTP: Phases

(one-way delay)

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
DTP: (i) Init Phase
INIT phase: The purpose of the INIT phase is to measure the one-way
delay between two peers. The phase begins when two ports are
physically connected and start communicating, i.e. when the link
between them is established.
Each peer measures the one-way delay by measuring the time between

EL
sending an INIT message and receiving an associated INIT-ACK message,
i.e. measure RTT, then divide the measured RTT by two.

PT
N

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
DTP: (ii) Beacon Phase
BEACON phase: During the BEACON phase, two ports periodically exchange
their local counters for resynchronization. Due to oscillator skew, the offset
between two local counters will increase over time. A port adjusts its local
counter by selecting the maximum of the local and remote counters upon
receiving a BEACON message from its peer. Since BEACON messages are

EL
exchanged frequently, hundreds of thousands of times a second (every few
microseconds), the offset can be kept to a minimum.

PT
N

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
DTP Switch

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
DTP Property

DTP provides bounded precision and scalability

EL
Bounded Precision in hardware
– Bounded by 4T (=25.6ns, T=oscillator tick is 6.4ns)

PT
– Network precision bounded by 4TD
D is network diameter in hops
N
Requires NIC and switch modifications

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
But Yet…
We still have a non-zero error!

We just can’t seem to get rid of error

EL
Can’t as long as messages latencies are non-zero.

PT
Can we avoid synchronizing clocks altogether, and
still be able to order events ?
N

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Ordering events in a distributed system
To order events across processes, trying to synchronize
clocks is an approach.
What if we instead assigned timestamps to events that

EL
were not absolute time ?
As long as those timestamps obey causality, that would
work
PT
If an event A causally happens before another event B, then
timestamp(A) < timestamp (B)
N
Example: Humans use causality all the time
• I enter the house only if I unlock it
• You receive a letter only after I send it

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Logical (or Lamport) ordering
Proposed by Leslie Lamport in the 1970s.
Used in almost all distributed systems since then
Almost all cloud computing systems use some

EL
form of logical ordering of events.

PT
Leslie B. Lamport (born February 7, 1941) is an American computer
scientist. Lamport is best known for his seminal work in distributed
systems and as the initial developer of the document preparation
N
system LaTeX. Leslie Lamport was the winner of the 2013 Turing
Award for imposing clear, well-defined coherence on the seemingly
chaotic behavior of distributed computing systems, in which several
autonomous computers communicate with each other by passing
messages.

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Lamport’s research contributions
Lamport’s research contributions have laid the foundations of the theory of
distributed systems. Among his most notable papers are
“Time, Clocks, and the Ordering of Events in a Distributed System”, which received the PODC
Influential Paper Award in 2000,
“How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs”,which

EL
defined the notion of Sequential consistency,
“The Byzantine Generals' Problem”,
“Distributed Snapshots: Determining Global States of a Distributed System” and
“The Part-Time Parliament”.

PT
These papers relate to such concepts as logical clocks (and the happened-before
relationship) and Byzantine failures. They are among the most cited papers in the
field of computer science and describe algorithms to solve many fundamental
N
problems in distributed systems, including:
the Paxos algorithm for consensus,
the bakery algorithm for mutual exclusion of multiple threads in a computer system that require
the same resources at the same time,
the Chandy-Lamport algorithm for the determination of consistent global states (snapshot), and
the Lamport signature, one of the prototypes of the digital signature.

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Logical (or Lamport) Ordering(2)

Define a logical relation Happens-Before among pairs of events


Happens-Before denoted as 
Three rules:

EL
1. On the same process: a  b, if time(a) < time(b) (using the
local clock)
PT
2. If p1 sends m to p2: send(m)  receive(m)
3. (Transitivity) If a  b and b  c then a  c
N
Creates a partial order among events
Not all events related to each other via 

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Example 1:

A B C D E
P1
Time

EL
P2 E F G

PT
N
P3 H I J

Instruction or step

Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Example 1: Happens-Before

A B C D E
P1
Time

EL
P2 E F G

PT
N
P3 H I J

• AB Instruction or step


• BF
• AF Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Example 2: Happens-Before

A B C D E
P1
Time

EL
P2 E F G

PT
N
P3 H I J

• HG
• FJ Instruction or step
• HJ
Message
• CJ
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport timestamps
Goal: Assign logical (Lamport) timestamp to each event
Timestamps obey causality
Rules
Each process uses a local counter (clock) which is an integer

EL
• initial value of counter is zero
A process increments its counter when a send or an

PT
instruction happens at it. The counter is assigned to the
event as its timestamp.
A send (message) event carries its timestamp
N
For a receive (message) event the counter is updated by

max(local clock, message timestamp) + 1

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Example

P1

EL
Time

P2
PT
N
P3

Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps

P1 0

EL
Time

P2 0
PT
N
P3 0

Instruction or step
Initial counters (clocks)
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps

P1 0
ts = 1

EL
Time

P2 0
PT
Message carries
N
ts = 1
P3 0
ts = 1
Message send Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps

P1 0
1 ts = max(local, msg) + 1

EL
Time
= max(0, 1)+1
=2
P2 0
PT
Message carries
N
ts = 1
P3 0
1
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps

P1 0
1 2

EL
Message carries Time
ts = 2
P2
0
PT 2
max(2, 2)+1
N
=3
P3
0 1
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
max(3, 4)+1
=5
P1 0
1 2 3

EL
Time

P2 0
PT 2 3 4
N
P3 0
1
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps

P1 0
1 2 3 5 6

EL
Time

P2 0
PT 2 3 4
N
P3 0
2 7
1
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Obeying Causality
A B C D E
P1 0
1 2 3 5 6
Time

EL
E F G
P2 0
2 3 4

P3 0 H
PT I J
N
2 7
1
Instruction or step
• A  B :: 1 < 2
• B  F :: 2 < 3 Message
• A  F :: 1 < 3
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Obeying Causality (2)
A B C D E
P1 0
1 2 3 5 6
Time

EL
E F G
P2 0
2 3 4

P3 0
H
PT I J
N
2 7
1
H  G :: 1 < 4 Instruction or step
F  J :: 3 < 7
H  J :: 1 < 7 Message
C  J :: 3 < 7
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Not always implying Causality
A B C D E
P1 0
1 2 3 5 6
Time

EL
E F G
P2 0
2 3 4

P3 0
H PT I J
N
1 2 7
• ? C  F ? :: 3 = 3 Instruction or step
• ? H  C ? :: 1 < 3 Message
• (C, F) and (H, C) are pairs of
concurrent events
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Concurrent Events
A pair of concurrent events doesn’t have a causal path
from one event to another (either way, in the pair)
Lamport timestamps not guaranteed to be ordered or

EL
unequal for concurrent events
Ok, since concurrent events are not causality related!

Remember:
PT
N
E1  E2  timestamp(E1) < timestamp (E2), BUT
timestamp(E1) < timestamp (E2) 
{E1  E2} OR {E1 and E2 concurrent}
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Vector Timestamps
Used in key-value stores like Riak
Each process uses a vector of integer clocks
Suppose there are N processes in the group 1…N

EL
Each vector has N elements
Process i maintains vector Vi [1…N]
PT
jth element of vector clock at process i, Vi[j], is i’s
knowledge of latest events at process j
N

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Assigning Vector Timestamps
Incrementing vector clocks
1. On an instruction or send event at process i, it increments
only its ith element of its vector clock

EL
2. Each message carries the send-event’s vector timestamp
Vmessage[1…N]

PT
3. On receiving a message at process i:
Vi[i] = Vi[i] + 1
N
Vi[j] = max(Vmessage[j], Vi[j]) for j ≠ i

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Example

A B C D E
P1

EL
Time
E F G
P2
PT
N
H I J
P3

Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Vector Timestamps

P1 (0,0,0)

EL
Time

P2
(0,0,0)
PT
N
P3
(0,0,0)

Initial counters (clocks)

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Vector Timestamps

P1 (0,0,0) (1,0,0)

EL
Time

P2
(0,0,0)
PT
N
P3 Message(0,0,1)
(0,0,0) (0,0,1)

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Vector Timestamps

P1(0,0,0) (1,0,0)

EL
Time

P2
(0,0,0)
PT
(0,1,1)
N
P3 Message(0,0,1)
(0,0,0) (0,0,1)

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Vector Timestamps

P1
(0,0,0) (1,0,0) (2,0,0)

EL
Message(2,0,0) Time

P2
(0,0,0)
PT
(0,1,1) (2,2,1)
N
P3
(0,0,0) (0,0,1)

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Vector Timestamps

P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)

EL
Time

P2
(0,0,0)
PT
(0,1,1) (2,2,1) (2,3,1)
N
P3
(0,0,0) (0,0,1) (0,0,2) (5,3,3)

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Causally-Related
VT1 = VT2,
iff (if and only if)
VT1[i] = VT2[i], for all i = 1, … , N

EL
VT1 ≤ VT2,
iff VT1[i] ≤ VT2[i], for all i = 1, … , N

PT
Two events are causally related iff
VT1 < VT2, i.e.,
N
iff VT1 ≤ VT2 &
there exists j such that
1 ≤ j ≤ N & VT1[j] < VT2 [j]

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
… or Not Causally-Related
Two events VT1 and VT2 are concurrent
iff
NOT (VT1 ≤ VT2) AND NOT (VT2 ≤ VT1)

EL
We’ll denote this as VT2 ||| VT1

PT
N

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Obeying Causality
A B C D E
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
Time

EL
P2 E F G
(0,0,0) (0,1,1) (2,2,1) (2,3,1)

H
PT I J
N
P3
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
• A  B :: (1,0,0) < (2,0,0)
• B  F :: (2,0,0) < (2,2,1)
• A  F :: (1,0,0) < (2,2,1)
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Obeying Causality (2)
A B C D E
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
Time

EL
P2 E F G
(0,0,0) (0,1,1) (2,2,1) (2,3,1)

P3
H
PT I J
N
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
• H  G :: (0,0,1) < (2,3,1)
• F  J :: (2,2,1) < (5,3,3)
• H  J :: (0,0,1) < (5,3,3)
• C  J :: (3,0,0) < (5,3,3)
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Identifying Concurrent Events
A B C D E
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
Time

EL
E F G
P2
(0,0,0) (0,1,1) (2,2,1) (2,3,1)

P3 H
PT I J
N
(0,0,0) (0,0,1) (0,0,2) (5,3,3)

• C & F :: (3,0,0) ||| (2,2,1)


• H & C :: (0,0,1) ||| (3,0,0)
• (C, F) and (H, C) are pairs of concurrent events
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Summary : Logical Timestamps

Lamport timestamp
• Integer clocks assigned to events.

EL
• Obeys causality
• Cannot distinguish concurrent events.

Vector timestamps PT
• Obey causality
N
• By using more space, can also identify concurrent events

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Conclusion
Clocks are unsynchronized in an asynchronous distributed system
But need to order events across processes!

Time synchronization:

EL
Christian’s algorithm
Berkeley algorithm
NTP
DTP
PT
N
But error a function of RTT

• Can avoid time synchronization altogether by instead assigning


logical timestamps to events

Cloud Computing and DistributedVuSystems


Pham Time and Clock Synchronization
Global State and Snapshot
Recording Algorithms

EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Global State and Snapshot
Preface
Content of this Lecture:

In this lecture, we will discuss about the Global

EL
states (i.e. consistent, inconsistent), Models of
communication and Snapshot algorithm i.e. Chandy-

PT
Lamport algorithm to record the global snapshot.
N

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Snapshots
Here’s Snapshot: Collect at a place Distributed Snapshot
How do you calculate a
Google Images
“global snapshot” in this
distributed system?
What does a “global

EL
snapshot” even mean?

PT
N

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
In the Cloud: Global Snapshot
In a cloud each application or service is running on
multiple servers
Servers handling concurrent events and interacting with

EL
each other
The ability to obtain a “global photograph” or “Global

PT
Snapshot” of the system is important
Some uses of having a global picture of the system
N
Checkpointing: can restart distributed application on failure
Garbage collection of objects: objects at servers that don’t have any other
objects (at any servers) with pointers to them
Deadlock detection: Useful in database transaction systems
Termination of computation: Useful in batch computing systems

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Global State: Introduction
Recording the global state of a distributed system on-the-fly is an
important paradigm.

The lack of globally shared memory, global clock and unpredictable

EL
message delays in a distributed system make this problem non-trivial.

PT
This lecture first defines consistent global states and discusses issues
to be addressed to compute consistent distributed snapshots.
N
Then the algorithm to determine on-the-fly such snapshots is
presented.

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
System Model
The system consists of a collection of n processes p1, p2, ..., pn
that are connected by channels.
There are no globally shared memory and physical global clock

EL
and processes communicate by passing messages through
communication channels.
Cij denotes the channel from process pi to process pj and its

PT
state is denoted by SCij .
The actions performed by a process are modeled as three types
of events: Internal events, the message send event and the
N
message receive event.
For a message mij that is sent by process pi to process pj ,
let send (mij ) and rec(mij ) denote its send and receive events.

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
System Model

At any instant, the state of process pi , denoted by LSi , is a


result of the sequence of all the events executed by pi till that
instant.

EL
For an event e and a process state LSi , e∈LSi iff e belongs to the
sequence of events that have taken process pi to state LSi .

PT
For an event e and a process state LSi , eLSi iff e does not
belong to the sequence of events that have taken process pi to
state LSi .
N
For a channel Cij , the following set of messages can be defined
based on the local states of the processes pi and pj
Transit: transit(LSi , LSj ) = {mij |send (mij ) ∈ LSi rec(mij )  LSj }

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Consistent Global State

The global state of a distributed system is a collection of the local


states of the processes and the channels.
Notationally, global state GS is defined as,

EL
GS = {U i LSi , Ui,j SCij }

A global state GS is a consistent global state iff it satisfies the

PT
following two conditions :

C1: send(mij )∈LSi ⇒ mij ∈SCij ⊕ rec(mij )∈LSj


N
(⊕ is Ex-OR operator)
C2: send(mij )LSi ⇒ mij SCij ∧ rec(mij )LSj

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Global State of a Distributed System

In the distributed execution of Figure 6.2:


A global state GS1 consisting of local states {LS11 , LS23 , LS33 , LS42} is

EL
inconsistent because the state of p2 has recorded the receipt of
message m12, however, the state of p1 has not recorded its send.
On the contrary, a global state GS2 consisting of local states

PT
{LS12 , LS24 , LS34 , LS42} is consistent; all the channels are empty except
c21 that contains message m21.
N

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Global State of a Distributed System

A global state GS = {Ui LSixi , Uj,k SCjkyj,zk } is transitless iff


∀ i , ∀ j : 1 ≤ i, j ≤ n : : SCjkyj,zk = Ø
Thus, all channels are recorded as empty in a transitless global state.

EL
A global state is strongly consistent iff it is transitless as well as
consistent. Note that in figure 6.2, the global state of local states
{LS12 , LS23 , LS34 , LS42} is strongly consistent.

PT
Recording the global state of a distributed system is an important
paradigm when one is interested in analyzing, monitoring, testing, or
verifying properties of distributed applications, systems, and algorithms.
N
Design of efficient methods for recording the global state of a distributed
system is an important problem.

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Example: Time

e11 e12 e13 e14


P1

m12 m21
e2 1 e22 e23 e24

EL
P2

P3
e31 e32
PT
e33 e34 e35
N
GS1 = {LS11 , LS23 , LS33 , LS42} is inconsistent
GS2 = {LS12 , LS24 , LS34 , LS42} is consistent
e4 1 GS3 ={LS12 , LS23 , LS34 , LS42} is strongly consistent.
e42
P4

Figure 6.2: The space-time diagram of a distributed execution.


Cloud Computing and DistributedVuSystems
Pham Global State and Snapshot
Issues in Recording a Global State
The following two issues need to be addressed:

I1: How to distinguish between the messages to be recorded in


the snapshot from those not to be recorded.

EL
-Any message that is sent by a process before recording its
snapshot, must be recorded in the global snapshot (from C1).

PT
-Any message that is sent by a process after recording its snapshot,
must not be recorded in the global snapshot (from C2).
N
I2: How to determine the instant when a process takes its snapshot.
-A process pj must record its snapshot before processing a message
mij that was sent by process pi after recording its snapshot.

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Example of Money Transfer
Let S1 and S2 be two distinct sites of a distributed system which
maintain bank accounts A and B, respectively. A site refers to a process
in this example. Let the communication channels from site S1 to site S2
and from site S2 to site S1 be denoted by C12 and C21, respectively.

EL
Consider the following sequence of actions, which are also illustrated in
the timing diagram of Figure 6.3:
Time t0: Initially, Account A=$600, Account B=$200, C12 =$0, C21 =$0.

PT
Time t1: Site S1 initiates a transfer of $50 from Account A to Account B.
Account A is decremented by $50 to $550 and a request for $50 credit
N
to Account B is sent on Channel C12 to site S2. Account A=$550,
Account B=$200, C12 =$50, C21 =$0.

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Time t2 : Site S2 initiates a transfer of $80 from Account B to
Account A.
Account B is decremented by $80 to $120 and a request for $80
credit to Account A is sent on Channel C21 to site S1. Account

EL
A=$550, Account B=$120, C12 =$50, C21 =$80.
Time t3: Site S1 receives the message for a $80 credit to Account

PT
A and updates Account A.
Account A=$630, Account B=$120, C12 =$50, C21 =$0.
Time t4: Site S2 receives the message for a $50 credit to Account
N
B and updates Account B.
Account A=$630, Account B=$170, C12 =$0, C21 =$0.

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
T3: Site S1 receives the message for a
$80 credit to Account A and updates

$600 $550 $550 $630 $630

S1: A
$50

EL
$80

S2: B PT
N
$200 $200 $120 $120 $170
t0 t1 t2 t3 t4
C12 $0 $50 $50 $50 $0
C21 $0 $0 $80 $0 $0
T4: Site S2 receives the message for a $50
credit to Account B and updates Account B

Cloud Computing and Distributed Systems Global State and Snapshot


Suppose the local state of Account A is recorded at time t0 to show
$600 and the local state of Account B and channels C12 and C21 are
recorded at time t2 to show $120, $50, and $80, respectively. Then
the recorded global state shows $850 in the system. An extra $50

EL
appears in the system.
The reason for the inconsistency is that Account A’s state was
recorded before the $50 transfer to Account B using channel C12

PT
was initiated, whereas channel C12’s state was recorded after the
$50 transfer was initiated.
This simple example shows that recording a consistent global state
N
of a distributed system is not a trivial task. Recording activities of
individual components must be coordinated appropriately.

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Model of Communication
Recall, there are three models of communication: FIFO, non-FIFO, and
Co.

In FIFO model, each channel acts as a first-in first-out message queue

EL
and thus, message ordering is preserved by a channel.

PT
In non-FIFO model, a channel acts like a set in which the sender process
adds messages and the receiver process removes messages from it in a
random order.
N
A system that supports causal delivery of messages satisfies the
following property: “For any two messages mij and mkj ,
if send (mij ) → send (mkj ), then rec(mij )→ rec(mkj)”

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Snapshot algorithm for FIFO channels
Chandy-Lamport algorithm:
The Chandy-Lamport algorithm uses a control message,
called a marker whose role in a FIFO system is to separate
messages in the channels.

EL
After a site has recorded its snapshot, it sends a marker,
along all of its outgoing channels before sending out any
more messages.
PT
A marker separates the messages in the channel into those to
be included in the snapshot from those not to be recorded in
N
the snapshot.
A process must record its snapshot no later than when it
receives a marker on any of its incoming channels.

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Chandy-Lamport Algorithm
The algorithm can be initiated by any process by executing the
“Marker Sending Rule” by which it records its local state and
sends a marker on each outgoing channel.

EL
A process executes the “Marker Receiving Rule” on receiving a
marker. If the process has not yet recorded its local state, it
records the state of the channel on which the marker is received
PT
as empty and executes the “Marker Sending Rule” to record its
local state.
N
The algorithm terminates after each process has received a
marker on all of its incoming channels.
All the local snapshots get disseminated to all other
processes and all the processes can determine the global state.

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Chandy-Lamport Algorithm
Marker Sending Rule for process i
1) Process i records its state.
2) For each outgoing channel C on which a marker has not been sent,
i sends a marker along C before i sends further messages along C.

EL
Marker Receiving Rule for process j
On receiving a marker along channel C:

PT
if j has not recorded its state then
Record the state of C as the empty set
Follow the “Marker Sending Rule”
N
else
Record the state of C as the set of messages
received along C after j ’s state was recorded
and before j received the marker along C

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Properties of the recorded global state

The recorded global state may not correspond to any of the


global states that occurred during the computation.

EL
Consider two possible executions of the snapshot
PT
algorithm (shown in Figure 6.4) for the previous money
transfer example .
N

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
$600 $550 $550 $630 $630

S1: A $50

EL
$80
S2: B

C12
$200
t0
$0
$200
t1
$50
PT $120
t2
$120
t3
$50
$170
t4
$0
N
$50
C21 $0 $0 $80 $0 $0

Execution Markers Markers

Message (1st example) (2nd example)


Figure 6.4: Timing diagram of two possible executions of the banking example
Figure 6.4: Timing diagram of two possible
Cloud Computing andofDistributed
executions the banking exampleSystems Global State and Snapshot
Properties of the recorded global state
1. (Markers shown using red dashed-and-dotted arrows.)

Let site S1 initiate the algorithm just after t1. Site S1 records

EL
its local state (account A=$550) and sends a marker to site
S2. The marker is received by site S2 after t4. When site S2
receives the marker, it records its local state (account

PT
B=$170), the state of channel C12 as $0, and sends a marker
along channel C21. When site S1 receives this marker, it
records the state of channel C21 as $80. The $800 amount in
N
the system is conserved in the recorded global state,
A = $550, B = $170, C12 = $0, C21 = $80

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
$600 $550 $550 $630 $630

S1: A

$50

EL
$80

S2: B
$200
t0
$200
t1 PT $120
t2
$120
t3
$170
t4
N
A = $550 B = $170 C12 = $0 C21 = $80

The $800 amount in the system is conserved in the recorded global state
Figure 6.4: Timing diagram of two possible executions of the banking example
Cloud Computing and Distributed Systems Global State and Snapshot
Properties of the recorded global state
2. (Markers shown using green dotted arrows.)

Let site S1 initiate the algorithm just after t0 and before sending the

EL
$50 for S2. Site S1 records its local state (account A = $600) and
sends a marker to site S2. The marker is received by site S2 between
t2 and t3. When site S2 receives the marker, it records its local state

PT
(account B = $120), the state of channel C12 as $0, and sends a
marker along channel C21. When site S1 receives this marker, it
records the state of channel C21 as $80. The $800 amount in the
N
system is conserved in the recorded global state,
A = $600, B = $120, C12 = $0, C21 = $80

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
$600 $550 $550 $630 $630

S1: A

$50

EL
$80

S2: B
$200
t0
$200
t1 PT $120
t2
$120
t3
$170
t4
N
A = $600 B = $120 C12 = $0 C21 = $80

The $800 amount in the system is conserved in the recorded global state
Figure 6.4: Timing diagram of two possible executions of the banking example
Cloud Computing and Distributed Systems Global State and Snapshot
Properties of the recorded global state
In both these possible runs of the algorithm, the recorded global
states never occurred in the execution.
This happens because a process can change its state asynchronously
before the markers it sent are received by other sites and the other sites

EL
record their states.
But the system could have passed through the recorded global states in
some equivalent executions.

PT
The recorded global state is a valid state in an equivalent execution and
if a stable property (i.e., a property that persists) holds in the system
before the snapshot algorithm begins, it holds in the recorded global
N
snapshot.
Therefore, a recorded global state is useful in detecting stable
properties.

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Conclusion
Recording global state of a distributed system is an
important paradigm in the design of the distributed systems
and the design of efficient methods of recording the global
state is an important issue.

EL
This lecture first discussed a formal definition of the global
PT
state of a distributed system and issues related to its
capture; then we have discussed the Chandy-Lamport
N
Algorithm to record a snapshot of a distributed system.

Cloud Computing and DistributedVuSystems


Pham Global State and Snapshot
Distributed Mutual Exclusion

EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Distributed Mutual Exclusion
Preface
Content of this Lecture:

In this lecture, we will discuss about the ‘Concepts of

EL
Mutual Exclusion’, Classical algorithms for distributed
computing systems and Industry systems for Mutual
Exclusion.
PT
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Need of Mutual Exclusion in Cloud
Bank’s Servers in the Cloud: Two customers
make simultaneous deposits of 10,000 Rs. into
your bank account, each from a separate ATM.

EL
Both ATMs read initial amount of 1000 Rs.
concurrently from the bank’s cloud server
PT
Both ATMs add 10,000 Rs. to this amount (locally at
the ATM)
N
Both write the final amount to the server
What’s wrong? 11000Rs. (or 21000Rs.)

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Need of Mutual Exclusion in Cloud
Bank’s Servers in the Cloud: Two customers make
simultaneous deposits of 10,000 Rs. into your bank
account, each from a separate ATM.
Both ATMs read initial amount of 1000 Rs. concurrently

EL
from the bank’s cloud server
Both ATMs add 10,000 Rs. to this amount (locally at the
ATM)
PT
Both write the final amount to the server
You lost 10,000 Rs.!
N
The ATMs need mutually exclusive access to your
account entry at the server
or, mutually exclusive access to executing the code that
modifies the account entry

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Some other Mutual Exclusion use
Distributed File systems
o Locking of files and directories
Accessing objects in a safe and consistent way
Ensure at most one server has access to object at any

EL
o
point of time
Server coordination
o

o PT
Work partitioned across servers
Servers coordinate using locks
In industry
N
o Chubby is Google’s locking service
o Many cloud stacks use Apache Zookeeper for
coordination among servers

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Problem Statement for Mutual Exclusion
• Critical Section Problem: Piece of code (at all
processes) for which we need to ensure there
is at most one process executing it at any point

EL
of time.
• Each process can call three functions
o

o
PT
enter() to enter the critical section (CS)
AccessResource() to run the critical section code
N
o exit() to exit the critical section

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Bank Example

ATM1: ATM2:
enter(S); enter(S);
// AccessResource() // AccessResource()

EL
obtain bank amount; obtain bank amount;
add in deposit; add in deposit;
update bank amount; update bank amount;

exit(S); // exit
PT
// AccessResource() end // AccessResource() end
exit(S); // exit
N

7
Cloud Computing and Distributed Systems
Vu Pham Distributed Mutual Exclusion
Approaches to Solve Mutual Exclusion
• Single OS:

• If all processes are running in one OS

EL
on a machine (or VM), then
• Semaphores, mutexes, condition
variables, monitors, etc.
PT
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Approaches to Solve Mutual Exclusion (2)
• Distributed system:
• Processes communicating by passing messages

EL
Need to guarantee 3 properties:
o Safety (essential): At most one process executes in

PT
CS (Critical Section) at any time
o Liveness (essential): Every request for a CS is
N
granted eventually
o Fairness (desirable): Requests are granted in the

order they were made

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Processes Sharing an OS: Semaphores
Semaphore == an integer that can only be accessed via two special
functions
Semaphore S=1; // Max number of allowed accessors

EL
1. wait(S) (or P(S) or down(S)):
while(1) { // each execution of the while loop is atomic
if (S > 0) {
enter() S--;

}
}
break;

PT
Each while loop execution and S++ are each atomic operations – supported
N
via hardware instructions such as compare-and-swap, test-and-set, etc.
exit() 2. signal(S) (or V(S) or up(s)):
S++; // atomic

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Bank Example Using Semaphores
Semaphore S=1; // shared
ATM1: Semaphore S=1; // shared
wait(S); ATM2:
// AccessResource() wait(S);

EL
obtain bank amount; // AccessResource()
add in deposit; obtain bank amount;
update bank amount; add in deposit;

signal(S); // exit
PT
// AccessResource() end update bank amount;
// AccessResource() end
signal(S); // exit
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Next
• In a distributed system, cannot share variables like
semaphores

EL
• So how do we support mutual exclusion in a
distributed system?

PT
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
System Model
Before solving any problem, specify its System Model:

Each pair of processes is connected by reliable channels

EL
(such as TCP).
Messages are eventually delivered to recipient, and in

PT
FIFO (First In First Out) order.
Processes do not fail.
• Fault-tolerant variants exist in literature.
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Central Solution
o Elect a central master (or leader)
o Use one of our election algorithms!
o Master keeps

EL
o A queue of waiting requests from processes who wish to
access the CS
o A special token which allows its holder to access CS

PT
o Actions of any process in group:
o enter()
N
o Send a request to master
o Wait for token from master
o exit()
o Send back token to master

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Central Solution
o Master Actions:
o On receiving a request from process Pi
if (master has token)

EL
Send token to Pi
else

PT
Add Pi to queue
o On receiving a token from process Pi
if (queue is not empty)
N
Dequeue head of queue (say Pj), send that process the token
else
Retain token

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Analysis of Central Algorithm
o Safety – at most one process in CS
o Exactly one token
o Liveness – every request for CS granted eventually

EL
o With N processes in system, queue has at most N
processes

PT
o If each process exits CS eventually and no failures,
liveness guaranteed
o FIFO Ordering is guaranteed, in order of requests received
N
at master

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Performance Analysis
Efficient mutual exclusion algorithms use fewer messages, and
make processes wait for shorter durations to access resources.
Three metrics:

EL
Bandwidth: the total number of messages sent in each enter
and exit operation.

PT
Client delay: the delay incurred by a process at each enter
and exit operation (when no other process is in, or waiting)
(We will prefer mostly the enter operation.)
N
Synchronization delay: the time interval between one
process exiting the critical section and the next process
entering it (when there is only one process waiting)

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Analysis of Central Algorithm
o Bandwidth: the total number of messages sent in each enter
and exit operation.
o 2 messages for enter

EL
o 1 message for exit
o Client delay: the delay incurred by a process at each enter and

PT
exit operation (when no other process is in, or waiting)
o 2 message latencies (request + grant)
N
o Synchronization delay: the time interval between one process
exiting the critical section and the next process entering it
(when there is only one process waiting)
o 2 message latencies (release + grant)

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
But…
The master is the performance bottleneck and SPoF
(single point of failure)

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Ring-based Mutual Exclusion

Currently holds token,

EL
N12 N3 can access CS

N6 PT N32
N

Token: N80 N5

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Ring-based Mutual Exclusion

Cannot access CS anymore

EL
N12 N3
Here’s the token!

N6 PT N32
N

Token: N80 N5

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Ring-based Mutual Exclusion

EL
N12 N3

N6 PT Currently holds token,


N32 can access CS
N

Token: N80 N5

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Ring-based Mutual Exclusion
N Processes organized in a virtual ring
Each process can send message to its successor in ring
Exactly 1 token

EL
enter()
Wait until you get token
PT
exit() // already have token
Pass on token to ring successor
N
If receive token, and not currently in enter(), just pass
on token to ring successor

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Analysis of Ring-based Mutual Exclusion
• Safety
• Exactly one token
• Liveness

EL
• Token eventually loops around ring and reaches
requesting process (no failures)
• Bandwidth
PT
• Per enter(), 1 message by requesting process but up to
N
N messages throughout system
• 1 message sent per exit()

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Analysis of Ring-Based Mutual Exclusion (2)

• Client delay: 0 to N message transmissions after entering


enter()
• Best case: already have token

EL
• Worst case: just sent token to neighbor
• Synchronization delay between one process’ exit() from the CS

PT
and the next process’ enter():
• Between 1 and (N-1) message transmissions.
• Best case: process in enter() is successor of process in exit()
N
• Worst case: process in enter() is predecessor of process in
exit()

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Next
Client/Synchronization delay to access CS still
O(N) in Ring-Based approach.
Can we make this faster?

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
System Model
Before solving any problem, specify its System Model:
Each pair of processes is connected by reliable
channels (such as TCP).

EL
Messages are eventually delivered to recipient, and
in FIFO (First In First Out) order.

PT
Processes do not fail.
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Lamport’s Algorithm

Requests for CS are executed in the increasing order of


timestamps and time is determined by logical clocks.

EL
Every site Si keeps a queue, request_queuei which contains
mutual exclusion requests ordered by their timestamps.

PT
This algorithm requires communication channels to deliver
messages the FIFO order. Three types of messages are used-
Request, Reply and Release. These messages with timestamps
N
also updates logical clock.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
The Algorithm
Requesting the critical section:
When a site Si wants to enter the CS, it broadcasts a REQUEST(tsi , i )
message to all other sites and places the request on request_queuei .
((tsi , i ) denotes the timestamp of the request.)
When Sj receives the REQUEST(tsi , i ) message from site Si , Sj places site Si ’s

EL
request on request_queuej and it returns a timestamped REPLY message to Si .

PT
Executing the critical section: Site Si enters the CS when the following two
conditions hold:
N
L1: Si has received a message with timestamp larger than (tsi , i ) from all other
sites.
L2: Si ’s request is at the top of request _queuei .

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
The Algorithm

Releasing the critical section:


Site Si , upon exiting the CS, removes its request from the top of its

EL
request queue and broadcasts a timestamped RELEASE message to all
other sites.
When a site Sj receives a RELEASE message from site Si , it removes Si ’s

PT
request from its request queue.

When a site removes a request from its request queue, its own request
N
may come at the top of the queue, enabling it to enter the CS.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Correctness
Theorem: Lamport’s algorithm achieves mutual exclusion.

Proof:
Proof is by contradiction. Suppose two sites Si and Sj are executing the CS
concurrently. For this to happen conditions L1 and L2 must hold at both the sites

EL
concurrently.
This implies that at some instant in time, say t, both Si and Sj have their own
requests at the top of their request_queues and condition L1 holds at them.

the request of Sj .
PT
Without loss of generality, assume that Si ’s request has smaller timestamp than

From condition L1 and FIFO property of the communication channels, it is clear


that at instant t the request of Si must be present in request_queuej when Sj was
N
executing its CS. This implies that Sj ’s own request is at the top of its own
request_queue when a smaller timestamp request, Si ’s request, is present in the
request queuej – a contradiction!

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Correctness
Theorem: Lamport’s algorithm is fair.

Proof:
The proof is by contradiction. Suppose a site Si ’s request has a smaller
timestamp than the request of another site Sj and Sj is able to execute the

EL
CS before Si .
For Sj to execute the CS, it has to satisfy the conditions L1 and L2. This
implies that at some instant in time say t, Sj has its own request at the top

PT
of its queue and it has also received a message with timestamp larger than
the timestamp of its request from all other sites.
But request_queue at a site is ordered by timestamp, and according to
N
our assumption Si has lower timestamp. So Si ’s request must be placed
ahead of the Sj ’s request in the request _queuej . This is a contradiction!

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Lamport’s Algorithm Example:
Sites S1 and S2 are Making Requests for the CS
Sites S1 enter the CS
(1,1)
S1

EL
(1,2)

S2
PT
N
S3

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Lamport’s Algorithm Example:
Site S1 exits the CS
and sends RELEASE
Site S1 enters the CS
messages

(1,1) (1,1), (1,2)


S1

EL
(1,2) (1,2)
S2
(1,1), (1,2) PT
N
S3

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Lamport’s Algorithm Example:
Site S1 exits the CS
and sends RELEASE
Site S1 enters the CS
messages

(1,1) (1,1), (1,2)


S1

EL
(1,2)

S2
(1,1), (1,2) PT Site S2 enters the CS
N
S3

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Performance

For each CS execution, Lamport’s algorithm requires (N − 1)


REQUEST messages, (N − 1) REPLY messages, and (N − 1)

EL
RELEASE messages.

Thus, Lamport’s algorithm requires 3(N − 1) messages per CS


invocation.
PT
Synchronization delay in the algorithm is T .
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
An Optimization
In Lamport’s algorithm, REPLY messages can be omitted in certain
situations. For example, if site Sj receives a REQUEST message from site
Si after it has sent its own REQUEST message with timestamp higher

EL
than the timestamp of site Si ’s request, then site Sj need not send a
REPLY message to site Si .

PT
This is because when site Si receives site Sj ’s request with timestamp
higher than its own, it can conclude that site Sj does not have any
smaller timestamp request which is still pending.
N
With this optimization, Lamport’s algorithm requires between
3(N − 1) and 2(N − 1) messages per CS execution.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Ricart-Agrawala’s Algorithm
Classical algorithm from 1981
Invented by Glenn Ricart (NIH) and Ashok
Agrawala (U. Maryland)

EL
No token
PT
Uses the notion of causality and multicast
N
Has lower waiting time to enter CS than Ring-
Based approach

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Key Idea: Ricart-Agrawala Algorithm
• enter() at process Pi
• multicast a request to all processes
• Request: <T, Pi>, where T = current Lamport

EL
timestamp at Pi
• Wait until all other processes have responded


PT
positively to request
Requests are granted in order of causality
N
• <T, Pi> is used lexicographically: Pi in request <T, Pi> is
used to break ties (since Lamport timestamps are not
unique for concurrent events)

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Ricart-Agrawala Algorithm
The Ricart-Agrawala algorithm assumes the communication channels are
FIFO. The algorithm uses two types of messages: REQUEST and REPLY.
A process sends a REQUEST message to all other processes to request their
permission to enter the critical section. A process sends a REPLY message to a

EL
process to give its permission to that process.
Processes use Lamport-style logical clocks to assign a timestamp to critical
section requests and timestamps are used to decide the priority of requests.

PT
Each process pi maintains the Request-Deferred array, RDi , the size of which
is the same as the number of processes in the system.
Initially, ∀i ∀j: RDi [j]=0. Whenever pi defer the request sent by pj , it sets
N
RDi [j]=1 and after it has sent a REPLY message to pj , it sets RDi [j]=0.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Description of the Algorithm
Requesting the critical section:
(a) When a site Si wants to enter the CS, it broadcasts a timestamped
REQUEST message to all other sites.
(b) When site Sj receives a REQUEST message from site Si , it sends a

EL
REPLY message to site Si if site Sj is neither requesting nor
executing the CS, or if the site Sj is requesting and Si ’s request’s
timestamp is smaller than site Sj ’s own request’s timestamp.

PT
Otherwise, the reply is deferred and Sj sets RDj [i]=1
Executing the critical section:
(c) Site Si enters the CS after it has received a REPLY message from
N
every site it sent a REQUEST message to.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Contd…
Releasing the critical section:
(d) When site Si exits the CS, it sends all the deferred REPLY
messages: ∀j if RDi [j]=1, then send a REPLY message to Sj and set
RDi [j]=0.

EL
Notes:
When a site receives a message, it updates its clock using the

PT
timestamp in the message.
When a site takes up a request for the CS for processing, it updates its
local clock and assigns a timestamp to the request.
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Correctness

Theorem: Ricart-Agrawala algorithm achieves mutual exclusion.

Proof:

EL
Proof is by contradiction. Suppose two sites Si and Sj ‘ are executing the
CS concurrently and Si ’s request has higher priority than the request of
Sj . Clearly, Si received Sj ’s request after it has made its own request.

PT
Thus, Sj can concurrently execute the CS with Si only if Si returns a REPLY
to Sj (in response to Sj ’s request) before Si exits the CS.
However, this is impossible because Sj ’s request has lower priority.
N
Therefore, Ricart-Agrawala algorithm achieves mutual exclusion.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Ricart–Agrawala algorithm Example:
Sites S1 and S2 are Making Requests for the CS

(1,1)

EL
S1

S2
(1,2)
PT
N

S3

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Ricart–Agrawala algorithm Example:

Request is deferred Site S1 enters the CS


(1,1)

EL
S1

S2
(1,2)
PT
N

S3

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Ricart–Agrawala algorithm Example:

Request is deferred Site S1 enters the CS


(1,1)

EL
S1
Site S1
exits the

S2
(1,2)
PT CS
N
And send a REPLY
message to S2’s
deferred request

S3

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Performance

For each CS execution, Ricart-Agrawala algorithm requires


(N − 1) REQUEST messages and (N − 1) REPLY messages.

EL
Thus, it requires 2(N − 1) messages per CS execution.

PT
Synchronization delay in the algorithm is T .
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Comparison
Compared to Ring-Based approach, in
Ricart-Agrawala approach
Client/synchronization delay has now gone

EL
down to O(1)
But bandwidth has gone up to O(N)
PT
Can we get both down?
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Quorum-based approach

In the ‘quorum-based approach’, each site requests


permission to execute the CS from a subset of sites
(called a quorum).

EL
The intersection property of quorums make sure that

PT
only one request executes the CS at any time.
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Quorum-Based Mutual Exclusion Algorithms
Quorum-based mutual exclusion algorithms are different in two
ways:

1. A site does not request permission from all other sites, but

EL
only from a subset of the sites.
The request set of sites are chosen such that

PT
∀ i ∀ j : 1 ≤ i , j ≤ N : : Ri ∩ Rj ≠ Φ.
Consequently, every pair of sites has a site which
mediates conflicts between that pair.
N
2. A site can send out only one REPLY message at any time.
A site can send a REPLY message only after it has received
a RELEASE message for the previous REPLY message.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Contd…

Notion of ‘Coteries’ and ‘Quorums’:


A coterie C is defined as a set of sets, where each set g ∈ C is called
a quorum.

EL
The following properties hold for quorums in a coterie:
Intersection property: For every quorum g, h ∈ C, g ∩ h ≠ ∅.
For example, sets {1,2,3}, {2,5,7} and {5,7,9} cannot be quorums in a
PT
coterie, because first and third sets do not have a common element.
N
Minimality property: There should be no quorums g, h in coterie C
such that g ⊇ h i.e g is superset of h.
For example, sets {1,2,3} and {1,3} cannot be quorums in a coterie
because the first set is a superset of the second.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Maekawa’s Algorithm

Maekawa’s algorithm was first quorum-based mutual exclusion algorithm.

The request sets for sites (i.e., quorums) in Maekawa’s algorithm are

EL
constructed to satisfy the following conditions:
M1: (∀i ∀j : i ≠ j, 1 ≤ i , j ≤ N : : Ri ∩ Rj ≠ φ)
M2: (∀i : 1 ≤ i ≤ N : : Si ∈ Ri )
PT
M3: (∀i : 1 ≤ i ≤ N : : |Ri | = K )
M4: Any site Sj is contained in K number of Ri s, 1 ≤ i , j ≤ N .
N
Maekawa used the theory of projective planes and showed that
N = K (K − 1) + 1. This relation gives |Ri | =

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Maekawa’s Algorithm

Conditions M1 and M2 are necessary for correctness; whereas


conditions M3 and M4 provide other desirable features to the
algorithm.

EL
Condition M3 states that the size of the requests sets of all sites
must be equal implying that all sites should have to do an equal

PT
amount of work to invoke mutual exclusion.
Condition M4 enforces that exactly the same number of sites
should request permission from any site, which implies that all sites
N
have “equal responsibility” in granting permission to other sites.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
The Algorithm
A site Si executes the following steps to execute the CS.

Requesting the critical section


(a) A site Si requests access to the CS by sending REQUEST( i )

EL
messages to all sites in its request set Ri .
(b) When a site Sj receives the REQUEST( i ) message, it sends a

PT
REPLY( j ) message to Si provided it hasn’t sent a REPLY message to
a site since its receipt of the last RELEASE message. Otherwise, it
queues up the REQUEST( i ) for later consideration.
N
Executing the critical section
(c) Site Si executes the CS only after it has received a REPLY message
from every site in Ri .

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
The Algorithm

Releasing the critical section


(d) After the execution of the CS is over, site Si sends a RELEASE( i )

EL
message to every site in Ri .
(e) When a site Sj receives a RELEASE( i ) message from site Si , it

PT
sends a REPLY message to the next site waiting in the queue and deletes
that entry from the queue.
If the queue is empty, then the site updates its state to reflect that it
N
has not sent out any REPLY message since the receipt of the last
RELEASE message.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Correctness

Theorem: Maekawa’s algorithm achieves mutual exclusion.


Proof:

EL
Proof is by contradiction. Suppose two sites Si and Sj are concurrently
executing the CS.

PT
This means site Si received a REPLY message from all sites in Ri and
concurrently site Sj was able to receive a REPLY message from all sites in Rj .
N
If Ri ∩ Rj = {Sk }, then site Sk must have sent REPLY messages to both
Si and Sj concurrently, which is a contradiction.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Performance

Since the size of a request set is , an execution of the CS requires


REQUEST, REPLY, and RELEASE messages, resulting in 3

EL
messages per CS execution.

Synchronization delay in this algorithm is 2T . This is because after a


PT
site Si exits the CS, it first releases all the sites in Ri and then one of those
sites sends a REPLY message to the next site that executes the CS.
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Problem of Deadlocks

Maekawa’s algorithm can deadlock because a site is exclusively locked


by other sites and requests are not prioritized by their timestamps.
Assume three sites Si , Sj , and Sk simultaneously invoke mutual exclusion.

EL
Suppose Ri ∩ Rj = {Sij }, Rj ∩ Rk = {Sjk }, and Rk ∩ Ri = {Ski }.

Consider the following scenario:


1.

2.
PT
Sij has been locked by Si (forcing Sj to wait at Si j).
Sjk has been locked by Sj (forcing Sk to wait at Sjk ).
N
3. Ski has been locked by Sk (forcing Si to wait at Ski ).

This state represents a deadlock involving sites Si , Sj , and Sk .

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Handling Deadlocks

Maekawa’s algorithm handles deadlocks by requiring a


site to yield a lock if the timestamp of its request is
larger than the timestamp of some other request

EL
waiting for the same lock.

PT
A site suspects a deadlock (and initiates message
exchanges to resolve it) whenever a higher priority
request arrives and waits at a site because the site has
N
sent a REPLY message to a lower priority request.

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Message types for Handling Deadlocks
Deadlock handling requires three types of messages:

FAILED: A FAILED message from site Si to site Sj indicates that Si can


not grant Sj’s request because it has currently granted permission

EL
to a site with a higher priority request.

PT
INQUIRE: An INQUIRE message from Si to Sj indicates that Si would like
to find out from Sj if it has succeeded in locking all the sites in its
request set.
N
YIELD: A YIELD message from site Si to Sj indicates that Si is returning
the permission to Sj (to yield to a higher priority request at Sj ).

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Handling Deadlocks
Maekawa’s algorithm handles deadlocks as follows:
When a REQUEST(ts , i ) from site Si blocks at site Sj because Sj has currently
granted permission to site Sk , then Sj sends a FAILED( j) message to Si if Si ’s
request has lower priority. Otherwise, Sj sends an INQUIRE( j) message to site Sk .

EL
In response to an INQUIRE( j) message from site Sj , site Sk sends a YIELD(k )
message to Sj provided Sk has received a FAILED message from a site in its request

from it.
PT
set and if it sent a YIELD to any of these sites, but has not received a new REPLY

In response to a YIELD(k ) message from site Sk , site Sj assumes as if it has been


N
released by Sk , places the request of Sk at appropriate location in the request
queue, and sends a REPLY( j) to the top request’s site in the queue.

Maximum number of messages required per CS execution in this case is 5

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Handling Deadlocks: Case-I
When a REQUEST(ts , i ) from site Si blocks at site Sj because Sj has currently
granted permission to site Sk , then Sj sends a FAILED( j ) message to Si if Si ’s
request has lower priority. Otherwise, Sj sends an INQUIRE( j ) message to site Sk .

EL
Si
REQUEST(ts,i) FAILED (j)

Sj PT Block at Sj IF (ts,i) < (ts,k)


N
REQUEST(ts,k) REPLY (j) INQUIRE (j)
Else(ts,i) > (ts,k)

Sk

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Handling Deadlocks: Case-II
In response to an INQUIRE(j) message from site Sj , site Sk sends a YIELD(k ) message
to Sj provided Sk has received a FAILED message from a site in its request set and if
it sent a YIELD to any of these sites, but has not received a new REPLY from it.

EL
Si

Sj PT
N
INQUIRE (j)
YIELD (k)

Sk

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Handling Deadlocks: Case-III
In response to a YIELD(k ) message from site Sk , site Sj assumes as if it has
been released by Sk , places the request of Sk at appropriate location in the
request queue, and sends a REPLY( j) to the top request’s site in the queue.

EL
Si
Sj assumes as if it has been
REPLY ( j) to top

Sj PT released by Sk
request site i.e. Si
N
YIELD (k)

Sk

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Failures?

other ways to handle failures: Use Paxos like!

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Industry Mutual Exclusion : Chubby
Google’s system for locking
Used underneath Google’s systems like
BigTable, Megastore, etc.

EL
Chubby provides Advisory locks only

PT
Doesn’t guarantee mutual exclusion unless every
client checks lock before accessing resource
N

Reference: http://research.google.com/archive/chubby.html

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion
Chubby
Can use not only for locking but also
writing small configuration files
Relies on Paxos-like (consensus) protocol Server A
Group of servers with one elected as

EL
Master Server B
All servers replicate same information
Clients send read requests to Master,

PT
which serves it locally
Clients send write requests to Master,
which sends it to all servers, gets majority
Server C

Server D Master
N
(quorum) among servers, and then
responds to client
Server E
On master failure, run election protocol
On replica failure, just replace it and have
it catch up
Cloud Computing and DistributedVuSystems
Pham Distributed Mutual Exclusion
Conclusion
Mutual exclusion important problem in cloud
computing systems
Classical algorithms

EL
Central
Ring-based

Ricart-Agrawala
Maekawa
PT
Lamport’s Algorithm
N
Industry systems
Chubby: a coordination service
Similarly, Apache Zookeeper for coordination

Cloud Computing and DistributedVuSystems


Pham Distributed Mutual Exclusion

You might also like