Week 4_Lecture Notes
Week 4_Lecture Notes
EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Preface
Content of this Lecture:
EL
We will also discuss the causality and a general framework of
PT
logical clocks and present two systems of logical time, namely,
lamport and vector, timestamps to capture causality between
events of a distributed computation .
N
EL
• You’ll miss the bus!
What if your watch is Fast by 15 minutes?
PT
• You’ll end up unfairly waiting for a longer time than
you intended
N
Time synchronization is required for:
Correctness
Fairness
Distributed Time
EL
The notion of time is well defined (and measurable) at
each single location
PT
But the relationship between time at different locations
is unclear
N
Time Synchronization is required for:
Correctness
Fairness
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Synchronization in the cloud
Example: Cloud based airline reservation system:
EL
Server X timestamps the purchase using its local clock as
6h:25m:42.55s. It then logs it. Replies ok to the client.
PT
That was the very last seat, Server X sends a message to Server Y
saying the “flight is full”.
Y enters, “Flight PQR 123 is full” + its own local clock value,
N
(which happens to read 6h:20m:20.21s).
Server Z, queries X's and Y's logs. Is confused that a client
purchased a ticket at X after the flight became full at Y.
This may lead to full incorrect actions at Z
EL
workstation which share a system clock.
Processes in internet based systems follow an
asynchronous model.
No bounds on PT
– Messages delays
N
– Processing delays
Unlike multi-processor (or parallel) systems which follow
a synchronous system model
EL
Each process takes actions to change its state, which may be
an instruction or a communication action (send, receive).
PT
An event is the occurrence of an action.
Each process has a large clock – events within a process can
be assigned timestamps, and thus ordered linearly.
N
But- in a distributed system, we also need to know the time
order of events across different processes.
EL
PT
N
EL
• Like distance between two vehicles on road.
Clock Drift = Relative difference in clock frequencies (rates)
of two processes
PT
• Like difference in speeds of two vehicles on the road.
N
A non-zero clock skew implies clocks are not synchronized
A non-zero clock drift causes skew increases (eventually).
If faster vehicle is ahead, it will drift away.
If faster vehicle is behind, it will catch up and then drift away.
EL
Physical clocks are synchronized to an
accurate real-time standard like UTC
PT
(Universal Coordinated Time).
EL
• MDR of any process depends on the environment.
Maximum drift rate between two clocks with similar MDR
is 2*MDR.
PT
Given a maximum acceptable skew M between any pair of
N
clocks, need to synchronize at least once every:
M/ (2* MDR) time units.
• Since time = Distance/ Speed.
EL
|C(i)- S|< D at all times.
External clock may be connected to UTC (Universal
PT
Coordinated Time) or an atomic clock.
Example: Christian’s algorithm, NTP
N
Internal Synchronization
Every pair of processes in group have clocks within bound D
|C(i)- C(j)|< D at all times and for all processes i,j.
Example: Berkley Algorithm, DTP
EL
Internal synchronization does not imply External
Synchronization.
PT
• In fact, the entire system may drift away from the
external clock S!
N
EL
What’s the time?
Here’s the time t
S
What’s Wrong:
PT
Check local clock to find time t
N
By the time the message has received at P, time has moved on.
P’s time set to t is in accurate.
Inaccuracy a function of message latencies.
Since latencies unbounded in an asynchronous system, the inaccuracy
cannot be bounded.
EL
time to queue messages, etc.
The actual time at P when it receives response is between
[t+min2, t + RTT-min1]
P
PT
RTT
Set clock to t Time
N
What’s the time?
Here’s the time t!
S
Check local clock to find time t
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Christians Algorithm
The actual time at P when it receives response is between
[t+min2, t + RTT-min1]
P sets its time to halfway through this interval
To: t + (RTT+min2-min1)/2
EL
Error is at most (RTT- min2- min1)/2
Bounded
P
PT
RTT
Set clock to t Time
N
What’s the time?
Here’s the time t!
S
Check local clock to find time t
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Error Bounds
EL
PT
N
EL
process.
PT
Allowed to increase or decrease speed of clock
N
If error is too high, take multiple readings and
average them
EL
Elapsed time is T1 -T0 T server= 5:09:25:300
5:08:15.900 - 5:08:15.100 = 800 msec Tmin = 200msec
Best guess: timestamp was generated
400 msec ago
PT
Set time to Tserver+ elapsed time
N
5:09:25.300 + 400 = 5:09.25.700
EL
Each node synchronizes with its tree parent
Primary servers
PT Secondary servers
N
Tertiary servers
Client
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
NTP Protocol
EL
Let’s start protocol Message 2
Parent PT
Message 1
ts1, tr2
N
Message 2 recv time tr2
Message 1 send time ts1
EL
Child is ahead of parent by oreal.
Parent is ahead of child by –oreal.
PT
Suppose one way latency of Message 1 is L1.
(L2 for Message 2)
N
No one knows L1 or L2!
Then
tr1 = ts1 + L1 + oreal
tr2 = ts2 + L2 – oreal
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Why o = (tr1-tr2 + ts2- ts1)/2 ?
Then
tr1 = ts1 + L1 + oreal.
tr2 = ts2 + L2 – oreal.
EL
Subtracting second equation from first
PT
oreal = (tr1-tr2 + ts2- ts1)/2 – (L2-L1)/2
N
=> oreal = o + (L2-L1)/2
=> |oreal – o|< |(L2-L1)/2| < |(L2+L1)/2|
• Thus the error is bounded by the round trip time (RTT)
EL
– Can use Christian’s algorithm to compensate the network’s
latency.
When results are in compute,
PT
Including master’s time.
Hope: average cancels out individual clock’s tendency to run
N
fast or slow
Send offset by which each clock needs adjustment to each
slave
• Avoids problems with network delays if we send a time-stamp.
EL
PT
N
EL
devices to implement a decentralized clock
synchronization protocol.
PT
Highly Scalable with bounded precision!
– ~25ns (4 clock ticks) between peers
N
– ~150ns for a datacenter with six hops
– No Network Traffic
– Internal Clock Synchronization
End-to-End: ~200ns precision!
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
DTP: Phases
(one-way delay)
EL
PT
N
EL
sending an INIT message and receiving an associated INIT-ACK message,
i.e. measure RTT, then divide the measured RTT by two.
PT
N
EL
exchanged frequently, hundreds of thousands of times a second (every few
microseconds), the offset can be kept to a minimum.
PT
N
EL
PT
N
EL
Bounded Precision in hardware
– Bounded by 4T (=25.6ns, T=oscillator tick is 6.4ns)
PT
– Network precision bounded by 4TD
D is network diameter in hops
N
Requires NIC and switch modifications
EL
Can’t as long as messages latencies are non-zero.
PT
Can we avoid synchronizing clocks altogether, and
still be able to order events ?
N
EL
were not absolute time ?
As long as those timestamps obey causality, that would
work
PT
If an event A causally happens before another event B, then
timestamp(A) < timestamp (B)
N
Example: Humans use causality all the time
• I enter the house only if I unlock it
• You receive a letter only after I send it
EL
form of logical ordering of events.
PT
Leslie B. Lamport (born February 7, 1941) is an American computer
scientist. Lamport is best known for his seminal work in distributed
systems and as the initial developer of the document preparation
N
system LaTeX. Leslie Lamport was the winner of the 2013 Turing
Award for imposing clear, well-defined coherence on the seemingly
chaotic behavior of distributed computing systems, in which several
autonomous computers communicate with each other by passing
messages.
EL
defined the notion of Sequential consistency,
“The Byzantine Generals' Problem”,
“Distributed Snapshots: Determining Global States of a Distributed System” and
“The Part-Time Parliament”.
PT
These papers relate to such concepts as logical clocks (and the happened-before
relationship) and Byzantine failures. They are among the most cited papers in the
field of computer science and describe algorithms to solve many fundamental
N
problems in distributed systems, including:
the Paxos algorithm for consensus,
the bakery algorithm for mutual exclusion of multiple threads in a computer system that require
the same resources at the same time,
the Chandy-Lamport algorithm for the determination of consistent global states (snapshot), and
the Lamport signature, one of the prototypes of the digital signature.
EL
1. On the same process: a b, if time(a) < time(b) (using the
local clock)
PT
2. If p1 sends m to p2: send(m) receive(m)
3. (Transitivity) If a b and b c then a c
N
Creates a partial order among events
Not all events related to each other via
A B C D E
P1
Time
EL
P2 E F G
PT
N
P3 H I J
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Example 1: Happens-Before
A B C D E
P1
Time
EL
P2 E F G
PT
N
P3 H I J
A B C D E
P1
Time
EL
P2 E F G
PT
N
P3 H I J
• HG
• FJ Instruction or step
• HJ
Message
• CJ
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport timestamps
Goal: Assign logical (Lamport) timestamp to each event
Timestamps obey causality
Rules
Each process uses a local counter (clock) which is an integer
EL
• initial value of counter is zero
A process increments its counter when a send or an
PT
instruction happens at it. The counter is assigned to the
event as its timestamp.
A send (message) event carries its timestamp
N
For a receive (message) event the counter is updated by
P1
EL
Time
P2
PT
N
P3
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
P1 0
EL
Time
P2 0
PT
N
P3 0
Instruction or step
Initial counters (clocks)
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
P1 0
ts = 1
EL
Time
P2 0
PT
Message carries
N
ts = 1
P3 0
ts = 1
Message send Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
P1 0
1 ts = max(local, msg) + 1
EL
Time
= max(0, 1)+1
=2
P2 0
PT
Message carries
N
ts = 1
P3 0
1
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
P1 0
1 2
EL
Message carries Time
ts = 2
P2
0
PT 2
max(2, 2)+1
N
=3
P3
0 1
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
max(3, 4)+1
=5
P1 0
1 2 3
EL
Time
P2 0
PT 2 3 4
N
P3 0
1
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Lamport Timestamps
P1 0
1 2 3 5 6
EL
Time
P2 0
PT 2 3 4
N
P3 0
2 7
1
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Obeying Causality
A B C D E
P1 0
1 2 3 5 6
Time
EL
E F G
P2 0
2 3 4
P3 0 H
PT I J
N
2 7
1
Instruction or step
• A B :: 1 < 2
• B F :: 2 < 3 Message
• A F :: 1 < 3
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Obeying Causality (2)
A B C D E
P1 0
1 2 3 5 6
Time
EL
E F G
P2 0
2 3 4
P3 0
H
PT I J
N
2 7
1
H G :: 1 < 4 Instruction or step
F J :: 3 < 7
H J :: 1 < 7 Message
C J :: 3 < 7
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Not always implying Causality
A B C D E
P1 0
1 2 3 5 6
Time
EL
E F G
P2 0
2 3 4
P3 0
H PT I J
N
1 2 7
• ? C F ? :: 3 = 3 Instruction or step
• ? H C ? :: 1 < 3 Message
• (C, F) and (H, C) are pairs of
concurrent events
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Concurrent Events
A pair of concurrent events doesn’t have a causal path
from one event to another (either way, in the pair)
Lamport timestamps not guaranteed to be ordered or
EL
unequal for concurrent events
Ok, since concurrent events are not causality related!
Remember:
PT
N
E1 E2 timestamp(E1) < timestamp (E2), BUT
timestamp(E1) < timestamp (E2)
{E1 E2} OR {E1 and E2 concurrent}
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Vector Timestamps
Used in key-value stores like Riak
Each process uses a vector of integer clocks
Suppose there are N processes in the group 1…N
EL
Each vector has N elements
Process i maintains vector Vi [1…N]
PT
jth element of vector clock at process i, Vi[j], is i’s
knowledge of latest events at process j
N
EL
2. Each message carries the send-event’s vector timestamp
Vmessage[1…N]
PT
3. On receiving a message at process i:
Vi[i] = Vi[i] + 1
N
Vi[j] = max(Vmessage[j], Vi[j]) for j ≠ i
A B C D E
P1
EL
Time
E F G
P2
PT
N
H I J
P3
Instruction or step
Message
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Vector Timestamps
P1 (0,0,0)
EL
Time
P2
(0,0,0)
PT
N
P3
(0,0,0)
P1 (0,0,0) (1,0,0)
EL
Time
P2
(0,0,0)
PT
N
P3 Message(0,0,1)
(0,0,0) (0,0,1)
P1(0,0,0) (1,0,0)
EL
Time
P2
(0,0,0)
PT
(0,1,1)
N
P3 Message(0,0,1)
(0,0,0) (0,0,1)
P1
(0,0,0) (1,0,0) (2,0,0)
EL
Message(2,0,0) Time
P2
(0,0,0)
PT
(0,1,1) (2,2,1)
N
P3
(0,0,0) (0,0,1)
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
EL
Time
P2
(0,0,0)
PT
(0,1,1) (2,2,1) (2,3,1)
N
P3
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
EL
VT1 ≤ VT2,
iff VT1[i] ≤ VT2[i], for all i = 1, … , N
PT
Two events are causally related iff
VT1 < VT2, i.e.,
N
iff VT1 ≤ VT2 &
there exists j such that
1 ≤ j ≤ N & VT1[j] < VT2 [j]
EL
We’ll denote this as VT2 ||| VT1
PT
N
EL
P2 E F G
(0,0,0) (0,1,1) (2,2,1) (2,3,1)
H
PT I J
N
P3
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
• A B :: (1,0,0) < (2,0,0)
• B F :: (2,0,0) < (2,2,1)
• A F :: (1,0,0) < (2,2,1)
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Obeying Causality (2)
A B C D E
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
Time
EL
P2 E F G
(0,0,0) (0,1,1) (2,2,1) (2,3,1)
P3
H
PT I J
N
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
• H G :: (0,0,1) < (2,3,1)
• F J :: (2,2,1) < (5,3,3)
• H J :: (0,0,1) < (5,3,3)
• C J :: (3,0,0) < (5,3,3)
Cloud Computing and DistributedVuSystems
Pham Time and Clock Synchronization
Identifying Concurrent Events
A B C D E
P1
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,3,1) (5,3,1)
Time
EL
E F G
P2
(0,0,0) (0,1,1) (2,2,1) (2,3,1)
P3 H
PT I J
N
(0,0,0) (0,0,1) (0,0,2) (5,3,3)
Lamport timestamp
• Integer clocks assigned to events.
EL
• Obeys causality
• Cannot distinguish concurrent events.
Vector timestamps PT
• Obey causality
N
• By using more space, can also identify concurrent events
Time synchronization:
EL
Christian’s algorithm
Berkeley algorithm
NTP
DTP
PT
N
But error a function of RTT
EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Global State and Snapshot
Preface
Content of this Lecture:
EL
states (i.e. consistent, inconsistent), Models of
communication and Snapshot algorithm i.e. Chandy-
PT
Lamport algorithm to record the global snapshot.
N
EL
snapshot” even mean?
PT
N
EL
each other
The ability to obtain a “global photograph” or “Global
PT
Snapshot” of the system is important
Some uses of having a global picture of the system
N
Checkpointing: can restart distributed application on failure
Garbage collection of objects: objects at servers that don’t have any other
objects (at any servers) with pointers to them
Deadlock detection: Useful in database transaction systems
Termination of computation: Useful in batch computing systems
EL
message delays in a distributed system make this problem non-trivial.
PT
This lecture first defines consistent global states and discusses issues
to be addressed to compute consistent distributed snapshots.
N
Then the algorithm to determine on-the-fly such snapshots is
presented.
EL
and processes communicate by passing messages through
communication channels.
Cij denotes the channel from process pi to process pj and its
PT
state is denoted by SCij .
The actions performed by a process are modeled as three types
of events: Internal events, the message send event and the
N
message receive event.
For a message mij that is sent by process pi to process pj ,
let send (mij ) and rec(mij ) denote its send and receive events.
EL
For an event e and a process state LSi , e∈LSi iff e belongs to the
sequence of events that have taken process pi to state LSi .
PT
For an event e and a process state LSi , eLSi iff e does not
belong to the sequence of events that have taken process pi to
state LSi .
N
For a channel Cij , the following set of messages can be defined
based on the local states of the processes pi and pj
Transit: transit(LSi , LSj ) = {mij |send (mij ) ∈ LSi rec(mij ) LSj }
EL
GS = {U i LSi , Ui,j SCij }
PT
following two conditions :
EL
inconsistent because the state of p2 has recorded the receipt of
message m12, however, the state of p1 has not recorded its send.
On the contrary, a global state GS2 consisting of local states
PT
{LS12 , LS24 , LS34 , LS42} is consistent; all the channels are empty except
c21 that contains message m21.
N
EL
A global state is strongly consistent iff it is transitless as well as
consistent. Note that in figure 6.2, the global state of local states
{LS12 , LS23 , LS34 , LS42} is strongly consistent.
PT
Recording the global state of a distributed system is an important
paradigm when one is interested in analyzing, monitoring, testing, or
verifying properties of distributed applications, systems, and algorithms.
N
Design of efficient methods for recording the global state of a distributed
system is an important problem.
m12 m21
e2 1 e22 e23 e24
EL
P2
P3
e31 e32
PT
e33 e34 e35
N
GS1 = {LS11 , LS23 , LS33 , LS42} is inconsistent
GS2 = {LS12 , LS24 , LS34 , LS42} is consistent
e4 1 GS3 ={LS12 , LS23 , LS34 , LS42} is strongly consistent.
e42
P4
EL
-Any message that is sent by a process before recording its
snapshot, must be recorded in the global snapshot (from C1).
PT
-Any message that is sent by a process after recording its snapshot,
must not be recorded in the global snapshot (from C2).
N
I2: How to determine the instant when a process takes its snapshot.
-A process pj must record its snapshot before processing a message
mij that was sent by process pi after recording its snapshot.
EL
Consider the following sequence of actions, which are also illustrated in
the timing diagram of Figure 6.3:
Time t0: Initially, Account A=$600, Account B=$200, C12 =$0, C21 =$0.
PT
Time t1: Site S1 initiates a transfer of $50 from Account A to Account B.
Account A is decremented by $50 to $550 and a request for $50 credit
N
to Account B is sent on Channel C12 to site S2. Account A=$550,
Account B=$200, C12 =$50, C21 =$0.
EL
A=$550, Account B=$120, C12 =$50, C21 =$80.
Time t3: Site S1 receives the message for a $80 credit to Account
PT
A and updates Account A.
Account A=$630, Account B=$120, C12 =$50, C21 =$0.
Time t4: Site S2 receives the message for a $50 credit to Account
N
B and updates Account B.
Account A=$630, Account B=$170, C12 =$0, C21 =$0.
S1: A
$50
EL
$80
S2: B PT
N
$200 $200 $120 $120 $170
t0 t1 t2 t3 t4
C12 $0 $50 $50 $50 $0
C21 $0 $0 $80 $0 $0
T4: Site S2 receives the message for a $50
credit to Account B and updates Account B
EL
appears in the system.
The reason for the inconsistency is that Account A’s state was
recorded before the $50 transfer to Account B using channel C12
PT
was initiated, whereas channel C12’s state was recorded after the
$50 transfer was initiated.
This simple example shows that recording a consistent global state
N
of a distributed system is not a trivial task. Recording activities of
individual components must be coordinated appropriately.
EL
and thus, message ordering is preserved by a channel.
PT
In non-FIFO model, a channel acts like a set in which the sender process
adds messages and the receiver process removes messages from it in a
random order.
N
A system that supports causal delivery of messages satisfies the
following property: “For any two messages mij and mkj ,
if send (mij ) → send (mkj ), then rec(mij )→ rec(mkj)”
EL
After a site has recorded its snapshot, it sends a marker,
along all of its outgoing channels before sending out any
more messages.
PT
A marker separates the messages in the channel into those to
be included in the snapshot from those not to be recorded in
N
the snapshot.
A process must record its snapshot no later than when it
receives a marker on any of its incoming channels.
EL
A process executes the “Marker Receiving Rule” on receiving a
marker. If the process has not yet recorded its local state, it
records the state of the channel on which the marker is received
PT
as empty and executes the “Marker Sending Rule” to record its
local state.
N
The algorithm terminates after each process has received a
marker on all of its incoming channels.
All the local snapshots get disseminated to all other
processes and all the processes can determine the global state.
EL
Marker Receiving Rule for process j
On receiving a marker along channel C:
PT
if j has not recorded its state then
Record the state of C as the empty set
Follow the “Marker Sending Rule”
N
else
Record the state of C as the set of messages
received along C after j ’s state was recorded
and before j received the marker along C
EL
Consider two possible executions of the snapshot
PT
algorithm (shown in Figure 6.4) for the previous money
transfer example .
N
S1: A $50
EL
$80
S2: B
C12
$200
t0
$0
$200
t1
$50
PT $120
t2
$120
t3
$50
$170
t4
$0
N
$50
C21 $0 $0 $80 $0 $0
Let site S1 initiate the algorithm just after t1. Site S1 records
EL
its local state (account A=$550) and sends a marker to site
S2. The marker is received by site S2 after t4. When site S2
receives the marker, it records its local state (account
PT
B=$170), the state of channel C12 as $0, and sends a marker
along channel C21. When site S1 receives this marker, it
records the state of channel C21 as $80. The $800 amount in
N
the system is conserved in the recorded global state,
A = $550, B = $170, C12 = $0, C21 = $80
S1: A
$50
EL
$80
S2: B
$200
t0
$200
t1 PT $120
t2
$120
t3
$170
t4
N
A = $550 B = $170 C12 = $0 C21 = $80
The $800 amount in the system is conserved in the recorded global state
Figure 6.4: Timing diagram of two possible executions of the banking example
Cloud Computing and Distributed Systems Global State and Snapshot
Properties of the recorded global state
2. (Markers shown using green dotted arrows.)
Let site S1 initiate the algorithm just after t0 and before sending the
EL
$50 for S2. Site S1 records its local state (account A = $600) and
sends a marker to site S2. The marker is received by site S2 between
t2 and t3. When site S2 receives the marker, it records its local state
PT
(account B = $120), the state of channel C12 as $0, and sends a
marker along channel C21. When site S1 receives this marker, it
records the state of channel C21 as $80. The $800 amount in the
N
system is conserved in the recorded global state,
A = $600, B = $120, C12 = $0, C21 = $80
S1: A
$50
EL
$80
S2: B
$200
t0
$200
t1 PT $120
t2
$120
t3
$170
t4
N
A = $600 B = $120 C12 = $0 C21 = $80
The $800 amount in the system is conserved in the recorded global state
Figure 6.4: Timing diagram of two possible executions of the banking example
Cloud Computing and Distributed Systems Global State and Snapshot
Properties of the recorded global state
In both these possible runs of the algorithm, the recorded global
states never occurred in the execution.
This happens because a process can change its state asynchronously
before the markers it sent are received by other sites and the other sites
EL
record their states.
But the system could have passed through the recorded global states in
some equivalent executions.
PT
The recorded global state is a valid state in an equivalent execution and
if a stable property (i.e., a property that persists) holds in the system
before the snapshot algorithm begins, it holds in the recorded global
N
snapshot.
Therefore, a recorded global state is useful in detecting stable
properties.
EL
This lecture first discussed a formal definition of the global
PT
state of a distributed system and issues related to its
capture; then we have discussed the Chandy-Lamport
N
Algorithm to record a snapshot of a distributed system.
EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Distributed Mutual Exclusion
Preface
Content of this Lecture:
EL
Mutual Exclusion’, Classical algorithms for distributed
computing systems and Industry systems for Mutual
Exclusion.
PT
N
EL
Both ATMs read initial amount of 1000 Rs.
concurrently from the bank’s cloud server
PT
Both ATMs add 10,000 Rs. to this amount (locally at
the ATM)
N
Both write the final amount to the server
What’s wrong? 11000Rs. (or 21000Rs.)
EL
from the bank’s cloud server
Both ATMs add 10,000 Rs. to this amount (locally at the
ATM)
PT
Both write the final amount to the server
You lost 10,000 Rs.!
N
The ATMs need mutually exclusive access to your
account entry at the server
or, mutually exclusive access to executing the code that
modifies the account entry
EL
o
point of time
Server coordination
o
o PT
Work partitioned across servers
Servers coordinate using locks
In industry
N
o Chubby is Google’s locking service
o Many cloud stacks use Apache Zookeeper for
coordination among servers
EL
of time.
• Each process can call three functions
o
o
PT
enter() to enter the critical section (CS)
AccessResource() to run the critical section code
N
o exit() to exit the critical section
ATM1: ATM2:
enter(S); enter(S);
// AccessResource() // AccessResource()
EL
obtain bank amount; obtain bank amount;
add in deposit; add in deposit;
update bank amount; update bank amount;
exit(S); // exit
PT
// AccessResource() end // AccessResource() end
exit(S); // exit
N
7
Cloud Computing and Distributed Systems
Vu Pham Distributed Mutual Exclusion
Approaches to Solve Mutual Exclusion
• Single OS:
EL
on a machine (or VM), then
• Semaphores, mutexes, condition
variables, monitors, etc.
PT
N
EL
Need to guarantee 3 properties:
o Safety (essential): At most one process executes in
PT
CS (Critical Section) at any time
o Liveness (essential): Every request for a CS is
N
granted eventually
o Fairness (desirable): Requests are granted in the
EL
1. wait(S) (or P(S) or down(S)):
while(1) { // each execution of the while loop is atomic
if (S > 0) {
enter() S--;
}
}
break;
PT
Each while loop execution and S++ are each atomic operations – supported
N
via hardware instructions such as compare-and-swap, test-and-set, etc.
exit() 2. signal(S) (or V(S) or up(s)):
S++; // atomic
EL
obtain bank amount; // AccessResource()
add in deposit; obtain bank amount;
update bank amount; add in deposit;
signal(S); // exit
PT
// AccessResource() end update bank amount;
// AccessResource() end
signal(S); // exit
N
EL
• So how do we support mutual exclusion in a
distributed system?
PT
N
EL
(such as TCP).
Messages are eventually delivered to recipient, and in
PT
FIFO (First In First Out) order.
Processes do not fail.
• Fault-tolerant variants exist in literature.
N
EL
o A queue of waiting requests from processes who wish to
access the CS
o A special token which allows its holder to access CS
PT
o Actions of any process in group:
o enter()
N
o Send a request to master
o Wait for token from master
o exit()
o Send back token to master
EL
Send token to Pi
else
PT
Add Pi to queue
o On receiving a token from process Pi
if (queue is not empty)
N
Dequeue head of queue (say Pj), send that process the token
else
Retain token
EL
o With N processes in system, queue has at most N
processes
PT
o If each process exits CS eventually and no failures,
liveness guaranteed
o FIFO Ordering is guaranteed, in order of requests received
N
at master
EL
Bandwidth: the total number of messages sent in each enter
and exit operation.
PT
Client delay: the delay incurred by a process at each enter
and exit operation (when no other process is in, or waiting)
(We will prefer mostly the enter operation.)
N
Synchronization delay: the time interval between one
process exiting the critical section and the next process
entering it (when there is only one process waiting)
EL
o 1 message for exit
o Client delay: the delay incurred by a process at each enter and
PT
exit operation (when no other process is in, or waiting)
o 2 message latencies (request + grant)
N
o Synchronization delay: the time interval between one process
exiting the critical section and the next process entering it
(when there is only one process waiting)
o 2 message latencies (release + grant)
EL
PT
N
EL
N12 N3 can access CS
N6 PT N32
N
Token: N80 N5
EL
N12 N3
Here’s the token!
N6 PT N32
N
Token: N80 N5
EL
N12 N3
Token: N80 N5
EL
enter()
Wait until you get token
PT
exit() // already have token
Pass on token to ring successor
N
If receive token, and not currently in enter(), just pass
on token to ring successor
EL
• Token eventually loops around ring and reaches
requesting process (no failures)
• Bandwidth
PT
• Per enter(), 1 message by requesting process but up to
N
N messages throughout system
• 1 message sent per exit()
EL
• Worst case: just sent token to neighbor
• Synchronization delay between one process’ exit() from the CS
PT
and the next process’ enter():
• Between 1 and (N-1) message transmissions.
• Best case: process in enter() is successor of process in exit()
N
• Worst case: process in enter() is predecessor of process in
exit()
EL
PT
N
EL
Messages are eventually delivered to recipient, and
in FIFO (First In First Out) order.
PT
Processes do not fail.
N
EL
Every site Si keeps a queue, request_queuei which contains
mutual exclusion requests ordered by their timestamps.
PT
This algorithm requires communication channels to deliver
messages the FIFO order. Three types of messages are used-
Request, Reply and Release. These messages with timestamps
N
also updates logical clock.
EL
request on request_queuej and it returns a timestamped REPLY message to Si .
PT
Executing the critical section: Site Si enters the CS when the following two
conditions hold:
N
L1: Si has received a message with timestamp larger than (tsi , i ) from all other
sites.
L2: Si ’s request is at the top of request _queuei .
EL
request queue and broadcasts a timestamped RELEASE message to all
other sites.
When a site Sj receives a RELEASE message from site Si , it removes Si ’s
PT
request from its request queue.
When a site removes a request from its request queue, its own request
N
may come at the top of the queue, enabling it to enter the CS.
Proof:
Proof is by contradiction. Suppose two sites Si and Sj are executing the CS
concurrently. For this to happen conditions L1 and L2 must hold at both the sites
EL
concurrently.
This implies that at some instant in time, say t, both Si and Sj have their own
requests at the top of their request_queues and condition L1 holds at them.
the request of Sj .
PT
Without loss of generality, assume that Si ’s request has smaller timestamp than
Proof:
The proof is by contradiction. Suppose a site Si ’s request has a smaller
timestamp than the request of another site Sj and Sj is able to execute the
EL
CS before Si .
For Sj to execute the CS, it has to satisfy the conditions L1 and L2. This
implies that at some instant in time say t, Sj has its own request at the top
PT
of its queue and it has also received a message with timestamp larger than
the timestamp of its request from all other sites.
But request_queue at a site is ordered by timestamp, and according to
N
our assumption Si has lower timestamp. So Si ’s request must be placed
ahead of the Sj ’s request in the request _queuej . This is a contradiction!
EL
(1,2)
S2
PT
N
S3
EL
(1,2) (1,2)
S2
(1,1), (1,2) PT
N
S3
EL
(1,2)
S2
(1,1), (1,2) PT Site S2 enters the CS
N
S3
EL
RELEASE messages.
EL
than the timestamp of site Si ’s request, then site Sj need not send a
REPLY message to site Si .
PT
This is because when site Si receives site Sj ’s request with timestamp
higher than its own, it can conclude that site Sj does not have any
smaller timestamp request which is still pending.
N
With this optimization, Lamport’s algorithm requires between
3(N − 1) and 2(N − 1) messages per CS execution.
EL
No token
PT
Uses the notion of causality and multicast
N
Has lower waiting time to enter CS than Ring-
Based approach
EL
timestamp at Pi
• Wait until all other processes have responded
•
PT
positively to request
Requests are granted in order of causality
N
• <T, Pi> is used lexicographically: Pi in request <T, Pi> is
used to break ties (since Lamport timestamps are not
unique for concurrent events)
EL
process to give its permission to that process.
Processes use Lamport-style logical clocks to assign a timestamp to critical
section requests and timestamps are used to decide the priority of requests.
PT
Each process pi maintains the Request-Deferred array, RDi , the size of which
is the same as the number of processes in the system.
Initially, ∀i ∀j: RDi [j]=0. Whenever pi defer the request sent by pj , it sets
N
RDi [j]=1 and after it has sent a REPLY message to pj , it sets RDi [j]=0.
EL
REPLY message to site Si if site Sj is neither requesting nor
executing the CS, or if the site Sj is requesting and Si ’s request’s
timestamp is smaller than site Sj ’s own request’s timestamp.
PT
Otherwise, the reply is deferred and Sj sets RDj [i]=1
Executing the critical section:
(c) Site Si enters the CS after it has received a REPLY message from
N
every site it sent a REQUEST message to.
EL
Notes:
When a site receives a message, it updates its clock using the
PT
timestamp in the message.
When a site takes up a request for the CS for processing, it updates its
local clock and assigns a timestamp to the request.
N
Proof:
EL
Proof is by contradiction. Suppose two sites Si and Sj ‘ are executing the
CS concurrently and Si ’s request has higher priority than the request of
Sj . Clearly, Si received Sj ’s request after it has made its own request.
PT
Thus, Sj can concurrently execute the CS with Si only if Si returns a REPLY
to Sj (in response to Sj ’s request) before Si exits the CS.
However, this is impossible because Sj ’s request has lower priority.
N
Therefore, Ricart-Agrawala algorithm achieves mutual exclusion.
(1,1)
EL
S1
S2
(1,2)
PT
N
S3
EL
S1
S2
(1,2)
PT
N
S3
EL
S1
Site S1
exits the
S2
(1,2)
PT CS
N
And send a REPLY
message to S2’s
deferred request
S3
EL
Thus, it requires 2(N − 1) messages per CS execution.
PT
Synchronization delay in the algorithm is T .
N
EL
down to O(1)
But bandwidth has gone up to O(N)
PT
Can we get both down?
N
EL
The intersection property of quorums make sure that
PT
only one request executes the CS at any time.
N
1. A site does not request permission from all other sites, but
EL
only from a subset of the sites.
The request set of sites are chosen such that
PT
∀ i ∀ j : 1 ≤ i , j ≤ N : : Ri ∩ Rj ≠ Φ.
Consequently, every pair of sites has a site which
mediates conflicts between that pair.
N
2. A site can send out only one REPLY message at any time.
A site can send a REPLY message only after it has received
a RELEASE message for the previous REPLY message.
EL
The following properties hold for quorums in a coterie:
Intersection property: For every quorum g, h ∈ C, g ∩ h ≠ ∅.
For example, sets {1,2,3}, {2,5,7} and {5,7,9} cannot be quorums in a
PT
coterie, because first and third sets do not have a common element.
N
Minimality property: There should be no quorums g, h in coterie C
such that g ⊇ h i.e g is superset of h.
For example, sets {1,2,3} and {1,3} cannot be quorums in a coterie
because the first set is a superset of the second.
The request sets for sites (i.e., quorums) in Maekawa’s algorithm are
EL
constructed to satisfy the following conditions:
M1: (∀i ∀j : i ≠ j, 1 ≤ i , j ≤ N : : Ri ∩ Rj ≠ φ)
M2: (∀i : 1 ≤ i ≤ N : : Si ∈ Ri )
PT
M3: (∀i : 1 ≤ i ≤ N : : |Ri | = K )
M4: Any site Sj is contained in K number of Ri s, 1 ≤ i , j ≤ N .
N
Maekawa used the theory of projective planes and showed that
N = K (K − 1) + 1. This relation gives |Ri | =
EL
Condition M3 states that the size of the requests sets of all sites
must be equal implying that all sites should have to do an equal
PT
amount of work to invoke mutual exclusion.
Condition M4 enforces that exactly the same number of sites
should request permission from any site, which implies that all sites
N
have “equal responsibility” in granting permission to other sites.
EL
messages to all sites in its request set Ri .
(b) When a site Sj receives the REQUEST( i ) message, it sends a
PT
REPLY( j ) message to Si provided it hasn’t sent a REPLY message to
a site since its receipt of the last RELEASE message. Otherwise, it
queues up the REQUEST( i ) for later consideration.
N
Executing the critical section
(c) Site Si executes the CS only after it has received a REPLY message
from every site in Ri .
EL
message to every site in Ri .
(e) When a site Sj receives a RELEASE( i ) message from site Si , it
PT
sends a REPLY message to the next site waiting in the queue and deletes
that entry from the queue.
If the queue is empty, then the site updates its state to reflect that it
N
has not sent out any REPLY message since the receipt of the last
RELEASE message.
EL
Proof is by contradiction. Suppose two sites Si and Sj are concurrently
executing the CS.
PT
This means site Si received a REPLY message from all sites in Ri and
concurrently site Sj was able to receive a REPLY message from all sites in Rj .
N
If Ri ∩ Rj = {Sk }, then site Sk must have sent REPLY messages to both
Si and Sj concurrently, which is a contradiction.
EL
messages per CS execution.
EL
Suppose Ri ∩ Rj = {Sij }, Rj ∩ Rk = {Sjk }, and Rk ∩ Ri = {Ski }.
2.
PT
Sij has been locked by Si (forcing Sj to wait at Si j).
Sjk has been locked by Sj (forcing Sk to wait at Sjk ).
N
3. Ski has been locked by Sk (forcing Si to wait at Ski ).
EL
waiting for the same lock.
PT
A site suspects a deadlock (and initiates message
exchanges to resolve it) whenever a higher priority
request arrives and waits at a site because the site has
N
sent a REPLY message to a lower priority request.
EL
to a site with a higher priority request.
PT
INQUIRE: An INQUIRE message from Si to Sj indicates that Si would like
to find out from Sj if it has succeeded in locking all the sites in its
request set.
N
YIELD: A YIELD message from site Si to Sj indicates that Si is returning
the permission to Sj (to yield to a higher priority request at Sj ).
EL
In response to an INQUIRE( j) message from site Sj , site Sk sends a YIELD(k )
message to Sj provided Sk has received a FAILED message from a site in its request
from it.
PT
set and if it sent a YIELD to any of these sites, but has not received a new REPLY
EL
Si
REQUEST(ts,i) FAILED (j)
Sk
EL
Si
Sj PT
N
INQUIRE (j)
YIELD (k)
Sk
EL
Si
Sj assumes as if it has been
REPLY ( j) to top
Sj PT released by Sk
request site i.e. Si
N
YIELD (k)
Sk
EL
PT
N
EL
Chubby provides Advisory locks only
PT
Doesn’t guarantee mutual exclusion unless every
client checks lock before accessing resource
N
Reference: http://research.google.com/archive/chubby.html
EL
Master Server B
All servers replicate same information
Clients send read requests to Master,
PT
which serves it locally
Clients send write requests to Master,
which sends it to all servers, gets majority
Server C
Server D Master
N
(quorum) among servers, and then
responds to client
Server E
On master failure, run election protocol
On replica failure, just replace it and have
it catch up
Cloud Computing and DistributedVuSystems
Pham Distributed Mutual Exclusion
Conclusion
Mutual exclusion important problem in cloud
computing systems
Classical algorithms
EL
Central
Ring-based
Ricart-Agrawala
Maekawa
PT
Lamport’s Algorithm
N
Industry systems
Chubby: a coordination service
Similarly, Apache Zookeeper for coordination