Distributed Algorithms
Wan Fokkink
Distributed Algorithms: An Intuitive Approach
MIT Press, 2013 (revised 1st edition, 2015)
1 / 329
Algorithms
A skilled programmer must have good insight into algorithms.
At bachelor level you were offered courses on basic algorithms:
searching, sorting, pattern recognition, graph problems, ...
You learned how to detect such subproblems within your programs,
and solve them effectively.
You’re trained in algorithmic thought for uniprocessor programs
(e.g. divide-and-conquer, greedy, memoization).
2 / 329
Distributed systems
A distributed system is an interconnected collection of
autonomous processes.
Motivation:
I information exchange
I resource sharing
I parallelization to increase performance
I replication to increase reliability
I multicore programming
3 / 329
Distributed versus uniprocessor
Distributed systems differ from uniprocessor systems in three aspects.
I Lack of knowledge on the global state: A process has no
up-to-date knowledge on the local states of other processes.
Example: termination and deadlock detection become an issue.
I Lack of a global time frame: No total order on events by
their temporal occurrence.
Example: mutual exclusion becomes an issue.
I Nondeterminism: Execution by processes is nondeterministic,
so running a system twice can give different results.
Example: race conditions.
4 / 329
Aim of this course
This course offers a bird’s-eye view on a wide range of algorithms
for basic and important challenges in distributed systems.
It aims to provide you with an algorithmic frame of mind for
solving fundamental problems in distributed computing.
I Handwaving correctness arguments.
I Back-of-the-envelope complexity calculations.
I Carefully developed exercises to acquaint you with intricacies
of distributed algorithms.
5 / 329
Message passing
The two main paradigms to capture communication in
a distributed system are message passing and shared memory.
We’ll only consider message passing.
(The course Concurrency & Multithreading is dedicated to shared memory.)
Asynchronous communication means that sending and receiving
of a message are independent events.
In case of synchronous communication, sending and receiving
of a message are coordinated to form a single event; a message
is only allowed to be sent if its destination is ready to receive it.
We’ll mainly consider asynchronous communication.
6 / 329
Communication protocols
In a computer network, messages are transported through a medium,
which may lose, duplicate or garble these messages.
A communication protocol detects and corrects such flaws during
message passing.
Example: Sliding window protocols.
B K C
A D
S R
F L E
7 0 7 0
6 1 6 1
5 2 5 2
4 3 4 3
7 / 329
Assumptions
Unless stated otherwise, we assume:
I a strongly connected network
I each process knows only its neighbors
I message passing communication
I asynchronous communication
I channels are non-FIFO
I the delay of messages in channels is arbitrary but finite
I channels don’t lose, duplicate or garble messages
I processes don’t crash
I processes have unique id’s
8 / 329
Directed versus undirected channels
Channels can be directed or undirected.
Question: What is more general, an algorithm for a directed
or for an undirected network ?
Remarks:
I Algorithms for undirected channels often include ack’s.
I Acyclic networks must always be undirected
(else the network wouldn’t be strongly connected).
9 / 329
Complexity measures
Resource consumption of an execution of a distributed algorithm
can be considered in several ways.
Message complexity: Total number of messages exchanged.
Bit complexity: Total number of bits exchanged.
(Only interesting when messages can be very long.)
Time complexity: Amount of time consumed.
(We assume: (1) event processing takes no time, and
(2) a message is received at most one time unit after it is sent.)
Space complexity: Amount of memory needed for the processes.
Different executions require different consumption of resources.
We consider worst- and average-case complexity (the latter with
a probability distribution over all executions).
10 / 329
Big O notation
Complexity measures state how resource consumption
(messages, time, space) grows in relation to input size.
For example, if an algorithm has a worst-case message complexity
of O(n2 ), then for an input of size n, the algorithm in the worst case
takes in the order of n2 messages.
Let f , g : N → R>0 .
f = O(g ) if, for some C > 0, f (n) ≤ C ·g (n) for all n ∈ N.
f = Θ(g ) if f = O(g ) and g = O(f ).
11 / 329
Formal framework
Now follows a formal framework for describing distributed systems,
mainly to fix terminology.
In this course, correctness proofs and complexity estimations of
distributed algorithms are presented in an informal fashion.
(The course Protocol Validation treats algorithms and tools to prove correctness
of distributed algorithms and network protocols.)
12 / 329
Transition systems
The (global) state of a distributed system is called a configuration.
The configuration evolves in discrete steps, called transitions.
A transition system consists of:
I a set C of configurations;
I a binary transition relation → on C; and
I a set I ⊆ C of initial configurations.
γ ∈ C is terminal if γ → δ for no δ ∈ C.
13 / 329
Executions
An execution is a sequence γ0 γ1 γ2 · · · of configurations that
either is infinite or ends in a terminal configuration, such that:
I γ0 ∈ I, and
I γi → γi+1 for all i ≥ 0
(excluding, for finite executions, the terminal γi at the end).
A configuration δ is reachable if there is a γ0 ∈ I and
a sequence γ0 γ1 γ2 · · · γk = δ with γi → γi+1 for all 0 ≤ i < k.
14 / 329
States and events
A configuration of a distributed system is composed from
the states at its processes, and the messages in its channels.
A transition is associated to an event (or, in case of synchronous
communication, two events) at one (or two) of its processes.
A process can perform internal, send and receive events.
A process is an initiator if its first event is an internal or send event.
An algorithm is centralized if there is exactly one initiator.
A decentralized algorithm can have multiple initiators.
15 / 329
Assertions
An assertion is a predicate on the configurations of an algorithm.
An assertion is a safety property if it is true in each configuration
of each execution of the algorithm.
“something bad will never happen”
An assertion is a liveness property if it is true in some configuration
of each execution of the algorithm.
“something good will eventually happen”
16 / 329
Invariants
Assertion P on configurations is an invariant if:
I P(γ) for all γ ∈ I, and
I if γ → δ and P(γ), then P(δ).
Each invariant is a safety property.
Question: Give a transition system S and an assertion P
such that P is a safety property but not an invariant for S.
17 / 329
Causal order
In each configuration of an asynchronous system, applicable events
at different processes are independent.
The causal order ≺ on occurrences of events in an execution is
the smallest transitive relation such that:
I if a and b are events at the same process and a occurs before b,
then a ≺ b; and
I if a is a send and b the corresponding receive event, then a ≺ b.
This relation is irreflexive.
a b denotes a ≺ b ∨ a = b.
18 / 329
Computations
If neither a b nor b a, then a and b are called concurrent.
A permutation of concurrent events in an execution doesn’t affect
the result of the execution.
These permutations together form a computation.
All executions of a computation start in the same initial configuration,
and if they are finite, they all end in the same terminal configuration.
19 / 329
Question
Consider the finite execution abc.
Let a ≺ b be the only causal relationship.
Which executions are in the same computation ?
20 / 329
Lamport’s clock
A logical clock C maps occurrences of events in a computation
to a partially ordered set such that a ≺ b ⇒ C (a) < C (b).
Lamport’s clock LC assigns to each event a the length k of
a longest causality chain a1 ≺ · · · ≺ ak = a.
LC can be computed at run-time:
Let a be an event, and k the clock value of the previous event
at the same process (k = 0 if there is no such previous event).
∗ If a is an internal or send event, then LC (a) = k + 1.
∗ If a is a receive event, and b the send event corresponding to a,
then LC (a) = max{k, LC (b)} + 1.
21 / 329
Question
Consider the following sequences of events at processes p0 , p1 , p2 :
p0 : a s1 r3 b
p1 : c r2 s3
p2 : r1 d s2 e
si and ri are corresponding send and receive events, for i = 1, 2, 3.
Provide all events with Lamport’s clock values.
Answer: 1 2 8 9
1 6 7
3 4 5 6
22 / 329
Vector clock
Given processes p0 , . . . , pN−1 .
We define a partial order on NN by:
(k0 , . . . , kN−1 ) ≤ (`0 , . . . , `N−1 ) ⇔ ki ≤ `i for all i = 0, . . . , N −1.
Vector clock VC maps each event in a computation to a unique
value in NN such that a ≺ b ⇔ VC (a) < VC (b).
VC (a) = (k0 , . . . , kN−1 ) where each ki is the length of a longest
causality chain a1i ≺ · · · ≺ aki i of events at process pi with aki i a.
VC can also be computed at run-time.
23 / 329
Question
Consider the same sequences of events at processes p0 , p1 , p2 :
p0 : a s1 r3 b
p1 : c r2 s3
p2 : r1 d s2 e
Provide all events with vector clock values.
Answer: (1 0 0) (2 0 0) (3 3 3) (4 3 3)
(0 1 0) (2 2 3) (2 3 3)
(2 0 1) (2 0 2) (2 0 3) (2 0 4)
24 / 329
Vector clock - Correctness
Let a ≺ b.
Any causality chain for a is also one for b. So VC (a) ≤ VC (b).
At the process where b occurs, there is a longer causality chain
for b than for a. So VC (a) < VC (b).
Let VC (a) < VC (b).
Consider the longest causality chain a1i ≺ · · · ≺ aki = a of events
at the process pi where a occurs.
VC (a) < VC (b) implies that the i-th coefficient of VC (b) is ≥ k.
So a b.
Since a and b are distinct, a ≺ b.
25 / 329
Snapshots
A snapshot of an execution of a distributed algorithm should return
a configuration of an execution in the same computation.
Snapshots can be used for:
I Restarting after a failure.
I Off-line determination of stable properties,
which remain true as soon as they have become true.
Examples: deadlock, garbage.
I Debugging.
Challenge: Take a snapshot without freezing the execution.
26 / 329
Snapshots
We distinguish basic messages of the underlying distributed algorithm
and control messages of the snapshot algorithm.
A snapshot of a (basic) execution consists of:
I a local snapshot of the (basic) state of each process, and
I the channel state of (basic) messages in transit for each channel.
A snapshot is meaningful if it is a configuration of an execution
in the same computation as the actual execution.
27 / 329
Snapshots
We need to avoid the following situations.
1. Process p takes a local snapshot, and then sends a message m
to process q, where:
• q takes a local snapshot after the receipt of m,
• or m is included in the channel state of pq.
2. p sends m to q, and then takes a local snapshot, where:
• q takes a local snapshot before the receipt of m,
• and m is not included in the channel state of pq.
28 / 329
Chandy-Lamport algorithm
Consider a directed network with FIFO channels.
Initiators take a local snapshot of their state, and send a control message
hmarkeri to their neighbors.
When a process that hasn’t yet taken a snapshot receives hmarkeri, it
I takes a local snapshot of its state, and
I sends hmarkeri to all its neighbors.
Process q computes as channel state of pq the messages it receives via pq
after taking its local snapshot and before receiving hmarkeri from p.
If channels are FIFO, this produces a meaningful snapshot.
Message complexity: Θ(E )
Worst-case time complexity: O(D)
29 / 329
Chandy-Lamport algorithm - Example
30 / 329
Chandy-Lamport algorithm - Example
hmkri
m1
snapshot
hmkri
30 / 329
Chandy-Lamport algorithm - Example
hmkri
m1 m2
hmkri
snapshot
∅
30 / 329
Chandy-Lamport algorithm - Example
snapshot
∅ hmkri
m1 m2
∅
30 / 329
Chandy-Lamport algorithm - Example
m1 {m2 }
The snapshot (processes red/blue/green, channels ∅, ∅, ∅, {m2 })
isn’t a configuration in the actual execution.
The send of m1 isn’t causally before the send of m2 .
So the snapshot is a configuration of an execution that is in
the same computation as the actual execution.
30 / 329
Chandy-Lamport algorithm - Correctness
Claim: If a post-snapshot event e is causally before an event f ,
then f is also post-snapshot.
This implies that the snapshot is a configuration of an execution
that is in the same computation as the actual execution.
Proof : The case that e and f occur at the same process is trivial.
Let e be a send and f the corresponding receive event.
Let e occur at p and f at q.
e is post-snapshot at p, so p sent hmarkeri to q before e.
Channels are FIFO, so q receives this hmarkeri before f .
Hence f post-snapshot at q.
31 / 329
Lai-Yang algorithm
Suppose channels are non-FIFO. We use piggybacking.
Initiators take a local snapshot of their state.
When a process has taken its local snapshot, it appends true
to each outgoing basic message.
When a process that hasn’t yet taken a snapshot receives a message
with true or a control message (see next slide) for the first time,
it takes a local snapshot of its state before reception of this message.
Process q computes as channel state of pq the basic messages
without the tag true that it receives via pq after its local snapshot.
32 / 329
Lai-Yang algorithm - Control messages
Question: How does q know when it can determine the channel state
of pq ?
p sends a control message to q, informing q how many basic messages
without the tag true p sent into pq.
These control messages also ensure that all processes eventually take
a local snapshot.
33 / 329
Lai-Yang algorithm - Multiple snapshots
Question: How can multiple subsequent snapshots be supported ?
Answer: Each snapshot is provided with a sequence number.
Basic message carry the sequence number of the last snapshot at
the sender (instead of true or false).
Control messages carry the sequence number of their snapshot.
34 / 329
What we need from last lecture
fully asynchronous message passing framework
channels are non-FIFO, and can be directed or undirected
configurations and transitions at the global level
states and events (internal/send/receive) at local level
(non)initiator
(de)centralized algorithm;
causal order ≺ on events in an execution
computation of executions, by reordering concurrent events
snapshot algorithm to compute a configuration of a computation
basic/control algorithm
35 / 329
Wave algorithms
A decide event is a special internal event.
In a wave algorithm, each computation (also called wave)
satisfies the following properties:
I termination: it is finite;
I decision: it contains one or more decide events; and
I dependence: for each decide event e and process p,
f ≺ e for an event f at p.
36 / 329
Wave algorithms - Example
In the ring algorithm, the initiator sends a token, which is passed on
by all other processes.
The initiator decides after the token has returned.
Question: For each process, which event is causally before
the decide event ?
The ring algorithm is an example of a traversal algorithm.
37 / 329
Traversal algorithms
A traversal algorithm is a centralized wave algorithm;
i.e., there is one initiator, which sends around a token.
I In each computation, the token first visits all processes.
I Finally, the token returns to the initiator, who performs
a decide event.
Traversal algorithms build a spanning tree:
I the initiator is the root; and
I each noninitiator has as parent the neighbor from which it
received the token first.
38 / 329
Tarry’s algorithm (from 1895)
Consider an undirected network.
R1 A process never forwards the token through the same channel
twice.
R2 A process only forwards the token to its parent when there is
no other option.
The token travels through each channel both ways, and finally
ends up at the initiator.
Message complexity: 2E messages
Time complexity: ≤ 2E time units
Gaston Tarry
39 / 329
Tarry’s algorithm - Example
p is the initiator.
3 10
p q r
8 9
12
4 7 1 11 2
6
s t
5
The network is undirected and unweighted.
Arrows and numbers mark the path of the token.
Solid arrows establish a parent-child relation (in the opposite direction).
40 / 329
Tarry’s algorithm - Spanning tree
The parent-child relation is the reversal of the solid arrows.
p q r
s t
Tree edges, which are part of the spanning tree, are solid.
Frond edges, which aren’t part of the spanning tree, are dashed.
41 / 329
Tarry’s algorithm - Correctness
Claim: The token θ travels through each channel in either direction,
and ends up at the initiator.
Proof : A noninitiator holding θ, received θ once more than it sent θ.
So by R1, this process can send θ into a channel.
Hence θ ends at the initiator, after traversing all its channels both ways.
Assume some channel isn’t traversed by θ both ways.
Let noninitiator q be the earliest visited process with such a channel.
q sends θ to its parent p. Namely, since θ visits p before q,
it traverses the channel pq both ways.
So by R2, q sends θ into all its channels.
Since q sends and receives θ an equal number of times, it also
receives θ through all its channels.
So θ travels through all channels of q both ways; contradiction.
42 / 329
Question
p q r
s t
Could this spanning tree have been produced by a depth-first search
starting at p ?
43 / 329
Depth-first search
Depth-first search is obtained by adding to Tarry’s algorithm:
R3 When a process receives the token, it immediately sends it
back through the same channel if this is allowed by R1,2.
Example: 3 6
p q r
4 5
12
10 9 1 7 2
8
s t
11
In the spanning tree of a depth-first search, all frond edges connect
an ancestor with one of its descendants in the spanning tree.
44 / 329
Depth-first search with neighbor knowledge
To prevent transmission of the token through a frond edge,
visited processes are included in the token.
The token isn’t forwarded to processes in this list
(except when a process sends the token back to its parent).
Message complexity: 2N − 2 messages
Each tree edge carries 2 tokens.
Time complexity: ≤ 2N − 2 time units
Bit complexity: Up to kN bits per message
(where k bits are needed to represent one process).
45 / 329
Awerbuch’s algorithm
A process holding the token for the first time informs all neighbors
except its parent and the process to which it forwards the token.
The token is only forwarded when these neighbors have all
acknowledged reception.
The token is only forwarded to processes that weren’t yet visited
by the token (except when a process sends the token to its parent).
46 / 329
Awerbuch’s algorithm - Complexity
Message complexity: ≤ 4E messages
Each frond edge carries 2 info and 2 ack messages.
Each tree edges carries 2 tokens, and possibly 1 info/ack pair.
Time complexity: ≤ 4N − 2 time units
Each tree edge carries 2 tokens.
Each process waits at most 2 time units for ack’s to return.
47 / 329
Cidon’s algorithm
Abolish ack’s from Awerbuch’s algorithm.
The token is forwarded without delay.
Each process p records to which process fw p it forwarded the token last.
Suppose process p receives the token from a process q 6= fw p .
Then p marks pq as frond edge and dismisses the token.
Suppose process q receives an info message from fw q .
Then q marks pq as frond edge and continues forwarding the token.
48 / 329
Cidon’s algorithm - Complexity
Message complexity: ≤ 4E messages
Each channel carries at most 2 info messages and 2 tokens.
Time complexity: ≤ 2N − 2 time units
Each tree edge carries 2 tokens.
At least once per time unit, a token is forwarded through a tree edge.
49 / 329
Cidon’s algorithm - Example
9
p q
1
3
2 8
4 7
s r t
5 6
50 / 329
Tree algorithm
The tree algorithm is a decentralized wave algorithm
for undirected, acyclic networks.
The local algorithm at a process p:
I p waits until it received messages from all neighbors except one,
which becomes its parent.
Then it sends a message to its parent.
I If p receives a message from its parent, it decides.
It sends the decision to all neighbors except its parent.
I If p receives a decision from its parent,
it passes it on to all other neighbors.
Always two (neighboring) processes decide.
51 / 329
Tree algorithm - Example
decide decide
52 / 329
Questions
What happens if the tree algorithm is applied to a network
containing a cycle ?
Apply the tree algorithm to compute the size of an undirected,
acyclic network.
53 / 329
Tree algorithm - Correctness
Claim: If the tree algorithm is run on an acyclic network with N > 1,
then exactly two processes decide.
Proof : Suppose some process p never sends a message.
p doesn’t receive a message through two of its channels, qp and rp.
q doesn’t receive a message through two of its channels, pq and sq.
Continuing this argument, we get a cycle of processes that don’t
receive a message through two of their channels.
Since the network topology is a tree, there is no cycle; contradiction.
So each process eventually sends a message.
Clearly each channel carries at least one message.
There are N − 1 channels, so one channel carries two messages.
Only the two processes connected by this channel decide.
54 / 329
Echo algorithm
The echo algorithm is a centralized wave algorithm for undirected networks.
I The initiator sends a message to all neighbors.
I When a noninitiator receives a message for the first time,
it makes the sender its parent.
Then it sends a message to all neighbors except its parent.
I When a noninitiator has received a message from all neighbors,
it sends a message to its parent.
I When the initiator has received a message from all neighbors,
it decides.
Message complexity: 2E messages
55 / 329
Echo algorithm - Example
decide
56 / 329
Questions
Use the echo algorithm to determine the largest process id.
Let each process initiate a run of the echo algorithm, tagged by its id.
Processes only participate in the “largest” wave they have seen so far.
Which of these concurrent waves complete ?
57 / 329
Communication and resource deadlock
A deadlock occurs if there is a cycle of processes waiting until:
I another process on the cycle sends some input
(communication deadlock)
I or resources held by other processes on the cycle are released
(resource deadlock)
Both types of deadlock are captured by the N-out-of-M model:
A process can wait for N grants out of M requests.
Examples:
I A process is waiting for one message from a group of processes:
N=1
I A database transaction first needs to lock several files: N = M.
58 / 329
Wait-for graph
A (non-blocked) process can issue a request to M other processes,
and becomes blocked until N of these requests have been granted.
Then it informs the remaining M − N processes that the request
can be dismissed.
Only non-blocked processes can grant a request.
A (directed) wait-for graph captures dependencies between processes.
There is an edge from node p to node q if p sent a request to q
that wasn’t yet dismissed by p or granted by q.
59 / 329
Wait-for graph - Example 1
Suppose process p must wait for a message from process q.
In the wait-for graph, node p sends a request to node q.
Then edge pq is created in the wait-for graph, and p becomes
blocked.
When q sends a message to p, the request of p is granted.
Then edge pq is removed from the wait-for graph, and p becomes
unblocked.
60 / 329
Wait-for graph - Example 2
Suppose two processes p and q want to claim a resource.
In the wait-for graph, nodes u, v representing p, q send a request to
node w representing the resource. Edges uw and vw are created.
Since the resource is free, the resource is given to say p.
So w sends a grant to u. Edge uw is removed.
The basic (mutual exclusion) algorithm requires that the resource
must be released by p before q can claim it.
So w sends a request to u, creating edge wu in the wait-for graph.
After p releases the resource, u grants the request of w .
Edge wu is removed.
The resource is given to q. Hence w grants the request from v .
Edge vw is removed and edge wv is created.
61 / 329
Drawing wait-for graphs
AND (3−out−of−3) request OR (1−out−of−3) request
62 / 329
Questions
Draw the wait-for graph for the initial configuration of the tree algorithm,
applied to the following network.
63 / 329
Static analysis on a wait-for graph
A snapshot is taken of the wait-for graph.
A static analysis on the wait-for graph may reveal deadlocks:
I Non-blocked nodes can grant requests.
I When a request is granted, the corresponding edge is removed.
I When an N-out-of-M request has received N grants,
the requester becomes unblocked.
(The remaining M − N outgoing edges are dismissed.)
When no more grants are possible, nodes that remain blocked in the
wait-for graph are deadlocked in the snapshot of the basic algorithm.
64 / 329
Static analysis - Example 1
b b
Is there a deadlock ?
65 / 329
Static analysis - Example 1
b b
Deadlock
65 / 329
Static analysis - Example 2
b b
66 / 329
Static analysis - Example 2
66 / 329
Static analysis - Example 2
66 / 329
Static analysis - Example 2
No deadlock
66 / 329
Bracha-Toueg deadlock detection algorithm - Snapshot
Given an undirected network, and a basic algorithm.
A process that suspects it is deadlocked, initiates
a (Lai-Yang) snapshot to compute the wait-for graph.
Each node u takes a local snapshot of:
I requests it sent or received that weren’t yet granted or dismissed;
I grant and dismiss messages in edges.
Then it computes:
Out u : the nodes it sent a request to (not granted)
Inu : the nodes it received a request from (not dismissed)
67 / 329
Bracha-Toueg deadlock detection algorithm
requests u is the number of grants u requires to become unblocked.
When u receives a grant message, requests u ← requests u − 1.
If requests u becomes 0, u sends grant messages to all nodes in Inu .
If after termination of the deadlock detection run, requests > 0 at
the initiator, then it is deadlocked (in the basic algorithm).
Challenge: The initiator must detect termination of deadlock detection.
68 / 329
Bracha-Toueg deadlock detection algorithm
Initially notified u = false and free u = false at all nodes u.
The initiator starts a deadlock detection run by executing Notify.
Notify u : notified u ← true
for all w ∈ Out u send NOTIFY to w
if requests u = 0 then Grant u
for all w ∈ Out u await DONE from w
Grant u : free u ← true
for all w ∈ Inu send GRANT to w
for all w ∈ Inu await ACK from w
While a node is awaiting DONE or ACK messages,
it can process incoming NOTIFY and GRANT messages.
69 / 329
Bracha-Toueg deadlock detection algorithm
Let u receive NOTIFY.
If notified u = false, then u executes Notify u .
u sends back DONE.
Let u receive GRANT.
If requests u > 0, then requests u ← requests u − 1;
if requests u becomes 0, then u executes Grant u .
u sends back ACK.
When the initiator has received DONE from all nodes in its Out set,
it checks the value of its free field.
If it is still false, the initiator concludes it is deadlocked.
70 / 329
Bracha-Toueg deadlock detection algorithm - Example
await DONE from v , x
NOTIFY
requests u = 2 u x requests x = 0
NOTIFY
requests v = 1 v w requests w = 1
u is the initiator
71 / 329
Bracha-Toueg deadlock detection algorithm - Example
await ACK from u, w
await DONE from v , x (for DONE to u)
GRANT
u x
GRANT
v w
await DONE from w NOTIFY
(for DONE to u)
71 / 329
Bracha-Toueg deadlock detection algorithm - Example
await ACK from u, w
await DONE from v , x (for DONE to u)
ACK
requests u = 1 u x
NOTIFY
v w requests w = 0
await DONE from w GRANT await DONE from x
(for DONE to u) (for DONE to v )
await ACK from v
(for ACK to x)
71 / 329
Bracha-Toueg deadlock detection algorithm - Example
await ACK from w
await DONE from v , x (for DONE to u)
u x
GRANT DONE
requests v = 0 v w
await DONE from w await DONE from x
(for DONE to u) (for DONE to v )
await ACK from u await ACK from v
(for ACK to w ) (for ACK to x)
71 / 329
Bracha-Toueg deadlock detection algorithm - Example
await ACK from w
await DONE from v , x (for DONE to u)
requests u = 0 u x
ACK
v w
await DONE from w DONE await ACK from v
(for DONE to u) (for ACK to x)
await ACK from u
(for ACK to w )
71 / 329
Bracha-Toueg deadlock detection algorithm - Example
await ACK from w
await DONE from v , x (for DONE to u)
u x
DONE
v w
ACK await ACK from v
(for ACK to x)
71 / 329
Bracha-Toueg deadlock detection algorithm - Example
await ACK from w
await DONE from x (for DONE to u)
u x
ACK
v w
71 / 329
Bracha-Toueg deadlock detection algorithm - Example
await DONE from x
DONE
u x
v w
71 / 329
Bracha-Toueg deadlock detection algorithm - Example
u x
v w
free u = true, so u concludes that it isn’t deadlocked.
71 / 329
Bracha-Toueg deadlock detection algorithm - Correctness
The Bracha-Toueg algorithm is deadlock-free:
The initiator eventually receives DONE’s from all nodes in its Out set.
At that moment the Bracha-Toueg algorithm has terminated.
Two types of trees are constructed, similar to the echo algorithm:
1. NOTIFY/DONE’s construct a tree T rooted in the initiator.
2. GRANT/ACK’s construct disjoint trees Tv ,
rooted in a node v where from the start requests v = 0.
The NOTIFY/DONE’s only complete when all GRANT/ACK’s
have completed.
72 / 329
Bracha-Toueg deadlock detection algorithm - Correctness
In a deadlock detection run, requests are granted as much as possible.
Therefore, if the initiator has received DONE’s from all nodes
in its Out set and its free field is still false, it is deadlocked.
Vice versa, if its free field is true, there is no deadlock (yet),
(if resource requests are granted nondeterministically).
73 / 329
Question
Could we apply the Bracha-Toueg algorithm to itself, to establish
that it is a deadlock-free algorithm ?
Answer: No.
The Bracha-Toueg algorithm can only establish whether a deadlock
is present in a snapshot of one computation of the basic algorithm.
74 / 329
Lecture in a nutshell
wave algorithm
traversal algorithm
I ring algorithm
I Tarry’s algorithm
I depth-first search
tree algorithm
echo algorithm
communication and resource deadlock
wait-for graph
Bracha-Toueg deadlock detection algorithm
75 / 329
Termination detection
The basic algorithm is terminated if (1) each process is passive, and
(2) no basic messages are in transit.
send internal
active passive
receive receive
The control algorithm concerns termination detection and announcement.
Announcement is simple; we focus on detection.
Termination detection shouldn’t influence basic computations.
76 / 329
Dijkstra-Scholten algorithm
Requires a centralized basic algorithm, and an undirected network.
A tree T is maintained, which has the initiator p0 as the root, and
includes all active processes. Initially, T consists of p0 .
cc p estimates (from above) the number of children of process p in T .
I When p sends a basic message, cc p ← cc p + 1.
I Let this message be received by q.
- If q isn’t yet in T , it joins T with parent p and cc q ← 0.
- If q is already in T , it sends a control message to p that it isn’t
a new child of p. Upon receipt of this message, cc p ← cc p − 1.
I When a noninitiator p is passive and cc p = 0, it quits T and
informs its parent that it is no longer a child.
I When the initiator p0 is passive and cc p0 = 0, it calls Announce.
77 / 329
Question
Let the initiator send a basic message and then become passive.
Why doesn’t it immediately detect termination ?
78 / 329
Shavit-Francez algorithm
Allows a decentralized basic algorithm; requires an undirected network.
A forest F of (disjoint) trees is maintained, rooted in initiators.
Initially, each initiator of the basic algorithm constitutes a tree in F .
I When a process p sends a basic message, cc p ← cc p + 1.
I Let this message be received by q.
- If q isn’t yet in a tree in F , it joins F with parent p and cc q ← 0.
- If q is already in a tree in F , it sends a control message to p
that it isn’t a new child of p. Upon receipt, cc p ← cc p − 1.
I When a noninitiator p is passive and cc p = 0, it informs
its parent that it is no longer a child.
A passive initiator p with cc p = 0 starts a wave, tagged with its id.
Processes in a tree refuse to participate; decide calls Announce.
79 / 329
Rana’s algorithm
Allows a decentralized basic algorithm; requires an undirected network.
Each basic message is acknowledged.
A logical clock provides (basic and control) events with a time stamp.
The time stamp of a process is the highest time stamp of its events
so far (initially it is 0).
If at time t a process becomes quiet, i.e. (1) it has become passive,
and (2) all basic messages it sent have been acknowledged,
it starts a wave (of control messages), tagged with t (and its id).
Only processes that have been quiet from a time ≤ t on take part in
the wave.
If a wave completes, its initiator calls Announce.
80 / 329
Rana’s algorithm - Correctness
Suppose a wave, tagged with some t, doesn’t complete.
Then some process p doesn’t take part in this wave.
Due to this wave, p’s logical time becomes greater than t.
When p becomes quiet, it starts a new wave, tagged with a t 0 > t.
81 / 329
Rana’s algorithm - Correctness
Suppose a quiet process q takes part in a wave,
and is later on made active by a basic message from a process p
that wasn’t yet visited by this wave.
Then this wave won’t complete.
Namely, let the wave be tagged with t.
When q takes part in the wave, its logical clock becomes > t.
By the ack from q to p, in response to the basic message from p,
the logical clock of p becomes > t.
So p won’t take part in the wave (because it is tagged with t).
82 / 329
Question
What is a drawback of the Dijkstra-Scholten as well as Rana’s algorithm ?
Answer: Requires one control message for every basic message.
83 / 329
Weight-throwing termination detection
Requires a centralized basic algorithm; allows a directed network.
The initiator has weight 1, all noninitiators have weight 0.
When a process sends a basic message, it transfers part of
its weight to this message.
When a process receives a basic message, it adds the weight of
this message to its own weight.
When a noninitiator becomes passive, it returns its weight to
the initiator.
When the initiator becomes passive, and has regained weight 1,
it calls Announce.
84 / 329
Weight-throwing termination detection - Underflow
Underflow: The weight of a process can become too small
to be divided further.
Solution 1: The process gives itself extra weight, and informs
the initiator that there is additional weight in the system.
An ack from the initiator is needed before the extra weight
can be used, to avoid race conditions.
Solution 2: The process initiates a weight-throwing termination
detection sub-call, and only returns its weight to the initiator
when it has become passive and this sub-call has terminated.
85 / 329
Question
Why is the following termination detection algorithm not correct ?
I Each basic message is acknowledged.
I If a process becomes quiet, i.e. (1) it has become passive,
and (2) all basic messages it sent have been acknowledged,
then it starts a wave (tagged with its id).
I Only quiet processes take part in the wave.
I If the wave completes, its initiator calls Announce.
Answer: Let a process p that wasn’t yet visited by the wave
make a quiet process q that was already visited active again.
Next p becomes quiet before the wave arrives.
Now the wave can complete while q is active.
86 / 329
Token-based termination detection
The following centralized termination detection algorithm allows
a decentralized basic algorithm and a directed network.
A process p0 is initiator of a traversal algorithm to check whether
all processes are passive.
Complication 1: Due to the directed channels, reception of
basic messages can’t be acknowledged.
Complication 2: A traversal of only passive processes doesn’t guarantee
termination (even if there are no basic messages in the channels).
87 / 329
Complication 2 - Example
p0 q
s r
The token is at p0 ; only s is active.
The token travels to r .
s sends a basic message to q, making q active.
s becomes passive.
The token travels on to p0 , which falsely calls Announce.
88 / 329
Safra’s algorithm
Allows a decentralized basic algorithm and a directed network.
Each process maintains a counter of type Z; initially it is 0.
At each outgoing/incoming basic message, the counter is
increased/decreased.
At any time, the sum of all counters in the network is ≥ 0,
and it is 0 if and only if no basic messages are in transit.
At each round trip, the token carries the sum of the counters
of the processes it has traversed.
Complication: The token may end a round trip with a negative sum,
when a visited passive process becomes active by a basic message,
and sends basic messages that are received by an unvisited process.
89 / 329
Safra’s algorithm
Processes are colored white or black. Initially they are white,
and a process that receives a basic message becomes black.
I When p0 is passive, it sends a white token with counter 0.
I A noninitiator only forwards the token when it is passive.
I When a black process receives the token, the process becomes
white and the token black.
The token will stay black for the rest of the round trip.
I Eventually the token returns to p0 , who waits until it is passive:
- If the token is white and the sum of all counters is zero,
p0 calls Announce.
- Else, p0 sends a white token again.
90 / 329
Safra’s algorithm - Example
The token is at p0 ; only s is active; no messages are in transit;
all processes are white with counter 0. p0 q
s r
s sends a basic message m to q, setting the counter of s to 1.
s becomes passive.
The token travels around the network, white with sum 1.
The token travels on to r , white with sum 0.
m travels to q and back to s, making them active, black, with counter 0.
s becomes passive.
The token travels from r to p0 , black with sum 0.
q becomes passive.
After two more round trips of the token, p0 calls Announce.
91 / 329
Safra’s algorithm - Correctness
When the system has terminated,
I the token will color all processes white, and
I the counters of the processes sum up to zero.
So the token eventually returns to the initiator white with counter 0.
Suppose a token returns to the initiator white with counter 0.
Since the token is white: if reception of a message is included in
the counter, then sending this message is included in the counter too.
So, since the counter is 0:
I no process was made active after the token’s visit, and
I no messages are in transit.
92 / 329
Question
Any suggestions for an optimization of Safra’s algorithm ?
(Hint: Can we do away with black tokens ?)
Answer: When a black process gets the token, it dismisses the token
(and becomes white).
When the process becomes passive, it sends a fresh token,
tagged with its id.
93 / 329
Garbage collection
Processes are provided with memory.
Objects carry pointers to local objects and references to remote objects.
A root object can be created in memory; objects are always accessed
by navigating from a root object.
Aim of garbage collection: To reclaim inaccessible objects.
Three operations by processes to build or delete a reference:
I Creation: The object owner sends a pointer to another process.
I Duplication: A process that isn’t object owner sends a reference
to another process.
I Deletion: The reference is deleted at its process.
94 / 329
Reference counting
Reference counting tracks the number of references to an object.
If it drops to zero, and there are no pointers, the object is garbage.
Advantage: Can be performed at run-time.
Drawback: Can’t reclaim cyclic garbage.
95 / 329
Indirect reference counting
A tree is maintained for each object, with the object at the root,
and the references to this object as the other nodes in the tree.
Each object maintains a counter how many references to it
have been created.
Each reference is supplied with a counter how many times
it has been duplicated.
References keep track of their parent in the tree,
where they were duplicated or created from.
96 / 329
Indirect reference counting
If a process receives a reference, but already holds a reference to
or owns this object, it sends back a decrement.
When a duplicated (or created) reference has been deleted,
and its counter is zero, a decrement is sent
to the process it was duplicated from (or to the object owner).
When the counter of the object becomes zero,
and there are no pointers to it, the object can be reclaimed.
97 / 329
Weighted reference counting
Each object carries a total weight (equal to the weights of
all references to the object), and a partial weight.
When a reference is created, the partial weight of the object
is divided over the object and the reference.
When a reference is duplicated, the weight of the reference
is divided over itself and the copy.
When a reference is deleted, the object owner is notified,
and the weight of the deleted reference is subtracted from
the total weight of the object.
If the total weight of the object becomes equal to its partial weight,
and there are no pointers to the object, it can be reclaimed.
98 / 329
Weighted reference counting - Underflow
When the weight of a reference (or object) becomes too small
to be divided further, no more duplication (or creation) is possible.
Solution 1: The reference increases its weight,
and tells the object owner to increase its total weight.
An ack from the object owner to the reference is needed before
the additional weight can be used, to avoid race conditions.
Solution 2: The process at which the underflow occurs
creates an artificial object with a new total weight,
and with a reference to the original object.
Duplicated references are then to the artificial object,
so that references to the original object become indirect.
99 / 329
Question
Why is it much more important to address underflow of weight
than overflow of a reference counter ?
Answer: At each reference creation and duplication, weight decreases
exponentially fast, while the reference counter increases linearly.
100 / 329
Garbage collection ⇒ termination detection
Garbage collection algorithms can be transformed into
(existing and new) termination detection algorithms.
Given a basic algorithm.
Let each process p host one artificial root object Op .
There is also a special non-root object Z .
Initially, only initiators p hold a reference from Op to Z .
Each basic message carries a duplication of the Z -reference.
When a process becomes passive, it deletes its Z -reference.
The basic algorithm is terminated if and only if Z is garbage.
101 / 329
Garbage collection ⇒ termination detection - Examples
Indirect reference counting ⇒ Dijkstra-Scholten termination detection.
Weighted reference counting ⇒ weight-throwing termination detection.
102 / 329
Mark-scan
Mark-scan garbage collection consists of two phases:
I A traversal of all accessible objects, which are marked.
I All unmarked objects are reclaimed.
Drawback: In a distributed setting, mark-scan usually requires
freezing the basic computation.
In mark-copy, the second phase consists of copying all marked objects
to contiguous empty memory space.
In mark-compact, the second phase compacts all marked objects
without requiring empty space.
Copying is significantly faster than compaction, but leads to
fragmentation of the memory space (and uses more memory).
103 / 329
Generational garbage collection
In practice, most objects either can be reclaimed shortly after
their creation, or stay accessible for a very long time.
Garbage collection in Java, which is based on mark-scan,
therefore divides objects into multiple generations.
I Garbage in the youngest generation is collected frequently
using mark-copy.
I Garbage in the older generations is collected sporadically
using mark-compact.
104 / 329
This lecture in a nutshell
termination detection
I Dijkstra-Scholten algorithm
I Shavit-Francez algorithm
I Rana’s algorithm
I weight throwing
I Safra’s algorithm
garbage collection ⇒ termination detection
I indirect reference counting
I weighted reference counting
I mark-scan
I generational garbage collection
105 / 329
Routing
Routing means guiding a packet in a network to its destination.
A routing table at node u stores for each v 6= u a neighbor w of u:
Each packet with destination v that arrives at u is passed on to w .
Criteria for good routing algorithms:
I use of optimal paths
I robust with respect to topology changes in the network
I cope with very large, dynamic networks
I table adaptation to avoid busy edges
106 / 329
Chandy-Misra algorithm
Consider an undirected, weighted network, with weights ωvw > 0.
A centralized algorithm to compute all shortest paths to initiator u0 .
Initially, dist u0 (u0 ) = 0, dist v (u0 ) = ∞ if v 6= u0 , and parent v (u0 ) = ⊥.
u0 sends the message h0i to its neighbors.
Let node v receives hdi from neighbor w . If d + ωvw < dist v (u0 ), then:
I dist v (u0 ) ← d + ωvw and parent v (u0 ) ← w
I v sends hdist v (u0 )i to its neighbors (except w )
Termination detection by e.g. the Dijkstra-Scholten algorithm.
107 / 329
Question
Why is Rana’s algorithm not a good choice for detecting termination ?
Answer: Nodes tend to become quiet, and start a wave, often.
108 / 329
Chandy-Misra algorithm - Example
dist u0 ← 0 parent u0 ← ⊥
dist w ← 6 parent w ← u0
dist v ← 7 parent v ← w
dist x ← 8 parent x ← v 1
v w
dist x ← 7 parent x ← w 4
dist v ← 4 parent v ← u0
1 6
dist w ← 5 parent w ← v 1
dist x ← 6 parent x ← w
x u0
dist x ← 5 parent x ← v 1
dist x ← 1 parent x ← u0
dist w ← 2 parent w ← x
dist v ← 3 parent v ← w
dist v ← 2 parent v ← x
109 / 329
Chandy-Misra algorithm - Complexity
Worst-case message complexity: Exponential
Worst-case message complexity for minimum-hop: O(N 2 ·E )
For each root, the algorithm requires at most O(N·E ) messages.
110 / 329
Merlin-Segall algorithm
A centralized algorithm to compute all shortest paths to initiator u0 .
Initially, dist u0 (u0 ) = 0, dist v (u0 ) = ∞ if v 6= u0 ,
and the parent v (u0 ) values form a sink tree with root u0 .
Each round, u0 sends h0i to its neighbors.
1. Let node v get hdi from neighbor w .
If d + ωvw < dist v (u0 ), then dist v (u0 ) ← d + ωvw
(and v stores w as future value for parent v (u0 )).
If w = parent v (u0 ), then v sends hdist v (u0 )i to its neighbors
except parent v (u0 ).
2. When a v 6= u0 has received a message from all neighbors,
it sends hdist v (u0 )i to parent v (u0 ), and updates parent v (u0 ).
u0 starts a new round after receiving a message from all neighbors.
111 / 329
Merlin-Segall algorithm - Termination + complexity
After i rounds, all shortest paths of ≤ i hops have been computed.
So the algorithm can terminate after N − 1 rounds.
Message complexity: Θ(N 2 ·E )
For each root, the algorithm requires Θ(N·E ) messages.
No separate termination detection is needed.
112 / 329
Merlin-Segall algorithm - Example (round 1)
0 0 0 8
0 0 0 8
0 1 1 1 8
0
4 3 4 3 4 3
1 5 1 5 1 5
5
4 4
1 1 1
4 5 4 5
4 5 4 5
8 8
0 1 0 1
1 1
1
4 3 4 3
1 5 1 5
4 4
1 1 4
4 5 4 4
4
113 / 329
Merlin-Segall algorithm - Example (round 2)
0
0 1 0 1 0 1
0 1 0 1 1 0 1 1
0 1
4 3 4 3 4 3
1 5 1 5 1 5
4 4 4
1 1 1
4 4 4 4 4 4
4 4 4
1
0 1 0 1
1 1
4 3 4 3
1 5 1 5
2 4 2
4 1 4 1 4
2 4 2 4
114 / 329
Merlin-Segall algorithm - Example (round 3)
0
0 1 0 1 0 1
0 1 0 1 1 0 1
0 0 1 0
4 3 4 3 4 3
1 5 1 5 1 5
2
1 1 1 4
2 4 2 4 2 4
2 4
1
0 1 0 1 0 1
1 1 1
4 3 4 3 4 3
1 5 1 5 1 5
2 3 2
2 1 4 1 4 1
2 3 2 3 2 3
115 / 329
Merlin-Segall algorithm - Topology changes
A number is attached to distance messages.
When an edge fails or becomes operational, adjacent nodes send
a message to u0 via the sink tree.
(If the message meets a failed tree link, it is discarded.)
When u0 receives such a message, it starts a new set of N − 1 rounds,
with a higher number.
If the failed edge is part of the sink tree, the sink tree is rebuilt.
Example: z v u0
y x w
x informs u0 (via v ) that an edge of the sink tree has failed.
116 / 329
Toueg’s algorithm
Computes for each pair u, v a shortest path from u to v .
d S (u, v ), with S a set of nodes, denotes the length of a shortest path
from u to v with all intermediate nodes in S.
d S (u, u) = 0
d ∅ (u, v ) = ωuv if u 6= v and uv ∈ E
d ∅ (u, v ) = ∞ if u 6= v and uv 6∈ E
d S∪{w } (u, v ) = min{ d S (u, v ), d S (u, w ) + d S (w , v ) } if w 6∈ S
When S contains all nodes, d S is the standard distance function.
117 / 329
Floyd-Warshall algorithm
We first discuss a uniprocessor algorithm.
Initially, S = ∅; the first three equations define d ∅ .
While S doesn’t contain all nodes, a pivot w 6∈ S is selected:
I d S∪{w } is computed from d S using the fourth equation.
I w is added to S.
When S contains all nodes, d S is the standard distance function.
Time complexity: Θ(N 3 )
118 / 329
Question
Which complications arise when the Floyd-Warshall algorithm
is turned into a distributed algorithm ?
I All nodes must pick the pivots in the same order.
I Each round, nodes need the distance values of the pivot
to compute their own routing table.
119 / 329
Toueg’s algorithm
Assumption: Each node knows the id’s of all nodes.
(Because pivots must be picked uniformly at all nodes.)
Initially, at each node u:
I Su = ∅;
I dist u (u) = 0 and parent u (u) = ⊥;
I for each v 6= u, either
dist u (v ) = ωuv and parent u (v ) = v if there is an edge uv , or
dist u (v ) = ∞ and parent u (v ) = ⊥ otherwise.
120 / 329
Toueg’s algorithm
At the w -pivot round, w broadcasts its values dist w (v ), for all nodes v .
If parent u (w ) = ⊥ for a node u 6= w at the w -pivot round,
then dist u (w ) = ∞, so dist u (w )+dist w (v ) ≥ dist u (v ) for all nodes v .
Hence the sink tree toward w can be used to broadcast dist w .
If u is in the sink tree toward w , it sends hrequest, w i to parent u (w ),
to let it pass on dist w .
If u isn’t in the sink tree toward w , it proceeds to the next pivot round,
with Su ← Su ∪ {w }.
121 / 329
Toueg’s algorithm
Consider any node u in the sink tree toward w .
If u 6= w , it waits for the values dist w (v ) from parent u (w ).
u forwards the values dist w (v ) to neighbors that send hrequest, w i to u.
If u 6= w , it checks for each node v whether
dist u (w ) + dist w (v ) < dist u (v ).
If so, dist u (v ) ← dist u (w ) + dist w (u) and parent u (v ) ← parent u (w ).
Finally, u proceeds to the next pivot round, with Su ← Su ∪ {w }.
122 / 329
Toueg’s algorithm - Example
4
pivot u dist x (v ) ← 5 dist v (x) ← 5 u v
parent x (v ) ← u parent v (x) ← u 1 1
pivot v dist u (w ) ← 5 dist w (u) ← 5 x w
1
parent u (w ) ← v parent w (u) ← v
pivot w dist x (v ) ← 2 dist v (x) ← 2
parent x (v ) ← w parent v (x) ← w
pivot x dist u (w ) ← 2 dist w (u) ← 2
parent u (w ) ← x parent w (u) ← x
dist u (v ) ← 3 dist v (u) ← 3
parent u (v ) ← x parent v (u) ← w
123 / 329
Toueg’s algorithm - Complexity + drawbacks
Message complexity: O(N 2 )
There are N pivot rounds, each taking at most O(N) messages.
Drawbacks:
I Uniform selection of pivots requires that all nodes know
the nodes in the network in advance.
I Global broadcast of dist w at the w -pivot round causes
a high bit complexity.
I Not robust with respect to topology changes.
124 / 329
Question
Which addition needs to be made to the algorithm to allow that
a node u can discard the routing table of the pivot w at some point ?
Answer: Next to hrequest, w i, u informs all other neighbors that
they do not need to forward w ’s routing table to u.
Then the message complexity increases to O(N·E ).
125 / 329
Toueg’s algorithm - Optimization
Let parent u (w ) = x with x 6= w at the start of the w -pivot round.
If dist x (v ) doesn’t change in this round, then neither does dist u (v )
(for any v ).
Upon reception of dist w , x first updates dist x , and only forwards
values dist w (v ) for which dist x (v ) has changed.
126 / 329
Distance-vector routing
Consider a network in which nodes or links may fail or are added.
Such a change is eventually detected by its neighbors.
In distance-vector routing, at a change in the local topology or
routing table, a node sends its entire routing table to its neighbors.
Each node locally computes shortest paths
(e.g. with the Bellman-Ford algorithm, if links can have negative weights).
127 / 329
Link-state routing
In link-state routing, nodes periodically sends a link-state packet, with
I the node’s edges and their weights (based on latency, bandwidth)
I a sequence number (which increases with each broadcast)
Link-state packets are flooded through the network.
Nodes store link-state packets, to obtain a view of the entire network.
Sequence numbers avoid that new info is overwritten by old info.
Each node locally computes shortest paths (e.g. with Dijkstra’s alg.).
128 / 329
Question
Flooding entire routing tables (instead of only edges and weights)
tends to produce a less efficient algorithm.
Why is that ?
Answer: A routing table may be based on remote edges that have
recently crashed.
And, of course, the bit complexity increases dramatically.
129 / 329
Link-state routing - Time -to -live
When a node recovers from a crash, its sequence number starts at 0.
So its link-state packets may be ignored for a long time.
Therefore link-state packets carry a time-to-live (TTL) field.
After this time the information from the packet may be discarded.
To reduce flooding, each time a link-state packet is forwarded,
its TTL field decreases.
When it becomes 0, the packet is discarded.
130 / 329
Autonomous systems
The OSPF protocol for routing on the Internet uses link-state routing.
The RIP protocol employs distance-vector routing.
Link-state / distance-vector routing doesn’t scale to the Internet,
because it uses flooding / sends entire routing tables.
Therefore the Internet is divided into autonomous systems.
Each autonomous system uses the OSPF or RIP protocol.
131 / 329
Border gateway protocol
The Border Gateway Protocol routes between autonomous systems.
Routers broadcast updates of their routing tables (a la Chandy-Misra).
A router may update its routing table
I because it detects a topology change, or
I because of an update in the routing table of a neighbor.
132 / 329
Routing on the Internet
133 / 329
Lecture in a nutshell
routing tables to guide a packet to its destination
Chandy-Misra algorithm has exponential worst-case message complexity
but only O(N 2 ·E ) for minimum-hop paths
Merlin-Segall algorithm has message complexity Θ(N 2 ·E )
Toueg’s algorithm has message complexity O(N 2 )
(but has a high bit complexity, and requires uniform selection of pivots)
link-state / distance-vector routing and the border gateway protocol
employ classical routing algorithms on the Internet
134 / 329
Breadth-first search
Consider an undirected, unweighted network.
A breadth-first search tree is a sink tree in which each tree path
to the root is minimum-hop.
The Chandy-Misra algorithm for minimum-hop paths computed
a breadth-first search tree using O(N·E ) messages (for each root).
√
The following centralized algorithm requires O(N· E ) messages
(for each root).
135 / 329
Breadth-first search - A “simple” algorithm
Initially (after round 0), the initiator is at distance 0,
noninitiators are at distance ∞, and parents are undefined.
After round n ≥ 0, the tree has been constructed up to depth n.
Nodes at distance n know which neighbors are at distance n − 1.
In round n + 1:
0
forward/reverse
forward/reverse
explore
n
explore
explore/reverse
n+1
136 / 329
Breadth-first search - A “simple” algorithm
I Messages hforward, ni travel down the tree, from the initiator
to nodes at distance n.
I When a node at distance n gets hforward, ni, it sends
hexplore, n + 1i to neighbors that aren’t at distance n − 1.
Let a node v receive hexplore, n + 1i.
I If dist v = ∞, then dist v ← n + 1, the sender becomes v ’s parent,
and v sends back hreverse, truei.
I If dist v = n + 1, then v stores that the sender is at distance n,
and v sends back hreverse, falsei.
I If dist v = n, then this is a negative ack for the hexplore, n + 1i
that v sends into this edge.
137 / 329
Breadth-first search - A “simple” algorithm
I A noninitiator at distance n (or < n) waits until all messages
hexplore, n + 1i (resp. hforward, ni) have been answered.
Then it sends hreverse, bi to its parent, where b = true if
and only if new nodes were added to its subtree.
I The initiator waits until all messages hforward, ni
(or, in round 1, hexplore, 1i) have been answered.
If no new nodes were added in round n + 1, it terminates.
Else, it continues with round n + 2.
In round n + 2, nodes only send a forward to children that reported
newly discovered nodes in round n + 1.
138 / 329
Breadth-first search - Complexity
Worst-case message complexity: O(N 2 + E ) = O(N 2 )
There are at most N rounds.
Each round, tree edges carry at most 1 forward and 1 replying reverse.
In total, edges carry 1 explore and 1 replying reverse or explore.
Worst-case time complexity: O(N 2 )
Round n is completed in at most 2n time units, for n = 1, . . . , N.
139 / 329
Frederickson’s algorithm
Computes ` levels per round, with 1 ≤ ` < N.
Initially, the initiator is at distance 0, noninitiators are at distance ∞,
and parents are undefined.
After round n, the tree has been constructed up to depth `n.
In round n + 1:
I hforward, `ni travels down the tree, from the initiator to nodes
at distance `n.
I When a node at distance `n gets hforward, `ni, it sends
hexplore, `n + 1i to neighbors that aren’t at distance `n − 1.
140 / 329
Frederickson’s algorithm - Complications
Complication 1: In round n + 1, a node at a distance > `n may send
multiple explore’s into an edge.
How can this happen ?
Which (small) complication may arise as a result of this ?
How can this be resolved ?
Solution: reverse’s in reply to explore’s are supplied with a distance.
Complication 2: A node w may receive a forward from a non-parent v .
How can this happen ?
Solution: w can dismiss this forward.
In the previous round, w informed v that it is no longer a child.
141 / 329
Frederickson’s algorithm
Let a node v receive hexplore, ki. We distinguish two cases.
I k < dist v :
dist v ← k, and the sender becomes v ’s parent.
If ` doesn’t divide k, then v sends hexplore, k + 1i to
its other neighbors.
If ` divides k, then v sends back hreverse, k, truei.
I k ≥ dist v :
If k = dist v and ` divides k, then v sends back hreverse, k, falsei.
Else v doesn’t send a reply (because it sent hexplore, dist v + 1i
into this edge).
142 / 329
Frederickson’s algorithm
I Nodes at a distance `n < k < `(n + 1) wait until a message
hreverse, k + 1, i or hexplore, ji with j ∈ {k, k + 1, k + 2}
has been received from all neighbors.
Then they send hreverse, k, truei to their parent.
I Noninitiators at a distance `n (or < `n) wait until all messages
hexplore, `n + 1i (resp. hforward, `ni) have been answered with
a reverse or explore (resp. reverse).
Then they send hreverse, bi to their parent, where b = true if
and only if they received hreverse, , truei from a child.
143 / 329
Frederickson’s algorithm
I The initiator waits until all messages hforward, `ni
(or, in round 1, hexplore, 1i) have been answered.
If it is certain that no unexplored nodes remain, it terminates.
Else, it continues with round n + 2.
In round n + 2, nodes only send a forward to children that reported
newly discovered nodes in round n + 1.
144 / 329
Question
Apply Frederickson’s algorithm to the network below, with initiator u0
and ` = 2:
∞ v w ∞
∞ x u0 0
Give a computation in which:
I w becomes the parent of x, and
I in round 2, v sends a spurious forward to w .
145 / 329
Frederickson’s algorithm - Complexity
2
Worst-case message complexity: O( N` + `·E )
There are at most d N−1
` e + 1 rounds.
Each round, tree edges carry at most 1 forward and 1 replying reverse.
In total, edges carry at most 2` explore’s and 2` replying reverse’s.
(In total, frond edges carry at most 1 spurious forward.)
2
Worst-case time complexity: O( N` )
Round n is completed in at most 2`n time units, for n = 1, . . . , d N−1
` e+1.
√
If ` = d √NE e, both message and time complexity are O(N· E ).
146 / 329
Questions
What is the optimal value of ` in case the network is:
I a complete graph
I acyclic
147 / 329
Question
Even with cycle-free routes, we can still get a deadlock. Why ?
Hint: Consider the (bounded) memory at the processes.
148 / 329
Store-and-forward deadlocks
A store-and-forward deadlock occurs when a group of packets are all
waiting for the use of a buffer slot occupied by a packet in the group.
A controller avoids such deadlocks.
It prescribes whether a packet can be generated or forwarded, and
in which buffer slot it is put next.
149 / 329
Deadlock-free packet switching
Consider an undirected network, supplied with routing tables.
Processes store data packets traveling to their destination in buffers.
Possible events:
I Generation: A new packet is placed in an empty buffer slot.
I Forwarding: A packet is forwarded to an empty buffer slot
of the next node on its route.
I Consumption: A packet at its destination node is removed
from the buffer.
At a node with an empty buffer, packet generation must be allowed.
150 / 329
Synchronous versus asynchronous networks
For simplicity we assume synchronous communication.
In an asynchronous setting, a node can only eliminate a packet
when it is sure that the packet will be accepted at the next node.
Question: How can this be achieved in an undirected network ?
Answer: A packet can only be eliminated by the sender when
its reception has been acknowledged.
151 / 329
Destination controller
The network consists of nodes u0 , . . . , uN−1 .
Ti denotes the sink tree (with respect to the routing tables)
with root ui for i = 0, . . . , N − 1.
In the destination controller, each node carries N buffer slots.
I When a packet with destination ui is generated at v ,
it is placed in the i th buffer slot of v .
I If vw is an edge in Ti , then the i th buffer slot of v is linked
to the i th buffer slot of w .
152 / 329
Destination controller - Correctness
Theorem: The destination controller is deadlock-free.
Proof : Consider a reachable configuration γ.
Make forwarding and consumption transitions to a configuration δ
where no forwarding or consumption is possible.
For each i, since Ti is acyclic, packets in an i th buffer slot can travel
to their destination, where they are consumed.
So in δ, all buffers are empty.
153 / 329
Hops-so-far controller
The network consists of nodes u0 , . . . , uN−1 .
Ti is the sink tree (with regard to the routing tables) with root ui
for 0 = 1, . . . , N − 1.
K is the length of a longest path in any Ti .
In the hops-so-far controller, each node carries K + 1 buffer slots,
numbered from 0 to K .
I A generated packet is placed in the 0th buffer slot.
I For each edge vw and any j < K , the j th buffer slot of v
is linked to the (j+1)th buffer slot of w , and vice versa.
154 / 329
Hops-so-far controller - Correctness
Theorem: The hops-so-far controller is deadlock-free.
Proof : Consider a reachable configuration γ.
Make forwarding and consumption transitions to a configuration δ
where no forwarding or consumption is possible.
Packets in a K th buffer slot are at their destination.
So in δ, K th buffer slots are all empty.
Suppose all i th buffer slots are empty in δ, for some 1 ≤ i ≤ K .
Then all (i−1)th buffer slots must also be empty in δ.
For else some packet in an (i−1)th buffer slot could be forwarded
or consumed.
Concluding, in δ all buffers are empty.
155 / 329
Acyclic orientation cover
Consider an undirected network.
An acyclic orientation is a directed, acyclic network obtained by
directing all edges.
Let P be a set of paths in the (undirected) network.
An acyclic orientation cover of P consists of acyclic orientations
G0 , . . . , Gn−1 such that each path in P is the concatenation of
paths P0 , . . . , Pn−1 in G0 , . . . , Gn−1 .
156 / 329
Acyclic orientation cover - Example
For each undirected ring there exists a cover, consisting of three
acyclic orientations, of the collection of minimum-hop paths.
For instance, in case of a ring of size six:
G0 G1 G2
157 / 329
Acyclic orientation cover controller
Let P be the set of paths in the network induced by the sink trees.
Let G0 , . . . , Gn−1 be an acyclic orientation cover of P.
In the acyclic orientation cover controller, nodes have n buffer slots,
numbered from 0 to n − 1.
I A generated packet is placed in the 0th buffer slot.
I Let vw be an edge in Gi .
The i th buffer slot of v is linked to the i th buffer slot of w .
Moreover, if i < n − 1, then the i th buffer slot of w is linked to
the (i+1)th buffer slot of v .
158 / 329
Acyclic orientation cover controller - Intuition
Consider a packet; it is routed via the sink tree of its destination.
Its path is a concatenation of paths P0 , . . . , Pn−1 in G0 , . . . , Gn−1 .
While the packet is in an i th slot with i < n − 1, it can be forwarded.
If the packet ends up in the (n − 1)th buffer slot at a node,
then it is being routed via the last part Pn−1 of the path.
In that case the packet can be routed to its destination via
(n − 1)th buffer slots.
159 / 329
Acyclic orientation cover controller - Example
For each undirected ring there exists a deadlock-free controller that:
I uses three buffer slots per node, and
I allows packets to travel via minimum-hop paths.
160 / 329
Acyclic orientation cover controller - Correctness
Theorem: Let all packets be routed via paths in P.
Then the acyclic orientation cover controller is deadlock-free.
Proof : Consider a reachable configuration γ.
Make forwarding and consumption transitions to a configuration δ
where no forwarding or consumption is possible.
Since Gn−1 is acyclic, packets in an (n−1)th buffer slot can travel
to their destination. So in δ, all (n−1)th buffer slots are empty.
Suppose all i th buffer slots are empty in δ, for some i = 1, . . . , n−1.
Then all (i−1)th buffer slots must also be empty in δ.
For else, since Gi−1 is acyclic, some packet in an (i−1)th buffer slot
could be forwarded or consumed.
Concluding, in δ all buffers are empty.
161 / 329
Question
Consider an acyclic orientation cover for the minimum-hop paths
in a ring of four nodes.
Show how the resulting acyclic orientation cover controller links
buffer slots.
162 / 329
Slow-start algorithm in TCP
Back to the asynchronous, pragmatic world of the Internet.
To avoid congestion, in TCP, nodes maintain a congestion window
for each of their edges.
It is the maximum number of unacknowledged packets on this edge.
163 / 329
Congestion window
The congestion window grows linearly with each received ack,
up to some threshold.
Question: Explain why the congestion window may double with
every “round trip time”.
The congestion window is reset to the initial size (in TCP Tahoe)
or halved (in TCP Reno) with each lost data packet.
164 / 329
Take-home messages of the current lecture
Frederickon’s algorithm to compute a breadth-first search tree
iterative deepening (a la Frederickson’s alg.)
optimization of a parameter (`) based on a complexity analysis
importance of deadlock-free packet switching
acyclic orientation cover controller
congestion window in TCP
165 / 329
Election algorithms
Often a leader process is needed to coordinate a distributed task.
In an election algorithm, each computation should terminate in
a configuration where one process is the leader.
Assumptions:
I All processes have the same local algorithm.
I The algorithm is decentralized:
The initiators can be any non-empty set of processes.
I Process id’s are unique, and from a totally ordered set.
166 / 329
Chang-Roberts algorithm
Consider a directed ring.
Initially only initiators are active, and send a message with their id.
Let an active process p receive a message q:
I If q < p, then p dismisses the message.
I If q > p, then p becomes passive, and passes on the message.
I If q = p, then p becomes the leader.
Passive processes (including all noninitiators) pass on messages.
Worst-case message complexity: O(N 2 )
Average-case message complexity: O(N log N)
167 / 329
Chang-Roberts algorithm - Example
All processes are initiators.
N−1 0
N−2 1 N(N+1)
anti-clockwise: 2 messages
N−3 2
clockwise: 2N−1 messages
i
168 / 329
Franklin’s algorithm
Consider an undirected ring.
Each active process p repeatedly compares its own id with
the id’s of its nearest active neighbors on both sides.
If such a neighbor has a larger id, then p becomes passive.
Initially, initiators are active, and noninitiators are passive.
Each round, an active process p:
I sends its id to its neighbors on either side, and
I receives id’s q and r :
- if max{q, r } < p, then p starts another round
- if max{q, r } > p, then p becomes passive
- if max{q, r } = p, then p becomes the leader
Passive processes pass on incoming messages.
169 / 329
Franklin’s algorithm - Complexity
Worst-case message complexity: O(N log N)
In each round, at least half of the active processes become passive.
So there are at most blog2 Nc + 1 rounds.
Each round takes 2N messages.
Question: Give an example with N = 4 that takes three rounds.
Question: Show that for any N there is a ring that takes two rounds.
170 / 329
Franklin’s algorithm - Example
N−1 0
N−2 1
after 1 round only node N−1 is active
N−3 2
after 2 rounds node N−1 is the leader
i
Suppose this ring is directed with a clockwise orientation.
If a process would only compare its id with the one of its predecessor,
then it would take N rounds to complete.
171 / 329
Dolev-Klawe-Rodeh algorithm
Consider a directed ring.
The comparison of id’s of an active process p and
its nearest active neighbors q and r is performed at r .
s q p r t
- If max{q, r } < p, then r changes its id to p, and sends out p.
- If max{q, r } > p, then r becomes passive.
- If max{q, r } = p, then r announces this id to all processes.
The process that originally had the id p becomes the leader.
Worst-case message complexity: O(N log N)
172 / 329
Dolev-Klawe-Rodeh algorithm - Example
Consider the following clockwise oriented ring.
1 5
5 3
0 2 4 3
4
leader
173 / 329
Tree election algorithm for acyclic networks
Question: How can the tree algorithm be used to make the process
with the largest id in an undirected, acyclic network the leader ?
(Be careful that a leaf may be a noninitiator.)
Start with a wake-up phase, driven by the initiators.
I Initially, initiators send a wake-up message to all neighbors.
I When a noninitiator receives a first wake-up message,
it sends a wake-up message to all neighbors.
I A process wakes up when it has received wake-up messages
from all neighbors.
174 / 329
Tree election algorithm
The local algorithm at an awake process p:
I p waits until it has received id’s from all neighbors except one,
which becomes its parent.
I p computes the largest id maxp among the received id’s
and its own id.
I p sends a parent request to its parent, tagged with maxp .
I If p receives a parent request from its parent, tagged with q,
it computes max0p , being the maximum of maxp and q.
I Next p sends an information message to all neighbors except
its parent, tagged with max0p .
I This information message is forwarded through the network.
I The process with id max0p becomes the leader.
Message complexity: 2N − 2 messages
175 / 329
Question
In case a process p receives a parent request from its parent,
why does it need to recompute maxp ?
176 / 329
Tree election algorithm - Example
The wake-up phase is omitted.
4 4 6
6 6 4 2 2 6 6 4 2 2 6 6 4 2 2
3 3 1 5 5 3 3 1 5 5 3 3 1 5 5
1 5 5
6 leader 6 6
6 6 4 2 6 6 6 4 2 6 6 6 4 2 6
3 3 1 5 5 3 3 1 5 5 6 3 1 5 6
5 6 6
177 / 329
Echo algorithm with extinction
Each initiator starts a wave, tagged with its id.
Non-initiators join the first wave that hits them.
At any time, each process takes part in at most one wave.
Suppose a process p in wave q is hit by a wave r :
I if q < r , then p changes to wave r
(it abandons all earlier messages);
I if q > r , then p continues with wave q
(it dismisses the incoming message);
I if q = r , then the incoming message is treated according to
the echo algorithm of wave q.
If wave p executes a decide event (at p), p becomes the leader.
Worst-case message complexity: O(N·E )
178 / 329
Minimum spanning trees
Consider an undirected, weighted network.
We assume that different edges have different weights.
(Or weighted edges can be totally ordered by also taking into account
the id’s of endpoints of an edge, and using a lexicographical order.)
In a minimum spanning tree, the sum of the weights of the edges
in the spanning tree is minimal.
Example: 10
3
1 8
9
179 / 329
Fragments
Lemma: Let F be a fragment
(i.e., a connected subgraph of the minimum spanning tree M).
Let e be the lowest-weight outgoing edge of F
(i.e., e has exactly one endpoint in F ).
Then e is in M.
Proof : Suppose not.
Then M ∪ {e} has a cycle,
containing e and another outgoing edge f of F .
Replacing f by e in M gives a spanning tree
with a smaller sum of weights of edges.
180 / 329
Kruskal’s algorithm
A uniprocessor algorithm for computing minimum spanning trees.
I Initially, each node forms a separate fragment.
I In each step, a lowest-weight outgoing edge of a fragment
is added to the spanning tree, joining two fragments.
This algorithm also works when edges have the same weight.
Then the minimum spanning tree may not be unique.
181 / 329
Gallager-Humblet-Spira algorithm
Consider an undirected, weighted network,
in which different edges have different weights.
Distributed computation of a minimum spanning tree:
I Initially, each process is a fragment.
I The processes in a fragment F together search for
the lowest-weight outgoing edge eF .
I When eF has been found, the fragment at the other end
is asked to collaborate in a merge.
Complications: Is an edge outgoing ? Is it lowest-weight ?
182 / 329
Level, name and core edge
Each fragment carries a (unique) name fn : R and a level ` : N.
Its level is the maximum number of joins any process in the fragment
has experienced.
Neighboring fragments F = (fn, `) and F 0 = (fn0 , `0 ) can be joined
as follows:
e
` < `0 ∧ F →
F
F 0: F ∪ F 0 = (fn0 , `0 )
` = `0 ∧ eF = eF 0 : F ∪ F 0 = (weight eF , ` + 1)
The core edge of a fragment is the last edge that connected two
sub-fragments at the same level. Its end points are the core nodes.
183 / 329
Parameters of a process
Its state :
I sleep (for noninitiators)
I find (looking for a lowest-weight outgoing edge)
I found (reported a lowest-weight outgoing edge to the core edge)
The status of its channels :
I basic edge (undecided)
I branch edge (in the spanning tree)
I rejected (not in the spanning tree)
The name and level of its fragment.
Its parent (toward the core edge).
184 / 329
Initialization
Non-initiators wake up when they receive a (connect or test) message.
Each initiator, and noninitiator after it has woken up:
I sets its level to 0
I sets its lowest-weight edge to branch
I sends hconnect, 0i into this channel
I sets its other channels to basic
I sets its state to found
185 / 329
Joining two fragments
Let fragments F = (fn, `) and F 0 = (fn0 , `0 ) be joined via channel pq.
I If ` < `0 , then p sent hconnect, `i to q.
q sends hinitiate, fn0 , `0 , found
find
i to p.
F ∪ F 0 inherits the core edge of F 0 .
I If ` = `0 , then p and q sent hconnect, `i to each other.
They send hinitiate, weight pq, ` + 1, findi to each other.
F ∪ F 0 gets core edge pq.
find
At reception of hinitiate, fn, `, found i, a process stores fn and `,
sets its state to find or found, and adopts the sender as its parent.
It passes on the message through its other branch edges.
186 / 329
Computing the lowest-weight outgoing edge
In case of hinitiate, fn, `, findi, p checks in increasing order of weight
if one of its basic edges pq is outgoing, by sending htest, fn, `i to q.
While ` > level q , q postpones processing the incoming test message.
Let ` ≤ level q .
I If q is in fragment fn, then q replies reject.
In this case p and q will set pq to rejected.
I Else, q replies accept.
When a basic edge is accepted, or there are no basic edges left,
p stops the search.
187 / 329
Questions
Why does q postpone processing the incoming htest, , `i message
from p while ` > level q ?
Answer: p and q might be in the same fragment, in which case
hinitiate, fn, `, findi is on its way to q.
Why does this postponement not lead to a deadlock ?
Answer: There is always a fragment with a smallest level.
188 / 329
Reporting to the core nodes
I p waits for all branch edges, except its parent, to report.
I p sets its state to found.
I p computes the minimum λ of (1) these reports, and
(2) the weight of its lowest-weight outgoing basic edge
(or ∞, if no such channel was found).
I If λ < ∞, p stores either the branch edge that sent λ,
or its basic edge of weight λ.
I p sends hreport, λi to its parent.
189 / 329
Termination or changeroot at the core nodes
A core node receives reports through all its branch edges, including
the core edge.
I If the minimum reported value µ = ∞, the core nodes terminate.
I If µ < ∞, the core node that received µ first sends changeroot
toward the lowest-weight outgoing basic edge.
(The core edge becomes a regular tree edge.)
Ultimately changeroot reaches the process p that reported
the lowest-weight outgoing basic edge.
p sets this channel to branch, and sends hconnect, level p i into it.
190 / 329
Starting the join of two fragments
When q receives hconnect, level p i from p, levelq ≥ level p .
Namely, either level p = 0, or q earlier sent accept to p.
I If levelq > level p , then q sets qp to branch
find
and sends hinitiate, name q , level q , found i to p.
I As long as levelq = level p and qp isn’t a branch edge,
q postpones processing the connect message.
I If levelq = level p and qp is a branch edge
(meaning that q sent hconnect, level q i to p), then q sends
hinitiate, weight qp, level q + 1, findi to p (and vice versa).
In this case pq becomes the core edge.
191 / 329
Questions
If level q = level p and qp isn’t a branch edge, why does q postpone
processing the incoming connect message from p ?
Answer: The fragment of q might be in the process of joining
another fragment at a level ≥ level q .
Then the fragment of p should subsume the name and level of that
joint fragment, instead of joining q’s fragment at an equal level.
Why does this postponement not give rise to a deadlock ?
(I.e., why can’t there be a cycle of fragments waiting for a reply
to a postponed connect message ?)
Answer: Because different channels have different weights.
192 / 329
Question
Suppose a process reported a lowest-weight outgoing basic edge,
and in return receives hinitiate, fn, `, findi.
Why must it test again whether this basic edge is outgoing ?
Answer: Its fragment may in the meantime have joined
the fragment at the other side of this basic edge.
193 / 329
Gallager-Humblet-Spira algorithm - Example
5 15
p q t
pq qp hconnect, 0i
pq qp hinitiate, 5, 1, findi 9
11
7
ps qr htest, 5, 1i
tq hconnect, 0i s r
qt hinitiate, 5, 1, findi 3
tq hreport, ∞i
rs sr hconnect, 0i rs hreport, 7i
rs sr hinitiate, 3, 1, findi sr hreport, 9i
sp rq accept rq hconnect, 1i
pq hreport, 9i rq qp qr qt rs hinitiate, 7, 2, findi
qp hreport, 7i ps sp htest, 7, 2i
qr hconnect, 1i pr htest, 7, 2i
sp rq htest, 3, 1i rp reject
ps qr accept pq tq qr sr rq hreport, ∞i
194 / 329
Gallager-Humblet-Spira algorithm - Complexity
Worst-case message complexity: O(E + N log N)
I A rejected channel requires a test-reject or test-test pair.
Between two subsequent joins, a process:
I receives one initiate
I sends at most one test that triggers an accept
I sends one report
I sends at most one changeroot or connect
A fragment at level ` contains ≥ 2` processes.
So each process experiences at most blog2 Nc joins.
195 / 329
Back to election
By two extra messages at the very end,
the core node with the largest id becomes the leader.
So Gallager-Humblet-Spira induces an election algorithm for
general undirected networks.
(We must impose an order on channels of equal weight.)
Lower bounds for the average-case message complexity
of election algorithms based on comparison of id’s:
Rings: Ω(N log N)
General networks: Ω(E + N log N)
196 / 329
Lecture in a nutshell
leader election
decentralized, uniform local algorithm, unique process id’s
Chang-Roberts and Dolev-Klawe-Rodeh algorithm on directed rings
tree election algorithm
echo algorithm with extinction
Gallager-Humblet-Spira minimum spanning tree algorithm
197 / 329
Election in anonymous networks
In an anonymous network, processes (and channels) have no unique id.
Processes may be anonymous for several reasons:
I Transmitting/storing id’s is too expensive (IEEE 1394 bus).
I Processes don’t want to reveal their id (security protocols).
I Absence of unique hardware id’s (LEGO Mindstorms).
Question: Suppose there is one leader.
How can each process be provided with a unique id ?
198 / 329
Impossibility of election in anonymous rings
Theorem: There is no election algorithm for anonymous rings
that always terminates.
Proof : Consider an anonymous ring of size N.
In a symmetric configuration, all processes are in the same state
and all channels carry the same messages.
I There is a symmetric initial configuration.
I If γ0 is symmetric and γ0 → γ1 , then there are transitions
γ1 → γ2 → · · · → γN with γN symmetric.
In a symmetric configuration there isn’t one leader.
So there is an infinite computation in which no leader is elected.
199 / 329
Fairness
An execution is fair if each event that is applicable in infinitely many
configurations, occurs infinitely often in the computation.
Each election algorithm for anonymous rings has a fair infinite execution.
(Basically because in the proof, γ0 → γ1 can be chosen freely.)
200 / 329
Probabilistic algorithms
In a probabilistic algorithm, a process may flip a coin, and
perform an event based on the outcome of this coin flip.
Probabilistic algorithms where all computations terminate in
a correct configuration aren’t interesting.
Because letting the coin e.g. always flip heads yields a correct
non-probabilistic algorithm.
201 / 329
Las Vegas and Monte Carlo algorithms
A probabilistic algorithm is Las Vegas if:
I the probability that it terminates is greater than zero, and
I all terminal configurations are correct.
It is Monte Carlo if:
I it always terminates, and
I the probability that a terminal configuration is correct is
greater than zero.
202 / 329
Questions
Even if the probability that a Las Vegas algorithm terminates is 1,
this doesn’t always imply termination. Why is that ?
Assume a Monte Carlo algorithm, and a (deterministic) algorithm
to check whether a run of the Monte Carlo algorithm terminated
correctly.
Give a Las Vegas algorithm that terminates with probability 1.
203 / 329
Itai-Rodeh election algorithm
Given an anonymous, directed ring; all processes know the ring size N.
We adapt the Chang-Roberts algorithm: each initiator sends out an id,
and the largest id is the only one making a round trip.
Each initiator selects a random id from {1, . . . , N}.
Complication: Different processes may select the same id.
Solution: Each message is supplied with a hop count.
A message that arrives at its source has hop count N.
If several processes select the same largest id, then they start
a new election round, with a higher round number.
204 / 329
Itai-Rodeh election algorithm
Initially, initiators are active in round 0, and noninitiators are passive.
Let p be active. At the start of election round n, p randomly selects id p ,
sends (n, id p , 1, false), and waits for a message (n0 , i, h, b).
The 3rd value is the hop count. The 4th value signals if another process
with the same id was encountered during the round trip.
I p gets (n0 , i, h, b) with n0 > n, or n0 = n and i > id p :
it becomes passive and sends (n0 , i, h + 1, b).
I p gets (n0 , i, h, b) with n0 < n, or n0 = n and i < id p :
it dismisses the message.
I p gets (n, id p , h, b) with h < N: it sends (n, id p , h + 1, true).
I p gets (n, id p , N, true): it proceeds to round n + 1.
I p gets (n, id p , N, false): it becomes the leader.
Passive processes pass on messages, increasing their hop count by one.
205 / 329
Itai-Rodeh election algorithm - Correctness
Question: How can an infinite computation occur ?
Correctness: The Itai-Rodeh election algorithm is Las Vegas.
Eventually one leader is elected, with probability 1.
Without rounds, the algorithm would be flawed.
Example: i i k `
h j, 1, falsei
i >j j j j > k, `
Average-case message complexity: O(N log N)
206 / 329
Election in arbitrary anonymous networks
The echo algorithm with extinction, with random selection of id’s,
can be used for election in anonymous undirected networks in which
all processes know the network size.
Initially, initiators are active in round 0, and noninitiators are passive.
Each active process selects a random id, and starts a wave,
tagged with its id and round number 0.
Let process p in wave i of round n be hit by wave j of round n0 :
I If n0 > n, or n0 = n and j > i, then p adopts wave j of round n0 ,
and treats the message according to the echo algorithm.
I If n0 < n, or n0 = n and j < i, then p dismisses the message.
I If n0 = n and j = i, then p treats the message according to
the echo algorithm.
207 / 329
Election in arbitrary anonymous networks
Each message sent upwards in the constructed tree reports the size
of its subtree.
All other messages report 0.
When a process decides, it computes the size of the constructed tree.
If the constructed tree covers the network, it becomes the leader.
Else, it selects a new id, and initiates a new wave, in the next round.
208 / 329
Election in arbitrary anonymous networks - Example
i > j > k > ` > m. Only waves that complete are shown.
h0, i, 0i h0, i, 0i h0, i, 0i h0, i, 0i
0, i 0, j 0, i 0, j 0, i
h0, i, 1i h0, i, 0i h0, i, 0i h0, i, 1i
h0, i, 0i h0, i, 1i
0, k
h1, `, 0i h1, `, 0i h1, `, 0i h1, `, 0i
1, ` 0, i 1, m 0, i 1, m
h1, `, 5i h1, `, 4i h1, `, 2i h1, `, 1i
h1, `, 0i h1, `, 1i
0, i
The process at the left computes size 6, and becomes the leader.
209 / 329
Question
Is there another scenario in which the right-hand side node
progresses to round 2 ?
210 / 329
Computing the size of a network
Theorem: There is no Las Vegas algorithm to compute the size of
an anonymous ring.
This implies that there is no Las Vegas algorithm for election in
an anonymous ring if processes don’t know the ring size.
Because when a leader is known, the network size can be computed
using a centralized wave algorithm with the leader as initiator.
211 / 329
Impossibility of computing anonymous network size
Theorem: There is no Las Vegas algorithm to compute the size of
an anonymous ring.
Proof : Consider an anonymous, directed ring p0 , . . . , pN−1 .
Suppose a computation C of a (probabilistic) ring size algorithm
terminates with the correct outcome N.
Consider the ring p0 , . . . , p2N−1 .
Let each event at a pi in C be executed concurrently at pi and pi+N .
This computation terminates with the incorrect outcome N.
212 / 329
Itai-Rodeh ring size algorithm
Each process p maintains an estimate est p of the ring size.
Initially est p = 2. (Always est p ≤ N.)
p initiates an estimate round (1) at the start of the algorithm, and
(2) at each update of est p .
Each round, p selects a random id p in {1, . . . , R}, sends (est p , id p , 1),
and waits for a message (est, id, h). (Always h ≤ est.)
I est < est p . Then p dismisses the message.
I est > est p .
- If h < est, then p sends (est, id, h + 1), and est p ← est.
- If h = est, then est p ← est + 1.
I est = est p .
- If h < est, then p sends (est, id, h + 1).
- If h = est and id 6= id p , then est p ← est + 1.
- If h = est and id = id p , then p dismisses the message
(possibly its own message returned).
213 / 329
Itai-Rodeh ring size algorithm - Correctness
When the algorithm terminates, est p ≤ N for all p.
The Itai-Rodeh ring size algorithm is a Monte Carlo algorithm.
Possibly, in the end est p < N.
Example: (j, 2)
(i, 2) (i, 2)
(j, 2)
214 / 329
Itai-Rodeh ring size algorithm - Example
(j, 2) (`, 3)
(i, 2) (i, 2) (i, 2) (i, 2)
(k, 2) (`, 3)
(`, 3) (i, 4)
(m, 3) (`, 3) (j, 4) (j, 4)
(`, 3) (n, 4)
215 / 329
Itai-Rodeh ring size algorithm - Termination
Question: Upon message-termination, is est p always the same
at all p ?
There is no Las Vegas algorithm for general termination detection
in anonymous rings.
216 / 329
Itai-Rodeh ring size algorithm - Complexity
The probability of computing an incorrect ring size tends to zero
when R tends to infinity.
Worst-case message complexity: O(N 3 )
The N processes start at most N − 1 estimate rounds.
Each round they send a message, which takes at most N steps.
217 / 329
Question
Give an (always correctly terminating) algorithm for computing
the network size of anonymous, acyclic networks.
Answer: Use the tree algorithm, whereby each process reports
the size of its subtree to its parent.
218 / 329
IEEE 1394 election algorithm
The IEEE 1394 standard is a serial multimedia bus.
It connects digital devices,
which can be added/removed dynamically.
Transmitting/storing id’s is too expensive,
so the network is anonymous.
The network size is unknown to the processes.
The tree algorithm for undirected, acyclic networks is used.
Networks that contain a cycle give a time-out.
219 / 329
IEEE 1394 election algorithm
When a process has one possible parent, it sends a parent request
to this neighbor. If the request is accepted, an ack is sent back.
The last two parentless processes can send parent requests
to each other simultaneously. This is called root contention.
Each of the two processes in root contention randomly decides
to either immediately send a parent request again, or
to wait some time for a parent request from the other process.
Question: Is it optimal for performance to give probability 0.5
to both sending immediately and waiting for some time ?
220 / 329
Lecture in a nutshell
anonymous network
impossibility of election in anonymous networks
Las Vegas / Monte Carlo algorithms
Itai-Rodeh election algorithm for directed rings (Las Vegas)
echo election algorithm for anonymous networks (Las Vegas)
no Las Vegas algorithm for computing anonymous network size
Itai-Rodeh ring size algorithm (Monte Carlo)
IEEE 1394 election algorithm
221 / 329
Fault tolerance
A process may (1) crash, i.e., execute no further events, or even
(2) be Byzantine, meaning that it can perform arbitrary events.
Assumption: The network is complete, i.e., there is
an undirected channel between each pair of different processes.
So failing processes never make the remaining network disconnected.
Assumption: Crashing of processes can’t be observed.
222 / 329
Consensus
Binary consensus: Initially, all processes randomly select 0 or 1.
Eventually, all correct processes must uniformly decide 0 or 1.
Consensus underlies many important problems in distributed computing:
termination detection, mutual exclusion, leader election, ...
223 / 329
Consensus - Assumptions
k-crash consensus: At most k processes may crash.
Validity: If all processes randomly select the same initial value b,
then all correct processes decide b.
This excludes trivial solutions where e.g. processes always decide 0.
By validity, each k-crash consensus algorithm with k ≥ 1 has
a bivalent initial configuration that can reach terminal configurations
with a decision 0 as well as with a decision 1.
224 / 329
Impossibility of 1-crash consensus
Theorem: No algorithm for 1-crash consensus always terminates.
Idea: A decision is determined by an event e at a process p.
Since p may crash, after e the other processes must be able to decide
without input from p.
225 / 329
b-potent set of processes
A set S of processes is called b-potent, in a configuration, if by only
executing events at processes in S, some process in S can decide b.
Question: Consider any k-crash consensus algorithm.
Why should each set of N − k processes be b-potent for some b ?
226 / 329
Impossibility of 1-crash consensus
Theorem: No algorithm for 1-crash consensus always terminates.
Proof : Consider a 1-crash consensus algorithm.
Let γ be a bivalent configuration: γ → γ0 and γ → γ1 , where
γ0 can lead to decision 0 and γ1 to decision 1.
I Let the transitions correspond to events at different processes.
Then γ0 → δ ← γ1 for some δ. So γ0 or γ1 is bivalent.
I Let the transitions correspond to events at one process p.
In γ, p can crash, so the other processes are b-potent for some b.
Likewise for γ0 and γ1 . It follows that γ1−b is bivalent.
So each bivalent configuration has a transition to a bivalent configuration.
Hence each bivalent initial configuration yields an infinite computation.
There exist fair infinite computations.
227 / 329
Impossibility of 1-crash consensus - Example
Let N = 4. At most one process can crash.
There are voting rounds, in which each process broadcasts it value.
Since one process may crash, in a round, processes can only wait
for three votes.
1 0
1 0
The left (resp. right) processes might in every round receive
two 1-votes and one 0-vote (resp. two 0-votes and one 1-vote).
(Admittedly, this scheduling of messages is unfair.)
228 / 329
Impossibility of d N2 e-crash consensus
N
Theorem: Let k ≥ 2. There is no Las Vegas algorithm for
k-crash consensus.
Proof : Suppose, toward a contradiction, there is such an algorithm.
Divide the set of processes in S and T , with |S| = b N2 c and |T | = d N2 e.
Suppose all processes in S select 0 and all processes in T select 1.
Suppose that messages between processes in S and in T are very slow.
Since k ≥ N2 , at some point the processes in S must assume the processes
in T all crashed, and decide 0.
Likewise, at some point the processes in T must assume the processes
in S all crashed, and decide 1.
229 / 329
Question
Give a Monte Carlo algorithm for k-crash consensus for any k.
Answer: Let any process decide for its initial (random) value.
With a (very small) positive probability all correct processes decide
for the same value.
230 / 329
Bracha-Toueg crash consensus algorithm
Let k < N2 . Initially, each correct process randomly selects 0 or 1,
with weight 1. In round n, at each correct, undecided p:
I p sends hn, value p , weight p i to all processes (including itself).
I p waits until N − k messages hn, b, w i have arrived.
(p dismisses/stores messages from earlier/future rounds.)
N
If w > 2 for an hn, b, w i, then value p ← b. (This b is unique.)
Else, value p ← 0 if most messages voted 0, value p ← 1 otherwise.
weight p ← the number of incoming votes for value p in round n.
N
I If w > 2 for > k incoming messages hn, b, w i, then p decides b.
(Note that k < N − k.)
If p decides b, it broadcasts hn + 1, b, N − ki and hn + 2, b, N − ki,
and terminates.
231 / 329
Bracha-Toueg crash consensus algorithm - Example
N = 3 and k = 1. Each round a correct process requires
two incoming messages, and two b-votes with weight 2 to decide b.
weight 1 <0,0,1> weight 1 weight 1 weight 2
0 0 1 0 decide 0
<0,1,1> <1,0,2> <1,0,2>
<0,0,1> <0,0,1> <1,1,1>
1 0
weight 1 weight 2
weight 1 crashed weight 2 crashed
0 0 decide 0 0 0
<2,0,1> <3,0,2>
<2,0,1> <3,0,2>
0 0 decide 0
weight 1 weight 2
(Messages of a process to itself aren’t depicted.)
232 / 329
Bracha-Toueg crash consensus algorithm - Correctness
Theorem: Let k < N2 . The Bracha-Toueg k-crash consensus algorithm
is a Las Vegas algorithm that terminates with probability 1.
Proof (part I): Suppose a process decides b in round n.
N
Then in round n, value q = b and weight q > 2 for > k processes q.
N
So in round n, each correct process receives a hq, b, w i with w > 2.
So in round n + 1, all correct processes vote b.
So in round n + 2, all correct processes vote b with weight N − k.
Hence, after round n + 2, all correct processes have decided b.
Concluding, all correct processes decide for the same value.
233 / 329
Bracha-Toueg crash consensus algorithm - Correctness
Proof (part II): Assumption: Scheduling of messages is fair.
Due to fair scheduling, there is a chance ρ > 0 that in a round n
all processes receive the first N − k messages from the same processes.
After round n, all correct processes have the same value b.
After round n + 1, all correct processes have value b with weight N − k.
After round n + 2, all correct processes have decided b.
Concluding, the algorithm terminates with probability 1.
234 / 329
Impossibility of d N3 e-Byzantine consensus
Theorem: Let k ≥ N3 . There is no Las Vegas algorithm for
k-Byzantine consensus.
Proof : Suppose, toward a contradiction, there is such an algorithm.
Since k ≥ N3 , we can choose sets S and T of processes with
|S| = |T | = N − k and |S ∩ T | ≤ k.
Suppose all processes in S select 0 and all processes in T select 1.
Suppose that messages between processes in S and in T are very slow.
Suppose all processes that aren’t in S ∪ T are Byzantine.
The processes in S can then, with the help of the Byzantine processes,
decide 0.
Likewise the processes in T can decide 1.
235 / 329
Bracha-Toueg Byzantine consensus algorithm
N
Let k < 3.
Again, in every round, each correct process:
I broadcasts its value,
I waits for N − k incoming messages, and
I changes its value to the majority of votes in the round.
(No weights are needed.)
N+k
A correct process decides b if it receives > 2 b-votes in one round.
Then more than half of the correct processes voted b in this round.
N+k
(Note that 2 < N − k.)
236 / 329
Echo mechanism
Complication: A Byzantine process may send different votes
to different processes.
Example: Let N = 4 and k = 1. Each round, a correct process
waits for three votes, and needs three b-votes to decide b.
0 decide 0 0
1 Byzantine 0 Byzantine
0 0
1 1 0 0 0
1 0 1 0
decide 1 1 decide 0
Solution: Each incoming vote is verified using an echo mechanism.
N+k
A vote is accepted after > 2 confirming echos.
237 / 329
Bracha-Toueg Byzantine consensus algorithm
Initially, each correct process randomly selects 0 or 1.
In round n, at each correct, undecided p:
I p sends hvote, n, value p i to all processes (including itself).
I If p receives hvote, m, bi from q, it sends hecho, q, m, bi
to all processes (including itself).
I p counts incoming hecho, q, n, bi messages for each q, b.
N+k
When > 2 such messages arrived, p accepts q’s b-vote.
I The round is completed when p has accepted N − k votes.
If most votes are for 0, then value p ← 0. Else, value p ← 1.
238 / 329
Bracha-Toueg Byzantine consensus algorithm
Processes dismiss/store messages from earlier/future rounds.
If multiple messages hvote, m, i or hecho, q, m, i arrive via
the same channel, only the first one is taken into account.
N+k
If > 2 of the accepted votes are for b, then p decides b.
When p decides b, it broadcasts hdecide, bi and terminates.
The other processes interpret hdecide, bi as a b-vote by p,
and a b-echo by p for each q, for all rounds to come.
239 / 329
Questions
If an undecided process receives hdecide, bi, why can it in general
not immediately decide b ?
Answer: The message may originate from a Byzantine process.
What happens if all k Byzantine processes keep silent ?
Answer: The N − k correct processes reach consensus in two rounds.
240 / 329
Bracha-Toueg Byzantine consensus alg. - Example
We study the previous example again, now with verification of votes.
N = 4 and k = 1, so each round a correct process needs:
N+k
I > 2 , i.e. three, confirmations to accept a vote;
I N − k, i.e. three, accepted votes to determine a value; and
N+k
I > 2 , i.e. three, accepted b-votes to decide b.
Only relevant vote messages are depicted (without their round number).
241 / 329
Bracha-Toueg Byzantine consensus alg. - Example
0 0
1 Byzantine 0 Byzantine
0 1
1 1 0 0 1 0 0
1 0 1 0 decide 0
0 1 0
In round zero, the left bottom process doesn’t accept vote 1 by
the Byzantine process, since none of the other two correct processes
confirm this vote. So it waits for (and accepts) vote 0 by
the right bottom process, and thus doesn’t decide 1 in round zero.
1
0 Byzantine decide 0 0 Byzantine
0 0 <decide,0>
1 0
0 0 decide 0 0 0
<decide,0>
242 / 329
Bracha-Toueg Byzantine consensus alg. - Correctness
Theorem: Let k < N3 . The Bracha-Toueg k-Byzantine consensus
algorithm is a Las Vegas algorithm that terminates with probability 1.
Proof : Each round, the correct processes eventually accept N − k votes,
since there are ≥ N − k correct processes. (Note that N − k > N+k2 .)
In round n, let correct processes p and q accept votes for b and b 0 ,
respectively, from a process r .
Then they received > N+k
2 messages hecho, r , n, bi resp. hecho, r , n, b 0 i.
> k processes, so at least one correct process, sent such messages to
both p and q.
So b = b 0 .
243 / 329
Bracha-Toueg Byzantine consensus alg. - Correctness
Suppose a correct process decides b in round n.
N+k
In this round it accepts > 2 b-votes.
N+k N−k
So in round n, correct processes accept > 2 −k = 2 b-votes.
Hence, after round n, value q = b for each correct q.
So correct processes will vote b in all rounds m > n.
N−k
Because they will accept ≥ N − 2k > 2 b-votes.
244 / 329
Bracha-Toueg Byzantine consensus alg. - Correctness
Let S be a set of N − k correct processes.
Assuming fair scheduling, there is a chance ρ > 0 that in a round
each process in S accepts N − k votes from the processes in S.
With chance ρ2 this happens in consecutive rounds n, n + 1.
After round n, all processes in S have the same value b.
After round n + 1, all processes in S have decided b.
245 / 329
Lecture in a nutshell
crashed / Byzantine processes
complete network / crashes can’t be observed
(binary) consensus
no algorithm for 1-crash consensus always terminates
N
if k ≥ 2, there is no Las Vegas algorithm for k-crash consensus
N
Bracha-Toueg k-crash consensus algorithm for k < 2
N
if k ≥ 3, there is no Las Vegas algorithm for k-Byzantine consensus
N
Bracha-Toueg k-Byzantine consensus algorithm for k < 3
246 / 329
Failure detection
A failure detector at a process keeps track which processes have
(or may have) crashed.
Given an upper bound on network latency, and heartbeat messages,
one can implement a failure detector.
With a failure detector, the proof of impossibility of 1-crash consensus
no longer applies.
For this setting, terminating crash consensus algorithms exist.
247 / 329
Failure detection
Aim: To detect crashed processes.
We assume a time domain, with a total order.
F (τ ) is the set of crashed processes at time τ .
τ1 ≤ τ2 ⇒ F (τ1 ) ⊆ F (τ2 ) (i.e., no restart)
Assumption: Processes can’t observe F (τ ).
H(p, τ ) is the set of processes that p suspects to be crashed at time τ .
Each computation is decorated with :
I a failure pattern F
I a failure detector history H
248 / 329
Complete failure detector
We require that failure detectors are complete :
From some time onward, each crashed process is suspected by
each correct process.
249 / 329
Strongly accurate failure detector
A failure detector is strongly accurate if only crashed processes are
ever suspected.
Assumptions:
I Each correct process broadcasts alive every ν time units.
I dmax is a known upper bound on network latency.
A process from which no message is received for ν + dmax time units
has crashed.
This failure detector is complete and strongly accurate.
250 / 329
Weakly accurate failure detector
A failure detector is weakly accurate if some (correct) process
is never suspected by any process.
Assume a complete and weakly accurate failure detector.
We give a rotating coordinator algorithm for (N − 1)-crash consensus.
251 / 329
Consensus with weakly accurate failure detection
Processes are numbered: p0 , . . . , pN−1 .
Initially, each process randomly selects 0 or 1. In round n :
I pn (if not crashed) broadcasts its value.
I Each process waits :
- either for an incoming message from pn , in which case it adopts
the value of pn ;
- or until it suspects that pn has crashed.
After round N − 1, each correct process decides for its value.
Correctness: Let pj never be suspected.
After round j, all correct processes have the same value b.
Hence, after round N − 1, all correct processes decide b.
252 / 329
Eventually strongly accurate failure detector
A failure detector is eventually strongly accurate if
from some time onward, only crashed processes are suspected.
Assumptions:
I Each correct process broadcasts alive every ν time units.
I There is an unknown upper bound on network latency.
Each process q initially guesses as network latency dq = 1.
If q receives no message from p for ν + dq time units, then
q suspects that p has crashed.
If q receives a message from a suspected process p, then
p is no longer suspected and dq ← dq + 1.
This failure detector is complete and eventually strongly accurate.
253 / 329
Impossibility of d N2 e-crash consensus
Theorem: Let k ≥ N2 . There is no Las Vegas algorithm for k-crash
consensus based on an eventually strongly accurate failure detector.
Proof : Suppose, toward a contradiction, there is such an algorithm.
Divide the set of processes in S and T , with |S| = b N2 c and |T | = d N2 e.
Suppose all processes in S select 0 and all processes in T select 1.
Suppose that for a long time the processes in S suspect the processes
in T crashed, and the processes in T suspect the processes in S crashed.
The processes in S can then decide 0, while the process in T can decide 1.
254 / 329
Chandra-Toueg k-crash consensus algorithm
A failure detector is eventually weakly accurate if
from some time onward some (correct) process is never suspected.
Let k < N2 . A complete and eventually weakly accurate failure detector
is used for k-crash consensus.
Each process q records the last round luq in which it updated value q .
Initially, value q ∈ {0, 1} and luq = −1.
Processes are numbered: p0 , . . . , pN−1 .
Round n is coordinated by pc with c = n mod N.
255 / 329
Chandra-Toueg k-crash consensus algorithm
I In round n, each correct q sends hvote, n, value q , luq i to pc .
I pc (if not crashed) waits until N − k such messages arrived,
and selects one, say hvote, n, b, `i, with ` as large as possible.
value pc ← b, lupc ← n, and pc broadcasts hvalue, n, bi.
I Each correct q waits:
- either until hvalue, n, bi arrives: then value q ← b, luq ← n,
and q sends hack, ni to pc ;
- or until it suspects pc crashed: then q sends hnack, ni to pc .
I If pc receives > k messages hack, ni, then pc decides b, and
broadcasts hdecide, bi.
An undecided process that receives hdecide, bi, decides b.
256 / 329
Chandra-Toueg algorithm - Correctness
Theorem: Let k < N2 . The Chandra-Toueg algorithm is
an (always terminating) k-crash consensus algorithm.
Proof (part I): If the coordinator in some round n receives > k ack’s,
then (for some b ∈ {0, 1}):
(1) there are > k processes q with luq ≥ n, and
(2) luq ≥ n implies value q = b.
Properties (1) and (2) are preserved in all rounds m > n.
This follows by induction on m − n.
By (1), in round m the coordinator receives a vote with lu ≥ n.
Hence, by (2), the coordinator of round m sets its value to b,
and broadcasts hvalue, m, bi.
So from round n onward, processes can only decide b.
257 / 329
Chandra-Toueg algorithm - Correctness
Proof (part II):
Since the failure detector is eventually weakly accurate,
from some round onward, some process p will never be suspected.
So when p becomes the coordinator, it receives ≥ N − k ack’s.
Since N − k > k, it decides.
All correct processes eventually receive the decide message of p,
and also decide.
258 / 329
Chandra-Toueg algorithm - Example
hvote, 0, 0, −1i lu = −1 hvalue, 0, 0i lu = −1 hack, 0i
lu = 0
1 0 0 0 0 0
lu = −1 lu = 0 lu = 0
1 1 1
lu = −1 lu = −1 lu = −1
decide 0 crashed crashed
lu = 0 lu = 1 lu = 1
0 0 0
hvote, 1, 1, −1i hack, 1i
hvalue, 1, 0i
1 1 0
lu = −1 lu = −1 lu = 1
crashed decide 0 crashed
lu = 1 lu = 1
0 0
0
hdecide, 0i
0 N = 3 and k = 1
lu = 1 decide 0 lu = 1
Messages and ack’s that a process sends to itself and ‘irrelevant’ messages are omitted.
259 / 329
Question
Why is it difficult to devise a failure detector for Byzantine processes ?
Answer: Failure detectors are usually based on the absence of events.
260 / 329
Local clocks with bounded drift
Let’s forget about Byzantine processes for a moment.
The time domain is R≥0 .
Each process p has a local clock Cp (τ ), which returns a time value
at real time τ .
Local clocks have bounded drift, compared to real time :
If Cp isn’t adjusted between times τ1 and τ2 , then
1
(τ2 − τ1 ) ≤ Cp (τ2 ) − Cp (τ1 ) ≤ ρ(τ2 − τ1 )
ρ
for some known ρ > 1.
261 / 329
Clock synchronization
At certain time intervals, the processes synchronize clocks :
They read each other’s clock values, and adjust their local clocks.
The aim is to achieve, for some δ, and all τ ,
|Cp (τ ) − Cq (τ )| ≤ δ
Due to drift, this precision may degrade over time, necessitating
repeated clock synchronizations.
We assume a known bound dmax on network latency.
For simplicity, let dmax be much smaller than δ, so that this latency
can be ignored in the clock synchronization.
262 / 329
Clock synchronization
Suppose that after each synchronization, at say real time τ ,
for all processes p, q:
|Cp (τ ) − Cq (τ )| ≤ δ0
for some δ0 < δ.
Due to ρ-bounded drift of local clocks, at real time τ + R,
1
|Cp (τ + R) − Cq (τ + R)| ≤ δ0 + (ρ − )R < δ0 + ρR
ρ
δ−δ0
So synchronizing every ρ (real) time units suffices.
263 / 329
Impossibility of d N3 e-Byzantine synchronizers
N
Theorem: Let k ≥ 3. There is no k-Byzantine clock synchronizer.
Proof : Let N = 3, k = 1. Processes are p, q, r ; r is Byzantine.
N
(The construction below easily extends to general N and k ≥ 3 .)
Let the clock of p run faster than the clock of q.
Suppose a synchronization takes place at real time τ .
r sends Cp (τ ) + δ to p, and Cq (τ ) − δ to q.
p and q can’t recognize that r is Byzantine.
So they have to stay within range δ of the value reported by r .
Hence p can’t decrease, and q can’t increase its clock value.
By repeating this scenario at each synchronization round,
the clock values of p and q get further and further apart.
264 / 329
Mahaney-Schneider synchronizer
Consider a complete network of N processes.
N
Suppose at most k < 3 processes are Byzantine.
Each correct process in a synchronization round:
1. Collects the clock values of all processes (waiting for 2dmax ).
2. Discards those reported values τ for which < N − k processes
report a value in the interval [τ − δ, τ + δ].
(They are from Byzantine processes.)
3. Replaces all discarded/non-received values by an accepted value.
4. Takes the average of these N values as its new clock value.
265 / 329
Mahaney-Schneider synchronizer - Correctness
Lemma: Let k < N3 . If in some synchronization round values ap and aq
pass the filters of correct processes p and q, respectively, then
|ap − aq | ≤ 2δ
Proof : ≥ N − k processes reported a value in [ap − δ, ap + δ] to p.
And ≥ N − k processes reported a value in [aq − δ, aq + δ] to q.
Since N − 2k > k, at least one correct process r reported
a value in [ap − δ, ap + δ] to p, and in [aq − δ, aq + δ] to q.
Since r reports the same value to p and q, it follows that
|ap − aq | ≤ 2δ
266 / 329
Mahaney-Schneider synchronizer - Correctness
Theorem: Let k < N3 . The Mahaney-Schneider synchronizer is k-Byzantine.
Proof : apr (resp. aqr ) denotes the value that correct process p (resp. q)
accepts from or assigns to process r , in some synchronization round.
By the lemma, for all r , |apr − aqr | ≤ 2δ.
Moreover, apr = aqr for all correct r .
Hence, for all correct p and q,
1 X 1 X 1 2
| ( apr ) − ( aqr )| ≤ k2δ < δ
N processes r N processes r N 3
So we can take δ0 = 23 δ.
δ−δ0 δ
There should be a synchronization every ρ = 3ρ time units.
267 / 329
Lecture in a nutshell (part I)
complete failure detector
strongly accurate failure detector
rotating coordinator crash consensus algorithm with
a weakly accurate failure detector
eventually strongly accurate failure detector
k-crash consensus for k ≥ N2 remains impossible with
an eventually strongly accurate failure detection
Chandra-Toueg crash consensus algorithm with
an eventually weakly accurate failure detector
268 / 329
Lecture in a nutshell (part II)
local clocks with ρ-bounded drift (where ρ is known)
synchronize clocks so that they stay within δ of each other
N
for k ≥ 3, there is no k-Byzantine synchronizer
Mahaney-Schneider k-Byzantine synchronizer
269 / 329
Synchronous networks
Let’s again forget about Byzantine processes for a moment.
A synchronous network proceeds in pulses. In one pulse, each process:
1. sends messages
2. receives messages
3. performs internal events
A message is sent and received in the same pulse.
Such synchrony is called lockstep.
270 / 329
Building a synchronous network
Assume ρ-bounded local clocks with precision δ.
For simplicity, we ignore the network latency.
When a process reads clock value (i − 1)ρ2 δ, it starts pulse i.
Key question: Does a process p receive all messages for pulse i
before it starts pulse i + 1 ? That is, for all q,
Cq−1 ((i − 1)ρ2 δ) ≤ Cp−1 (iρ2 δ)
Because then q starts pulse i no later than p starts pulse i + 1.
(Cr−1 (τ ) is the moment in time the clock of r returns τ .)
271 / 329
Building a synchronous network
Since the clock of q is ρ-bounded from below,
Cq−1 (τ ) ≤ Cq−1 (τ − δ) + ρδ
τ −δ ≥τ
Cq
real time
ρδ
Since local clocks have precision δ,
Cq−1 (τ − δ) ≤ Cp−1 (τ )
Hence, for all τ ,
Cq−1 (τ ) ≤ Cp−1 (τ ) + ρδ
272 / 329
Building a synchronous network
Since the clock of p is ρ-bounded from above,
Cp−1 (τ ) + ρδ ≤ Cp−1 (τ + ρ2 δ)
τ ≤ τ + ρ2 δ
Cp
real time
ρδ
Hence,
Cq−1 ((i − 1)ρ2 δ) ≤ Cp−1 ((i − 1)ρ2 δ) + ρδ
≤ Cp−1 (iρ2 δ)
273 / 329
Byzantine consensus for synchronous systems
In a synchronous system, the proof of impossibility of 1-crash consensus
no longer applies.
Because within a pulse, a process is guaranteed to receive a message
from all correct processes.
For this setting, terminating Byzantine consensus algorithms exist.
274 / 329
Byzantine broadcast
Consider a synchronous network with
at most k < N3 Byzantine processes.
One process g , called the general, is given an input xg ∈ {0, 1}.
The other processes, called lieutenants, know who is the general.
Requirements for k-Byzantine broadcast:
I Termination: Every correct process decides 0 or 1.
I Agreement: All correct processes decide the same value.
I Dependence: If the general is correct, it decides xg .
275 / 329
Impossibility of d N3 e-Byzantine broadcast
Theorem: Let k ≥ N3 . There is no k-Byzantine broadcast algorithm
for synchronous networks.
Proof : Divide the processes into three sets S, T and U with each
≤ k elements. Let g ∈ S.
Scenario 1 S Scenario 2 S
0 0 0 0 1 1 1 1
T 1 U Byzantine Byzantine T 1 U
0 0
The processes in S and T decide 0 The processes in S and U decide 1
Scenario 3 S Byzantine
0 0 1 1
T 1 U
0
The processes in T decide 0 and in U decide 1
276 / 329
Lamport-Shostak-Pease Byzantine broadcast algorithm
Broadcast g (N, k) terminates after k + 1 pulses.
Pulse 1: The general g broadcasts and decides xg .
If a lieutenant q receives b from g then xq ← b, else xq ← 0.
If k = 0: each lieutenant q decides xq .
If k > 0: for each lieutenant p, each (correct) lieutenant q takes part
in Broadcast p (N−1, k−1) in pulse 2 (g is excluded).
Pulse k + 1 (k > 0):
Lieutenant q has, for each lieutenant p, computed a value in
Broadcast p (N−1, k−1); it stores this value in Mq [p].
xq ← major (Mq ); lieutenant q decides xq .
(major maps each list over {0, 1} to 0 or 1, such that if more than
half of the elements in the list M are b, then major (M) = b.)
277 / 329
Lamport-Shostak-Pease broadcast alg. - Example 1
N = 4 and k = 1; general correct.
decide 1
Initially: p g 1 After pulse 1: 1 p g 1
Byzantine r q Byzantine r q 1
After pulse 1, g has decided 1, and the correct lieutenants p and q
carry the value 1.
Consider the sub-network without g .
In Broadcast p (3, 0) and Broadcast q (3, 0), p and q both compute 1,
while in Broadcast r (3, 0) they may compute an arbitrary value.
So p and q both build a list [ 1, 1, ], and decide 1.
278 / 329
Lamport-Shostak-Pease broadcast alg. - Example 2
N = 7 and k = 2; general Byzantine. (Channels are omitted.)
1 1
After pulse 1: g Byzantine
Byzantine
0 0 0
k − 1 < N−1 3 , so by induction, the recursive calls Broadcast p (6, 1)
lead to the same list M = [ 1, 1, 0, 0, 0, b ], for some b ∈ {0, 1},
at all correct lieutenants.
So in Broadcast g (7, 2), they all decide major (M).
279 / 329
Lamport-Shostak-Pease broadcast alg. - Example 2
For instance, in Broadcast p (6, 1) with p the Byzantine lieutenant,
the following values could be distributed by p.
1 0
Byzantine
0 1 1
Then the five subcalls Broadcast q (5, 0), for the correct lieutenants q,
would at each correct lieutenant lead to the list [ 1, 0, 0, 1, 1 ].
So in that case b = major ( [ 1, 0, 0, 1, 1 ] ) = 1.
280 / 329
Questions
Question: Draw a tree of all recursive subcalls of Broadcast g (7, 2).
Question: Consider Broadcast g (N, 1), with a correct general g .
N−1
Let fewer than 2 lieutenants be Byzantine.
Argue that all correct processes decide xg .
281 / 329
Lamport-Shostak-Pease broadcast alg. - Correctness
Lemma: If the general g is correct, and < N−k 2 lieutenants are Byzantine,
then in Broadcast g (N, k) all correct processes decide xg .
Proof : By induction on k. Case k = 0 is trivial, because g is correct.
Let k > 0.
Since g is correct, in pulse 1, at all correct lieutenants p, xp ← xg .
Since (N−1)−(k−1)
2 = N−k
2 , by induction, for all correct lieutenants p,
in Broadcast p (N−1, k−1) the value xp = xg is computed.
Since a majority of the lieutenants is correct (because k > 0),
in pulse k + 1, at each correct lieutenant p, xp ← major (Mp ) = xg .
282 / 329
Lamport-Shostak-Pease broadcast alg. - Correctness
Theorem: Let k < N3 . Broadcast g (N, k) is an (always terminating)
k-Byzantine broadcast algorithm for synchronous networks.
Proof : By induction on k.
N
If g is correct, the theorem follows from the lemma and k < 3.
Let g be Byzantine (so k > 0). Then ≤ k−1 lieutenants are Byzantine.
Since k−1 < N−13 , by induction, for every lieutenant p, all correct
lieutenants compute in Broadcast p (N−1, k−1) the same value.
Hence, all correct lieutenants compute the same list M.
So in pulse k + 1, all correct lieutenants decide major (M).
283 / 329
Partial synchrony
A synchronous system can be obtained if local clocks have known
bounded drift, and there is a known upper bound on network latency.
In a partially synchronous system,
I the bounds on the inaccuracy of local clocks and network latency
are unknown, or
I these bounds are known, but only valid from some unknown point
in time.
Dwork, Lynch and Stockmeyer showed that, for k < N3 , there is
a k-Byzantine broadcast algorithm for partially synchronous systems.
These ideas are at the core of the Paxos consensus protocol.
284 / 329
Public-key cryptosystems
Given a large message domain M.
A public-key cryptosystem consists of functions Sq , Pq : M → M,
for each process q, with
Sq (Pq (m)) = Pq (Sq (m)) = m for all m ∈ M.
Sq is kept secret, Pq is made public.
Underlying assumption: Computing Sq from Pq is very expensive.
p sends a secret message m to q: Pq (m)
p sends a signed message m to q: hm, Sp (m)i
Such signatures guarantee that Byzantine processes can’t lie about
the messages they have received.
285 / 329
Lamport-Shostak-Pease authentication algorithm
Pulse 1: The general broadcasts hxg , (Sg (xg ), g )i, and decides xg .
Pulse i: If a lieutenant q receives a message hv , (σ1 , p1 ) : · · · : (σi , pi )i
that is valid, i.e.:
I p1 = g
I p1 , . . . , pi , q are distinct
I Pp (σk ) = v for all k = 1, . . . , i
k
then q includes v in the set Wq .
If i ≤ k, then in pulse i + 1, q sends to all other lieutenants
hv , (σ1 , p1 ) : · · · : (σi , pi ): (Sq (v ), q)i
After pulse k + 1, each correct lieutenant p decides
v if Wp is a singleton {v }, or
0 otherwise (the general is Byzantine)
286 / 329
Lamport-Shostak-Pease authentication alg. - Example
N = 4 and k = 2. p g Byzantine
Byzantine r q
pulse 1: g sends h1, (Sg (1), g )i to p and q
g sends h0, (Sg (0), g )i to r
Wp = Wq = {1}
pulse 2: p broadcasts h1, (Sg (1), g ) : (Sp (1), p)i
q broadcasts h1, (Sg (1), g ) : (Sq (1), q)i
r sends h0, (Sg (0), g ) : (Sr (0), r )i to q
Wp = {1} and Wq = {0, 1}
pulse 3: q broadcasts h0, (Sg (0), g ) : (Sr (0), r ) : (Sq (0), q)i
Wp = Wq = {0, 1}
p and q decide 0
287 / 329
Lamport-Shostak-Pease authentication alg. - Correctness
Theorem: The Lamport-Shostak-Pease authentication algorithm is
an (always terminating) k-Byzantine broadcast algorithm, for any k.
Proof : If the general is correct, then owing to authentication,
correct lieutenants q only add xg to Wq . So they all decide xg .
Let a correct lieutenant receive a valid message hv , `i in a pulse ≤ k.
In the next pulse, it makes all correct lieutenants p add v to Wp .
Let a correct lieutenant receive a valid message hv , `i in pulse k + 1.
Since ` has length k + 1, it contains a correct q.
Then q received a valid message hv , i in a pulse ≤ k.
In the next pulse, q made all correct lieutenants p add v to Wp .
So after pulse k + 1, Wp is the same for all correct lieutenants p.
288 / 329
Lamport-Shostak-Pease authentication alg. - Optimization
Dolev-Strong optimization: Each correct lieutenant broadcasts
at most two messages, with different values.
Because when it has broadcast two different values,
all correct lieutenants are certain to decide 0.
289 / 329
Lecture in a nutshell
ρ-bounded local clocks with precision δ
synchronous network
Byzantine broadcast
N
no k-Byzantine broadcast if k ≥ 3
N
Lamport-Shostak-Pease broadcast algorithm if k < 3
Lamport-Shostak-Pease authentication algorithm
290 / 329
Mutual exclusion
Processes contend to enter their critical section.
A process (allowed to be) in its critical section is called privileged.
For each computation we require:
Mutual exclusion: Always at most one process is privileged.
Starvation-freeness: If a process p tries to enter its critical section, and
no process stays privileged forever, then p eventually becomes privileged.
Applications: Distributed shared memory, replicated data, atomic commit.
291 / 329
Mutual exclusion with message passing
Mutual exclusion algorithms with message passing are generally
based on one of the following paradigms.
I Leader election: A process that wants to become privileged
sends a request to the leader.
I Token passing: The process holding the token is privileged.
I Logical clock: Requests to enter a critical section are
prioritized by means of logical time stamps.
I Quorum: To become privileged, a process needs permission
from a quorum of processes.
Each pair of quorums has a non-empty intersection.
292 / 329
Ricart-Agrawala algorithm
When a process pi wants to access its critical section, it sends
request(tsi , i) to all other processes, with tsi its logical time stamp.
When pj receives this request, it sends permission to pi as soon as :
I pj isn’t privileged, and
I pj doesn’t have a pending request with time stamp tsj where
(tsj , j) < (tsi , i) (lexicographical order).
pi enters its critical section when it has received permission from
all other processes.
When pi exits its critical section, it sends permission to all pending
requests.
293 / 329
Ricart-Agrawala algorithm - Example 1
N = 2, and p0 and p1 both are at logical time 0.
p1 sends request(1, 1) to p0 .
When p0 receives this message, it sends permission to p1 , setting
the time at p0 to 2.
p0 sends request(2, 0) to p1 .
When p1 receives this message, it doesn’t send permission to p0 ,
because (1, 1) < (2, 0).
p1 receives permission from p0 , and enters its critical section.
294 / 329
Ricart-Agrawala algorithm - Example 2
N = 2, and p0 and p1 both are at logical time 0.
p1 sends request(1, 1) to p0 , and p0 sends request(1, 0) to p1 .
When p0 receives the request from p1 ,
it doesn’t send permission to p1 , because (1, 0) < (1, 1).
When p1 receives the request from p0 ,
it sends permission to p0 , because (1, 0) < (1, 1).
p0 and p1 both set their logical time to 2.
p0 receives permission from p1 , and enters its critical section.
295 / 329
Ricart-Agrawala algorithm - Correctness
Mutual exclusion: When p sends permission to q:
I p isn’t privileged; and
I p won’t get permission from q to enter its critical section
until q has entered and left its critical section.
(Because p’s pending or future request is larger than
q’s current request.)
Starvation-freeness: Each request will eventually become
the smallest request in the network.
296 / 329
Ricart-Agrawala algorithm - Optimization
Drawback: High message overhead, because requests must be sent
to all other processes.
Carvalho-Roucairol optimization: After a process q has exited
its critical section, q only needs to send requests to the processes
that q has sent permission to since this exit.
Suppose q is waiting for permissions and didn’t send a request to p.
If p sends a request to q that is smaller than q’s request, then q
sends both permission and a request to p.
This optimization is correct since for each pair of distinct processes,
at least one must ask permission from the other.
297 / 329
Question
Let first p0 and then p1 become privileged.
Next they want to become privileged again.
Which scenario’s are possible, if the Carvalho-Roucairol optimization
is employed ?
Answer: p0 needs permission from p1 , but not vice versa.
If p0 ’s request reaches p1 before it wants to become privileged again,
then p1 sends permission and later a request to p0 .
Else p1 enters its critical section, and answers p0 ’s request only
after exiting the critical section.
298 / 329
Raymond’s algorithm
Given an undirected network, with a sink tree.
At any time, the root, holding a token, is privileged.
Each process maintains a FIFO queue, which can contain
id’s of its children, and its own id. Initially, this queue is empty.
Queue maintenance:
I When a non-root wants to enter its critical section,
it adds its id to its own queue.
I When a non-root gets a new head at its (non-empty) queue,
it asks its parent for the token.
I When a process receives a request for the token from a child,
it adds this child to its queue.
299 / 329
Raymond’s algorithm
When the root exits its critical section (and its queue is non-empty),
I it sends the token to the process q at the head of its queue,
I makes q its parent, and
I removes q from the head of its queue.
Let p get the token from its parent, with q at the head of its queue:
I If q 6= p, then p sends the token to q, and makes q its parent.
I If q = p, then p becomes the root
(i.e., it has no parent, and is privileged).
In both cases, p removes q from the head of its queue.
300 / 329
Raymond’s algorithm - Example
2 3
4 5
301 / 329
Raymond’s algorithm - Example
1 3
2 3 5
4 5 5
301 / 329
Raymond’s algorithm - Example
1 3, 2
2 2 3 5
4 5 5
301 / 329
Raymond’s algorithm - Example
1 3, 2
2 2 3 5, 4
4 5 5
4
301 / 329
Raymond’s algorithm - Example
1 2
2 2 3 5, 4, 1
4 5 5
4
301 / 329
Raymond’s algorithm - Example
1 2
2 2 3 4, 1
4 5 3
4
301 / 329
Raymond’s algorithm - Example
1 2
2 2 3 4, 1
4 5
4
301 / 329
Raymond’s algorithm - Example
1 2
2 2 3 1
4 5
3
301 / 329
Raymond’s algorithm - Example
1 2
2 2 3 1
4 5
301 / 329
Raymond’s algorithm - Example
1 2
2 2 3
4 5
301 / 329
Raymond’s algorithm - Example
2 3
4 5
301 / 329
Raymond’s algorithm - Correctness
Raymond’s algorithm provides mutual exclusion, because
at all times there is at most one root.
Raymond’s algorithm is starvation-free, because eventually
each request in a queue moves to the head of this queue,
and a chain of requests never contains a cycle.
Drawback: Sensitive to failures.
302 / 329
Question
What is the Achilles’ heel of a mutual exclusion algorithm based on
a leader ?
Answer: The leader is a single point of failure.
303 / 329
Agrawal-El Abbadi algorithm
To enter a critical section, permission from a quorum is required.
For simplicity we assume that N = 2k − 1, for some k > 1.
The processes are structured in a binary tree of depth k − 1.
A quorum consists of all processes on a path from the root to a leaf.
If a non-leaf p has crashed (or is unresponsive), permission is asked
from all processes on two paths instead: from each child of p to a leaf.
304 / 329
Agrawal-El Abbadi algorithm - Example
Example: Let N = 7. 1
2 3
4 5 6 7
Possible quorums are:
I {1, 2, 4}, {1, 2, 5}, {1, 3, 6}, {1, 3, 7}
I if 1 crashed: {2, 4, 3, 6}, {2, 5, 3, 6}, {2, 4, 3, 7}, {2, 5, 3, 7}
I if 2 crashed: {1, 4, 5} (and {1, 3, 6}, {1, 3, 7})
I if 3 crashed: {1, 6, 7} (and {1, 2, 4}, {1, 2, 5})
Question: What are the quorums if 1,2 crashed? And if 1,2,3 crashed?
And if 1,2,4 crashed?
305 / 329
Agrawal-El Abbadi algorithm
A process p that wants to enter its critical section, places the root
of the tree in a queue.
p repeatedly tries to get permission from the head r of its queue.
If successful, r is removed from p’s queue.
If r is a non-leaf, one of r ’s children is appended to p’s queue.
If non-leaf r has crashed, it is removed from p’s queue,
and both of r ’s children are appended at the end of the queue
(in a fixed order, to avoid deadlocks).
If leaf r has crashed, p aborts its attempt to become privileged.
When p’s queue becomes empty, it enters its critical section.
After exiting its critical section, p informs all processes in the quorum
that their permission to p can be withdrawn.
306 / 329
Agrawal-El Abbadi algorithm - Example
1
2 3
4 5 6 7
p and q concurrently want to enter their critical section.
p gets permission from 1, and wants permission from 3.
1 crashes, and q now wants permission from 2 and 3.
q gets permission from 2, and appends 4 to its queue.
q obtains permission from 3, and appends 7 to its queue.
3 crashes, and p now wants permission from 6 and 7.
q gets permission from 4, and now wants permission from 7.
p gets permission from both 6 and 7, and enters its critical section.
307 / 329
Agrawal-El Abbadi algorithm - Mutual exclusion
We prove, by induction on depth k, that each pair of quorums has
a non-empty intersection, and so mutual exclusion is guaranteed.
A quorum with 1 contains a quorum in one of the subtrees below 1,
while a quorum without 1 contains a quorum in both subtrees below 1.
I If two quorums both contain 1, we are done.
I If two quorums both don’t contain 1, then by induction they
have elements in common in the two subtrees below process 1.
I Suppose quorum Q contains 1, while quorum Q 0 doesn’t.
Then Q contains a quorum in one of the subtrees below 1,
and Q 0 also contains a quorum in this subtree.
By induction, they have an element in common in this subtree.
308 / 329
Agrawal-El Abbadi algorithm - Deadlock-freeness
In case of a crashed process, let its left child be put before its right child
in the queue of a process that wants to become privileged.
Let a process p at depth d be greater than any process
I at a depth > d in the binary tree, or
I at depth d and more to the right than p in the binary tree.
A process with permission from r , never needs permission from a q < r .
This guarantees that, in case some leave is responsive, eventually
some process will become privileged.
Starvation can happen, if a process waits for a permission infinitely long.
(This can be easily resolved.)
309 / 329
Self-stabilization
All configurations are initial configurations.
An algorithm is self-stabilizing if every computation eventually reaches
a correct configuration.
Advantages:
I fault tolerance
I straightforward initialization
Self-stabilizing operating systems and databases have been developed.
310 / 329
Self-stabilization - Shared memory
In a message-passing setting, processes might all be initialized
in a state where they are waiting for a message.
Then the self-stabilizing algorithm wouldn’t exhibit any behavior.
Therefore, in self-stabilizing algorithms, processes communicate
via variables in shared memory.
We assume that a process can read the variables of its neighbors.
311 / 329
Dijkstra’s self-stabilizing token ring
Processes p0 , . . . , pN−1 form a directed ring.
Each pi holds a value xi ∈ {0, . . . , K − 1} with K ≥ N.
I pi for each i = 1, . . . , N − 1 is privileged if xi 6= xi−1 .
I p0 is privileged if x0 = xN−1 .
Each privileged process is allowed to change its value,
causing the loss of its privilege :
I xi ← xi−1 when xi 6= xi−1 , for each i = 1, . . . , N − 1
I x0 ← (x0 + 1) mod K when x0 = xN−1
If K ≥ N, then Dijkstra’s token ring self-stabilizes.
That is, each computation eventually satisfies mutual exclusion.
Moreover, Dijkstra’s token ring is starvation-free.
312 / 329
Dijkstra’s token ring - Example
Let N = K = 4. Consider the initial configuration
p2 2
1 p3 p1 3
p0 0
It isn’t hard to see that the ring self-stabilizes. For instance,
p2 3 p2 0
2 p3 p1 0 3 p3 p1 0
p0 0 p0 0
313 / 329
Dijkstra’s token ring - Correctness
Theorem: If K ≥ N, then Dijkstra’s token ring self-stabilizes.
Proof : In each configuration at least one process is privileged.
An event never increases the number of privileged processes.
Consider an (infinite) computation. After at most 12 (N − 1)N events
at p1 , . . . , pN−1 , an event must happen at p0 .
So during the computation, x0 ranges over all values in {0, . . . , K − 1}.
Since p1 , . . . , pN−1 only copy values, they stick to their ≤ N − 1 values
as long as x0 equals xi for some i = 1, . . . , N − 1.
Since K ≥ N, at some point, x0 6= xi for all i = 1, . . . , N − 1.
The next time p0 becomes privileged, clearly xi = x0 for all i.
So then mutual exclusion has been achieved.
314 / 329
Question
Let N ≥ 3. Argue that Dijkstra’s token ring self-stabilizes if K = N − 1.
This lower bound for K is sharp ! (See the next slide.)
Answer: Consider any computation.
At some moment, pN−1 copies the value from pN−2 .
Then p1 , . . . , pN−1 hold ≤ N − 2 different values (because N ≥ 3).
Since p1 , . . . , pN−1 only copy values, they hold these ≤ N − 2 values
as long as x0 equals xi for some i = 1, . . . , N − 1.
Since K ≥ N − 1, at some point, x0 6= xi for all i = 1, . . . , N − 1.
315 / 329
Dijkstra’s token ring - Non-stabilization if K = N − 2
Example: Let N ≥ 4 and K = N − 2, and consider the following
initial configuration.
2 pN−4
. . . p4 N−6
1 pN−3 p3 N−5
0 pN−2 p2 N−4
N−3 pN−1 p1 N−3
p0
N−3
It doesn’t always self-stabilize.
316 / 329
Afek-Kutten-Yung self-stabilizing spanning tree algorithm
We compute a spanning tree in an undirected network.
As always, each process is supposed to have a unique id.
The process with the largest id becomes the root.
Each process p maintains the following variables :
parent p : its parent in the spanning tree
root p : the root of the spanning tree
dist p : its distance from the root via the spanning tree
317 / 329
Afek-Kutten-Yung spanning tree algorithm - Complications
Due to arbitrary initialization, there are three complications.
Complication 1: Multiple processes may consider themselves root.
Complication 2: There may be a cycle in the spanning tree.
Complication 3: root p may not be the id of any process in the network.
318 / 329
Afek-Kutten-Yung spanning tree algorithm
A non-root p declares itself root, i.e.
parent p ← ⊥ root p ← p dist p ← 0
if it detects an inconsistency in its root or parent value,
or with the root or dist value of its parent :
I root p ≤ p, or
I parent p = ⊥, or
I parent p 6= ⊥, and parent p isn’t a neighbor of p
or root p 6= root parent p or dist p 6= dist parent p + 1.
319 / 329
Question
Suppose that during an application of the Afek-Kutten-Yung algorithm,
the created directed network contains a cycle with a “false” root.
Why is such a cycle always broken ?
Answer: At some p on this cycle, dist p 6= dist parent p + 1.
So p declares itself root.
320 / 329
Afek-Kutten-Yung spanning tree algorithm
A root p makes a neighbor q its parent if p < root q :
parent p ← q root p ← root q dist p ← dist q + 1
Complication: Processes can infinitely often rejoin a component
with a false root.
321 / 329
Afek-Kutten-Yung spanning tree alg. - Example
Given two processes 0 and 1.
parent 0 = 1 parent 1 = 0 root 0 = root 1 = 2 dist 0 = 0 dist 1 = 1
Since dist 0 6= dist 1 + 1, 0 declares itself root :
parent 0 ← ⊥ root 0 ← 0 dist 0 ← 0
Since root 0 < root 1 , 0 makes 1 its parent :
parent 0 ← 1 root 0 ← 2 dist 0 ← 2
Since dist 1 6= dist 0 + 1, 1 declares itself root :
parent 1 ← ⊥ root 1 ← 1 dist 1 ← 0
Since root 1 < root 0 , 1 makes 0 its parent :
parent 1 ← 0 root 1 ← 2 dist 1 ← 3 et cetera
322 / 329
Afek-Kutten-Yung spanning tree alg. - Join Requests
Before p makes q its parent, it must wait until q’s component
has a proper root. Therefore p first sends a join request to q.
This request is forwarded through q’s component, toward the root
of this component.
The root sends back an ack toward p, which retraces the path of
the request.
Only when p receives this ack, it makes q its parent :
parent p ← q root p ← root q dist p ← dist q + 1
Join requests are only forwarded between “consistent” processes.
323 / 329
Afek-Kutten-Yung spanning tree alg. - Example
Given two processes 0 and 1.
parent 0 = 1 parent 1 = 0 root 0 = root 1 = 2 dist 0 = dist 1 = 0
Since dist 0 6= dist 1 + 1, 0 declares itself root :
parent 0 ← ⊥ root 0 ← 0 dist 0 ← 0
Since root 0 < root 1 , 0 sends a join request to 1.
This join request doesn’t immediately trigger an ack.
Since dist 1 6= dist 0 + 1, 1 declares itself root :
parent 1 ← ⊥ root 1 ← 1 dist 1 ← 0
Since 1 is now a proper root, it replies to the join request of 0 with
an ack, and 0 makes 1 its parent :
parent 0 ← 1 root 0 ← 1 dist 0 ← 1
324 / 329
Afek-Kutten-Yung spanning tree alg. - Shared memory
A process can only be forwarding and awaiting an ack for at most
one join request at a time.
(That’s why in the previous example 1 can’t send 0’s join request on to 0.)
Communication is performed using shared memory, so join requests and
ack’s are encoded in shared variables.
The path of a join request is remembered in local variables.
For simplicity, join requests are here presented in a message passing
framework with synchronous communication.
325 / 329
Afek-Kutten-Yung spanning tree alg. - Consistency check
Given a ring with processes p, q, r , and s > p, q, r .
Initially, p and q consider themselves root; r has p as parent and
considers s the root.
Since root r > q, q sends a join request to r .
Without the consistency check, r would forward this join request to p.
Since p considers itself root, it would send back an ack to q (via r ),
and q would make r its parent and consider s the root.
Since root r 6= root p , r makes itself root.
Now we would have a symmetrical configuration to the initial one.
326 / 329
Afek-Kutten-Yung spanning tree alg. - Correctness
Each component in the network with a false root has an inconsistency,
so a process in this component will declare itself root.
Since processes can only be involved in one join request at a time,
each join request is eventually acknowledged.
Since join requests are only passed on between consistent processes,
processes can only finitely often join a component with a false root
(each time due to improper initial values of local variables).
These observations imply that eventually false roots will disappear,
the process with the largest id in the network will declare itself root,
and the network converges to a spanning tree with this process as root.
327 / 329
Lecture in a nutshell
mutual exclusion
Ricart-Agrawala algorithm with a logical clock
Raymond’s algorithm with token passing
Agrawal-El Abbadi algorithm with quorums
self-stabilization
Dijkstra’s self-stabilizing mutual exclusion algorithm
Afek-Kutten-Yung self-stabilizing spanning tree algorithm
328 / 329
Edsger W. Dijkstra prize in distributed computing
2000: Lamport, Time, clocks, and the ordering of events in a distributed system, 1978
2001: Fischer, Lynch, Paterson, Impossibility of distributed consensus with one faulty
process, 1985
2002: Dijkstra, Self-stabilizing systems in spite of distributed control, 1974
2004: Gallager, Humblet, Spira, A distributed algorithm for minimum-weight spanning
trees, 1983
2005: Pease, Shostak, Lamport, Reaching agreement in the presence of faults, 1980
2007: Dwork, Lynch, Stockmeyer, Consensus in the presence of partial synchrony, 1988
2010: Chandra, Toueg, Unreliable failure detectors for reliable distributed systems, 1996
2014: Chandy, Lamport, Distributed snapshots: determining global states of distributed
systems, 1985
2015: Ben-Or, Another advantage of free choice: completely asynchronous agreement
protocols, 1983
The 2003, 2006, 2012 award winners are treated in Concurrency & Multithreading.
329 / 329