0% found this document useful (0 votes)
91 views117 pages

Logical Time

The document discusses logical time in distributed systems. It begins by introducing scalar (linear) time, which uses logical clocks to order events in a distributed system. Lamport clocks are described, which associate each process with a local clock that is incremented upon internal and message events. Vector time and matrix time are also mentioned as ways to extend scalar time. The document then discusses using logical time to solve problems like mutual exclusion through permission-based and token-based algorithms. It presents the principles of individual permissions and describes how timestamps can be used to establish priority and consistently grant permissions.

Uploaded by

Palaniappan Sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views117 pages

Logical Time

The document discusses logical time in distributed systems. It begins by introducing scalar (linear) time, which uses logical clocks to order events in a distributed system. Lamport clocks are described, which associate each process with a local clock that is incremented upon internal and message events. Vector time and matrix time are also mentioned as ways to extend scalar time. The document then discusses using logical time to solve problems like mutual exclusion through permission-based and token-based algorithms. It presents the principles of individual permissions and describes how timestamps can be used to establish priority and consistently grant permissions.

Uploaded by

Palaniappan Sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

LOGICAL TIME

in DISTRIBUTED SYSTEMS

Michel RAYNAL
Institut Universitaire de France
Academia Europaea

IRISA, Université de Rennes, France


Polytechnic University (PolyU), Hong Kong

c M. Raynal
Logical time in distributed systems 1
Companion Book (1)

Distributed Algorithms for


Message-passing Systems
by Michel Raynal
Springer, 515 pages, 2013
ISBN: 978-3-642-38122-5

c M. Raynal
Logical time in distributed systems 2
Companion Book (2): content (six parts)

• Part 1: Distributed graph algorithms


• Part 2:
Logical time and global state in distributed systems
• Part 3: Mutual exclusion and resource allocation
• Part 4: High level communication abstractions
• Part 5: Detection of properties of distributed executions
• Part 6: Distributed shared memory

c M. Raynal
Logical time in distributed systems 3
Contents

• Scalar (linear) time


• Vector time
• Matrix time
• Using virtual time

c M. Raynal
Logical time in distributed systems 4
Part I

SCALAR/LINEAR TIME

- Lamport, L., Time, Clocks and the Ordering of Events in a Distributed System.
Communications of the ACM, 21(7):558-565, 1978

c M. Raynal
Logical time in distributed systems 5
Aim

• Build a logical time in order to be able to associate a


consistent date with events, i.e.,

e −→ f ⇒ date(e) < date(f )

• Why logical time? Because there is no notion of phys-


ical time in a pure asynchronous system (no bound on
process speed, and message transfer delay)
Even if we had physical time, it would be more difficult

c M. Raynal
Logical time in distributed systems 6
The fundamental constraint

Logical time has to increase along causal paths

• How logical time is used: when it produces a new event,


a process associates the current clock value with that
event
• Idea: consider the set of integers for the time domain

⋆ Each process pi has a local clock hi


⋆ From a local point of view:
hi has to measure the progress of pi
⋆ From a global point of view:
hi has to measure the progress of the whole compu-
tation

c M. Raynal
Logical time in distributed systems 7
Lamport clocks (1978)

Local progress rule:


before producing an internal event::
hi ← hi + 1 % date of the internal event %

Sending rule:
when sending a message m to pj :
hi ← hi + 1; % date of the send event %
send (m, hi) to pj

Receiving Rule:
when receiving a message (m, h) from pj :
hi ← max(hi, h);
hi ← hi + 1 % date of the receive event %

c M. Raynal
Logical time in distributed systems 8
Illustration

1 2 3 4 6
p1 h1

2 3
5
1 2 6
p2 h2
4 5
1 6

h3
p3
1 2 3 4 5 7

Observation: (date(e) = x)

There are x events on the longest causal path ending at e

c M. Raynal
Logical time in distributed systems 9
Build a total order on all the events

• Motivation: Resource allocation problems


• Observations

⋆ (date(e) < date(f )) ⇒ ¬(f −→ e)


⋆ (date(e) < date(f )) ∧ ¬(e −→ f ) is possible
In that case, e and f are independent events, but
cannot be concluded from the dates only
⋆ (date(e) = date(f )) ⇒ e||f

• Associate a timestamp (h, i) with each event e where:

⋆ h = local clock value when e is produced (date)


⋆ i = index of the process that produced e (location)

c M. Raynal
Logical time in distributed systems 10
Total order definition

• Let e and f be events timestamped (h, i) and (k, j)

TO def
(e −→ f ) = (h < k) ∨ ((h = k) ∧ (i < j))

• This is the (well-known) lexicographical ordering

c M. Raynal
Logical time in distributed systems 11
Illustration

Σinit = [0, 0]
(2, 1) (3, 1)
e12
e11 e21
p1 Σ = [0, 1]

e11 e22
p2
e12 e22 e32 Σ = [1, 1] Σ = [0, 2]
(1, 2) (2, 2) (4, 2) e21 e22 e11

H=1 H=2 H=4


Σ = [2, 1] Σ = [1, 2]
H=3 e22 e21
Lamport’s timestamps capture an
observation (among all possible oberv.) Σ = [2, 2]

namely, e12, e11, e22, e21, e22 e32

Σf inal = [2, 3]

c M. Raynal
Logical time in distributed systems 12
A theorem on the space of scalar clocks

• Let C be the set of all the scalar clock systems that are
consistent (with respect to the causality relation)
• Let e and f be any two events of a distributed execution
ev
• ∀C ∈ C: e −→ f ⇒ dateC (e) < dateC (f ) (Consistency)
• e||f ⇔ ∃C ∈ C : dateC (e) = dateC (f )
• Or equivalently
e||f ⇔ ∃C1, C2 ∈ C :
dateC1(e) ≤ dateC1(f ) ∧ dateC2(e) ≥ dateC2(f )

c M. Raynal
Logical time in distributed systems 13
Part II

SCALAR CLOCKS in ACTION

c M. Raynal
Logical time in distributed systems 14
The Mutex problem

• Enrich the underlying system with new operations


• These operations define a service
• Here, two operations: acquire() and release()
• Process behavior: abstracted in a 3-state automaton
statei ∈ {out, asking, in} (all other detail are irrelevant)

out asking

in

c M. Raynal
Logical time in distributed systems 15
The Mutex problem: definition

• Definition

⋆ Safety: no two processes are concurrently in the CS


⋆ Liveness: any request is eventually granted

• Algorithms

⋆ Permission-based (Individual vs arbiter permissions)


⋆ Token-based

- Raynal M., Algorithms for mutual exclusion. The MIT Press, 1986
- Anderson J., Kim Y.-J. and Herman T., Shared-memory mutual exclusion: major
research trends since 1986. Distributed Computing, 16(2-3): 75-110, 2003
- Taubenfeld G., Synchronization Algorithms and Concurrent Programming. Pear-
son/Prentice Hall, 2006.

c M. Raynal
Logical time in distributed systems 16
Individual permissions: principles

• When,it wants to enter the CS, pi asks for permissions


• When,it has received all the permissions, pi enters
• Ri = the set of processes from which pi needs the per-
mission to enter the SC
• Individual permission: Ri = {1, . . . , n} \ {i}
• When pi gives its permission to a process pk , the mean-
ing of the permission is “As far as I am concerned, you
can enter” (a permission is consequently “individual”)
• Core of the algorithm: the way permissions are granted
• The algorithm manages bilateral conflicts

c M. Raynal
Logical time in distributed systems 17
Granting a permission
statei = out statei = asking
pi

perm

pj
statej = out

pk
statek 6= out

• Solve the conflict between pi and pk


• A solution: timestamp the requests, and use the total
order on timestamps to establish a system-wide consis-
tent priority
⋆ pk has not priority: send its permission to pi by return
⋆ pk has priority: delay the permission, will be sent when
exiting the CS

c M. Raynal
Logical time in distributed systems 18
From mechanisms to properties

• Safety: ∀i 6= j : j ∈ Ri ∧ i ∈ Rj
• Liveness: based on a timestamping mechanism

c M. Raynal
Logical time in distributed systems 19
Ricart-Agrawala mutex algorithm: local variables

• statei ∈ {out, asking, in}, init out


• hi, last i integers, init 0
• prioi boolean
• waiting f romi, postponedi sets

c M. Raynal
Logical time in distributed systems 20
Structure

acquire() release()

11111111111111
00000000000000
00000000000000
11111111111111 11111111111111
00000000000000
00000000000000
11111111111111
00000000000000
11111111111111 00000000000000
11111111111111
00000000000000
11111111111111 00000000000000
11111111111111
00000000000000
11111111111111 00000000000000
11111111111111

local variables

11111111111111
00000000000000 11111111111111
00000000000000
req(k, j)
00000000000000
11111111111111 00000000000000
11111111111111
00000000000000
11111111111111 perm(j)
00000000000000
11111111111111 00000000000000
11111111111111
00000000000000
11111111111111 00000000000000
11111111111111
00000000000000
11111111111111

c M. Raynal
Logical time in distributed systems 21
Ricart-Agrawala mutex algorithm (1)

operation acquire() issued by pi


statei ← asking; postponedi ← ∅;
hi ← hi + 1; lasti ← hi;
waiting f romi ← Ri;
for each j 6= i do send req(lasti, i) to pj end for;
wait (waiting f romi = ∅);
statei ← in

when perm(j) is received


waiting f romi ← waiting f romi \ {j}

c M. Raynal
Logical time in distributed systems 22
Ricart-Agrawala mutex algorithm (2)

when req(k, j) is received


hi ← max(hi, k) + 1;
prioi ← (statei 6= out) ∧ (lasti, i) < (k, j);
if prioi then postponedi ← postponedi ∪ {j}
else send perm(i) to pj
end if

operation release()
for each j ∈ postponedi do
send perm(i) to pj end for;
statei ← out

c M. Raynal
Logical time in distributed systems 23
Clock values

• hi can increase forever


• Aim: limit its increase
• As only requests have to be timestamped we can replace
hi ← max(hi, k) + 1 with hi ← max(hi, k)
• As we are about to see in the proof, it is possible to
further limit the increase in the acquire() operation:
The two statements [hi ← hi +1; lasti ← hi] are replaced
by lasti ← hi + 1, which does not increase hi!
• These two modifications allows obtaining an algorithm
in which all variables are bounded: clocks values can be
implemented modulo 2n − 1

c M. Raynal
Logical time in distributed systems 24
Ricart-Agrawala mutex algorithm

operation acquire() issued by pi


statei ← asking; postponedi ← ∅;
lasti ← hi + 1; % replaces hi ← hi + 1; lasti ← hi
waiting f romi ← Ri;
for each j 6= i do send req(lasti, i) to pj end for;
wait (waiting f romi = ∅);
statei ← in

when req(k, j) is received


hi ← max(hi, k); % replaces hi ← max(hi, k) + 1
prioi ← (statei 6= out) ∧ (lasti, i) < (k, j);
if prioi then postponedi ← postponedi ∪ {j}
else send perm(i) to pj
end if

c M. Raynal
Logical time in distributed systems 25
Proof: on the safety side (1)

• By contradiction (assume invaraint is violated)


• Two cases:

pi
request(h, i)

request(k, j)
pj

c M. Raynal
Logical time in distributed systems 26
Proof: on the safety side (2)

cs statei 6= out

pi
request(h, i)

request(k, j)
pj
out asking

clockj ← max(clockj , h) k = ℓrdj = clockj + 1 > h

c M. Raynal
Logical time in distributed systems 27
Proof: on the liveness side (1)

• Two-step proof
• Deadlock-freedom:

consider the request with the smallest timestamp


• Starvation-freedom:
any pi that wants to enter the crit. section will do it

c M. Raynal
Logical time in distributed systems 28
Proof: on the liveness side (2)

request(h, i) from pi to pj
permission(j) from pj to pi

clocki = h − 1 < k − 1 clocki ≥ k

pi
request(k, j) request(h′, i)
where k > h where h′ > k
pj

clockj = k − 1 > h − 1

c M. Raynal
Logical time in distributed systems 29
Cost

• Message cost: 2(n − 1) messages per CS use


• Improvement:

⋆ The algorithm can be improved in such a way that a


CS use costs between 0 and 2(n − 1) messages
⋆ Idea: every pair of processes manages a single per-
mission (token) to solve their conflicts

• Time: consider each message takes one time unit

⋆ Heavy load: one time unit


⋆ Light load: two time unit

c M. Raynal
Logical time in distributed systems 30
Variants

• Ring structure

⋆ Forward a request = give its permission


⋆ Cost: n messages

• Assumption ∆ on transfer delays

⋆ Give its permission = not to answer


⋆ Not to give its permission = send by return a negative
ack, cancel it when exiting the CS

c M. Raynal
Logical time in distributed systems 31
On mutual exclusion

• Permission-based

⋆ Individual permission approach


⋆ Arbiter permission approach
Quorums and three-way handshake algorithms

• Token-based
• A continuous view

- Raynal M., Algorithms for mutual exclusion, The MIT Press, 1986

- Anderson J., Kim Y.-J. andHerman T., Shared-memory mutual exclusion: major
research trends since 1986, Distributed Computing 16(2-3): 75-110 2003

c M. Raynal
Logical time in distributed systems 32
Part III

VECTOR TIME

- Fidge C., Timestamp in Message Passing Systems that Preserves Partial Ordering,
Proc. 11th Australian Computing Conference, pp. 56-66, 1988
- Mattern F., Virtual time and global states of distributed systems. Proc. Int’l work-
shop on Parallel and Distributed Systems, North-Holland, pp. 215-226, (Cosnard,
Quinton, Raynal and Robert Eds), 1988
- Baldoni R. and Raynal M. Fundamentals of Distributed Computing: A Practical
Tour of Vector-Clock Systems. IEEE Distributed Systems Online, 3(2):1-18, 2002

c M. Raynal
Logical time in distributed systems 33
Aim: capture the causality relation

• Scalar (linear) clock system

⋆ Respects causality
⋆ But does not capture it

• Find a dating system that captures causality exactly

ev
(e −→ f ) ⇔ date(e) < date(f )

(e||f ) ⇔ date(e) and date(f ) cannot be compared

c M. Raynal
Logical time in distributed systems 34
Vector clock: intuition

• Observation: a process pi can always measure its progress


by counting the number of events it has produced since
the beginning
This number can be seen as its logical local clock
There is one such clock per process
• The time domain is consequently n-dimensional: there
is one dimension associated with each process
• Hence the idea of vector clocks: each process pi man-
ages a vector V Ci[1..n] that represents its view of the
global time progress
V Ci is a digest of the current causal past of pi

c M. Raynal
Logical time in distributed systems 35
Vector clock: definition

• V Ci[i] = nb of events issued by pi


• V Ci[j] = nb of events issued by pj , as known by pi

Formally, let e be the current event produced by pi

V Ci[j] = |{f | f −→ e ∧ f has been issued by pj }|

• Notation: component-wise maximum/minimum


max(V 1, V 2) = [max(V 1[1], V 2[1]), · · · , max(V 1[n], V 2[n])]

min(V 1, V 2) = [min(V 1[1], V 2[1]), · · · , min(V 1[n], V 2[n])]

c M. Raynal
Logical time in distributed systems 36
Vector clock: algorithm

Local progress rule:


before producing an internal event:
V Ci[i] ← V Ci[i] + 1
Sending rule:
when sending a message m to pj :
V Ci[i] ← V Ci[i] + 1;
send (m, V Ci) to pj

Receiving Rule:
when receiving a message (m, V C) from pj :
V Ci[i] ← V Ci[i] + 1;
V Ci ← max(V Ci, V C[)

c M. Raynal
Logical time in distributed systems 37
Illustration

[1, 2, 0, 0]
[0, 0, 0, 0]
p1

[0, 1, 0, 0] [0, 3, 0, 0]
p2
[0, 2, 0, 0]
[0, 3, 2, 2]
p3
[0, 0, 0, 0] [0, 0, 1, 0]

p4
[0, 0, 0, 0] [0, 3, 0, 1] [0, 3, 0, 2]

∀i, k: V Ci[k] is not decreasing, and V Ci[k] ≤ V Ck [k]

c M. Raynal
Logical time in distributed systems 38
A few simple definitions

def
• V1≤V2 = ∀k : V 1[k] ≤ V 2[k]
def
• V1<V2 = (V 1 ≤ V 2) ∧ (V 1 6= V 2)
def
• V 1||V 2 = ¬(V 1 ≤ V 2) ∧ ¬(V 2 ≤ V 1)

c M. Raynal
Logical time in distributed systems 39
The vector clock properties

Let e with date(e) = Ve , and f with date(f ) = Vf

ev
(e −→ f ) ⇔ (Ve < Vf )

(e || f ) ⇔ (Ve || Vf )

These are the fundamental properties provided by vector clocks

c M. Raynal
Logical time in distributed systems 40
Proof (1)

• Theorem 1: Vector clocks increase along causal paths


ev
• Theorem 2: (e −→ f ) ⇔ (Ve < Vf )

ev
⋆ (e −→ f ) ⇒ (Ve < Vf ): follows from Theorem 1.
ev
⋆ (Ve < Vf ) ⇒ (e −→ f ):
Let pi be the process that issued the event e. We
have (Ve < Vf ) ⇒ (Ve[i] ≤ Vf [i]). As only pi can
entail an increase of V [i] (for any vector V ), it follows
that there is a causal path from e to f .

c M. Raynal
Logical time in distributed systems 41
Proof and cost

• Theorem 3: (e || f ) ⇔ (Ve || Vf ).
def ev ev
(e || f ) = ¬(e −→ f ) ∧ ¬(f −→ e) (definition).

ev
⋆ ¬(e −→ f ) ⇒ ¬(Ve < Vf ).
ev
⋆ ¬(f −→ e) ⇒ ¬(Vf < Ve).

From which follows that Vf and Ve cannot be compared.

• Theorem 4: The previous (causality/independence) pred-


icates require O(n) comparisons

c M. Raynal
Logical time in distributed systems 42
Refining the causality test

• Let us associate a timestamp (V e, i) with each event e,


where pi is the process that issued e
• Let e timestamped (Ve, i) and f timestamped (Vf , j)

• Refined causality test:


ev
(e −→ f ) ⇔ (Ve[i] ≤ Vf [i])

• Refined independence test:


(e || f ) ⇔ (Ve[i] > Vf [i]) ∧ (Vf [j] > Ve[j])

• Theorem 4: The previous (causality/independence) pred-


icates require O(1) comparisons (Scalability of the test)

c M. Raynal
Logical time in distributed systems 43
A process is a “local” observer

Σinit = [0, 0]
[1, 1] [2, 1] e12
σ10 e11 σ11 e21 σ12
p1 Σ = [0, 1]

σ20 σ21 σ22 σ23 e11 e22


p2
e12 e22 e32 Σ = [1, 1] Σ = [0, 2]
[0, 1] [0, 2] [2, 3]
e21 e22 e11

Σ = [2, 1] Σ = [1, 2]
e22 e21

Σ = [2, 2]
e32
A process is an oberver of the computation
Σf inal = [2, 3]

c M. Raynal
Logical time in distributed systems 44
A vector clock denotes a global state

Σinit = [0, 0]
[1, 1] [2, 1] e12
σ10 e11 σ11 e21 σ12
p1 Σ = [0, 1]

σ20 σ21 σ22 σ23 e11 e22


p2
e12 e22 e32 Σ = [1, 1] Σ = [0, 2]
[0, 1] [0, 2] [2, 3]
e21 e22 e11

Σa = [2, 1] Σb = [1, 2]
Σa = [2, 1]
e22 e21
Σb = [1, 2]

Σc = [2, 2]
Σc = max(Σa, Σb)
e32
[2, 2] = max([2, 1], [1, 2])
Σf inal = [2, 3]

c M. Raynal
Logical time in distributed systems 45
The development of logical time (1)

Vi[i] = s
pi
e m
pj
Vj [i] ≥ s causal path: e → f
Vj [j] = r
pk
f

• m: sent by pi at Vi[i] = s, received by pj at Vj [j] = r


• “Knowing” the receipt of m ⇒ “knowing” its sending
• I.e., for any event x: (Vx[j] ≥ r) ⇒ (Vx[i] ≥ s)
• Due to m it is impossible to have (Vx[j] ≥ r) ∧ (Vx[i] < s)

c M. Raynal
Logical time in distributed systems 46
The development of logical time (2)

1, 0 2, 0 3, 0 4, 2 5, 2 6, 5

m1 m4
m2
m3
0, 1 0, 2 0, 3 3, 4 3, 5 3, 6

p2 m1 makes it impossible to have


m1 (V [1] < 2) ∧ (V [2] ≥ 6)
6

5 m3

4
m4

1 m2

0
0 1 2 3 4 5 6 p1

c M. Raynal
Logical time in distributed systems 47
Part IV

VECTOR CLOCKS in ACTION (1)

CAUSAL ORDER ABSTRACTION

c M. Raynal
Logical time in distributed systems 48
Causal order abstraction

- Birman, A. Schiper, and P. Stephenson. Lightweight causal and atomic group


multicast. ACM Transactions on Computer Systems, 9(3):272-314, 1991
- Raynal M., Schiper A. and Toueg S., The causal ordering abstraction and a simple
way to implement it. Information Processing Letters, 39:343-351, 1991

• co broadcast(m): allows to send a message m to all

• co deliver(): allows a process to deliver a message

c M. Raynal
Logical time in distributed systems 49
Causal delivery: definition

• Termination: If a message is co broadcast, it is even-


tually co delivered (No loss)

• Integrity: A process co delivers a message m at most


once (No duplication)

• Validity: If a process co delivers a message m, then m


has been co broadcast (No spurious message)

• Causal Order:
co broadcast(m1) → co broadcast(m2)
⇒ co del(m1) → co del(m2)

c M. Raynal
Logical time in distributed systems 50
Causal delivery: Why it is useful

• Capture causality

• Cooperative work

• Stronger than fifo channels

• But weaker than atomic broadcast

Atomic broadcast = total order delivery

c M. Raynal
Logical time in distributed systems 51
Causal order: Example 1

c M. Raynal
Logical time in distributed systems 52
Causal order: Example 2

c M. Raynal
Logical time in distributed systems 53
Causal broadcast

V Ci[j] = nb messages broadcast by Pj (to pi’s knowledge)

m2 m4

m1 m3 m5

• V Cm2 = [1, 1, 0, 0] V Cm3 = [1, 1, 1, 0]


• V Cm4 = [1, 2, 0, 0] V Cm5 = [1, 2, 2, 0]

c M. Raynal
Logical time in distributed systems 54
Illustration

0 0 1
0 0 0
0 0 1
p1
0
0 0 1 1
0 0 0 0
0 0 1
p2
1
0 0
0 0
0
p3
1 1
0 0
0 1

c M. Raynal
Logical time in distributed systems 55
RST algorithm

operation co broadcast(m)
for each j 6= i do send (m, V Ci) to pj end for;
V Ci[i] ← V Ci[i] + 1

when (m, m.V C) is received from pj :


wait until (∀k : V Ci[k] ≥ m.V C[k]);
co deliver m to the application;
V Ci[j] ← V Ci[j] + 1

c M. Raynal
Logical time in distributed systems 56
Part V

VECTOR CLOCKS in ACTION (2)

PREDICATE DETECTION

c M. Raynal
Logical time in distributed systems 57
Stable Local Predicate Detection (1)

• Local predicate LPi: on the local variables of a single


process pi
• Stable predicate: once true, remains true
• A consistent global state Σ = (σ1, · · · , σn) satisfies the
global predicate LP1 ∧ LP2 · · · ∧ LPn if ∀ i : (σi |= LPi)
^ ^
Σ |= ( LPi) ⇔ (σi |= LPi)
i i

c M. Raynal
Logical time in distributed systems 58
Stable Local Predicate Detection (2)

• Problem: Design an algorithm that detects the first


consistent global state that satisfies a conjunction of
stable local predicates
• Constraints: Do not use additional control messages,
Detection must be done on the fly

c M. Raynal
Logical time in distributed systems 59
Stable Local Predicate Detection (3)

σ10 σ1x1 σ1y1


P1

m1 m5
σ20 σ2x2 m3 σ2y2
P2
m2 m4
σ30 σ3x3 σ3y3
P3

c M. Raynal
Logical time in distributed systems 60
Stable Local Predicate Detection (4)

σ1y1
P1

σ2y2 m1 m3

P2

σ3y3 m2
P3

c M. Raynal
Logical time in distributed systems 61
Detection algorithm: local context of pi

• V Ci[1..n]: local vector clock


• SATi: set of process identities such that
j ∈ SATi ⇔ pj entered a local state σjx from
which LPj is true

• F IRSTi: first global state (as known by pi) in


which all the local predicates LPj such that
j ∈ SATi are satisfied

c M. Raynal
Logical time in distributed systems 62
Detection algorithm (1)

procedure detected? is
if SATi = {1, 2, . . . , n} then
F IRSTi defines the first consistent
V
global state Σ that satisfies j LPj
fi

procedure check LPi is


if (σix |= LPi) then SATi := SATi ∪ {i};
F IRSTi := V Ci;
donei := true;
detected?
fi

c M. Raynal
Logical time in distributed systems 63
Detection algorithm (2)

(S1) when Pi produces an internal event (e)


V Ci[i] := V Ci[i] + 1;
execute e and move to σ;
if ¬donei then check LPi fi

c M. Raynal
Logical time in distributed systems 64
Detection algorithm (3)

(S2) when Pi produces a send event (e=send m to Pj )


V Ci[i] := V Ci[i] + 1;
move to σ;
if ¬donei then check LPi fi;
m.V C := V Ci; m.SAT := SATi; m.F IRST := F IRSTi ;
send (m) to Pj
% m carries m.V C, m.SAT and m.F IRST %

c M. Raynal
Logical time in distributed systems 65
Detection algorithm (4)

(S3) when Pi produces a receive event (e=receive (m))


V Ci[i] := V Ci[i] + 1; V Ci := max(V Ci, m.V C);
move to σ; % by delivering m to the process %
if ¬donei then check LPi fi;
if ¬(m.SAT ⊆ SATi)then
SATi := SATi ∪ m.SAT ;
F IRSTi := max(F IRSTi , m.F IRST );
detected?
fi

c M. Raynal
Logical time in distributed systems 66
Part VI

LIMIT of VECTOR CLOCKS

DETECTION OF A SIMPLE
EVENT PATTERN

-Raynal M., Illustrating the Use of Vector Clocks in Property Detection: an Example
and a Counter-Example. Proc. 5th Int’l European Parallel Computing Conference
(EUROPAR’99), Springer LNCS 1685, pp. 806-814, 1999

c M. Raynal
Logical time in distributed systems 67
Pattern Recognition (1)

• Some internal events are tagged black, the others are


tagged white
• All communication events are tagged white
• The problem: Given two black events s and t, does it
exist a black event u such that s → u ∧ u → t
• Formally, P(s, t) is the conjunction of:

⋆ (black(s) ∧ black(t))

⋆ (∃u 6= s, t : (black(u) ∧ (s → u ∧ u → t)))

c M. Raynal
Logical time in distributed systems 68
Pattern Recognition

t t

u u

s s

P(s, t) is false P(s, t) is true

White and black: s.V C = (0, 0, 2) and t.V C = (3, 4, 2) in both cases

Only black: s.V C = (0, 0, 1) and t.V C = (2, 1, 1) in both cases

c M. Raynal
Logical time in distributed systems 69
Non-Triviality of the Problem

a t1 t2
P1

b u c
P2

s d
P3

P(s, t2) is true while P(s, t1) is not

c M. Raynal
Logical time in distributed systems 70
Decomposing the Predicate

• P(s, t) ≡ (∃u : P1(s, u, t) ∧ P2(s, u, t))

⋆ P1(s, u, t) ≡ (black(s) ∧ black(u) ∧ black(t))

⋆ P2(s, u, t) ≡ (s → u ∧ u → t)

c M. Raynal
Logical time in distributed systems 71
Using Vector of Vector Clocks

• Only black events are relevant: count only them


• Event e:

⋆ e.V C: its vector timestamp (counting only black events)

⋆ e.M C[1..n]: an array of vector timestamps


e.M C[j] contains the vector timestamp of the last
black event of Pj that causally precedes e
e.M C[j] can be considered as a “pointer” from e to
the last event that precedes it on Pj

c M. Raynal
Logical time in distributed systems 72
Example (1)

a t1 t2
P1

b u c
P2

s d
P3

t1.M C[1] = a.V C means that t1.M C[1] points to a


t1.M C[2] = b.V C means that t1.M C[2] points to b
t1.M C[3] = s.V C means that t1.M C[3] points to s

c M. Raynal
Logical time in distributed systems 73
Example (2)

a t1 t2
P1

b u c
P2

s d
P3

t2.M C[1] = t1.V C: means that t2.M C[1] points to t1


t2.M C[2] = u.V C: means that t2.M C[2] points to u
t2.M C[3] = s.V C : means that t2.M C[3] points to s

c M. Raynal
Logical time in distributed systems 74
Operational Predicate

• Event s: s.V C and s.M C Event t: t.V C and t.M C


• P1 is trivially satisfied by any triple of events
• (∃u : P2(s, u, t)) ≡ (∃u : s → u → t) can be restated as:
(∃u : s → u → t) ≡ (∃u : s.V C < u.V C < t.V C)
(∃u : s → u → t) ≡ (∃pk : s.V C < t.M C[k] < t.V C)
As ∀k : t.M C[k] < t.V C, we get the operational predi-
cate:

P(s, t) ≡ (∃k : s.V C < t.M C[k])

c M. Raynal
Logical time in distributed systems 75
The Protocol (1)

(S1) when Pi produces a black event (e)


V Ci[i] := V Ci[i] + 1; % one more black event on Pi %
e.V C = V Ci; e.M C = M Ci;
M Ci[i] := V Ci
% vector timestamp of Pi’s last black event %

c M. Raynal
Logical time in distributed systems 76
The Protocol (2)

(S2) when Pi executes a send event (e=send m to Pj )


m.V C := V Ci; m.M C := M Ci;
send (m) to Pj % m carries m.V C and m.M C %

c M. Raynal
Logical time in distributed systems 77
The Protocol (3)

(S3) when Pi executes a receive event (e=receive(m))


V Ci := max(V Ci, m.V C);
% update of the local vector clock %
∀ k : M Ci[k] := max(M Ci[k], m.M C[k])
% record vector timestamps of last black predecessors %

c M. Raynal
Logical time in distributed systems 78
What has ben learnt

• Power of vector clocks: To track (counter-based) causal-


ity: “First Order” predecessor tracking
• Limitation of vector clocks: To solve problems where
causality can not be reduced to event counting: “Sec-
ond Order” predecessor (or more) tracking

c M. Raynal
Logical time in distributed systems 79
Part VII

VECTOR CLOCKS in ACTION (3)

DETERMINING IMMEDIATE
PREDECESSORS

- Anceaume E. Helary J.-M. and Raynal M. A Note on the Determination of the


Immediate Predecessors in a Distributed Computation. Int. Journal of Foundations
of Computer Science (IJFCS), 13(6):865-972, 2002
- Helary J.-M., Raynal M., Melideo G., and Baldoni R., Efficient Causality-Tracking
Timestamping. IEEE Transactions on Knowledge and Data Engineering, 15(5):1239-
1250, 2003

c M. Raynal
Logical time in distributed systems 80
Relevant Events

• At some abstraction level only some events of a dis-


tributed computation are relevant
• Let R ⊆ H be the set of relevant events
• Let → be the relation on R defined in the following way:
ev
∀ (e, f ) ∈ R × R : (e → f ) ⇔ (e −→ f ).

• The poset (R, →) constitutes an abstraction of the dis-


tributed computation

Without loss of generality we consider that the set of


relevant events is a subset of the internal events (if a
communication event has to be observed, a relevant
internal event can be generated just after the corre-
sponding communication event occurred)

c M. Raynal
Logical time in distributed systems 81
A Distributed Computation

P1
1
0 1
0 00
11
P2
1
0 11 0
00 1
P3
11
00 11
00

c M. Raynal
Logical time in distributed systems 82
Vector Clocks (2)

VC0 V Ci[1..n] initialized to [0, . . . , 0]


VC1 Each time pi produces a relevant event e:

⋆ It increments its vector clock entry V Ci[i] to indicate


its progress: V Ci[i] := V Ci[i] + 1

⋆ It associates with e its timestamp e.V C = V Ci

VC2 When a process pi sends a message m, it attaches to it


the current value of V Ci (Let m.V C denote this value)
VC3 When pi receives a message m, it updates its vector
clock: ∀ x : V Ci[x] := max(V Ci[x], m.V C[x])

c M. Raynal
Logical time in distributed systems 83
Vector Clocks (3)

• V Ci= current knowledge of pi on the progress


of each process Pk (measured by V Ci[k])

• More precisely:

V Ci[k]= # number of relevant events produced


by pk and known by pi

c M. Raynal
Logical time in distributed systems 84
Vector Clocks: Example

(1, 1) (1, 2) (1, 3)


P1
[1, 0, 0] [2, 0, 1] [3, 2, 1]
(2, 1) (2, 2) (2, 3)
P2
[1, 1, 0] [2, 2, 1][2, 3, 1]
(3, 1) (3, 2)
P3
[0, 0, 1] [1, 1, 2]

c M. Raynal
Logical time in distributed systems 85
Immediate Predecessor Tracking: the Problem

• Given two relevant events e and f , we say that e is an


immediate predecessor of f if:

⋆ e → f , and

⋆ 6 ∃ relevant event g such that e → g → f

• The Immediate Predecessor Tracking (IPT) problem


consists in associating with each relevant event e the
set of relevant events that are its immediate predeces-
sors
Moreover, this has to be done on the fly and without
additional control message (i.e., without modifying the
communication pattern of the computation)

c M. Raynal
Logical time in distributed systems 86
Immediate Predecessor Tracking: Why?

• Capture the very structure of the causal past


of each event
• Allow the analysis of distributed computations
(e.g., detection of global predicates, analysis
of control flows)

c M. Raynal
Logical time in distributed systems 87
Distributed Computation and its Reduction

(1, 1) (1, 2) (1, 3) (1, 1) (1, 2) (1, 3)


P1
[1, 0, 0] [2, 0, 1] [3, 2, 1]
(2, 1) (2, 2) (2, 3)
(2, 3)
P2 (2, 1)
[1, 1, 0] [2, 2, 1][2, 3, 1] (2, 2)
(3, 1) (3, 2) (3, 1)
P3 (3, 2)
[0, 0, 1] [1, 1, 2]

c M. Raynal
Logical time in distributed systems 88
Transitive Reduction (Hasse Diagram)

11
00
00 1
11 0 0 1
(1, 1) (1, 2) (1, 3)

0
1 00
11
0 11
1 00 1
(2, 1) 0
1
0 (2, 2)
(2, 3)

11 00
00 11
(3, 1)
(3, 2)

c M. Raynal
Logical time in distributed systems 89
Basic IPT Protocol (1)

Each pi manages:

• A vector clock V Ci
• A boolean array IPi whose meaning is:

(IPi[j] = 1) ⇔ The last relevant event pro-


duced by pj and known by pi is an immediate
predecessor of pi’s current event

c M. Raynal
Logical time in distributed systems 90
Basic IPT Protocol (2)

R0 Both V Ci[1..n] and IPi[1..n] are initialized to [0, . . . , 0]


R1 Each time pi produces a relevant event e:

⋆ It increments its VC entry: V Ci[i] := V Ci[i] + 1

⋆ It associates with e the timestamp


e.T S = {(k, V Ci[k]) | IPi[k] = 1}

⋆ It resets IPi: ∀ ℓ 6= i : IPi[ℓ] := 0; IPi[i] := 1

R2 When pi sends a message m to pj , it attaches to m the


current values of V Ci (denoted m.V C) and the boolean
array IPi (denoted m.IP )

c M. Raynal
Logical time in distributed systems 91
How to Manage the IPi Vectors? (1)

P1

P2

P3

c M. Raynal
Logical time in distributed systems 92
How to Manage the IPi Vectors? (2)

P1

P2

P3

c M. Raynal
Logical time in distributed systems 93
Basic IPT Protocol (3)

R3 When it receives a message m from pj , pi executes the


following updates:

∀k : case
V Ci[k] < m.V C[k] then V Ci[k] := m.V C[k];
IPi[k] := m.IP [k]
V Ci[k] = m.V C[k] then IPi[k] := min(IPi[k], m.IP [k])
V Ci[k] > m.V C[k] then skip
end case

c M. Raynal
Logical time in distributed systems 94
Efficient IPT? (1)

• Question : Is it possible to design an IPT pro-


tocol that does not require each message m
to carry a vector clock m.V C and a boolean
vector m.IP whose size is always n?
• Answer : Yes! ... but How???

c M. Raynal
Logical time in distributed systems 95
Efficient IPT (2): Towards a General Condition

Underlying intuition:

V Ci [k] = x
IPi [k] = 1
send(m)
Pi

Pj
V Cj [k] ≥ x receive(m)

c M. Raynal
Logical time in distributed systems 96
Efficient IPT (3): a General Condition

• Let e.Xi= value of the var Xi of pi when it produces e


• Let K(m, k) be the following predicate:
(1) (send(m).V Ci[k] = 0)
(2) ∨ (send(m).V Ci[k] < pred(receive(m)).V Cj [k])
(3) ∨ ((send(m).V Ci[k] = pred(receive(m)).V Cj [k])∧
(send(m).IPi [k] = 1))

c M. Raynal
Logical time in distributed systems 97
Efficient IPT (3): a General Condition

• Theorem 1:
The condition K(m, k) is both nec-
essary and sufficient to omit the trans-
mission of V Ci[k] and IPi[k] when m is
sent by pi to pj

c M. Raynal
Logical time in distributed systems 98
Efficient IPT (4): Towards a Concrete Condition

• K(m, k) involves events on two processes (send(m) at


pi and receive(m) at pj ), and consequently cannot be
atomically evaluated by a single process
• Replace it by a “concrete” condition C(m, k) that:

⋆ Can be locally evaluated by a process just before it


sends a message, and

⋆ Is a correct approximation of K(m, k), i.e., C(m, k)


has to be such that ∀m, k: C(m, k) ⇒ K(m, k)

c M. Raynal
Logical time in distributed systems 99
Efficient IPT (5): Towards a Concrete Condition

• The “constant” condition ∀(m, k) : KC(m, k) = f alse


works
It is actually the trivially correct approximation of K
that corresponds to the basic IPT protocol d in which
each message m carries a whole vector clock m.V C and
a whole boolean vector m.IP
• Let us equip each process Pi with an additional matrix
Mi of 0/1 values such that
(Mi[j, k] = 1) ⇔ (to Pi’s knowledge: V Cj [k] ≥ V Ci[k])

c M. Raynal
Logical time in distributed systems 100
An Implementation of the Matrices Mi

M0 ∀ (j, k) : Mi[j, k] is initialized to 1


M1 Each time it produces a relevant event e: pi resets the
ith column of its boolean matrix: ∀j 6= i : Mi[j, i] := 0
M2 When pi sends a message: no update of Mi occurs.
M3 When it receives a message m from pj , pi executes the
following updates (m.V C is carried by m):

∀k: case V Ci[k] < m.V C[k] then ∀ℓ 6= i, j, k : Mi[ℓ, k] := 0;


Mi[j, k] := 1
V Ci[k] = m.V C[k] then Mi[j, k] := 1
V Ci[k] > m.V C[k] then skip
endcase

c M. Raynal
Logical time in distributed systems 101
A Concrete Condition

• Let m be a message sent by pi to pj and C(m, k) =

((send(m).Mi [j, k] = 1) ∧ (send(m).IPi [k] = 1))

∨(send(m).V Ci[k] = 0)

• Theorem 2: ∀k : C(m, k) ⇒ K(m, k)

c M. Raynal
Logical time in distributed systems 102
An Efficient IPT Protocol (1)

RM0 Both V Ci[1..n] and IPi[1..n] are set to [0, . . . , 0], and
∀ (j, k) : Mi[j, k] is set to 1
RM1 Each time pi produces a relevant event e:

⋆ It increments its VC entry: V Ci[i] := V Ci[i] + 1,

⋆ It associates with e the timestamp


e.T S = {(k, V Ci[k]) | IPi[k] = 1}

⋆ It resets IPi: ∀ ℓ 6= i : IPi[ℓ] := 0; IPi[i] := 1

⋆ It resets the ith col. of Mi: ∀j 6= i : Mi[j, i] := 0

c M. Raynal
Logical time in distributed systems 103
An Efficient IPT Protocol (2)

RM2 When pi sends a message m to pj , it attaches to m the


set of triples {(k, V Ci[k], IPi[k])} where k is such that
(Mi[j, k] = 0 ∨ IPi[k] = 0) ∧ (V Ci[k] > 0)}
RM3 When pi receives a message m from pj , it executes:

∀(k,m.V C[k], m.IP [k]) carried by m:


case V Ci[k] < m.V C[k] then V Ci[k] := m.V C[k];
IPi[k] := m.IP [k];
∀ℓ 6= i, j, k : Mi[ℓ, k] := 0;
Mi[j, k] := 1
V Ci[k] = m.V C[k] then IPi[k] := min(IPi[k], m.IP [k]);
Mi[j, k] := 1
V Ci[k] > m.V C[k] then skip
endcase

c M. Raynal
Logical time in distributed systems 104
Properties of the IPT Protocol

• Improvement: Transmit rows of Mi allows the processes


to have more entries of their matrices equal to 1, hence
transmit fewer triples
• If one is not interested in the IPT problem, s/he can
suppress the IPi arrays. Then, we obtain an efficient
implementation of vector clocks (that does not require
fifo channels)
• A simulation study has shown the gains are substantial

c M. Raynal
Logical time in distributed systems 105
Part VIII

MATRIX CLOCKS

-Wuu G.T. and Bernstein A.J., Efficient solutions to the replicated log and dic-
tionnary problems. Proc. 3rd Int’l ACM Symposium on Principles of Distributed
Computing (PODC’84), ACM Press, pp. 233-242, 1984

c M. Raynal
Logical time in distributed systems 106
Matrix clock

• Martrix clocks capture a “second order” knowledge


• Each process manages a time matrix M Ci[1..n, 1..n]
• M Ci[i, i] = nb of event produced by pi
• M Ci[i, k] = nb of events produced by pk , to pi’s knowl-
edge (this is nothing else than pi’s vector clock)
• M Ci[j, k] = pi’s knowledge of the nb of events produced
by pk as known by pj

M Ci[j, k] = x means

pi knows that pj knows that pk has issued x events

c M. Raynal
Logical time in distributed systems 107
Matrix clock: algorithm

Local progress rule:


before producing an internal event:
M Ci[i, i] ← M Ci[i, i] + 1
Sending rule:
when sending a message m to pj :
M Ci[i, i] ← M Ci[i, i] + 1;
send (m, M Ci) to pj

Receiving Rule:
when receiving a message (m, M C) from pj :
M Ci[i, i] ← M Ci[i, i] + 1;
M Ci[i, ∗] ← max(M Ci[i, ∗], M C[j, ∗]);
for each k do
M Ci[k, ∗] ← max(M Ci[k, ∗], M C[k, ∗]) end for

c M. Raynal
Logical time in distributed systems 108
Illustration

Me[j, k] Me[i, k] = M Ci[i, k] = M Ci[k, k]


pk

Me[j, j] = M Ci[i, j] = M Ci[j, j]


pj
Me[k, j]
e
pi
Me[i, i] = M Ci[i, i]

c M. Raynal
Logical time in distributed systems 109
Properties

• Let min(M Ci[1, k], . . . , M Ci[n, k]) = x

⋆ This means: to pis’s knowledge, all the processes


know the past of pk until its xth event
⋆ This can be used by pi to forget th events of pk that
are older than x + 1

• Let min(M Ci[1, i], . . . , M Ci[n, i]) = x

⋆ This means: to pis’s knowledge, all the processes


know its past until its xth event
⋆ This can be used by pi to forget its past events that
are older than x + 1

• Matrix clock allow processes to forget about the past


(garbage collection)

c M. Raynal
Logical time in distributed systems 110
Matrix clocks in action: Message stability tracking

• Buffer management problem


• Processes broadcast messages
• For fault-tolerance: each message m has to be saved
by any process pi in a local buffer until pi knows all the
processes have delivered m
• A buffer has two operations: deposit(m) and discard(m)
• For simplicity reasons: consider fifo channels

c M. Raynal
Logical time in distributed systems 111
Message stability tracking: structure

broadcast(m) deliver(m)

deposit(m)

deposit(m)
buffer

discard(m)

send(m) receive(m)

c M. Raynal
Logical time in distributed systems 112
Message stability tracking: control variables

• Each process pi manages a matrix M Ci[1..n, 1..n]


• Only broadcast operations are relevant for the problem
• M Ci[j, k]: pi’s knowledge of the number of messages
delivered by pj and broadcast by pk

• M Ci[i, i]: number of messages broadcast (and delivered)


by pi
Remark: M Ci[i, i] is a sequence number
• Let x = min(M Ci[1, k], . . . , (M Ci[n, k]): sequence num-
ber of the last message broadcast by pk that is stable
(delivered by all the processes). The messages from pk
that are in the buffer and whose sequence nb are ≤ x
can be discarded

c M. Raynal
Logical time in distributed systems 113
Message stability tracking: algorithm (1)

operation broadcast (m) (issued by pi)


M Ci[i, i] ← M Ci[i, i] + 1;
m.sender ← i;
m.V C ← M Ci[i, ∗]; % vector clock %
% pi informs about the msgs it has delivered %
for each j 6= i do send (m) to pj end for;
deposit (m)

c M. Raynal
Logical time in distributed systems 114
Message stability tracking: algorithm (2)

when receiving a message m:


deposit (m);
let j = m.sender;
M Ci[j, ∗] ← m.V C;
M Ci[i, j] ← M Ci[i, j] + 1; % fifo channel %
deliver m to the upper layer

c M. Raynal
Logical time in distributed systems 115
Message stability tracking: algorithm (3)

when ∃ m ∈ buffer :
k = m.sender ∧
m.V C[k] ≤ min(M Ci[1, k], . . . , (M Ci[n, k]): discard (m)

• Each process (site) manages a matrix


• Each message carries only a vector

c M. Raynal
Logical time in distributed systems 116
THAT’s ALL, FOLKS!

c M. Raynal
Logical time in distributed systems 117

You might also like