Unit 2 Notes
Unit 2 Notes
Logical clocks are based on capturing chronological and causal relationships of processes and
ordering events based on these relationships.
In a system of logical clocks, every process has a logical clock that is advanced using a set
of rules. Every event is assigned a timestamp and the causality relation between events can
be generally inferred from their timestamps.
The timestamps assigned to events obey the fundamental monotonicity property; that is, if
an event a causally affects an event b, then the timestamp of a is smaller than the timestamp
of b.
A Framework for a system of logical clocks
A system of logical clocks consists of a time domain T and a logical clock C. Elements of T form a
partially ordered set over a relation <. This relation is usually called the happened before or
causal precedence.
The logical clock C is a function that maps an event e in a distributed system to an element
in the time domain T denoted as C(e).
such that
for any two events ei and ej,.
This monotonicity property is called the clock consistency condition. When T and C
satisfythe following condition,
Data structures:
Each process pi maintains data structures with the given capabilities:
• A local logical clock (lci), that helps process pi measure its own progress.
• A logical global clock (gci), that is a representation of process pi’s local view of the
logicalglobal time. It allows this process to assign consistent timestamps to its local events.
19
Protocol:
The protocol ensures that a process’s logical clock, and thus its view of the global time,
ismanaged consistently with the following rules:
Rule 1: Decides the updates of the logical clock by a process. It controls send, receive and
other operations.
Rule 2: Decides how a process updates its global logical clock to update its view of the
global time and global progress. It dictates what information about the logical time is
piggybacked in a message and how this information is used by the receiving process to
update its view of the global time.
20
Basic properties of scalar time:
1. Consistency property: Scalar clock always satisfies monotonicity. A monotonic clock
only increments its timestamp and never jump. Hence it is consistent.
2. Total Reordering: Scalar clocks order the events in distributed systems. But all the
events do not follow a common identical timestamp. Hence a tie breaking mechanism is
essential toorder the events. The tie breaking is done through:
• Linearly order process identifiers.
• Process with low identifier value will be given higher priority.
The term (t, i) indicates timestamp of an event, where t is its time of occurrence and i is the
identity of the process where it occurred.
The total order relation ( ) over two events x and y with timestamp (h, i) and (k, j) is given by:
3. Event Counting
If event e has a timestamp h, then h−1 represents the minimum logical duration,
counted in units of events, required before producing the event e. This is called height of the
event e. h-1 events have been produced sequentially before the event e regardless of the
processes that produced these events.
4. No strong consistency
The scalar clocks are not strongly consistent is that the logical local clock and logical
global clock of a process are squashed into one, resulting in the loss causal dependency
information among events at different processes.
21
• Each time a process sends a message, it includes a copy of its own (incremented)
vector in the message.
• Each time a process receives a message, it increments its own counter in the vector by
one and updates each element in its vector by taking the maximum of the value in its
own vector counter and the value in the vector in the received message.
The time domain is represented by a set of n-dimensional non-negative integer vectors in vector
time.
Rule 2: Each message m is piggybacked with the vector clock vt of the sender
process at sending time. On the receipt of such a message (m,vt), process
pi executes the following sequence of actions:
1. update its global logical time
2. execute R1
3. deliver the message m
• If the process at which an event occurred is known, the test to compare two
timestamps can be simplified as:
2. Strong consistency
The system of vector clocks is strongly consistent; thus, by examining the vector timestamp
of two events, we can determine if the events are causally related.
22
3. Event counting
If an event e has timestamp vh[i], vh[j] denotes the number of events executed by process
pjthat causally precede e.
Clock synchronization is the process of ensuring that physically distributed processors have a
common notion of time.
Due to different clocks rates, the clocks at various sites may diverge with time, and
periodically a clock synchronization must be performed to correct this clock skew in
distributed systems. Clocks are synchronized to an accurate real-time standard like UTC
(Universal Coordinated Time). Clocks that must not only be synchronized with each other
but also have to adhere to physical time are termed physical clocks. This degree of
synchronization additionally enables to coordinate and schedule actions between multiple
computers connected to a common network.
Basic terminologies:
If Ca and Cb are two different clocks, then:
• Time: The time of a clock in a machine p is given by the function Cp(t),where Cp(t)=
tfor a perfect clock.
• Frequency: Frequency is the rate at which a clock progresses. The frequency at time t
of clock Ca is Ca’(t).
• Offset: Clock offset is the difference between the time reported by a clock and the
real time. The offset of the clock Ca is given by Ca(t)− t. The offset of clock C a
relative toCb at time t ≥ 0 is given by Ca(t)- Cb(t)
• Skew: The skew of a clock is the difference in the frequencies of the clock and
theperfect clock. The skew of a clock Ca relative to clock Cb at timet is Ca’(t)-
Cb’(t).
• Drift (rate): The drift of clock Ca the second derivative of the clockvalue with
respectto time. The drift is calculated as:
23
Clocking Inaccuracies
Physical clocks are synchronized to an accurate real-time standard like UTC
(Universal Coordinated Time). Due to the clock inaccuracy discussed above, a timer (clock)
is said to be working within its specification if:
delay.
Fig : Behavior of clocks
Fig a) Offset and delay estimation Fig b) Offset and delay estimation
between processes from same server between processes from different servers
Let T1, T2, T3, T4 be the values of the four most recent timestamps. The clocks A and B
arestable and running at the same speed. Let a = T1 − T3 and b = T2 − T4. If the network
delay difference from A to B and from B to A, called differential delay, is
24
small, the clock offset and roundtrip delay of B relative to A at time T4 are
approximatelygiven by the following:
Each NTP message includes the latest three timestamps T1, T2, andT3, while
T4 isdetermined upon arrival.
(i) non-FIFO
(ii) FIFO
(iii) causal order
(iv) synchronous order
There is always a trade-off between concurrency and ease of use and implementation.
Asynchronous Executions
An asynchronous execution (or A-execution) is an execution (E, ≺) for which the causality relation
is a partial order.
• There cannot be any causal relationship between events in asynchronous execution.
• The messages can be delivered in any order even in non FIFO.
• Though there is a physical link that delivers the messages sent on it in FIFO order due
to the physical properties of the medium, a may be formed as a composite of
physical links and multiple paths may exist between the two end points of the logical
link.
25
Fig 2.1: a) FIFO executions b) non FIFO executions
FIFO executions
Fig: CO Execution
• Two send events s and s’ are related by causality ordering (not physical time
ordering), then a causally ordered execution requires that their corresponding receive
events r and r’ occur in the same order at all common destinations.
Applications of causal order:
Applications that requires update to shared data to implement distributed shared
memory, and fair resource allocation in distributed mutual exclusion.
If send(m1) ≺ send(m2) then for each common destination d of messages m1 and m2,
deliverd(m1) ≺deliverd(m2) must be satisfied.
26
Other properties of causal ordering
1. Message Order (MO): A MO execution is an A-execution in which, for all
Synchronous Execution
• When all the communication between pairs of processes uses synchronous send and
receives primitives, the resulting order is the synchronous order.
• The synchronous communication always involves a handshake between the receiver
and the sender, the handshake events may appear to be occurring instantaneously and
atomically.
27
Fig) Execution in an asynchronous system Fig) Execution in synchronous
An execution( E ≺)is synchronous if and only if there exists a mapping from E to T (scalar timestamps)
such that
• for any message M, T(s(M))=T(r(M);
• for each process Pi, if ei ≺ ei1 then T(ei)< T(ei1) .
29
results a deadlock.
• There is no semantic dependency between the send and the immediately following
receive at each of the processes. If the receive call at one of the processes can be
scheduled before the send call, then there is no deadlock.
Rendezvous
Rendezvous systems are a form of synchronous communication among an arbitrary
number of asynchronous processes. All the processes involved meet with each other, i.e.,
communicate synchronously with each other at one time. Two types of rendezvous systems
are possible:
• Binary rendezvous: When two processes agree to synchronize.
• Multi-way rendezvous: When more than two processes agree to synchronize.
• Scheduling involves pairing of matching send and receives commands that are both
enabled. The communication events for the control messages under the covers do not
alter the partial order of the execution.
30
The message (M) types used are: M, ack(M), request(M), and permission(M). Execution
events in the synchronous execution are only the send of the message M and receive of the
message M. The send and receive events for the other message types – ack(M), request(M),
and permission(M) which are control messages. The messages request(M), ack(M), and
permission(M) use M’s unique tag; the message M is not included in these messages.
(message types)
Pi executes send(M) and blocks until it receives ack(M) from Pj . The send event SEND(M)
nowcompletes.
Any M’ message (from a higher priority processes) and request(M’) request for
synchronization (from a lower priority processes) received during the blocking period are
queued.
send(request(M)).
(i) If a message M’ arrives from a higher priority process Pk, Pi accepts M’ by scheduling a
RECEIVE(M’) event and then executes send(ack(M’)) to Pk.
(ii) Ifa request(M’) arrives from a lower priority process Pk, Pi executes
send(permission(M’)) to Pk and blocks waiting for the messageM’. WhenM’ arrives, the
RECEIVE(M’) event is executed.
(2c) When the permission(M) arrives, Pi knows partner Pj is synchronized and Pi executes
send(M). The SEND(M) now completes.
31
(4) Message M arrival at Pi from a higher priority process Pj:
When Pi is unblocked, it dequeues the next (if any) message from the queue and processes it
as a message arrival (as per rules 3 or 4).
32
• All algorithms aim to reduce this log overhead, and the space and time overhead of
maintaining the log information at the processes.
• To distribute this log information, broadcast and multicast communication is used.
• The hardware-assisted or network layer protocol assisted multicast cannot efficiently
provide features:
➢ Application-specific ordering semantics on the order of delivery of messages.
➢ Adapting groups to dynamically changing membership.
➢ Sending multicasts to an arbitrary set of processes at each send event.
➢ Providing various fault-tolerance semantics
Propagation Constraint II: it is not known that a message has been sent to d in the causal
future of Send(M), and hence it is not guaranteed using a reasoning based on transitivity that
the message M will be delivered to d in CO.
33
The Propagation Constraints also imply that if either (I) or (II) is false, the information
“d ∈ M.Dests” must not be stored or propagated, even to remember that (I) or (II) has been
falsified:
▪ not in the causal future of Deliverd(M1, a)
▪ not in the causal future of e k, c where d ∈Mk,cDests and there is no
other message sent causally between Mi,a and Mk, c to the same
destination d.
34
Fig 2.7 a) Send algorithm by Kshemkalyani–Singhal to optimally implement causal
ordering
The data structures maintained are sorted row–major and then column–major:
1. Explicit tracking:
▪ Tracking of (source, timestamp, destination) information for messages (i) not known to be
delivered and (ii) not guaranteed to be delivered in CO, is done explicitly using the
I.Dests field of entries in local logs at nodes and o.Dests field of entries in messages.
▪ Sets li,a Dests and oi,a. Dests contain explicit information of destinations to which Mi,ais
not guaranteed to be delivered in CO and is not known to be delivered.
▪ The information about d ∈Mi,a .Dests is propagated up to the earliest events on all causal
paths from (i, a) at which it is known that Mi,a is delivered to d or is guaranteed to be
delivered to d in CO.
2. Implicit tracking:
▪ Tracking of messages that are either (i) already delivered, or (ii) guaranteed to be
delivered in CO, is performed implicitly.
35
▪ The information about messages (i) already delivered or (ii) guaranteed to be
delivered in CO is deleted and not propagated because it is redundant as far as
enforcing CO is concerned.
▪ It is useful in determining what information that is being carried in other messages
and is being stored in logs at other nodes has become redundant and thus can be
purged.
▪ The semantics are implicitly stored and propagated. This information about messages
that are (i) already delivered or (ii) guaranteed to be delivered in CO is tracked
without explicitly storing it.
▪ The algorithm derives it from the existing explicit information about messages (i) not
known to be delivered and (ii) not guaranteed to be delivered in CO, by examining
only oi,aDests or li,aDests, which is a part of the explicit information.
Multicast M4,3
At event (4, 3), the information P6 ∈M5,1.Dests in Log4 is propagated on multicast M4,3only
to process P6 to ensure causal delivery using the Delivery Condition. The piggybacked
information on message M4,3sent to process P3must not contain this information because of
constraint II. As long as any future message sent to P6 is delivered in causal order w.r.t.
M4,3sent to P6, it will also be delivered in causal order w.r.t. M5,1. And as M5,1 is already
delivered to P4, the information M5,1Dests = ∅ is piggybacked on M4,3 sent to P 3.
Similarly, the information P6 ∈ M5,1Dests must be deleted from Log4 as it will no longer be
needed, because of constraint II. M5,1Dests = ∅ is stored in Log4 to remember that M5,1 has
been delivered or is guaranteed to be delivered in causal order to all its destinations.
36
www.Poriyaan.in
Processing at P6
When message M5,1 is delivered to P6, only M5,1.Dests = P4 is added to Log6. Further,
P6 propagates only M5,1.Dests = P4 on message M6,2, and this conveys the current
implicit information M5,1 has been delivered to P6 by its very absence in the explicit
information.
• When the information P6 ∈ M5,1Dests arrives on M4,3, piggybacked as M5,1 .Dests
= P6 it is used only to ensure causal delivery of M4,3 using the Delivery
Condition,and is not inserted in Log6 (constraint I) – further, the presence of M5,1
.Dests = P4 in Log6 implies the implicit information that M5,1 has already been
delivered to P6. Also, the absence of P4 in M5,1 .Dests in the explicit
piggybacked information implies the implicit information that M5,1 has been
delivered or is guaranteed to be delivered in causal order to P4, and, therefore,
M5,1. Dests is set to ∅ in Log6.
• When the information P6 ∈ M5,1 .Dests arrives on M5,2 piggybacked as M5,1. Dests
= {P4, P6} it is used only to ensure causal delivery of M4,3 using the Delivery
Condition, and is not inserted in Log6 because Log6 contains M5,1 .Dests = ∅,
which gives the implicit information that M5,1 has been delivered or is
guaranteedto be delivered in causal order to both P4 and P6.
Processing at P1
• When M2,2arrives carrying piggybacked information M5,1.Dests = P6 this
(new)information is inserted in Log1.
• When M6,2arrives with piggybacked information M5,1.Dests ={P4}, P1learns
implicit information M5,1has been delivered to P6 by the very absence of explicit
information P6 ∈ M5,1.Dests in the piggybacked information, and hence marks
information P6 ∈ M5,1Dests for deletion from Log1
• The information “P6 ∈M5,1.Dests piggybacked on M2,3,which arrives at P 1, is
inferred to be outdated using the implicit knowledge derived from M5,1.Dest= ∅”
inLog1.
37
www.Poriyaan.in
For each pair of processes Pi and Pj and for each pair of messages Mx and My that are delivered to
both the processes, Pi is delivered Mx before My if and only if Pj is delivered Mxbefore My.
Each process sends the message it wants to broadcast to a centralized process, which
relays all the messages it receives to every other process over FIFO channels.
Complexity: Each message transmission takes two message hops and exactly n messages
in a system of n processes.
Drawbacks: A centralized algorithm has a single point of failure and congestion, and is
not an elegant solution.
Sender
Phase 1
• In the first phase, a process multicasts the message M with a locally unique tag and
the local timestamp to the group members.
Phase 2
• The sender process awaits a reply from all the group members who respond with a
tentative proposal for a revised timestamp for that message M.
• The await call is non-blocking.
Phase 3
• The process multicasts the final timestamp to the group.
38
Fig) Sender side of three phase distributed algorithm
Receiver Side
Phase 1
• The receiver receives the message with a tentative timestamp. It updates the variable
priority that tracks the highest proposed timestamp, then revises the proposed
timestamp to the priority, and places the message with its tag and the revised
timestamp at the tail of the queue temp_Q. In the queue, the entry is marked as
undeliverable.
Phase 2
• The receiver sends the revised timestamp back to the sender. The receiver then waits
in a non-blocking manner for the final timestamp.
Phase 3
• The final timestamp is received from the multi caster. The corresponding
messageentry in temp_Q is identified using the tag, and is marked as deliverable
after the revised timestamp is overwritten by the final timestamp.
39
• The queue is then resorted using the timestamp field of the entries as the key. As the
queue is already sorted except for the modified entry for the message under
consideration, that message entry has to be placed in its sorted position in the queue.
• If the message entry is at the head of the temp_Q, that entry, and all consecutive
subsequent entries that are also marked as deliverable, are dequeued from temp_Q,
and enqueued in deliver_Q.
Complexity
This algorithm uses three phases, and, to send a message to n − 1 processes, it uses 3(n – 1)
messages and incurs a delay of three message hops
Example
An example execution to illustrate the algorithm is given in Figure 6.14. Here, A and B
multicast to a set of destinations and C and D are the common destinations for both
multicasts. •
Figure (a) The main sequence of steps is as follows:
1. A sends a REVISE_TS(7) message, having timestamp 7. B sends a REVISE_TS(9)
message, having timestamp 9.
2. C receives A’s REVISE_TS(7), enters the corresponding message in temp_Q, and marks
it as undeliverable; priority = 7. C then sends PROPOSED_TS(7) message to A
3. D receives B’s REVISE_TS(9), enters the corresponding message in temp_Q, and marks
it as undeliverable; priority = 9. D then sends PROPOSED_TS(9) message to B.
4. C receives B’s REVISE_TS(9), enters the corresponding message in temp_Q, and marks
it as undeliverable; priority = 9. C then sends PROPOSED_TS(9) message to B.
5. D receives A’s REVISE_TS(7), enters the corresponding message in temp_Q, and marks
it as undeliverable; priority = 10. D assigns a tentative timestamp value of 10, which is
greater than all of the times tamps on REVISE_TSs seen so far, and then sends
PROPOSED_TS(10) message to A.
The state of the system is as shown in the figure
Fig) An example to illustrate the three-phase total ordering algorithm. (a) A snapshot for
PROPOSED_TS and REVISE_TS messages. The dashed lines show the further execution
after the snapshot. (b) The FINAL_TS messages in the example.
40
Figure (b) The continuing sequence of main steps is as follows:
6. When A receives PROPOSED_TS(7) from C and PROPOSED_TS(10) from D, it
computes the final timestamp as max710=10, and sends FINAL_TS(10) to C and D.
7. When B receives PROPOSED_TS(9) from C and PROPOSED_TS(9) from D, it
computes the final timestamp as max99= 9, and sends FINAL_TS(9) to C and D.
8. C receives FINAL_TS(10) from A, updates the corresponding entry in temp_Q with the
timestamp, resorts the queue, and marks the message as deliverable. As the message is not
at the head of the queue, and some entry ahead of it is still undeliverable, the message is
not moved to delivery_Q.
9. D receives FINAL_TS(9) from B, updates the corresponding entry in temp_Q by
marking the corresponding message as deliverable, and resorts the queue. As the message
is at the head of the queue, it is moved to delivery_Q. This is the system snapshot shown in
Figure (b).
The following further steps will occur:
10. When C receives FINAL_TS(9) from B, it will update the correspond ing entry in
temp_Q by marking the corresponding message as deliv erable. As the message is at the
head of the queue, it is moved to the delivery_Q, and the next message (of A), which is
also deliverable, is also moved to the delivery_Q.
11. When D receives FINAL_TS(10) from A, it will update the corre sponding entry in
temp_Q by marking the corresponding message as deliverable. As the message is at the
head of the queue, it is moved to the delivery_Q
41
• The state of a process at any time is defined by the contents of processor registers,
stacks, local memory, etc., and may be highly dependent on the local context of
the distributed application.
• The state of channel Cij, denoted by SCij, is given by the set of messages in transit
in the channel.
• The events that may happen are: internal event, send (send (mij)) and receive
(rec(mij)) events.
• The occurrences of events cause changes in the process state.
• A channel is a distributed entity and its state depends on the local states of the
processes on which it is incident.
Law of conservation of messages: Every message mijthat is recorded as sent in the local state of a
process pi must be captured in the state of the channel Cij or in the collected local state of the
receiver process pj.
➢ In a consistent global state, every message that is recorded as received is also recorded
as sent. Such a global state captures the notion of causality that a message cannot be
received if it was not sent.
➢ Consistent global states are meaningful global states and inconsistent global states are
not meaningful in the sense that a distributed system can never be in an inconsistent
state.
https://play.google.com/store/apps/details?id=com.poriyaan.poriyaan 42
• A consistent global state corresponds to a cut in which every message received in the
PAST of the cut has been sent in the PAST of that cut. Sucha cut is known as a
consistent cut.
• In a consistent snapshot, all the recorded local states of processes are concurrent; that
is, the recorded local state of no process casually affects the recorded local state of
anyother process.
Issue 2:
How to determine the instant when a process takes its snapshot?
The answer
Answer:
A process pj must record its snapshot before processing a message mij that was sent byprocess pi after
recording its snapshot
A snapshot captures the local states of each process along with the state of each communication channel.
Chandy–Lamport algorithm
• The algorithm will record a global snapshot for each process channel.
• The Chandy-Lamport algorithm uses a control message, called a marker.
• After a site has recorded its snapshot, it sends a marker along all of its outgoing
channels before sending out any more messages.
• Since channels are FIFO, a marker separates the messages in the channel into those to
be included in the snapshot from those not to be recorded in the snapshot.
• This addresses issue I1. The role of markers in a FIFO system is to act as delimiters
for the messages in the channels so that the channel state recorded by the process
43
at the receiving end of the channel satisfies the condition C2.
Initiating a snapshot
• Process Pi initiates the snapshot
• Pi records its own state and prepares a special marker message.
• Send the marker message to all other processes.
• Start recording all incoming messages from channels Cij for j not equal to i.
Propagating a snapshot
• For all processes Pjconsider a message on channel Ckj.
Terminating a snapshot
• All processes have received a marker.
• All process have received a marker on all the N-1 incoming channels.
• A central server can gather the partial state to build a global snapshot.
44
• Due to FIFO property of channels, it follows that no message sent after the marker on that
channel is recorded in the channel state. Thus, condition C2 is satisfied.
• When a process pj receives message mij that precedes the marker on channel Cij, it acts
as follows: if process pj has not taken its snapshot yet, then it includes mij in its recorded
snapshot. Otherwise, it records mij in the state of the channel Cij. Thus, condition C1
issatisfied.
Complexity
The recording part of a single instance of the algorithm requires O(e) messages
and O(d) time, where e is the number of edges in the network and d is the diameter of
thenetwork.
2. (Markers shown using dotted arrows.) Let site S1 initiate the algorithm just after t0 and before
Sending the $50 for S2. Site S1 records its local state (account A = $600) and sends a marker to
S2. The marker is received by site S2 between t2 and t3. When site S2 receives the marker, it
records its local state (account B = $120), the state of channel C12 as $0, and sends a marker
along channel C21. When site S1 receives this marker, it records the state of channel C21 as $80.
The $800 amount in the system is conserved in the recorded global state,
A=$600 B=$120 C12 =$0 C21 =$80
The recorded global state may not correspond to any of the global states that occurred
during the computation.
This happens because a process can change its state asynchronously before the markers it
sentare received by other sites and the other sites record their states.
45