Unit 4 Part 3

The document discusses the concept of the domino effect in distributed systems, where a rollback in one process necessitates rollbacks in others to maintain consistency. It outlines methods to avoid this effect, including coordinated checkpointing, communication-induced checkpointing, and log-based rollback recovery. Key points emphasize the importance of maintaining consistent system states and the challenges posed by failures in distributed applications.

Uploaded by

Surya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views21 pages

Unit 4 Part 3

Uploaded by

Surya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

CS 3551 DISTRIBUTED

COMPUTING
Checkpoint

A/C balance = 20000

ATM Pin Entry

Amount = 10000

Update Balance = 10000

Cash dispense
Checkpoint in Distributed System
What is Domino Effect?
● To see why rollback propagation occurs, consider the situation where the
sender of a message m rolls back to a state that precedes the sending of m.
● The receiver of m must also roll back to a state that precedes m’s receipt;
otherwise, the states of the two processes would be inconsistent because
they would show that message m was received without being sent, which is
impossible in any correct failure-free execution.
● This phenomenon of cascaded rollback is called the domino effect.
● In some situations, rollback propagation may extend back to the initial state of
the computation, losing all the work performed before the failure.
Domino effect continued…
● Independent or uncoordinated checkpointing : - If each participating process takes its
checkpoints independently, then the system is susceptible to the domino effect.
How to avoid domino effect?
● Coordinated checkpointing :
○ processes coordinate their checkpoints to form a system-wide consistent state.
○ In case of a process failure, the system state can be restored to such a consistent set of checkpoints,
preventing the rollback propagation.
● Communication-induced checkpointing :
○ forces each process to take checkpoints based on information piggybacked on the application messages it
receives from other processes.
○ Checkpoints are taken such that a system-wide consistent state always exists on stable storage, thereby
avoiding the domino effect.
● Logbased rollback recovery:
○ combines checkpointing with logging of nondeterministic events.
○ Log-based rollback recovery relies on the piecewise deterministic (PWD) assumption, which postulates that all
non-deterministic events that a process executes can be identified and that the information necessary to replay
each event during recovery can be logged in the event’s determinant.
○ By logging and replaying the non-deterministic events in their exact original order, a process can
deterministically recreate its pre-failure state even if this state has not been checkpointed.
Key Points
● Rollback recovery treats a distributed system application as a collection of
processes that communicate over a network.
● It achieves fault tolerance by periodically saving the state of a process during
the failure-free execution, enabling it to restart from a saved state upon a
failure to reduce the amount of lost work.
● The saved state is called a checkpoint, and the procedure of restarting from a
previously checkpointed state is called rollback recovery.
● A checkpoint can be saved on either the stable storage or the volatile storage
depending on the failure scenarios to be tolerated.
● Challenges for Recovery:
○ on a failure of one or more processes in a system, these dependencies may force some of the
processes that did not fail to roll back, creating what is commonly called a rollback propagation
Background and Definitions
1. System Model
2. Local Checkpoint
3. Consistent system states
4. Interactions with the outside world
5. Different types of messages
1. System Model

● A distributed system consists of a fixed number of processes, P1, P2 PN , which

communicate only through messages.
● Processes cooperate to execute a distributed application and interact with the
outside world by receiving and sending input and output messages, respectively.
● Some protocols assume that the communication subsystem delivers messages
reliably, in first-in-first-out (FIFO) order, while other protocols assume that the
communication subsystem can lose, duplicate, or reorder messages.
● a system recovers correctly if its internal state is consistent with the
observable behavior of the system before the failure
2. Local Checkpoint - @ each process level

1. A local checkpoint is a snapshot of the state of the process at a given

instance and the event of recording the state of a process is called local
checkpointing.
2. The contents of a checkpoint depend upon the application context and the
checkpointing method being used.
3. Depending upon the checkpointing method used, a process may keep
several local checkpoints or just a single checkpoint at any time
4. a process stores all local checkpoints on the stable storage so that they are
available even if the process crashes.
5. We also assume that a process is able to roll back to any of its existing
local checkpoints and thus restore to and restart from the corresponding
state
3. Consistent vs Inconsistent System States
4. Interactions with Outside World (OWP)

● a printer cannot roll back the effects of printing a character, and an automatic
teller machine cannot recover the money that it dispensed to a customer
● A distributed application often interacts with the outside world to receive input
data or deliver the outcome of a computation. If a failure occurs, the outside
world cannot be expected to roll back.
● the outside world see a consistent behavior of the system despite failures
● Output Commit- before sending output to the OWP, the system must ensure
that the state from which the output is sent will be recovered despite any
future failure.
● Input messages :
○ Received messages from the OWP may not be reproducible during recovery, because it may
not be possible for the outside world to regenerate them.
○ Thus, recovery protocols must arrange to save these input messages so that they can be
retrieved when needed for execution replay after a failure
Types of Messages
Types of Messages
Key Points
1. In-transit (m1,m2)
a. Messages that has been sent but not yet received
b. When in-transit messages are part of a global system state, these messages do not cause any
inconsistency.
c. For reliable communication channels, a consistent state must include in-transit messages
because they will always be delivered to their destinations in any legal execution of the
system.
d. On the other hand, if a system model assumes lossy communication channels, then in-transit
messages can be omitted from system state.
2. Lost Messages(m1)
a. Messages whose send is not undone but receive is undone due to rollback are called lost
messages.
b. This type of messages occurs when the process rolls back to a checkpoint prior to reception of
the message while the sender does not rollback beyond the send operation of the message
Key Points….
3. Delayed Messages (m2,m5)
a. Messages whose receive is not recorded because the receiving process was either down or
the message arrived after the rollback of the receiving process
4. Orphan Messages
a. Messages with receive recorded but message send not recorded are called orphan messages.
b. For example, a rollback might have undone the send of such messages, leaving the receive
event intact at the receiving process.
c. Orphan messages do not arise if processes roll back to a consistent global state.
5. Duplicate Message(m4,m5)
a. Duplicate messages arise due to message logging and replaying during process recovery
Issues in Failure Recovery

J
Key Points

Unit 4 Part 2
No ratings yet
Unit 4 Part 2
21 pages
Lm2-Rollback & Recovery
No ratings yet
Lm2-Rollback & Recovery
34 pages
Dc-3551 Unit IV Notes
No ratings yet
Dc-3551 Unit IV Notes
32 pages
DC UNIT4
No ratings yet
DC UNIT4
33 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
DC Unit4
No ratings yet
DC Unit4
32 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
Unit Iv Recovery
No ratings yet
Unit Iv Recovery
27 pages
DS NOTES Unit 4 PDF
No ratings yet
DS NOTES Unit 4 PDF
36 pages
System Recovery
No ratings yet
System Recovery
38 pages
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
No ratings yet
Module 4 - Distributed Shared Memory and Failure Recovery - Sreerag Sanilkumar
14 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
Distributed-Computing-Module-4-Important-Topics-PYQs
No ratings yet
Distributed-Computing-Module-4-Important-Topics-PYQs
23 pages
CheckpointingRecovery ds14
No ratings yet
CheckpointingRecovery ds14
35 pages
1904050001
No ratings yet
1904050001
119 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
unit 4
No ratings yet
unit 4
94 pages
Assignment 4 - 044
No ratings yet
Assignment 4 - 044
4 pages
Unit IV 2 Marks With Answer
No ratings yet
Unit IV 2 Marks With Answer
2 pages
u4p6
No ratings yet
u4p6
10 pages
c1cc1cde-bdda-41e7-92a0-5453e98d0676
No ratings yet
c1cc1cde-bdda-41e7-92a0-5453e98d0676
5 pages
Distributed Computing Series 2 Important Topics
No ratings yet
Distributed Computing Series 2 Important Topics
24 pages
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Recovery DC
No ratings yet
Recovery DC
6 pages
Distributed Failure Recovery
No ratings yet
Distributed Failure Recovery
30 pages
Checkpointing and Rollback
No ratings yet
Checkpointing and Rollback
61 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
33 pages
Presentation On Consistent Checkpoints & Recovery in Distributed System
100% (1)
Presentation On Consistent Checkpoints & Recovery in Distributed System
26 pages
16_issues in Failure Recovery
No ratings yet
16_issues in Failure Recovery
5 pages
Unit 4 Answer Key
No ratings yet
Unit 4 Answer Key
24 pages
a161126
No ratings yet
a161126
26 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
rollback_slides
No ratings yet
rollback_slides
22 pages
Module 4
No ratings yet
Module 4
59 pages
DS UNIT-3 NOTES
No ratings yet
DS UNIT-3 NOTES
35 pages
Concurrent Checkpointing and Recovery in Distributed Systems
No ratings yet
Concurrent Checkpointing and Recovery in Distributed Systems
61 pages
CST402-SCHEME
No ratings yet
CST402-SCHEME
9 pages
DistributedComputing(University) PartA
No ratings yet
DistributedComputing(University) PartA
19 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
f8f02eee3f8eddeb6f056cc5fc212723b6f2
No ratings yet
f8f02eee3f8eddeb6f056cc5fc212723b6f2
26 pages
DC 2 MARKS New
No ratings yet
DC 2 MARKS New
6 pages
Fault Tolerant Systems: Part 17 - Checkpointing II Chapter 6 - Checkpointing
No ratings yet
Fault Tolerant Systems: Part 17 - Checkpointing II Chapter 6 - Checkpointing
34 pages
Ds chapter 7 (2)
No ratings yet
Ds chapter 7 (2)
21 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
33 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Lm3 Checkpointing Algorithm
No ratings yet
Lm3 Checkpointing Algorithm
40 pages
CMU-CS-99-148
No ratings yet
CMU-CS-99-148
44 pages
Session 33
No ratings yet
Session 33
4 pages
Distributed Computing: Farhad Muhammad Riaz
No ratings yet
Distributed Computing: Farhad Muhammad Riaz
18 pages
11 Coordinated Checkpoint
No ratings yet
11 Coordinated Checkpoint
3 pages
Consensus
No ratings yet
Consensus
77 pages
Global States
No ratings yet
Global States
16 pages
Unit 4_Deadlock Handling & Recovery Techniques & Failuere Classification
No ratings yet
Unit 4_Deadlock Handling & Recovery Techniques & Failuere Classification
55 pages
CS 194: Distributed Systems
No ratings yet
CS 194: Distributed Systems
15 pages
Checkpoints Recovery(1)(2)
No ratings yet
Checkpoints Recovery(1)(2)
35 pages
1-Lecture (2. Intro-Core Challenges)_Slides
No ratings yet
1-Lecture (2. Intro-Core Challenges)_Slides
22 pages
UNIT%20IV%20CONSENSUS%20AND%20RECOVERY
No ratings yet
UNIT%20IV%20CONSENSUS%20AND%20RECOVERY
38 pages
Kafka Developer Certified: The Essential Guide
From Everand
Kafka Developer Certified: The Essential Guide
SUJAN
No ratings yet
Plate Heat Exchanger Brochure English
No ratings yet
Plate Heat Exchanger Brochure English
3 pages
Axle Load Kawasaki
No ratings yet
Axle Load Kawasaki
24 pages
IoT Based Smart Parking System
No ratings yet
IoT Based Smart Parking System
6 pages
Advanced Word Processing Skills
No ratings yet
Advanced Word Processing Skills
2 pages
Multi V pqrcvsl0
100% (1)
Multi V pqrcvsl0
23 pages
Diamec PHC 4
No ratings yet
Diamec PHC 4
152 pages
FM Paper 2 Revision 7 2023
No ratings yet
FM Paper 2 Revision 7 2023
14 pages
N2XSY 1 X (1.5-800) MM 0.6/1 KV
No ratings yet
N2XSY 1 X (1.5-800) MM 0.6/1 KV
5 pages
Digital Business Assignment 2
No ratings yet
Digital Business Assignment 2
3 pages
Analysis of New Media Communication Based On Lasswell's "5W" Model
No ratings yet
Analysis of New Media Communication Based On Lasswell's "5W" Model
6 pages
FractalAntenna_UACM
No ratings yet
FractalAntenna_UACM
2 pages
Specifications Standards For Shafts Alignment PDF
100% (1)
Specifications Standards For Shafts Alignment PDF
18 pages
NVM Express NVM Command Set Specification 1.0d 2023.12.28 Ratified
No ratings yet
NVM Express NVM Command Set Specification 1.0d 2023.12.28 Ratified
107 pages
05 50 00 Ic
No ratings yet
05 50 00 Ic
44 pages
Mivi Collar Classic Neckband With Fast Charging Bluetooth Headset
No ratings yet
Mivi Collar Classic Neckband With Fast Charging Bluetooth Headset
1 page
Copia de PRECIOS INSUMOS
No ratings yet
Copia de PRECIOS INSUMOS
75 pages
Awareness Briefings.
No ratings yet
Awareness Briefings.
25 pages
Resume 2009
No ratings yet
Resume 2009
3 pages
Resume Bilal Qureshi
No ratings yet
Resume Bilal Qureshi
4 pages
Ppt1overview of Health Analytics
No ratings yet
Ppt1overview of Health Analytics
46 pages
REDHAT Linux - Linux Terminal Server Using XRDP
No ratings yet
REDHAT Linux - Linux Terminal Server Using XRDP
3 pages
SWOT Analysis of Apple Iphone
No ratings yet
SWOT Analysis of Apple Iphone
4 pages
2 High Level and Translators
No ratings yet
2 High Level and Translators
23 pages
C++ User'S Guide: Forte Developer 6 Update 2 (Sun Workshop 6 Update 2)
No ratings yet
C++ User'S Guide: Forte Developer 6 Update 2 (Sun Workshop 6 Update 2)
384 pages
Khirendra
No ratings yet
Khirendra
2 pages
Building Construction Multiple choice paper 2015
No ratings yet
Building Construction Multiple choice paper 2015
10 pages
Chapter 9 System Testing
No ratings yet
Chapter 9 System Testing
22 pages
National Institute of Technology, Rourkela
No ratings yet
National Institute of Technology, Rourkela
2 pages
Precalculus Week 4
No ratings yet
Precalculus Week 4
10 pages
Hp z440 Workstation Technical Guide
No ratings yet
Hp z440 Workstation Technical Guide
20 pages

Unit 4 Part 3

Uploaded by

Unit 4 Part 3

Uploaded by

CS 3551 DISTRIBUTED

A/C balance = 20000

ATM Pin Entry

Update Balance = 10000

● A distributed system consists of a fixed number of processes, P1, P2 PN , which

1. A local checkpoint is a snapshot of the state of the process at a given

You might also like