Unit 4 Part 3
Unit 4 Part 3
COMPUTING
Checkpoint
Amount = 10000
Cash dispense
Checkpoint in Distributed System
What is Domino Effect?
● To see why rollback propagation occurs, consider the situation where the
sender of a message m rolls back to a state that precedes the sending of m.
● The receiver of m must also roll back to a state that precedes m’s receipt;
otherwise, the states of the two processes would be inconsistent because
they would show that message m was received without being sent, which is
impossible in any correct failure-free execution.
● This phenomenon of cascaded rollback is called the domino effect.
● In some situations, rollback propagation may extend back to the initial state of
the computation, losing all the work performed before the failure.
Domino effect continued…
● Independent or uncoordinated checkpointing : - If each participating process takes its
checkpoints independently, then the system is susceptible to the domino effect.
How to avoid domino effect?
● Coordinated checkpointing :
○ processes coordinate their checkpoints to form a system-wide consistent state.
○ In case of a process failure, the system state can be restored to such a consistent set of checkpoints,
preventing the rollback propagation.
● Communication-induced checkpointing :
○ forces each process to take checkpoints based on information piggybacked on the application messages it
receives from other processes.
○ Checkpoints are taken such that a system-wide consistent state always exists on stable storage, thereby
avoiding the domino effect.
● Logbased rollback recovery:
○ combines checkpointing with logging of nondeterministic events.
○ Log-based rollback recovery relies on the piecewise deterministic (PWD) assumption, which postulates that all
non-deterministic events that a process executes can be identified and that the information necessary to replay
each event during recovery can be logged in the event’s determinant.
○ By logging and replaying the non-deterministic events in their exact original order, a process can
deterministically recreate its pre-failure state even if this state has not been checkpointed.
Key Points
● Rollback recovery treats a distributed system application as a collection of
processes that communicate over a network.
● It achieves fault tolerance by periodically saving the state of a process during
the failure-free execution, enabling it to restart from a saved state upon a
failure to reduce the amount of lost work.
● The saved state is called a checkpoint, and the procedure of restarting from a
previously checkpointed state is called rollback recovery.
● A checkpoint can be saved on either the stable storage or the volatile storage
depending on the failure scenarios to be tolerated.
● Challenges for Recovery:
○ on a failure of one or more processes in a system, these dependencies may force some of the
processes that did not fail to roll back, creating what is commonly called a rollback propagation
Background and Definitions
1. System Model
2. Local Checkpoint
3. Consistent system states
4. Interactions with the outside world
5. Different types of messages
1. System Model
● a printer cannot roll back the effects of printing a character, and an automatic
teller machine cannot recover the money that it dispensed to a customer
● A distributed application often interacts with the outside world to receive input
data or deliver the outcome of a computation. If a failure occurs, the outside
world cannot be expected to roll back.
● the outside world see a consistent behavior of the system despite failures
● Output Commit- before sending output to the OWP, the system must ensure
that the state from which the output is sent will be recovered despite any
future failure.
● Input messages :
○ Received messages from the OWP may not be reproducible during recovery, because it may
not be possible for the outside world to regenerate them.
○ Thus, recovery protocols must arrange to save these input messages so that they can be
retrieved when needed for execution replay after a failure
Types of Messages
Types of Messages
Key Points
1. In-transit (m1,m2)
a. Messages that has been sent but not yet received
b. When in-transit messages are part of a global system state, these messages do not cause any
inconsistency.
c. For reliable communication channels, a consistent state must include in-transit messages
because they will always be delivered to their destinations in any legal execution of the
system.
d. On the other hand, if a system model assumes lossy communication channels, then in-transit
messages can be omitted from system state.
2. Lost Messages(m1)
a. Messages whose send is not undone but receive is undone due to rollback are called lost
messages.
b. This type of messages occurs when the process rolls back to a checkpoint prior to reception of
the message while the sender does not rollback beyond the send operation of the message
Key Points….
3. Delayed Messages (m2,m5)
a. Messages whose receive is not recorded because the receiving process was either down or
the message arrived after the rollback of the receiving process
4. Orphan Messages
a. Messages with receive recorded but message send not recorded are called orphan messages.
b. For example, a rollback might have undone the send of such messages, leaving the receive
event intact at the receiving process.
c. Orphan messages do not arise if processes roll back to a consistent global state.
5. Duplicate Message(m4,m5)
a. Duplicate messages arise due to message logging and replaying during process recovery
Issues in Failure Recovery
J
Key Points