Fault Tolerance FDCC
Fault Tolerance FDCC
•Fault Tolerance
•Recovery
Fault Tolerance
Introduction
What are the various faults that a distributed systems may
face?
a) Failure of a link
b) Failure of a site
c) Loss of message
d) Failure of power
• A DS should be fault-tolerant
– Should be able to continue functioning in the
presence of faults
Dependability Includes
• Availability
• Reliability
• Safety
• Maintainability
Availability & Reliability (1)
Timing failure A server's response lies outside the specified time interval
Process Resilience
Process Resilience
• Mask process failures by replication
• Unordered multicast:
• Three communicating processes in the same group. The ordering of events per process is shown along
the vertical axis.
• Four processes in the same group with two different senders, and a possible delivery order of messages
under FIFO-ordered multicasting
--P1: m0, m1, m2
– P2: m3, m4, m5
– P3: m6, m7, m8
• FIFO? (m0, m3, m6, m1, m4, m7, m2, m5, m8)
• FIFO? (m0, m4, m6, m1, m3, m7, m2, m5, m8)
– P1: m0, m1, m2
– P2: m3, m4, m5
– P3: m6, m7, m8
– Cross-process happened-before: m0 m4, m5 m8
•Causal? (m0, m3, m6, m1, m4, m7, m2, m5, m8)
• Causal? (m0, m4, m1, m7, m3, m6, m2, m5, m8)
– P1: m0, m1, m2
– P2: m3, m4, m5
– P3: m6, m7, m8
• Total?
– P1: m7, m1, m2, m4, m5, m3, m6, m0, m8
– P2: m7, m1, m2, m4, m5, m3, m6, m0, m8
– P3: m7, m1, m2, m4, m5, m3, m6, m0, m8
• Total?
– P1: m7, m1, m2, m4, m5, m3, m6, m0, m8
– P2: m7, m2, m1, m4, m5, m3, m6, m0, m8
– P3: m7, m1, m2, m4, m5, m3, m6, m8, m0
Does FIFO ordered multicast imply causally ordered
multicast?
Recovery
Recovery
Examples:
When a lost/damaged packet can be reconstructed
as a result of the receipt of other successfully
delivered packets, then this is known as Erasure
Correction. This is an example of a forward
recovery technique
More on Backward Recovery
How?
Global State
+ So simple!!
- Correct??
Example
Producer Consumer problem
p q
p q
m
Example
p q
m m
Result:
Global state has record of the receive event but no
send event violating the happens-before concept!!
An orphan message is a message whose receiving event is
recorded in the checkpoint, but its sending event is lost.
Lost Messages
A message whose sending event is recorded, but its
receiving event is not recorded.
Cut