Fault Tolerance:Introduction, Process Resilience, Distributed Commit, Recovery
Fault Tolerance Basic Concepts
Being fault tolerant is strongly related to what are called dependable systems
Dependability implies the following for any system:
Availability:-Ready to use immediately Reliability:-can run without failure Safety:-nothing catastrophic happen if a system fails to operate correctly. Maintainability:-how easy a failed system can be repaired.
Types of faults: Transient, Intermittent, Permanent
Types of faults: Transient, Intermittent, Permanent
Transient:-Occur once and then disappear Intermittent:-Occurs then reappears Permanent:-Continuous to exist until the faulty component
is repaired.
Failure Models
Different types of failures.
Failure Masking by Redundancy
Information Redundancy. For example, adding extra bits (like in Hamming Codes, see the book Coding and Information Theory) to allow recovery from garbled bits Time Redundancy. Repeat actions if need be Physical Redundancy. Extra equipment or processes are added to make the system tolerate loss of some components
Failure Masking by Physical Redundancy
Figure 8-2. Triple modular redundancy.
In fig 8-2(a).Signal Pass through devices A,B,C in
sequesnce .If one of them is faulty ,the final result will probably be incorrect. In fig 8-2(b).Each device is replicated three times .Following each stage in the circuit is triplicated voter. Each voter is a circuit that has three inputs and one output. If two or three of the inputs are same the output is same as the input. If all three inputs are different, the output is undefined. This type of design is known as triple modular redundancy.
How fault tolerance can actually be achieved
Process Resilience
Achieved by replicating processes into groups. How to design fault-tolerant groups? How to reach an agreement within a group when some members cannot be trusted to give correct answers?
Design Issues:Flat Groups versus Hierarchical Groups
Figure 8-3. (a) Communication in a flat group.
(b) Communication in a simple hierarchical group.
Failure Masking and Replication
Primary-backup protocol(Hierarchical group). A primary coordinates all write operations. If it fails, then the others hold an election to replace the primary
Replicated-write protocols(Flat Group). Active replication as well as quorum based protocols. These solutions corresponds to organizing a collection of identical processes into a flat group
A system is said to be k fault tolerant if it can survive faults in k components and still meet its specifications. For fail-silent components, k+1 are enough to be k fault tolerant
Agreement in Faulty Systems (1)
Agreement
needed in many cases among different processes. The general goal of agreement is to have all non faulty processes reach consensus on some issue and to establish that consensus within a finite number of steps.
Synchronous versus asynchronous systems Communication delay is bounded or not Message delivery is ordered or not Message transmission is done through unicasting or multicasting
Two Army Problem
Bonaparte Alexander
Nonfaulty generals with unreliable communication
Byzantine Generals problem
Red army in the valley, n blue generals each with their own army surrounding them. Communication is pairwise, instantaneous and perfect. However m of the blue generals are traitors (faulty processes) and are actively trying to prevent the loyal generals from reaching agreement. The generals know the value m.
Goal: The generals need to exchange their troop strengths. At the end of the algorithm, each general has a vector of length n. If ith general is loyal, then the ith element has their troop strength otherwise it is undefined.
Conditions for a Solution
All loyal generals decide upon the same plan of action A small number of traitors cannot cause the loyal generals to adopt a bad plan
Agreement in Faulty Systems (3)
Figure 8-5. The Byzantine agreement problem for three
nonfaulty and one faulty process. (a) Each process sends their value to the others.
Byzantine Example
The Byzantine generals problem for 3 loyal generals and1 traitor:
a) b) c)
d)
The generals announce their troop strengths (in units of 1 kilo soldiers) The vectors that each general assembles based on previous step The vectors that each general receives If a value has a majority, then we know it correctly, else it is unknown
Byzantine Example (2)
The same as in previous slide, except now with 2 loyal generals and one traitor.
For m faulty processes, we need a total of 3m+1 processes to reach agreement.
Recovery
Backward recovery. Roll back the system from erroneous state to a previously correct state. This requires system to be checkpointing, which has the following issues: Relatively costly to checkpoint. Often combined with message logging for better performance. Messages are logged before sending or before receiving. Combined with checkpoints to makes recovery possible. Checkpoints alone cannot solve the issue of replaying all messages in the right order Backward recovery requires a loop of recovery so failure transparency cannot be guaranteed. Some states can never be rolled back to... Forward recovery. Bring the system to a correct new state from which it can continue execution. E.g. In an (n,k) block erasure code, a set of k source packets is encoded into a set of n encoded packets, such that any set of k encoded packets is enough to reconstruct the original k source packets.
Stable Storage
We need fault-tolerant disk storage for the checkpoints and
message logs. Examples are various RAID (Redundant Array of Independent Disks) schemes (although they are used for both improved fault tolerance as well as improved performance). Some common schemes:
RAID-0 (block-level striping)
RAID-1 (mirroring) RAID-5 (block-level striping with distributed parity)
RAID-6 (block-level striping with double distributed
parity)
Recovery Stable Storage
Figure 8-23. (a) Stable storage. (b) Crash after drive 1 is
updated. (c) Bad spot due to spontaneous decay can be dealt with.
Checkpointing
Backward error recovery schemes require that a distributed system regularly records a consistent global state to stable storage. This is known as a distributed snapshot In a distributed snapshot, if a process P has recorded the receipt of a message, then there is also a process Q that has recorded the sending of that message To recover after a process or system failure, it is best to recover to the most recent distributed snapshot, also known as the recovery line Independent checkpointing: Coordinated checkpointing: Message logging:
Optimistic message logging
Pessimistic message logging
Checkpointing
Figure 8-24. A recovery line.
Independent Checkpointing
Figure 8-25. The domino effect.
Coordinated Checkpointing
All processes synchronize to jointly write their state to local stable storage, which implies that the saved state is automatically consistent. Simple Coordinated Checkpointing. Coordinator multicasts a CHECKPOINT_REQUEST to all processes. When a process receives the request, it takes a local checkpoint, queues any subsequent messages handed to it by the application it is executing, and acknowledges to the coordinator. When the coordinator has received an acknowledgement from all processes, it multicasts a CHECKPOINT_DONE message to allow the blocked processes to continue Incremental snapshot. The coordinator multicasts a checkpoint request only to those processes it had sent a message to since it last took a checkpoint. When a process P receives such a request, it forwards it to all those processes to which P itself had sent a message since the last checkpoint and so on. A process forwards the request only once. When all processes have been identified, then a second message is multicast to trigger checkpointing and to allow the processes to continue where they had left off.
Message Logging
If the transmission of messages can be replayed, we can still reach a globally consistent state by starting from a checkpointed state and retransmitting all messages sent since. Helps in reducing the number of checkpoints Assumes a piecewise deterministic model, where deterministic intervals occur between sending/receiving messages An orphan process is a process that has survived the crash of another process, but whose state is inconsistent with the crashed process after its recovery
Message Logging
Incorrect replay of messages after recovery, leading to an orphan process.
Message Logging Schemes A message is said to be stable if it can no longer be lost, because it has
been written to stable storage. Stable messages can be used for recovery by replaying their transmission.
DEP(m): A set of processes that depend upon the delivery of message m. COPY(m): A set of processes that have a copy of m but not yet in their local stable storage. A process Q is an orphan process if there is a message m such that Q is contained in DEP(m), while at the same time all processes in COPY(m) have crashed. We want to avoid this scenario.
Pessimistic logging protocol: For each non-stable message m, there is at most one process dependent upon m, which means that this process is in COPY(m). Basically, a process P is not allowed to send any messages after delivery of m without first storing it in stable storage Optimistic logging protocol: After a crash, orphan processes are rolled back until they are not in DEP(m). Much more complicated than pessimistic logging
Distributed Commit
Given a process group and an operation
The operation might or might not be committable at
all processes
Either everybody commits or everybody aborts
Consistency, validity, termination
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Distributed Commit
Coordinator multicasts vote request
All processes respond to request Coordinator multicasts vote result COMMIT iff all vote COMMIT
This handles some error cases
But, what if a participant B crashes between a
backup votes COMMIT and the COMMIT result is broadcast and then comes back to live?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Two-Phase Commit
COORDINATOR
Input event Output event
PARTICIPANT
Figure 8-18. (a) The finite state machine for the coordinator in
Tanenbaum & Van Steen, 2PC. (b) The finite state machine for a participant. Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Two-Phase Commit
2PC detects crashes via timeouts 2PC handles crashes by logging state to
permanent storage, turning crash errors into reset errors
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Coordinator Perspective
Blocks in WAIT
Participant may have failed That participant might vote
COORDINATOR
ABORT, in which case a GLOBAL COMMIT would be wrong and irreversible So, must do a GLOBAL ABORT
TIMEOUT
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Coordinator Perspective
COORDINATOR
...
Figure 8-20. Outline of the steps taken by the
Tanenbaum & Van Steen, coordinator in a two-phase commit Distributed protocol. Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Coordinator Perspective
COORDINATOR
...
Figure 8-20. Outline of the steps taken by the
coordinator in a two-phase commit protocol. Tanenbaum & Van Steen,
Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Participant Perspective
Blocks in READY
Coordinator may have failed
PARTICIPANT
What to do?
Some participants may
alreadyhavecommitted Perhaps another participant knowswhattodo?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Participant Perspective
After timeout allowing all messages in transit to arrive:
We know that coordinato managed to start commit
At least one participan aborted and coordinat noticed
Q did not even receive vote-request, so no on committed yet
What if all in READY
Figure 8-19. Actions taken by a participant P when
residing in state READY and having contacted another participant Q. Tanenbaum & Van Steen,
Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Two-Phase Commit PARTICIPANT
Figure 8-21. (a) The steps taken by a participant
Tanenbaum & Van Steen, Distributed Systems: Principles process 2PC. 2e, (c) 2007 andin Paradigms, Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
All READY (1/2)
COORDINATOR PARTICIPANT
Why do we block when all live participants are in the READY state?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
All READY (2/2)
COORDINATOR PARTICIPANT
Same view, but different decisions, so Yellow needs to wait for Blue or Green to come up again and inspect their log files!
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Two-Phase Commit
Two-Phase Commit has the problem that if the coordinator and one participant crashes at a bad time the entire system freezes until one of them is up again Getting a server up and running again typically involves human (a.k.a. very slow) intervention
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Three-Phase Commit
Three-Phase Commit enhances Two-Phase Commit in that it is non-blocking in many more cases As long as the live participants can make a majority decision they can continue on their own If there are many participants, this makes it very unlikely that 3PC blocks
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Three-Phase Commit TIMEOUT
COORDINATOR
PARTICIPANT
Figure 8-22. (a) The finite state machine for the
coordinator in 3PC. (b) The finite state machine for a Tanenbaum & Van Steen, participant. Distributed Systems: Principles
and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Three-Phase Commit TIMEOUT
COORDINATOR
PARTICIPANT
Figure 8-22. (a) The finite state machine for the
coordinator in 3PC. (b) The finite state machine for a Tanenbaum & Van Steen, participant. Distributed Systems: Principles
and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
On timeout: IF anyone in ABORT ABORT ELIF anyone in COMMIT COMMIT ELIF anyone in INIT ABORT ELSE elect new coordinator among the live New Coordinator: Go to WAIT and from there goto ABORT or PRECOMMIT ABORT: If a majority of participants are in READY PRECOMMIT: If a majority are in PRECOMMIT If no majority, then block
COORDINATOR
PARTICIPANT
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
On timeout: IF anyone in ABORT ABORT ELIF anyone in COMMIT COMMIT ELIF anyone in INIT ABORT ELSE elect new coordinator among the live New Coordinator: Go to WAIT and from there goto ABORT or PRECOMMIT ABORT: If a majority of participants are in READY PRECOMMIT: If a majority are in PRECOMMIT If no majority, then block
If anyone is in PRECOMMIT, then original coordinators vote is set to be PRECOMMIT, as the original coordinator must be in PRECOMMIT
COORDINATOR
PARTICIPANT
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
On timeout: IF anyone in ABORT ABORT ELIF anyone in COMMIT COMMIT ELIF anyone in INIT ABORT ELSE elect new coordinator among the live New Coordinator: Go to WAIT and from there goto ABORT or PRECOMMIT ABORT: If a majority of participants are in READY PRECOMMIT: If a majority are in PRECOMMIT If no majority, then block
If anyone is in PRECOMMIT, then original coordinators vote is set to be PRECOMMIT, as the original coordinator must be in PRECOMMIT
COORDINATOR
PARTICIPANT
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
More Non-Blocking
Follows from the decision rules that the live agents always can make decisions on their own unless no true majority for READY or PRECOMMIT can be found True majority: Majority among all processes, both dead and live
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Correctness (1/4)
Let P and Q be any two processes which both acted as coordinator at some point THEOREM It can never happen that P is in ABORT and Q is in COMMIT Proof:
1.
2.
3.
When P went to ABORT there was a true majority in READY When Q went to COMMIT there was a true majority in PRECOMMIT These two configurations are mutually exclusive Tanenbaum & Van Steen,
Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Correctness (2/4)
By construction: If there is a process in ABORT, then there is a coordinator in ABORT
COORDINATOR
PARTICIPANT
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Correctness (3/4)
Bu construction: If there is a process in COMMIT, then there is a coordinator in COMMIT
COORDINATOR
PARTICIPANT
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Correctness (4/4)
Let P and Q be any two processes COROLLARY It can never happen that P is in ABORT and Q is in COMMIT
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Summary
Looked at Distributed Commit Distributed commit
2PC blocking, has a bad state 3PC less blocking, but not widely used in practice
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5