0% found this document useful (0 votes)

199 views52 pages

Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery

This document discusses concepts related to fault tolerance in distributed systems. It covers dependability attributes like availability, reliability, safety and maintainability. It describes different types of faults like transient, intermittent and permanent faults. It discusses failure models and how redundancy can be used to mask failures through information, time and physical redundancy. It also covers process resilience through replication, and achieving agreement in faulty systems using protocols like the two generals problem and Byzantine generals problem. Finally, it discusses recovery through backward and forward error recovery, stable storage, checkpointing and message logging protocols.

Uploaded by

Neha Rohilla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

199 views52 pages

Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery

Uploaded by

Neha Rohilla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 52

Fault Tolerance:Introduction, Process Resilience, Distributed Commit, Recovery

Fault Tolerance Basic Concepts

Being fault tolerant is strongly related to what are called dependable systems

Dependability implies the following for any system:

Availability:-Ready to use immediately Reliability:-can run without failure Safety:-nothing catastrophic happen if a system fails to operate correctly. Maintainability:-how easy a failed system can be repaired.

Types of faults: Transient, Intermittent, Permanent

Transient:-Occur once and then disappear Intermittent:-Occurs then reappears Permanent:-Continuous to exist until the faulty component

is repaired.

Failure Models

Different types of failures.

Failure Masking by Redundancy

Information Redundancy. For example, adding extra bits (like in Hamming Codes, see the book Coding and Information Theory) to allow recovery from garbled bits Time Redundancy. Repeat actions if need be Physical Redundancy. Extra equipment or processes are added to make the system tolerate loss of some components

Failure Masking by Physical Redundancy

Figure 8-2. Triple modular redundancy.

In fig 8-2(a).Signal Pass through devices A,B,C in

sequesnce .If one of them is faulty ,the final result will probably be incorrect. In fig 8-2(b).Each device is replicated three times .Following each stage in the circuit is triplicated voter. Each voter is a circuit that has three inputs and one output. If two or three of the inputs are same the output is same as the input. If all three inputs are different, the output is undefined. This type of design is known as triple modular redundancy.

How fault tolerance can actually be achieved

Process Resilience
Achieved by replicating processes into groups. How to design fault-tolerant groups? How to reach an agreement within a group when some members cannot be trusted to give correct answers?

Design Issues:Flat Groups versus Hierarchical Groups

Figure 8-3. (a) Communication in a flat group.

(b) Communication in a simple hierarchical group.

Failure Masking and Replication

Primary-backup protocol(Hierarchical group). A primary coordinates all write operations. If it fails, then the others hold an election to replace the primary

Replicated-write protocols(Flat Group). Active replication as well as quorum based protocols. These solutions corresponds to organizing a collection of identical processes into a flat group
A system is said to be k fault tolerant if it can survive faults in k components and still meet its specifications. For fail-silent components, k+1 are enough to be k fault tolerant

Agreement in Faulty Systems (1)

Agreement

needed in many cases among different processes. The general goal of agreement is to have all non faulty processes reach consensus on some issue and to establish that consensus within a finite number of steps.

Synchronous versus asynchronous systems Communication delay is bounded or not Message delivery is ordered or not Message transmission is done through unicasting or multicasting

Two Army Problem

Bonaparte Alexander

Nonfaulty generals with unreliable communication

Byzantine Generals problem

Red army in the valley, n blue generals each with their own army surrounding them. Communication is pairwise, instantaneous and perfect. However m of the blue generals are traitors (faulty processes) and are actively trying to prevent the loyal generals from reaching agreement. The generals know the value m.

Goal: The generals need to exchange their troop strengths. At the end of the algorithm, each general has a vector of length n. If ith general is loyal, then the ith element has their troop strength otherwise it is undefined.

Conditions for a Solution

All loyal generals decide upon the same plan of action A small number of traitors cannot cause the loyal generals to adopt a bad plan

Agreement in Faulty Systems (3)

Figure 8-5. The Byzantine agreement problem for three

nonfaulty and one faulty process. (a) Each process sends their value to the others.

Byzantine Example

The Byzantine generals problem for 3 loyal generals and1 traitor:

a) b) c)

The generals announce their troop strengths (in units of 1 kilo soldiers) The vectors that each general assembles based on previous step The vectors that each general receives If a value has a majority, then we know it correctly, else it is unknown

Byzantine Example (2)

The same as in previous slide, except now with 2 loyal generals and one traitor.
For m faulty processes, we need a total of 3m+1 processes to reach agreement.

Recovery
Backward recovery. Roll back the system from erroneous state to a previously correct state. This requires system to be checkpointing, which has the following issues: Relatively costly to checkpoint. Often combined with message logging for better performance. Messages are logged before sending or before receiving. Combined with checkpoints to makes recovery possible. Checkpoints alone cannot solve the issue of replaying all messages in the right order Backward recovery requires a loop of recovery so failure transparency cannot be guaranteed. Some states can never be rolled back to... Forward recovery. Bring the system to a correct new state from which it can continue execution. E.g. In an (n,k) block erasure code, a set of k source packets is encoded into a set of n encoded packets, such that any set of k encoded packets is enough to reconstruct the original k source packets.

Stable Storage
We need fault-tolerant disk storage for the checkpoints and

message logs. Examples are various RAID (Redundant Array of Independent Disks) schemes (although they are used for both improved fault tolerance as well as improved performance). Some common schemes:
RAID-0 (block-level striping)
RAID-1 (mirroring) RAID-5 (block-level striping with distributed parity)

RAID-6 (block-level striping with double distributed

parity)

Recovery Stable Storage

Figure 8-23. (a) Stable storage. (b) Crash after drive 1 is

updated. (c) Bad spot due to spontaneous decay can be dealt with.

Checkpointing
Backward error recovery schemes require that a distributed system regularly records a consistent global state to stable storage. This is known as a distributed snapshot In a distributed snapshot, if a process P has recorded the receipt of a message, then there is also a process Q that has recorded the sending of that message To recover after a process or system failure, it is best to recover to the most recent distributed snapshot, also known as the recovery line Independent checkpointing: Coordinated checkpointing: Message logging:

Optimistic message logging

Pessimistic message logging

Checkpointing

Figure 8-24. A recovery line.

Independent Checkpointing

Figure 8-25. The domino effect.

Coordinated Checkpointing
All processes synchronize to jointly write their state to local stable storage, which implies that the saved state is automatically consistent. Simple Coordinated Checkpointing. Coordinator multicasts a CHECKPOINT_REQUEST to all processes. When a process receives the request, it takes a local checkpoint, queues any subsequent messages handed to it by the application it is executing, and acknowledges to the coordinator. When the coordinator has received an acknowledgement from all processes, it multicasts a CHECKPOINT_DONE message to allow the blocked processes to continue Incremental snapshot. The coordinator multicasts a checkpoint request only to those processes it had sent a message to since it last took a checkpoint. When a process P receives such a request, it forwards it to all those processes to which P itself had sent a message since the last checkpoint and so on. A process forwards the request only once. When all processes have been identified, then a second message is multicast to trigger checkpointing and to allow the processes to continue where they had left off.

Message Logging
If the transmission of messages can be replayed, we can still reach a globally consistent state by starting from a checkpointed state and retransmitting all messages sent since. Helps in reducing the number of checkpoints Assumes a piecewise deterministic model, where deterministic intervals occur between sending/receiving messages An orphan process is a process that has survived the crash of another process, but whose state is inconsistent with the crashed process after its recovery

Message Logging

Incorrect replay of messages after recovery, leading to an orphan process.

Message Logging Schemes A message is said to be stable if it can no longer be lost, because it has
been written to stable storage. Stable messages can be used for recovery by replaying their transmission.

DEP(m): A set of processes that depend upon the delivery of message m. COPY(m): A set of processes that have a copy of m but not yet in their local stable storage. A process Q is an orphan process if there is a message m such that Q is contained in DEP(m), while at the same time all processes in COPY(m) have crashed. We want to avoid this scenario.
Pessimistic logging protocol: For each non-stable message m, there is at most one process dependent upon m, which means that this process is in COPY(m). Basically, a process P is not allowed to send any messages after delivery of m without first storing it in stable storage Optimistic logging protocol: After a crash, orphan processes are rolled back until they are not in DEP(m). Much more complicated than pessimistic logging

Distributed Commit
Given a process group and an operation
The operation might or might not be committable at

all processes
Either everybody commits or everybody aborts
Consistency, validity, termination

Distributed Commit
Coordinator multicasts vote request
All processes respond to request Coordinator multicasts vote result COMMIT iff all vote COMMIT

This handles some error cases

But, what if a participant B crashes between a

backup votes COMMIT and the COMMIT result is broadcast and then comes back to live?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5

Two-Phase Commit
COORDINATOR
Input event Output event

PARTICIPANT

Figure 8-18. (a) The finite state machine for the coordinator in
Tanenbaum & Van Steen, 2PC. (b) The finite state machine for a participant. Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5

Two-Phase Commit
2PC detects crashes via timeouts 2PC handles crashes by logging state to

permanent storage, turning crash errors into reset errors

Coordinator Perspective
Blocks in WAIT
Participant may have failed That participant might vote
COORDINATOR

ABORT, in which case a GLOBAL COMMIT would be wrong and irreversible So, must do a GLOBAL ABORT

TIMEOUT
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5

Coordinator Perspective
COORDINATOR

...

Figure 8-20. Outline of the steps taken by the

Coordinator Perspective
COORDINATOR

...

Figure 8-20. Outline of the steps taken by the

coordinator in a two-phase commit protocol. Tanenbaum & Van Steen,

Participant Perspective
Blocks in READY
Coordinator may have failed
PARTICIPANT

What to do?
Some participants may

alreadyhavecommitted Perhaps another participant knowswhattodo?

Participant Perspective
After timeout allowing all messages in transit to arrive:

We know that coordinato managed to start commit

At least one participan aborted and coordinat noticed

Q did not even receive vote-request, so no on committed yet

What if all in READY

Figure 8-19. Actions taken by a participant P when

residing in state READY and having contacted another participant Q. Tanenbaum & Van Steen,
Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5

Two-Phase Commit PARTICIPANT

Figure 8-21. (a) The steps taken by a participant

All READY (1/2)

COORDINATOR PARTICIPANT

Why do we block when all live participants are in the READY state?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5

All READY (2/2)

COORDINATOR PARTICIPANT

Same view, but different decisions, so Yellow needs to wait for Blue or Green to come up again and inspect their log files!
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5

Two-Phase Commit

Two-Phase Commit has the problem that if the coordinator and one participant crashes at a bad time the entire system freezes until one of them is up again Getting a server up and running again typically involves human (a.k.a. very slow) intervention

Three-Phase Commit

Three-Phase Commit enhances Two-Phase Commit in that it is non-blocking in many more cases As long as the live participants can make a majority decision they can continue on their own If there are many participants, this makes it very unlikely that 3PC blocks

Three-Phase Commit TIMEOUT

COORDINATOR

PARTICIPANT

Figure 8-22. (a) The finite state machine for the

coordinator in 3PC. (b) The finite state machine for a Tanenbaum & Van Steen, participant. Distributed Systems: Principles
and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5

Three-Phase Commit TIMEOUT

COORDINATOR

PARTICIPANT

Figure 8-22. (a) The finite state machine for the

On timeout: IF anyone in ABORT ABORT ELIF anyone in COMMIT COMMIT ELIF anyone in INIT ABORT ELSE elect new coordinator among the live New Coordinator: Go to WAIT and from there goto ABORT or PRECOMMIT ABORT: If a majority of participants are in READY PRECOMMIT: If a majority are in PRECOMMIT If no majority, then block

COORDINATOR

If anyone is in PRECOMMIT, then original coordinators vote is set to be PRECOMMIT, as the original coordinator must be in PRECOMMIT

COORDINATOR

If anyone is in PRECOMMIT, then original coordinators vote is set to be PRECOMMIT, as the original coordinator must be in PRECOMMIT

COORDINATOR

More Non-Blocking

Follows from the decision rules that the live agents always can make decisions on their own unless no true majority for READY or PRECOMMIT can be found True majority: Majority among all processes, both dead and live

Correctness (1/4)

Let P and Q be any two processes which both acted as coordinator at some point THEOREM It can never happen that P is in ABORT and Q is in COMMIT Proof:
1.

2.
3.

When P went to ABORT there was a true majority in READY When Q went to COMMIT there was a true majority in PRECOMMIT These two configurations are mutually exclusive Tanenbaum & Van Steen,
Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5

Correctness (2/4)

By construction: If there is a process in ABORT, then there is a coordinator in ABORT

COORDINATOR

PARTICIPANT

Correctness (3/4)

Bu construction: If there is a process in COMMIT, then there is a coordinator in COMMIT

COORDINATOR

PARTICIPANT

Correctness (4/4)

Let P and Q be any two processes COROLLARY It can never happen that P is in ABORT and Q is in COMMIT

Summary
Looked at Distributed Commit Distributed commit
2PC blocking, has a bad state 3PC less blocking, but not widely used in practice

Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Fault Tolerance
No ratings yet
Fault Tolerance
33 pages
Fault Tolerance Techniques: Unit 3
No ratings yet
Fault Tolerance Techniques: Unit 3
40 pages
Lect8 FaultTolerance
No ratings yet
Lect8 FaultTolerance
37 pages
Fault Tolerant Computing
No ratings yet
Fault Tolerant Computing
4 pages
16 Fault Tolerance
No ratings yet
16 Fault Tolerance
34 pages
Lecture 3
No ratings yet
Lecture 3
118 pages
Introduction To Fault Tolerance
No ratings yet
Introduction To Fault Tolerance
20 pages
Fault Tolerance for Engineers
100% (1)
Fault Tolerance for Engineers
104 pages
Overview of Distributed Shared Memory
No ratings yet
Overview of Distributed Shared Memory
20 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Lecture 6 - Synchronization
No ratings yet
Lecture 6 - Synchronization
16 pages
Distributed Systems: Chapter 07: Consistency & Replication
No ratings yet
Distributed Systems: Chapter 07: Consistency & Replication
48 pages
Consistency & Replication in Distributed Systems
No ratings yet
Consistency & Replication in Distributed Systems
32 pages
Distributed Computing Models
No ratings yet
Distributed Computing Models
31 pages
Chord: A Scalable DHT Algorithm
No ratings yet
Chord: A Scalable DHT Algorithm
47 pages
Distributed Con Currency Control - 2 of 3
100% (1)
Distributed Con Currency Control - 2 of 3
46 pages
Concurrency Control in DDBMS
No ratings yet
Concurrency Control in DDBMS
37 pages
Fault Tolerant System Design
100% (1)
Fault Tolerant System Design
44 pages
Fault Tolerance Exam
No ratings yet
Fault Tolerance Exam
14 pages
System Models For Distributed and Cloud Computing
No ratings yet
System Models For Distributed and Cloud Computing
9 pages
Fault Tolerance
No ratings yet
Fault Tolerance
13 pages
Fault Tolerance Techniques Overview
No ratings yet
Fault Tolerance Techniques Overview
101 pages
Deadlocks
100% (1)
Deadlocks
55 pages
Chapter 4 - Wireless Local Area Networks
No ratings yet
Chapter 4 - Wireless Local Area Networks
75 pages
13 Randomized Algorithms
100% (1)
13 Randomized Algorithms
50 pages
Advanced Computer Networking
No ratings yet
Advanced Computer Networking
1 page
Distributed Systems (Cosc 6003) : Chapter 1 - Introduction
No ratings yet
Distributed Systems (Cosc 6003) : Chapter 1 - Introduction
37 pages
Security Policy and Operating Systems
No ratings yet
Security Policy and Operating Systems
34 pages
Understanding Distributed Shared Memory
100% (1)
Understanding Distributed Shared Memory
21 pages
Understanding Real-Time Operating Systems
No ratings yet
Understanding Real-Time Operating Systems
35 pages
2012 IN4392 Lecture-5 CloudProgrammingModels
100% (1)
2012 IN4392 Lecture-5 CloudProgrammingModels
95 pages
Unit 11 Dependability-and-Security
No ratings yet
Unit 11 Dependability-and-Security
39 pages
Distributed Systems
No ratings yet
Distributed Systems
17 pages
Virtualization: Key Concepts and Benefits
No ratings yet
Virtualization: Key Concepts and Benefits
37 pages
Fundamentals of Distributed Systems
100% (1)
Fundamentals of Distributed Systems
20 pages
Chapter 7 Distributed Shared Memory
No ratings yet
Chapter 7 Distributed Shared Memory
73 pages
Message Passing and RPC Explained
No ratings yet
Message Passing and RPC Explained
17 pages
Distributed Debugging
No ratings yet
Distributed Debugging
13 pages
IoT Module 4 Associated IoT Technologies
No ratings yet
IoT Module 4 Associated IoT Technologies
56 pages
Encrypted Data Analysis
No ratings yet
Encrypted Data Analysis
30 pages
Rushing Attack and Its Prevention Techniques: Volume 2, Issue 4, April 2013
No ratings yet
Rushing Attack and Its Prevention Techniques: Volume 2, Issue 4, April 2013
4 pages
1-2 Distributed System Models
No ratings yet
1-2 Distributed System Models
7 pages
Quad Tree
No ratings yet
Quad Tree
33 pages
Operating System Part 1
100% (1)
Operating System Part 1
129 pages
SQA System Architecture
No ratings yet
SQA System Architecture
11 pages
4.1.4. Checkpoint Based Recovery-1
No ratings yet
4.1.4. Checkpoint Based Recovery-1
10 pages
MACHINE LEARNING Unit-1
No ratings yet
MACHINE LEARNING Unit-1
23 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
Distributed Mutual Exclusion Methods
No ratings yet
Distributed Mutual Exclusion Methods
6 pages
Data Link Layer Essentials
No ratings yet
Data Link Layer Essentials
31 pages
Security Challenges in Wireless Sensor Networks
No ratings yet
Security Challenges in Wireless Sensor Networks
39 pages
Tính Toán Phân Tán
No ratings yet
Tính Toán Phân Tán
79 pages
WN General Notes PDF
No ratings yet
WN General Notes PDF
148 pages
Overview of Distributed Shared Memory
No ratings yet
Overview of Distributed Shared Memory
109 pages
Understanding Deadlock and Models
No ratings yet
Understanding Deadlock and Models
50 pages
Java Exception Handling Basics
No ratings yet
Java Exception Handling Basics
34 pages
Unit - Iv
No ratings yet
Unit - Iv
10 pages
Distributed Systems Resilience
No ratings yet
Distributed Systems Resilience
25 pages
Test Bank
No ratings yet
Test Bank
31 pages
Chapter 4 Database Recovery Techniques
No ratings yet
Chapter 4 Database Recovery Techniques
26 pages
Synchronization in Distributed Systems
No ratings yet
Synchronization in Distributed Systems
11 pages
Distributed Systems: Fault Tolerance
No ratings yet
Distributed Systems: Fault Tolerance
22 pages
Transactions and Their Applications in The Digital World: Abstract
No ratings yet
Transactions and Their Applications in The Digital World: Abstract
5 pages
Oracle Database Background Processes Explained
No ratings yet
Oracle Database Background Processes Explained
11 pages
Spring Cloud Microservices MCQs
No ratings yet
Spring Cloud Microservices MCQs
23 pages
Synchronization in Distributed Systems
No ratings yet
Synchronization in Distributed Systems
8 pages
Distributed Systems: Consensus & Fault Tolerance
No ratings yet
Distributed Systems: Consensus & Fault Tolerance
10 pages
501 Data Base Exercises Solution
No ratings yet
501 Data Base Exercises Solution
20 pages
Slides For Chapter 17: Distributed Transactions: Distributed Systems: Concepts and Design
No ratings yet
Slides For Chapter 17: Distributed Transactions: Distributed Systems: Concepts and Design
24 pages
(MIT 6.1800) Spring 2025 Notes
No ratings yet
(MIT 6.1800) Spring 2025 Notes
17 pages
Distributed DBMS: Announcements
100% (1)
Distributed DBMS: Announcements
11 pages
ADBMS Exam Question Answers
No ratings yet
ADBMS Exam Question Answers
54 pages
Commit Protocols Non-Blocking Commit Protocols
No ratings yet
Commit Protocols Non-Blocking Commit Protocols
10 pages
Lecture 7 PDC
No ratings yet
Lecture 7 PDC
8 pages
Ddbmsunit 2
No ratings yet
Ddbmsunit 2
60 pages
2.11 Distributed Transaction
No ratings yet
2.11 Distributed Transaction
27 pages
Transaction Processing System
No ratings yet
Transaction Processing System
26 pages
Distributed Recovery Management Techniques
No ratings yet
Distributed Recovery Management Techniques
31 pages
Database Backup and Recovery Concepts
No ratings yet
Database Backup and Recovery Concepts
5 pages
ADBMS Chapter No. 3
No ratings yet
ADBMS Chapter No. 3
37 pages
Unit Ii
No ratings yet
Unit Ii
9 pages
Configuring COM+ Services for Security
No ratings yet
Configuring COM+ Services for Security
90 pages
SOA & WS-Coordination Lecture
No ratings yet
SOA & WS-Coordination Lecture
51 pages
Spanner Google's Globally-Distributed Database
No ratings yet
Spanner Google's Globally-Distributed Database
14 pages
Distributed Databases Overview and Types
No ratings yet
Distributed Databases Overview and Types
44 pages
Parallel Database Systems An Overview
No ratings yet
Parallel Database Systems An Overview
10 pages
Distributed DBMS Reliability Unit IV
100% (1)
Distributed DBMS Reliability Unit IV
27 pages
Cross-Rollup Atomic Transaction Execution
No ratings yet
Cross-Rollup Atomic Transaction Execution
21 pages

Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery

Uploaded by

Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery

Uploaded by

Fault Tolerance:Introduction, Process Resilience, Distributed Commit, Recovery

Fault Tolerance Basic Concepts

Dependability implies the following for any system:

Types of faults: Transient, Intermittent, Permanent

Types of faults: Transient, Intermittent, Permanent

Different types of failures.

Failure Masking by Redundancy

Failure Masking by Physical Redundancy

Figure 8-2. Triple modular redundancy.

In fig 8-2(a).Signal Pass through devices A,B,C in

How fault tolerance can actually be achieved

Design Issues:Flat Groups versus Hierarchical Groups

Figure 8-3. (a) Communication in a flat group.

(b) Communication in a simple hierarchical group.

Failure Masking and Replication

Agreement in Faulty Systems (1)

Two Army Problem

Nonfaulty generals with unreliable communication

Byzantine Generals problem

Conditions for a Solution

Agreement in Faulty Systems (3)

Figure 8-5. The Byzantine agreement problem for three

The Byzantine generals problem for 3 loyal generals and1 traitor:

Byzantine Example (2)

RAID-6 (block-level striping with double distributed

Recovery Stable Storage

Figure 8-23. (a) Stable storage. (b) Crash after drive 1 is

Optimistic message logging

Figure 8-24. A recovery line.

Figure 8-25. The domino effect.

Incorrect replay of messages after recovery, leading to an orphan process.

This handles some error cases

But, what if a participant B crashes between a

permanent storage, turning crash errors into reset errors

Figure 8-20. Outline of the steps taken by the

Figure 8-20. Outline of the steps taken by the

coordinator in a two-phase commit protocol. Tanenbaum & Van Steen,

alreadyhavecommitted Perhaps another participant knowswhattodo?

We know that coordinato managed to start commit

At least one participan aborted and coordinat noticed

Q did not even receive vote-request, so no on committed yet

What if all in READY

Figure 8-19. Actions taken by a participant P when

Two-Phase Commit PARTICIPANT

Figure 8-21. (a) The steps taken by a participant

All READY (1/2)

All READY (2/2)

Three-Phase Commit TIMEOUT

Figure 8-22. (a) The finite state machine for the

Three-Phase Commit TIMEOUT

Figure 8-22. (a) The finite state machine for the

By construction: If there is a process in ABORT, then there is a coordinator in ABORT

Bu construction: If there is a process in COMMIT, then there is a coordinator in COMMIT

You might also like