Assignment 4 - 044
Assignment 4 - 044
CS3551
ASSIGNMENT – 4
NAME: Saraniya P
REG.NO: 310822243044
DEPT: AI&DS
YEAR: III
1. Issues in Failure Recovery:
Failure recovery in distributed systems is complex due to several challenges:
• Unpredictable Failures: Hardware or software failures can occur at any node,
and their impact may cascade across the system.
• Concurrency and Non-Determinism: Processes in distributed systems
operate concurrently and may exhibit non-deterministic behaviour, making
recovery nontrivial.
• Global State Consistency: It is difficult to ensure a consistent global state for
recovery because distributed systems lack a single point of control.
• Partial Failures: Some nodes may fail while others remain operational,
complicating coordination and recovery efforts.
• Communication Failures: Loss or delay of messages between nodes can lead
to inconsistent states and make recovery harder.
• Cost of Recovery: Recovery mechanisms like checkpointing and logging add
computational and storage overhead.
Steps:
1. Checkpointing:
o Each process periodically saves its local state (checkpoint)
without waiting for other processes.
o Checkpoints include process state and metadata about
dependencies (e.g., sent/received messages).
2. Log Communication:
o Messages exchanged between checkpoints are logged to
ensure they can be replayed during recovery.
o Processes log messages sent and received during execution.
3. Failure Detection:
Upon detecting a failure, the system identifies a set of consistent.
checkpoints for recovery.
4. Recovery:
o Processes roll back to their latest checkpoints.
o Lost messages after the checkpoints are replayed from the logs to
ensure consistency.
Benefits:
• No need for global coordination, reducing latency.
• Suitable for systems with frequent communication or high failure
probabilities.
Drawbacks:
• Risk of cascading rollbacks if dependencies among checkpoints are not
carefully managed.
• Increased storage and communication overhead for logging.
Key Concepts:
• Event Logging:
o Each process logs events such as message sends, receives, and state
changes.
o Logs are stored persistently to survive failures.
2. Optimistic Logging:
o Allows processes to proceed without waiting for logs to be committed,
reducing runtime overhead.
o Recovery may involve complex rollbacks and replays.
3. Causal Logging:
o Ensures logs respect causal dependencies between events.
o Balances runtime performance and recovery complexity.
Challenges:
• Managing and storing logs efficiently in large-scale systems.
• Ensuring that logs capture all necessary events for recovery without
excessive overhead.