0% found this document useful (0 votes)
19 views23 pages

Amazon EC2 Availability Tactics Explained

Software architecture aims to maximize system availability through various tactics: 1) Fault detection methods like pinging, heartbeats, and data checks monitor system health. 2) Recovery is enabled by redundancy, rollbacks, retries, and state resynchronization when failures occur. 3) Fault prevention focuses on removing potential causes of failure through software upgrades, monitoring, and exception handling.

Uploaded by

abhay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views23 pages

Amazon EC2 Availability Tactics Explained

Software architecture aims to maximize system availability through various tactics: 1) Fault detection methods like pinging, heartbeats, and data checks monitor system health. 2) Recovery is enabled by redundancy, rollbacks, retries, and state resynchronization when failures occur. 3) Fault prevention focuses on removing potential causes of failure through software upgrades, monitoring, and exception handling.

Uploaded by

abhay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Software Architecture

Availability
BITS Pilani
Availability

1) Call out
2) Give Response

SS ZG653
Availability
• The SLA that Amazon provides for its EC2 cloud service is
• AWS will use commercially reasonable efforts to make Amazon EC2
available with an Annual Uptime Percentage [defined elsewhere] of
at least 99.95% during the Service Year. In the event Amazon EC2
does not meet the Annual Uptime Percentage commitment, you will
be eligible to receive a Service Credit as described below.
Availability
System Availability Requirements

Amazon EC2 guarantees 99.95% availability


Availability

• Availability is the ability of a system to minimize system


outages

• Availability is generally measured as % of up-time

• Ex. If a system is down for an average of 1 day in 100


days, its availability is 99.9%
Availability

Calculating availability

Uptime Downtime
(Time between failures) (Time to repair)
(TBF) (TTR)

99 days 1 day

Availability = Mean TBF / (Mean TBF + Mean TTR)


= MTBF / (MTBF + MTTR)
How to recover from node
failure in a network?
• Detects node failure by ‘pinging’ nodes

• If a node does not ‘Echo’ back, change the routing of


packets via alternate nodes
Availability

Types of faults

• Omission : Failure to respond


• Crash : Repeated failure to respond
• Timing issue : Respond early or late (ex. Late arrival of
message)
• Incorrect response: Response with incorrect value
Availability

• Replication
• Functional redundancy (same input but diversely
designed),
• Analytic redundancy (Ex. Determining altitude using
barometric pressure, using geometry)
Availability

• Redundancy:
• Active (hot spare): All components process in parallel. Ex Disk
mirroring, sending data via 2 routes
• Passive (warm spare): Periodic state updates, Ex. Server farms
in eCommerce
• The application must store as much of its state on non-
volatile shared storage as possible. Equally important is the
ability to restart on another node at the last state before
failure using the saved state from the shared storage.
• Spare (cold spare): Power on procedure
How to recover from failed
online transactions?
Roll back incomplete transactions, using a log

A=10,000, B=10,000
Begin Trasaction

Read A
A = A - 1000
Write A
Crash
Read B
B = B + 1000
Time Write B

End Transaction

A=9,000, B=11,000
How to ensure integrity of data when 2
txns are trying to access the same data?
‘Lock’ concept in database

A = 10,000
User 1 User 2

Read A
Read A
A = A + 1000
A = A - 1000
Write A Time
Write A
A = 9,000

Interleaving of txns can lead to loss of data integrity


Availability

Source: ‘Software architecture


in practice by Len Bass &
others
Availability Tactics
Fault detection Error Masking Recover From Fault Fault prevention

• Ping/echo • Active redundancy • Rollback • Removal of a


• Heartbeat (Hot) • Retry component to
• Timestamp • Passive • Reconfiguration prevent
redundancy anticipated
• Data sanity check • Shadow operation
(Warm) failure–
• Condition • State auto/manual
monitoring • Spare (Cold) resynchronization reboot
• Voting • Exception handling • Escalating restart • Create transaction
• Exception • Graceful • Nonstop
degradation • Software upgrade
Detection forwarding
• Ignore faulty • Predictive model
• Self-test
behavior • Process monitor-
that can detect,
remove and restart
faulty process
• Exception
prevention

9/17/2023 SS ZG653 14
Availability

Detecting faults – other tactics


• Checksum: used in networks, data storage
• Voting: Triple Modular Redundancy: Used in satellite
systems
Disk mirroring (Active
redundancy)
To recover from disk failure

Once new disk is installed, copy database


(resynchronization)
High availability in Tomcat

Use of temporary server when the main server is getting upgraded


during new release

• Live traffic requests are re-directed to a temporary server while the


main server is upgraded
Load balancing
• Distribute load to different servers in a server farm
Types of faults that impact
availability
Hardware faults
• Disk failure
• Power failure
• Network failure

Software faults
• Memory leak
• Divide by zero
• Incorrect parameter passing
Server farms

A server farm or server cluster is a collection of computer servers - usually


maintained by an organization to supply server functionality far beyond the
capability of a single machine.

Server farms often consist of thousands of computers which require a large amount
of power to run and to keep cool. At the optimum performance level, a server
farm has enormous costs (both financial and environmental) associated with it.[1]

Server farms often have backup servers, which can take over the function of primary
servers in the event of a primary-server failure.

Server farms are typically collocated with the network switches and/or routers which
enable communication between the different parts of the cluster and the users of
the cluster.
Amazon Server Farm

Refer :
https://www.youtube.com/watch?v=q6WlzHLxNKI&t=342
s
Availability

Recovering from faults – other tactics


• Exception handling: Ex. Divide by zero, File IO
error
• Retry: Used in networks and server farms
Summary of availability tactics

• Detect faults
• Recover from faults (Automated or manual)
• Prevent faults

You might also like