Software Architecture
Availability
BITS Pilani
Availability
1) Call out
2) Give Response
SS ZG653
Availability
• The SLA that Amazon provides for its EC2 cloud service is
• AWS will use commercially reasonable efforts to make Amazon EC2
available with an Annual Uptime Percentage [defined elsewhere] of
at least 99.95% during the Service Year. In the event Amazon EC2
does not meet the Annual Uptime Percentage commitment, you will
be eligible to receive a Service Credit as described below.
Availability
System Availability Requirements
Amazon EC2 guarantees 99.95% availability
Availability
• Availability is the ability of a system to minimize system
outages
• Availability is generally measured as % of up-time
• Ex. If a system is down for an average of 1 day in 100
days, its availability is 99.9%
Availability
Calculating availability
Uptime Downtime
(Time between failures) (Time to repair)
(TBF) (TTR)
99 days 1 day
Availability = Mean TBF / (Mean TBF + Mean TTR)
= MTBF / (MTBF + MTTR)
How to recover from node
failure in a network?
• Detects node failure by ‘pinging’ nodes
• If a node does not ‘Echo’ back, change the routing of
packets via alternate nodes
Availability
Types of faults
• Omission : Failure to respond
• Crash : Repeated failure to respond
• Timing issue : Respond early or late (ex. Late arrival of
message)
• Incorrect response: Response with incorrect value
Availability
• Replication
• Functional redundancy (same input but diversely
designed),
• Analytic redundancy (Ex. Determining altitude using
barometric pressure, using geometry)
Availability
• Redundancy:
• Active (hot spare): All components process in parallel. Ex Disk
mirroring, sending data via 2 routes
• Passive (warm spare): Periodic state updates, Ex. Server farms
in eCommerce
• The application must store as much of its state on non-
volatile shared storage as possible. Equally important is the
ability to restart on another node at the last state before
failure using the saved state from the shared storage.
• Spare (cold spare): Power on procedure
How to recover from failed
online transactions?
Roll back incomplete transactions, using a log
A=10,000, B=10,000
Begin Trasaction
Read A
A = A - 1000
Write A
Crash
Read B
B = B + 1000
Time Write B
End Transaction
A=9,000, B=11,000
How to ensure integrity of data when 2
txns are trying to access the same data?
‘Lock’ concept in database
A = 10,000
User 1 User 2
Read A
Read A
A = A + 1000
A = A - 1000
Write A Time
Write A
A = 9,000
Interleaving of txns can lead to loss of data integrity
Availability
Source: ‘Software architecture
in practice by Len Bass &
others
Availability Tactics
Fault detection Error Masking Recover From Fault Fault prevention
• Ping/echo • Active redundancy • Rollback • Removal of a
• Heartbeat (Hot) • Retry component to
• Timestamp • Passive • Reconfiguration prevent
redundancy anticipated
• Data sanity check • Shadow operation
(Warm) failure–
• Condition • State auto/manual
monitoring • Spare (Cold) resynchronization reboot
• Voting • Exception handling • Escalating restart • Create transaction
• Exception • Graceful • Nonstop
degradation • Software upgrade
Detection forwarding
• Ignore faulty • Predictive model
• Self-test
behavior • Process monitor-
that can detect,
remove and restart
faulty process
• Exception
prevention
9/17/2023 SS ZG653 14
Availability
Detecting faults – other tactics
• Checksum: used in networks, data storage
• Voting: Triple Modular Redundancy: Used in satellite
systems
Disk mirroring (Active
redundancy)
To recover from disk failure
Once new disk is installed, copy database
(resynchronization)
High availability in Tomcat
Use of temporary server when the main server is getting upgraded
during new release
• Live traffic requests are re-directed to a temporary server while the
main server is upgraded
Load balancing
• Distribute load to different servers in a server farm
Types of faults that impact
availability
Hardware faults
• Disk failure
• Power failure
• Network failure
Software faults
• Memory leak
• Divide by zero
• Incorrect parameter passing
Server farms
A server farm or server cluster is a collection of computer servers - usually
maintained by an organization to supply server functionality far beyond the
capability of a single machine.
Server farms often consist of thousands of computers which require a large amount
of power to run and to keep cool. At the optimum performance level, a server
farm has enormous costs (both financial and environmental) associated with it.[1]
Server farms often have backup servers, which can take over the function of primary
servers in the event of a primary-server failure.
Server farms are typically collocated with the network switches and/or routers which
enable communication between the different parts of the cluster and the users of
the cluster.
Amazon Server Farm
Refer :
https://www.youtube.com/watch?v=q6WlzHLxNKI&t=342
s
Availability
Recovering from faults – other tactics
• Exception handling: Ex. Divide by zero, File IO
error
• Retry: Used in networks and server farms
Summary of availability tactics
• Detect faults
• Recover from faults (Automated or manual)
• Prevent faults