AVAILABILITY CONCEPT
Chapter 4
INTRODUCTION
• Everyone expects their
infrastructure to be available all the
time
• A 100% guaranteed availability of
an infrastructure is impossible
UNDERSTANDING AVAILABILITY
There are different definitions:
• Enable access to authorized information or resources to those who need them
• An authorized parties should not be prevented from accessing objects to which they
have legitimate access
• A requirement intended to assure that systems work promptly, and services is not
denied to authorized users
UNDERSTANDING AVAILABILITY
• Information availability can be defined with the help of:
● Accessibility
• Information should be accessible to the right user when required
● Reliability
• Information should be reliable and correct in all aspects. It is “the same” as what was
stored, and there is no alteration or corruption to the information
● Timeliness
• Defines the time window during which information must be accessible. For example, if
online access to an application is required between 8:00 a.m. and 10:00 p.m. each day, any
disruptions to data availability outside of this time slot are not considered to affect
timeliness 4
IMPACT OF DOWNTIME
Lost Productivity
• Number of employees
impacted x hours out x
hourly rate
Lost Revenue
Damaged • Direct loss
• Reputation
Customers • Compensatory payments
• Suppliers • Lost future revenue
• Financial markets • Billing losses
• Banks • Investment losses
• Business partners
Other Expenses
• Temporary employees, equipment rental, overtime costs,
extra shipping costs, travel expenses, and so on.
CALCULATING AVAILABILITY
• In this age of global, always-on, always connected systems, disturbances in availability
are noticed immediately.
• A 100% guaranteed availability of an infrastructure, however, is impossible.
• Over the years, much knowledge and experience is gained on how to design high
available systems
• Failover
• Redundancy
• Avoiding Single Points of Failures (SPOFs)
AVAILABILITY PERCENTAGE
•The availability of a system is usually expressed as a percentage of
uptime in a given time period
• Usually, one year or one month
•Typical requirements used in service level agreements today are
99.8% or 99.9% availability per month for a full IT system
Availability% Downtime Downtime per Downtime per
per year month week
99.8% 17.5 hours 86.2 minutes 20.2 minutes
•Example99.9%
for (three
downtime
nines) expressed
8.8 hours as43.2a minutes
percentage10.1per year:
minutes
99.99% (four nines) 52.6 minutes 4.3 minutes 1.0 minutes
99.999% (five nines) 5.3 minutes 25.9 seconds 6.1 seconds
MTBF AND MTTR
•Factors involved in
calculating availability:
• Mean Time Between
Failures (MTBF)
MTBF
• In other words, MTBF is a maintenance metric, represented in hours, showing how
long a piece of equipment operates without interruption
• Example:
Component MTBF (hours)
Hard disk 750,000
Power supply 100,000
Fan 100,000
Ethernet Network Switch 350,000
RAM 1,000,000
It is important to understand how these numbers are calculated. No manufacturer can test if a
hard disk will continue to work without failing for 750,000 hours (=85 years). Instead,
manufacturers run tests on large batches of components
CALCULATE MTBF
• MTBF is calculated by taking the total time an asset is running (uptime) and dividing it by the number of
breakdowns that happened over that same period of time.
MTBF = Total uptime / # of Breakdowns
• The MTBF calculation might look like this:
• Find the total uptime: Imagine you have a warehouse full of hard disks, and 40 of them
were tested for 400 hours each. The total hours spent testing equal 16,000 hours (40 x 400 =
16,000).
• Figure out the number of failures: Identify the number of failures over the entire number
of hard disks tested. For this example, consider there were 5 hard disks failures.
• Calculate MTBF: Now that we know testing was performed for 16,000 hours with 5 hard
disk failures, we can calculate MTBF: 16,000 hours / 5 failures = 3200 hours.
CALCULATE MTBF
So, what does this tell us
In this example, the MTBF isn't suggesting that each hard disk should last 3200
hours. It's saying if you run a group of hard disk, the average time between failures
within the tested group is 3200 hours.
In other words, MTBF isn't meant to predict the behavior of a single component; it
predicts the behavior of a group of components.
MTTR
•MTTR can be kept low by:
• Having a service contract with the supplier
• Having spare parts on-site
• Automated redundancy and failover
MTTR
Time to repair or ‘downtime’
Response Time Recovery Time
Detection Repair Restoration Incident
Incident Diagnosis Recovery Time
Detection Repair time Time between failures or ‘uptime’
elapsed time
MTTR: Average time required to repair a failed component
MTTR = Total downtime/Number of failures
MTTR
• Steps to complete repairs:
• Notification of the fault (time before seeing an alarm message)
• Processing the alarm
• Finding the root cause of the error
• Looking up repair information
• Getting spare components from storage
• Having technician come to the datacenter with the spare component
• Physically repairing the fault
• Restarting and testing the component
AVAILABILITY CALCULATION EXAMPLES
•
Component MTBF (h) MTTR (h) Availability in %
Power supply 100,000 8 0.9999200 99.99200
Fan 100,000 8 0.9999200 99.99200
System board 300,000 8 0.9999733 99.99733
Memory 1,000,000 8 0,9999920 99.99920
CPU 500,000 8 0.9999840 99.99840
Network
Interface 250,000 8 0.9999680 99.99680
Controller (NIC)
AVAILABILITY CALCULATION EXAMPLES
•
AVAILABILITY CALCULATION EXAMPLES
•
SOURCES OF UNAVAILABILITY - HUMAN
ERRORS
• 80% of outages impacting mission-critical services is caused by people and
process issues
• Examples:
• Performing a test in the production environment
• Switching off the wrong component for repair
• Swapping a good working disk in a RAID set instead of the defective one
• Restoring the wrong backup tape to production
• Accidentally removing files
• Mail folders, configuration files
• Accidentally removing database entries
• Drop table x instead of drop table y
• Example: Capital knight group
SOURCES OF UNAVAILABILITY - SOFTWARE
BUGS
• Because of the complexity of most software, it is nearly impossible (and very costly) to
create bug-free software
• Application software bugs can stop an entire system
• Operating systems are software too
• Operating systems containing bugs can lead to corrupted file systems, network failures, or
other sources of unavailability
• Example: New York black out
SOURCES OF UNAVAILABILITY - PLANNED
MAINTENANCE
• Sometimes needed to perform systems
management tasks:
• Upgrading hardware or software
• Implementing software changes
• Migrating data
• Creation of backups
• Should only be performed on parts of the infrastructure where other parts keep serving clients
• During planned maintenance, the system is more vulnerable to downtime than under normal
circumstances
• A temporary SPOF could be introduced
• Systems managers could make mistakes
SOURCES OF UNAVAILABILITY - PHYSICAL
DEFECTS
• Everything breaks down eventually
• Mechanical parts are most likely to break first
• Examples:
• Fans for cooling equipment usually break because of dust in the bearings
• Disk drives contain moving parts
• Tapes are very vulnerable to defects as the tape is spun on and off the reels all the time
• Tape drives contain very sensitive pieces of mechanics that can break easily
SOURCES OF UNAVAILABILITY - BATHTUB
CURVE
• A component failure is most likely when the
component is new
• Sometimes a component doesn't even work
at all when unpacked for the first time. This
is called a DOA component – Dead On
Arrival.
• When a component still works after the first
month, it is likely that it will continue
working without failure until the end of its
life
SOURCES OF UNAVAILABILITY -
ENVIRONMENTAL ISSUES
• Environmental issues can cause downtime:
• Failing facilities
• Power
• Cooling
• Disasters
• Fire
• Earthquakes
• Flooding
BUSINESS CONTINUITY (BC)
• Business continuity is an organization’s ability to ensure operations and core business functions
are not severely impacted by a disaster or unplanned incident that take critical systems offline
• It is the advance planning and preparation undertaken to ensure that an organization will have the
capability to operate its critical business functions during emergency events
• It is important to remember that you should plan and prepare not only for events that will stop
functions completely but for those that also have the potential to adversely impact services or
functions
BC TERMINOLOGY: REDUNDANCY, DISASTER
RECOVERY
• Redundancy is the duplication of critical components in a single system, to avoid a
single point of failure (SPOF)
• Examples:
• A single component having two power supplies; if one fails, the other takes over
• Dual networking interfaces
• Redundant cabling
• Disaster recovery: it is the coordinated process of restoring systems, data and
infrastructure required to support ongoing business operations in the event of
disaster
BC TERMINOLOGY: FAILOVER
• Failover is the automatic switch-over to a standby or redundant system
or component
• Examples:
• Cluster: A group of servers and other necessary resources, coupled to operate as a
single system. Clusters ensure high availability and load balancing. Typically, in
failover clusters, one server runs an application and updates the data, and the other is
kept as a standby to take over completely, when required. Example of service
managing cluster is Windows Server failover clustering. It works in “heartbeat”
signals
BC TERMINOLOGY: FALLBACK
• Fallback is the manual switchover to an identical standby computer system in a
different location
• Typically used for disaster recovery
• Three basic forms of fallback solutions:
• Hot site
• Cold site
• Warm site
FALLBACK – HOT SITE
• A hot site is
• A fully configured fallback datacentre
• Fully equipped with power and cooling
• Applications are installed on the servers
• Data is kept up-to-date to fully mirror the production system
• Requires constant maintenance of the hardware, software, data, and applications to
be sure the site accurately mirrors the state of the production site
FALLBACK - COLD SITE
• A site to where an enterprise’s operations can be moved, in the event of disaster. It has
minimum IT infrastructure and environmental facilities in place, but are not activated
• Applications will need to be installed and current data fully restored from backups
• If an organization has very little budget for a fallback site, a cold site may be better than
nothing
FALLBACK - WARM SITE
• A computer facility readily available with power, cooling, and computers, but the
applications may not be installed or configured
• A mix between a hot site and cold site
• Applications and data must be restored from backup media and tested
RECOVERY POINT OBJECTIVE (RPO):
• This is the point in time to which systems and data must be recovered after an outage. It defines
the amount of data loss that a business can endure.
• A large RPO signifies high tolerance to information loss in a business. Based on the RPO,
organizations plan for the minimum frequency with which a backup or replica must be made
RECOVERY POINT OBJECTIVE (RPO):
• An organization may plan for an appropriate BC technology solution based on the
RPO it sets.
• For example, if RPO is 24 hours, that means that backups are created on an offsite
tape drive every midnight. The corresponding recovery strategy is to restore data
from the set of last backup tapes. Similarly, for zero RPO, data is mirrored
synchronously to a remote site.
RECOVERY TIME OBJECTIVE (RTO):
• The time within which systems, applications, or functions must be recovered after
an outage. It defines the amount of downtime that a business can endure and
survive.
• Businesses can optimize disaster recovery plans after defining the RTO for a given
data center or network. For example, if the RTO is two hours, then use a disk
backup because it enables a faster restore than a tape backup
BUSINESS CONTINUITY PLAN
• Business continuity planning (BCP) is the process involved in creating a system of
prevention and recovery from potential threats to a company. The plan ensures that
personnel and assets are protected and are able to function quickly in the event of a
disaster
• It includes:
• Establish objectives Train,Establish
test &
objectives
BCP
Implement
maintainAnalysis
• Analysis Design &
development
• Design and development
• Implement
• Train, test and maintain
ESTABLISH OBJECTIVES
• Determine BC requirements
• Select BC team
• Estimate BC scope and budget
ANALYSIS
• Collect information on data profiles, business processes, infrastructure support,
dependencies, and frequency of using business infrastructure
• Perform risk assessment:
• Evaluation of the company’s risks and exposures
• Assessment of the potential impact of various business disruption scenarios
• Determination of the most likely threat scenarios
• Conduct business impact analysis: A systematic process to determine and evaluate the
potential effects of an interruption to critical business operations as a result of a disaster,
accident or emergency. Also determine RTO and RPO
DESIGN AND DEVELOP
• Assign individual roles and responsibilities. For example, different teams are formed
for activities, such as emergency response, damage assessment, and infrastructure and
application recovery.
• Design data protection strategies.
• Develop contingency solutions.
• Detail recovery and restart procedures
IMPLEMENT
• Implement risk management and mitigation procedures that include backup, replication, and
management of resources.
• Prepare the disaster recovery sites that can be utilized if a disaster affects the primary data center.
• Implement redundancy for critical resources in a data center to avoid single points of failure.
TRAIN, TEST AND MAINTAIN
• Train the employees who are responsible for backup and replication of business-critical data
• Train employees on emergency response procedures when disasters are declared.
• Train the recovery team on recovery procedures based on contingency Scenarios
• Test the BC plan regularly to evaluate its performance and identify its limitations
• Update the BC plans and recovery/restart procedures to reflect regular changes within the data
center