0% found this document useful (0 votes)

32 views39 pages

Availability Concepts

Uploaded by

Lena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views39 pages

Availability Concepts

Uploaded by

Lena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

AVAILABILITY CONCEPT

Chapter 4
INTRODUCTION

• Everyone expects their

infrastructure to be available all the
time

• A 100% guaranteed availability of

an infrastructure is impossible
UNDERSTANDING AVAILABILITY

There are different definitions:

• Enable access to authorized information or resources to those who need them

• An authorized parties should not be prevented from accessing objects to which they
have legitimate access

• A requirement intended to assure that systems work promptly, and services is not
denied to authorized users
UNDERSTANDING AVAILABILITY
• Information availability can be defined with the help of:
● Accessibility
• Information should be accessible to the right user when required
● Reliability
• Information should be reliable and correct in all aspects. It is “the same” as what was
stored, and there is no alteration or corruption to the information
● Timeliness
• Defines the time window during which information must be accessible. For example, if
online access to an application is required between 8:00 a.m. and 10:00 p.m. each day, any
disruptions to data availability outside of this time slot are not considered to affect
timeliness 4
IMPACT OF DOWNTIME
Lost Productivity
• Number of employees
impacted x hours out x
hourly rate

Lost Revenue

Damaged • Direct loss

• Reputation
Customers • Compensatory payments
• Suppliers • Lost future revenue
• Financial markets • Billing losses
• Banks • Investment losses
• Business partners

Other Expenses
• Temporary employees, equipment rental, overtime costs,
extra shipping costs, travel expenses, and so on.
CALCULATING AVAILABILITY

• In this age of global, always-on, always connected systems, disturbances in availability

are noticed immediately.
• A 100% guaranteed availability of an infrastructure, however, is impossible.

• Over the years, much knowledge and experience is gained on how to design high
available systems
• Failover
• Redundancy
• Avoiding Single Points of Failures (SPOFs)
AVAILABILITY PERCENTAGE
•The availability of a system is usually expressed as a percentage of
uptime in a given time period
• Usually, one year or one month

•Typical requirements used in service level agreements today are

99.8% or 99.9% availability per month for a full IT system
Availability% Downtime Downtime per Downtime per
per year month week
99.8% 17.5 hours 86.2 minutes 20.2 minutes
•Example99.9%
for (three
downtime
nines) expressed
8.8 hours as43.2a minutes
percentage10.1per year:
minutes
99.99% (four nines) 52.6 minutes 4.3 minutes 1.0 minutes
99.999% (five nines) 5.3 minutes 25.9 seconds 6.1 seconds
MTBF AND MTTR

•Factors involved in
calculating availability:
• Mean Time Between
Failures (MTBF)
MTBF
• In other words, MTBF is a maintenance metric, represented in hours, showing how
long a piece of equipment operates without interruption
• Example:
Component MTBF (hours)
Hard disk 750,000
Power supply 100,000
Fan 100,000
Ethernet Network Switch 350,000
RAM 1,000,000

It is important to understand how these numbers are calculated. No manufacturer can test if a
hard disk will continue to work without failing for 750,000 hours (=85 years). Instead,
manufacturers run tests on large batches of components
CALCULATE MTBF
• MTBF is calculated by taking the total time an asset is running (uptime) and dividing it by the number of
breakdowns that happened over that same period of time.
MTBF = Total uptime / # of Breakdowns

• The MTBF calculation might look like this:

• Find the total uptime: Imagine you have a warehouse full of hard disks, and 40 of them
were tested for 400 hours each. The total hours spent testing equal 16,000 hours (40 x 400 =
16,000).
• Figure out the number of failures: Identify the number of failures over the entire number
of hard disks tested. For this example, consider there were 5 hard disks failures.
• Calculate MTBF: Now that we know testing was performed for 16,000 hours with 5 hard
disk failures, we can calculate MTBF: 16,000 hours / 5 failures = 3200 hours.
CALCULATE MTBF

So, what does this tell us

In this example, the MTBF isn't suggesting that each hard disk should last 3200
hours. It's saying if you run a group of hard disk, the average time between failures
within the tested group is 3200 hours.
In other words, MTBF isn't meant to predict the behavior of a single component; it
predicts the behavior of a group of components.
MTTR

•MTTR can be kept low by:

• Having a service contract with the supplier

• Having spare parts on-site

• Automated redundancy and failover

MTTR
Time to repair or ‘downtime’

Response Time Recovery Time

Detection Repair Restoration Incident

Incident Diagnosis Recovery Time

Detection Repair time Time between failures or ‘uptime’

elapsed time

MTTR: Average time required to repair a failed component

MTTR = Total downtime/Number of failures
MTTR

• Steps to complete repairs:

• Notification of the fault (time before seeing an alarm message)
• Processing the alarm
• Finding the root cause of the error
• Looking up repair information
• Getting spare components from storage
• Having technician come to the datacenter with the spare component
• Physically repairing the fault
• Restarting and testing the component
AVAILABILITY CALCULATION EXAMPLES
•

Component MTBF (h) MTTR (h) Availability in %

Power supply 100,000 8 0.9999200 99.99200
Fan 100,000 8 0.9999200 99.99200
System board 300,000 8 0.9999733 99.99733
Memory 1,000,000 8 0,9999920 99.99920

CPU 500,000 8 0.9999840 99.99840

Network
Interface 250,000 8 0.9999680 99.99680
Controller (NIC)
AVAILABILITY CALCULATION EXAMPLES
•
AVAILABILITY CALCULATION EXAMPLES
•
SOURCES OF UNAVAILABILITY - HUMAN
ERRORS

• 80% of outages impacting mission-critical services is caused by people and

process issues
• Examples:
• Performing a test in the production environment
• Switching off the wrong component for repair
• Swapping a good working disk in a RAID set instead of the defective one
• Restoring the wrong backup tape to production
• Accidentally removing files
• Mail folders, configuration files
• Accidentally removing database entries
• Drop table x instead of drop table y
• Example: Capital knight group
SOURCES OF UNAVAILABILITY - SOFTWARE
BUGS

• Because of the complexity of most software, it is nearly impossible (and very costly) to
create bug-free software

• Application software bugs can stop an entire system

• Operating systems are software too

• Operating systems containing bugs can lead to corrupted file systems, network failures, or
other sources of unavailability

• Example: New York black out

SOURCES OF UNAVAILABILITY - PLANNED
MAINTENANCE

• Sometimes needed to perform systems

management tasks:
• Upgrading hardware or software
• Implementing software changes
• Migrating data
• Creation of backups
• Should only be performed on parts of the infrastructure where other parts keep serving clients
• During planned maintenance, the system is more vulnerable to downtime than under normal
circumstances
• A temporary SPOF could be introduced
• Systems managers could make mistakes
SOURCES OF UNAVAILABILITY - PHYSICAL
DEFECTS
• Everything breaks down eventually
• Mechanical parts are most likely to break first
• Examples:
• Fans for cooling equipment usually break because of dust in the bearings
• Disk drives contain moving parts
• Tapes are very vulnerable to defects as the tape is spun on and off the reels all the time
• Tape drives contain very sensitive pieces of mechanics that can break easily
SOURCES OF UNAVAILABILITY - BATHTUB
CURVE
• A component failure is most likely when the
component is new

• Sometimes a component doesn't even work

at all when unpacked for the first time. This
is called a DOA component – Dead On
Arrival.

• When a component still works after the first

month, it is likely that it will continue
working without failure until the end of its
life
SOURCES OF UNAVAILABILITY -
ENVIRONMENTAL ISSUES
• Environmental issues can cause downtime:
• Failing facilities
• Power
• Cooling
• Disasters
• Fire
• Earthquakes
• Flooding
BUSINESS CONTINUITY (BC)

• Business continuity is an organization’s ability to ensure operations and core business functions
are not severely impacted by a disaster or unplanned incident that take critical systems offline

• It is the advance planning and preparation undertaken to ensure that an organization will have the
capability to operate its critical business functions during emergency events

• It is important to remember that you should plan and prepare not only for events that will stop
functions completely but for those that also have the potential to adversely impact services or
functions
BC TERMINOLOGY: REDUNDANCY, DISASTER
RECOVERY

• Redundancy is the duplication of critical components in a single system, to avoid a

single point of failure (SPOF)
• Examples:
• A single component having two power supplies; if one fails, the other takes over
• Dual networking interfaces
• Redundant cabling

• Disaster recovery: it is the coordinated process of restoring systems, data and

infrastructure required to support ongoing business operations in the event of
disaster
BC TERMINOLOGY: FAILOVER
• Failover is the automatic switch-over to a standby or redundant system
or component

• Examples:
• Cluster: A group of servers and other necessary resources, coupled to operate as a
single system. Clusters ensure high availability and load balancing. Typically, in
failover clusters, one server runs an application and updates the data, and the other is
kept as a standby to take over completely, when required. Example of service
managing cluster is Windows Server failover clustering. It works in “heartbeat”
signals
BC TERMINOLOGY: FALLBACK
• Fallback is the manual switchover to an identical standby computer system in a
different location

• Typically used for disaster recovery

• Three basic forms of fallback solutions:

• Hot site
• Cold site
• Warm site
FALLBACK – HOT SITE

• A hot site is
• A fully configured fallback datacentre
• Fully equipped with power and cooling
• Applications are installed on the servers
• Data is kept up-to-date to fully mirror the production system

• Requires constant maintenance of the hardware, software, data, and applications to

be sure the site accurately mirrors the state of the production site
FALLBACK - COLD SITE

• A site to where an enterprise’s operations can be moved, in the event of disaster. It has
minimum IT infrastructure and environmental facilities in place, but are not activated

• Applications will need to be installed and current data fully restored from backups

• If an organization has very little budget for a fallback site, a cold site may be better than
nothing
FALLBACK - WARM SITE

• A computer facility readily available with power, cooling, and computers, but the
applications may not be installed or configured

• A mix between a hot site and cold site

• Applications and data must be restored from backup media and tested
RECOVERY POINT OBJECTIVE (RPO):

• This is the point in time to which systems and data must be recovered after an outage. It defines
the amount of data loss that a business can endure.
• A large RPO signifies high tolerance to information loss in a business. Based on the RPO,
organizations plan for the minimum frequency with which a backup or replica must be made
RECOVERY POINT OBJECTIVE (RPO):

• An organization may plan for an appropriate BC technology solution based on the

RPO it sets.

• For example, if RPO is 24 hours, that means that backups are created on an offsite
tape drive every midnight. The corresponding recovery strategy is to restore data
from the set of last backup tapes. Similarly, for zero RPO, data is mirrored
synchronously to a remote site.
RECOVERY TIME OBJECTIVE (RTO):

• The time within which systems, applications, or functions must be recovered after
an outage. It defines the amount of downtime that a business can endure and
survive.

• Businesses can optimize disaster recovery plans after defining the RTO for a given
data center or network. For example, if the RTO is two hours, then use a disk
backup because it enables a faster restore than a tape backup
BUSINESS CONTINUITY PLAN

• Business continuity planning (BCP) is the process involved in creating a system of

prevention and recovery from potential threats to a company. The plan ensures that
personnel and assets are protected and are able to function quickly in the event of a
disaster

• It includes:
• Establish objectives Train,Establish
test &
objectives
BCP
Implement
maintainAnalysis
• Analysis Design &
development

• Design and development

• Implement
• Train, test and maintain
ESTABLISH OBJECTIVES

• Determine BC requirements

• Select BC team

• Estimate BC scope and budget

ANALYSIS

• Collect information on data profiles, business processes, infrastructure support,

dependencies, and frequency of using business infrastructure

• Perform risk assessment:

• Evaluation of the company’s risks and exposures
• Assessment of the potential impact of various business disruption scenarios
• Determination of the most likely threat scenarios

• Conduct business impact analysis: A systematic process to determine and evaluate the
potential effects of an interruption to critical business operations as a result of a disaster,
accident or emergency. Also determine RTO and RPO
DESIGN AND DEVELOP

• Assign individual roles and responsibilities. For example, different teams are formed
for activities, such as emergency response, damage assessment, and infrastructure and
application recovery.

• Design data protection strategies.

• Develop contingency solutions.

• Detail recovery and restart procedures

IMPLEMENT

• Implement risk management and mitigation procedures that include backup, replication, and
management of resources.

• Prepare the disaster recovery sites that can be utilized if a disaster affects the primary data center.

• Implement redundancy for critical resources in a data center to avoid single points of failure.
TRAIN, TEST AND MAINTAIN

• Train the employees who are responsible for backup and replication of business-critical data
• Train employees on emergency response procedures when disasters are declared.
• Train the recovery team on recovery procedures based on contingency Scenarios
• Test the BC plan regularly to evaluate its performance and identify its limitations
• Update the BC plans and recovery/restart procedures to reflect regular changes within the data
center

Information Technology Infrastructure IT602
No ratings yet
Information Technology Infrastructure IT602
19 pages
02 - Availability Basics
100% (1)
02 - Availability Basics
8 pages
DCCA
No ratings yet
DCCA
248 pages
High Availability and Load Balancing: Realized by
No ratings yet
High Availability and Load Balancing: Realized by
31 pages
IT602-MidTerm Handouts by Yasir Ejaz
0% (1)
IT602-MidTerm Handouts by Yasir Ejaz
201 pages
Storage Technologies Unit 4,5-1
No ratings yet
Storage Technologies Unit 4,5-1
53 pages
DCCA Exam Study Guide 2023
No ratings yet
DCCA Exam Study Guide 2023
237 pages
Unit IV Backup - Lectures Notes
No ratings yet
Unit IV Backup - Lectures Notes
32 pages
Module 09
No ratings yet
Module 09
29 pages
UNIT 4 Storage Technologies
No ratings yet
UNIT 4 Storage Technologies
48 pages
Modul 4. Availability Concepts
No ratings yet
Modul 4. Availability Concepts
46 pages
Unit 4 - Availability Concepts
No ratings yet
Unit 4 - Availability Concepts
25 pages
17CS754 San Notes Svit Module 3
No ratings yet
17CS754 San Notes Svit Module 3
30 pages
4 - HA-Reya
No ratings yet
4 - HA-Reya
20 pages
High Avail Netweaver
No ratings yet
High Avail Netweaver
21 pages
BDS Session 3
No ratings yet
BDS Session 3
68 pages
18CS822 - SAN - Module 4
No ratings yet
18CS822 - SAN - Module 4
24 pages
SAN Module3 ExamsExpert
No ratings yet
SAN Module3 ExamsExpert
29 pages
Mod4 SAN
No ratings yet
Mod4 SAN
24 pages
Cours HA LB
No ratings yet
Cours HA LB
34 pages
Hendershott Consulting Inc: Availability Management
No ratings yet
Hendershott Consulting Inc: Availability Management
26 pages
Module - 4: 3.1 Introduction To Business Continuity
No ratings yet
Module - 4: 3.1 Introduction To Business Continuity
30 pages
Cours HA M1
No ratings yet
Cours HA M1
32 pages
Combine 1-4 Week IT602
No ratings yet
Combine 1-4 Week IT602
116 pages
Concept of C Programming
No ratings yet
Concept of C Programming
74 pages
Availability Management
No ratings yet
Availability Management
22 pages
Ch9-Business Continuity
No ratings yet
Ch9-Business Continuity
38 pages
Inbound 1860805038903445286
No ratings yet
Inbound 1860805038903445286
35 pages
Storage Technologies Unit 4 Part 1
No ratings yet
Storage Technologies Unit 4 Part 1
27 pages
Module - 4 Introduction To Business Continuity, Backup and Archive
No ratings yet
Module - 4 Introduction To Business Continuity, Backup and Archive
24 pages
Availability of Cyber Security
No ratings yet
Availability of Cyber Security
4 pages
Lecture 4 ITI
No ratings yet
Lecture 4 ITI
25 pages
New Table PRCD - ELEMENTS in S - 4 HANA
100% (1)
New Table PRCD - ELEMENTS in S - 4 HANA
9 pages
Availability
No ratings yet
Availability
30 pages
Availability Concepts
No ratings yet
Availability Concepts
32 pages
Computer System General Requirements
No ratings yet
Computer System General Requirements
9 pages
High Availaility
No ratings yet
High Availaility
8 pages
WP One-Essential-Guide-to-DR 200813 E Final
No ratings yet
WP One-Essential-Guide-to-DR 200813 E Final
8 pages
Introduction To Business Continuity
100% (1)
Introduction To Business Continuity
26 pages
EMC2
No ratings yet
EMC2
18 pages
San-Module 4 - Chapter - 1 - Notes
No ratings yet
San-Module 4 - Chapter - 1 - Notes
16 pages
IT 602 Week 2 - Slides
No ratings yet
IT 602 Week 2 - Slides
31 pages
CS411 Final Term MCQs With Reference Solved by Arslan Arshad
No ratings yet
CS411 Final Term MCQs With Reference Solved by Arslan Arshad
44 pages
As Safety Training 092023 Day 1n2 W
100% (1)
As Safety Training 092023 Day 1n2 W
342 pages
Fundamentals of Availability
No ratings yet
Fundamentals of Availability
12 pages
Introduction To Business Continuity
No ratings yet
Introduction To Business Continuity
25 pages
Session 31
No ratings yet
Session 31
25 pages
Data Storage Management and Retrieval
No ratings yet
Data Storage Management and Retrieval
43 pages
System Reliability Availability Calculations
No ratings yet
System Reliability Availability Calculations
6 pages
3BSE041389-600 B en System 800xa 6.0 System Planning
No ratings yet
3BSE041389-600 B en System 800xa 6.0 System Planning
400 pages
Availability
No ratings yet
Availability
19 pages
Tipos de Disponibilidad y Metodo de Calculo
No ratings yet
Tipos de Disponibilidad y Metodo de Calculo
9 pages
Technical Essentials of HP Servers, Rev. 11.41
No ratings yet
Technical Essentials of HP Servers, Rev. 11.41
72 pages
Fundamentals of Availability Transcript
No ratings yet
Fundamentals of Availability Transcript
13 pages
Fundamentals of Availability Transcript
No ratings yet
Fundamentals of Availability Transcript
13 pages
Avaialbility Management Worksheet
100% (1)
Avaialbility Management Worksheet
4 pages
Achieving High Availability Objectives
No ratings yet
Achieving High Availability Objectives
8 pages
The One Essential Guide To Disaster Recovery - How To Ensure IT and Business Continuity
No ratings yet
The One Essential Guide To Disaster Recovery - How To Ensure IT and Business Continuity
11 pages
Information Technology Infrastructure IT602
No ratings yet
Information Technology Infrastructure IT602
10 pages
Process Factsheet: Availability Management
No ratings yet
Process Factsheet: Availability Management
3 pages
Take Note!: You Can Achieve High Availability in Your Current SAP Landscape
No ratings yet
Take Note!: You Can Achieve High Availability in Your Current SAP Landscape
4 pages
Lab3 - SWT301
No ratings yet
Lab3 - SWT301
72 pages
High Availability Process
No ratings yet
High Availability Process
8 pages
2022 Leadership Vision Software Engineering Leaders
No ratings yet
2022 Leadership Vision Software Engineering Leaders
14 pages
Java Metadata Interface (JMI) Specification: JSR 040 Java Community Process
No ratings yet
Java Metadata Interface (JMI) Specification: JSR 040 Java Community Process
142 pages
DevsecOps Part 1 Post Quiz - Attempt Review
0% (1)
DevsecOps Part 1 Post Quiz - Attempt Review
2 pages
Exhaust Has Excessive Black Smoke: Troubleshooting
No ratings yet
Exhaust Has Excessive Black Smoke: Troubleshooting
3 pages
On Reengineering
100% (1)
On Reengineering
17 pages
Software Maintenance
No ratings yet
Software Maintenance
17 pages
Tanzima Mam All Slide
No ratings yet
Tanzima Mam All Slide
234 pages
Availability vs. Reliability vs. Maintainability What's The Difference - Infraspeak Blog-1
No ratings yet
Availability vs. Reliability vs. Maintainability What's The Difference - Infraspeak Blog-1
1 page
Library Management System
50% (2)
Library Management System
73 pages
A Review of Aspect Oriented Programming
No ratings yet
A Review of Aspect Oriented Programming
10 pages
CH 1 - Intro To RTS
No ratings yet
CH 1 - Intro To RTS
63 pages
Prog 3112-1
No ratings yet
Prog 3112-1
23 pages
TQ - Agile and DevOps
No ratings yet
TQ - Agile and DevOps
3 pages
Complete Mark 1.00 Out of 1.00 Flag Question
No ratings yet
Complete Mark 1.00 Out of 1.00 Flag Question
18 pages
User's Guide: System Analyzer
No ratings yet
User's Guide: System Analyzer
135 pages
Lecture 4 CHaracteristics and Quality Attributes of Embedded Systems
No ratings yet
Lecture 4 CHaracteristics and Quality Attributes of Embedded Systems
27 pages
Pallavi Sharma - CV (1) - 2
No ratings yet
Pallavi Sharma - CV (1) - 2
2 pages
UGRD-IT6208 System Integration and Architecture 1 MIDTERM EXAM
No ratings yet
UGRD-IT6208 System Integration and Architecture 1 MIDTERM EXAM
29 pages
Assignment 1 Front Sheet
No ratings yet
Assignment 1 Front Sheet
27 pages
BP 2059
No ratings yet
BP 2059
8 pages
Business Logic Bugs
No ratings yet
Business Logic Bugs
4 pages
Seletividade EZC Vs IC60N
No ratings yet
Seletividade EZC Vs IC60N
4 pages
Jmieti: Jai Parkash Mukand Lal Innovative Engineering & Technology Institute, (JMIETI) Radaur
No ratings yet
Jmieti: Jai Parkash Mukand Lal Innovative Engineering & Technology Institute, (JMIETI) Radaur
6 pages
Ahmed Abd El-Halim
No ratings yet
Ahmed Abd El-Halim
2 pages
What Is Assembler, Compiler and Interpreter ? Give Examples: Comparison of Assemblers
No ratings yet
What Is Assembler, Compiler and Interpreter ? Give Examples: Comparison of Assemblers
3 pages

Availability Concepts

Uploaded by

Availability Concepts

Uploaded by

AVAILABILITY CONCEPT

• Everyone expects their

• A 100% guaranteed availability of

There are different definitions:

• Enable access to authorized information or resources to those who need them

Damaged • Direct loss

• In this age of global, always-on, always connected systems, disturbances in availability

•Typical requirements used in service level agreements today are

• The MTBF calculation might look like this:

So, what does this tell us

•MTTR can be kept low by:

• Having spare parts on-site

• Automated redundancy and failover

Response Time Recovery Time

Detection Repair Restoration Incident

Incident Diagnosis Recovery Time

Detection Repair time Time between failures or ‘uptime’

MTTR: Average time required to repair a failed component

• Steps to complete repairs:

Component MTBF (h) MTTR (h) Availability in %

CPU 500,000 8 0.9999840 99.99840

• 80% of outages impacting mission-critical services is caused by people and

• Application software bugs can stop an entire system

• Operating systems are software too

• Example: New York black out

• Sometimes needed to perform systems

• Sometimes a component doesn't even work

• When a component still works after the first

• Redundancy is the duplication of critical components in a single system, to avoid a

• Disaster recovery: it is the coordinated process of restoring systems, data and

• Typically used for disaster recovery

• Three basic forms of fallback solutions:

• Requires constant maintenance of the hardware, software, data, and applications to

• A mix between a hot site and cold site

• An organization may plan for an appropriate BC technology solution based on the

• Business continuity planning (BCP) is the process involved in creating a system of

• Design and development

• Estimate BC scope and budget

• Collect information on data profiles, business processes, infrastructure support,

• Perform risk assessment:

• Design data protection strategies.

• Develop contingency solutions.

• Detail recovery and restart procedures

You might also like