0% found this document useful (0 votes)
68 views38 pages

Business Continuity and Information Availability

The document provides an introduction to business continuity and discusses key aspects such as information availability, causes of information unavailability, consequences of downtime, measuring information availability, business continuity terminology, and the business continuity planning life cycle. It defines business continuity and explains that the goal is to ensure the information availability required to conduct vital business operations.

Uploaded by

ranvee01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views38 pages

Business Continuity and Information Availability

The document provides an introduction to business continuity and discusses key aspects such as information availability, causes of information unavailability, consequences of downtime, measuring information availability, business continuity terminology, and the business continuity planning life cycle. It defines business continuity and explains that the goal is to ensure the information availability required to conduct vital business operations.

Uploaded by

ranvee01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Introduction to Business

Continuity
In today’s world, continuous access to information is
a must for the smooth functioning of business
operations.
The cost of unavailability of information is greater
than ever, and outages in key industries cost millions
of dollars per hour.
Business continuity (BC) is an integrated and
enterprise-wide process that includes all activities
(internal and external to IT) that a business must
perform to mitigate the impact of planned and
unplanned downtime.
It involves proactive measures, such as business impact
analysis, risk assessments, BC technology solutions
deployment (backup and replication), and reactive
measures, such as disaster recovery and restart, to be
invoked in the event of a failure.
The goal of a BC solution is to ensure the “information
availability” required to conduct vital business operations.
This chapter describes the factors that affect information
availability and the consequences of information
unavailability.
It also explains the key parameters that govern any BC
strategy and the roadmap to develop an effective BC plan.
Information Availability
Information availability (IA) refers to the ability of an
IT infrastructure to function according to business
expectations during its specified time of operation.
IA ensures that people (employees, customers,
suppliers, and partners) can access information
whenever they need it.
IA can be defined in terms of accessibility, reliability,
and timeliness of information.
Accessibility: Information should be accessible at the
right place, to the right user.
Reliability: Information should be reliable and correct in
all aspects. It is “the same” as what was stored, and there is
no alteration or corruption to the information.
Timeliness: Defines the exact moment or the time
window (a particular time of the day, week, month, and
year as specified) during which information must be
accessible. For example, if online access to an application
is required between 8:00 a.m. and 10:00 p.m. each day, any
disruptions to data availability outside of this time slot are
not considered to affect timeliness.
Causes of Information
Unavailability
Various planned and unplanned incidents result in
information unavailability.
Planned outages include
installation/integration/maintenance of new hardware,
software upgrades or patches, taking backups, application
and data restores, facility operations (renovation and
construction), and refresh/migration of the testing to the
production environment.
Unplanned outages include failure caused by human
errors, database corruption, and failure of physical and
virtual components.
Another type of incident that may cause data
unavailability is natural or manmade disasters, such
as flood, fire, earthquake, and contamination.
The majority of outages are planned.
Planned outages are expected and scheduled but still
cause data to be unavailable.
Consequences of Downtime
Information unavailability or downtime results in loss of
productivity, loss of revenue, poor financial performance,
and damage to reputation.
 Loss of productivity includes reduced output per unit of
labor, equipment, and capital.
 Loss of revenue includes direct loss, compensatory
payments, future revenue loss, billing loss, and investment
loss.
 Poor financial performance affects revenue recognition,
cash flow, discounts, payment guarantees, credit rating, and
stock price.
 Damages to reputations may result in a loss of confidence
or credibility with customers, suppliers, financial markets,
banks, and business partners.
The business impact of downtime is the sum of all losses
sustained as a result of a given disruption.
An important metric, average cost of downtime per hour,
provides a key estimate in determining the appropriate BC
solutions.
It is calculated as follows:
Average cost of downtime per hour = average productivity
loss per hour + average revenue loss per hour
Where:
Productivity loss per hour = (total salaries and benefits of all
employees per week)/(average number of working hours per
week)
Average revenue loss per hour = (total revenue of an organization
per week)/(average number of hours per week that an
organization is open for business)
Measuring Information
Availability
Proactive risk analysis, performed as part of the BC
planning process, considers the component failure
rate and average repair time, which are measured by
mean time between failure (MTBF) and mean time to
repair (MTTR):
Mean Time Between Failure (MTBF): It is the
average time available for a system or component to
perform its normal operations between failures. It is the
measure of system or component reliability and is
usually expressed in hours.
Mean Time To Repair (MTTR):
 It is the average time required to repair a failed component.
 While calculating MTTR, it is assumed that the fault

responsible for the failure is correctly identified and the


required spares and personnel are available.
 A fault is a physical defect at the component level, which may

result in information unavailability.


 MTTR includes the total time required to do the following

activities: Detect the fault, mobilize the maintenance team,


diagnose the fault, obtain the spare parts, repair, test, and
restore the data.
IA is the time period during which a system is in a condition
to perform its intended function upon demand.
It can be expressed in terms of system uptime and downtime
and measured as the amount or percentage of system
uptime:
IA = system uptime/(system uptime + system downtime)
Where system uptime is the period of time during which the
system is in an accessible state; when it is not accessible, it is
termed as system downtime.
In terms of MTBF and MTTR, IA could also be expressed as
IA = MTBF/(MTBF + MTTR)
BC Terminology
Disaster recovery:
 This is the coordinated process of restoring systems, data,
and the infrastructure required to support ongoing business
operations after a disaster occurs.
 It is the process of restoring a previous copy of the data and
applying logs or other necessary processes to that copy to
bring it to a known point of consistency.
 After all recovery efforts are completed, the data is validated
to ensure that it is correct.
Disaster restart:
 This is the process of restarting business operations with
 mirrored consistent copies of data and applications.
Recovery-Point Objective (RPO):
 This is the point in time to which systems and data must be
recovered after an outage.
 It defines the amount of data loss that a business can endure.
RPO of 24 hours:
 Backups are created at an offsite tape library every midnight.
 The corresponding recovery strategy is to restore data from the set
of last backup tapes.
RPO of 1 hour:
 Shipping database logs to the remote site every hour.
 The corresponding recovery strategy is to recover the database to
the point of the last log shipment.
RPO in the order of minutes:
 Mirroring data asynchronously to a remote site
Near zero RPO:
 Mirroring data synchronously to a remote site
Recovery-Time Objective (RTO):
 The time within which systems and applications must be
recovered after an outage.
 It defines the amount of downtime that a business can
endure and survive
Data vault:
 A repository at a remote site where data can be periodically
or continuously copied (either to tape drives or disks) so that
there is always a copy at another site
Hot site:
 A site where an enterprise’s operations can be moved in the
event of disaster. It is a site with the required hardware,
operating system, application, and network support to
perform business operations, where the equipment is
available and running at all times.
Cold site:
A site where an enterprise’s operations can be moved in
the event of disaster, with minimum IT infrastructure
and environmental facilities in place, but not activated
Server Clustering:
A group of servers and other necessary resources
coupled to operate as a single system.
Clusters can ensure high availability and load
balancing.
BC Planning Life Cycle
The BC planning life cycle includes five stages :
1. Establishing objectives
2. Analyzing
3. Designing and developing
4. Implementing
5. Training, testing, assessing, and maintaining
1. Establish objectives:
Determine BC requirements.
Estimate the scope and budget to achieve
requirements.
Select a BC team that includes subject matter experts
from all areas of the business, whether internal or
external.
Create BC policies.
2. Analysis:
Collect information on data profiles, business
processes, infrastructure support, dependencies, and
frequency of using business infrastructure.
Conduct a Business Impact Analysis (BIA).
Identify critical business processes and assign recovery
priorities.
Perform risk analysis for critical functions and create
mitigation strategies
3. Design and develop:
Define the team structure and assign individual roles
and responsibilities.
For example, different teams are formed for activities,
such as emergency response, damage assessment, and
infrastructure and application recovery.
Design data protection strategies and develop
infrastructure.
Develop contingency solutions.
Develop emergency response procedures.
Detail recovery and restart procedures.
4. Implement:
Implement risk management and mitigation
procedures that include backup, replication, and
management of resources.
Prepare the disaster recovery sites that can be utilized if
a disaster affects the primary data center.
Implement redundancy for every resource in a data
center to avoid single points of failure.
5. Train, test, assess, and maintain:
 Train the employees who are responsible for backup and
replication of business-critical data on a regular basis or whenever
there is a modification in the BC plan.
 Train employees on emergency response procedures when
disasters
 are declared.
 Train the recovery team on recovery procedures based on
contingency
 scenarios.
 Perform damage-assessment processes and review recovery plans.
 Test the BC plan regularly to evaluate its performance and identify
 its limitations.
 Assess the performance reports and identify limitations.
 Update the BC plans and recovery/restart procedures to reflect
regular
 changes within the data center.
Failure Analysis
Failure analysis involves analyzing both the physical and
virtual infrastructure components to identify systems that
are susceptible to a single point of failure and
implementing fault-tolerance mechanisms.
Single Point of Failure
 A single point of failure refers to the failure of a component
that can terminate the availability of the entire system or IT
service. Figure depicts a system setup in which an
application, running on a VM, provides an interface to the
client and performs I/O operations.
 The client is connected to the server through an IP network,
and the server is connected to the storage array through an
FC connection.
In a setup in which each component must function as
required to ensure data availability, the failure of a single
physical or virtual component causes the unavailability of
an application.
This failure results in disruption of business operations.
For example, failure of a hypervisor can affect all the
running VMs and the virtual network, which are hosted on
it.
In the setup shown in Figure, several single points of
failure can be identified.
A VM, a hypervisor, an HBA/NIC on the server, the
physical server, the IP network, the FC switch, the storage
array ports, or even the storage array could be a potential
single point of failure.
Resolving Single Points of Failure
Configuration of redundant HBAs at a server to mitigate single
HBA failure
Configuration of NIC teaming at a server allows protection
against single physical NIC failure. It allows grouping of two or
more physical NICs and treating them as a single logical device.
Configuration of redundant switches to account for a switch
failure
Configuration of multiple storage array ports to mitigate a port
failure
RAID and hot spare configuration to ensure continuous
operation in the event of disk failure
Implementation of a redundant storage array at a remote site to
mitigate local site failure
Implementing server (or compute) clustering, a fault-
tolerance mechanism whereby two or more servers in a
cluster access the same set of data volumes. Clustered
servers exchange a heartbeat to inform each other about
their health. If one of the servers or hypervisors fails, the
other server or hypervisor can take up the workload.
Implementing a VM Fault Tolerance mechanism ensures
BC in the event of a server failure. This technique creates
duplicate copies of each VM on another server so that
when a VM failure is detected, the duplicate VM can be
used for failover. The two VMs are kept in synchronization
with each other in order to perform successful failover.
Multipathing Software
Configuration of multiple paths increases the data
availability through path failover.
If servers are configured with one I/O path to the
data, there will be no access to the data if that path
fails.
 Redundant paths to the data eliminate the possibility
of the path becoming a single point of failure.
Multiple paths to data also improve I/O performance
through load balancing among the paths and
maximize server, storage, and data path utilization.
Business Impact Analysis
A business impact analysis (BIA) identifies which business
units, operations, and processes are essential to the
survival of the business.
It evaluates the financial, operational, and service impacts
of a disruption to essential business processes.
The BIA process leads to a report detailing the incidents
and their impact over business functions.
The impact may be specified in terms of money or in
terms of time.
Based on the potential impacts associated with downtime,
businesses can prioritize and implement countermeasures
to mitigate the likelihood of such disruptions.
These are detailed in the BC plan.
A BIA includes the following set of tasks:
 Determine the business areas.
 For each business area, identify the key business processes critical
to its operation.
 Determine the attributes of the business process in terms of
applications, databases, and hardware and software requirements.
 Estimate the costs of failure for each business process.
 Calculate the maximum tolerable outage and defi ne RTO and
RPO for each business process.
 Establish the minimum resources required for the operation of
business processes.
 Determine recovery strategies and the cost for implementing
them.
 Optimize the backup and business recovery strategy based on
business priorities.
 Analyze the current state of BC readiness and optimize future BC
planning.
BC Technology Solutions
After analyzing the business impact of an outage,
designing the appropriate solutions to recover from a
failure is the next important activity.
One or more copies of the data are maintained using
any of the following strategies so that data can be
recovered or business operations can be restarted
using an alternative copy:
Backup: Data backup is a predominant method of
ensuring data availability. The frequency of backup is
determined based on RPO, RTO, and the frequency of
data changes.
Local replication: Data can be replicated to a separate
location within the same storage array. The replica is used
independently for other business operations. Replicas can
also be used for restoring operations if data corruption
occurs.
Remote replication: Data in a storage array can be
replicated to another storage array located at a remote
site. If the storage array is lost due to a disaster, business
operations can be started from the remote storage array.
Question 1
A system has three components and requires all three
components to be operational 24 hours, Monday
through Friday. Failure of component 1 occurs as
follows:
Monday = No failure.
Tuesday = 5 a.m. to 7 a.m.
Wednesday = No failure.
Thursday = 4 p.m. to 8 p.m.
Friday = 8 a.m. to 11 a.m.
Calculate the MTBF and MTTR of component 1.
Solution
The formula for MTBF is
(total operational time/number of failures).
Therefore,
MTBF = (24 hours * 5 days)/3 = 120 hours/3 = 40 hours
The formula for MTTR is
(total downtime/number of failures).
Therefore,
Total downtime = 2 hours on Tuesday + 4 hours on
Thursday + 3 hours on Friday
So,
MTTR = (9 hours/3) = 3 hours
Question 2
A system has three components and requires all three
components to be operational during 8 a.m. through
5 p.m. business hours, Monday through Friday.
Failure of component 2 occurs as follows:
Monday = 8 a.m. to 11 a.m.
Tuesday = No failure.
Wednesday = 4 p.m. to 7 p.m.
Thursday = 5 p.m. to 8 p.m.
Friday = 1 p.m. to 2 p.m.
Calculate the availability of component 2.
Solution
Availability (%) = system uptime/(system uptime +
system downtime)
System downtime = 3 hours on Monday + 1 hour on
Wednesday + 1 hour on Friday = 5 hours
System uptime = total operational time – system
downtime = 45 hours – 5 hours = 40 hours
Availability (%) = 40/45 = 88.9%

You might also like