MICROSOFT SQL SERVER
HIGH AVAILABILITY
AND DISASTER RECOVERY
Michael Poremba // October 2008
Database HA & DR Experience…
2
Work with business to determine HA or DR
requirements for applications and data?
Design HA or DR solutions?
Administer HA or DR process?
Still learning MS SQL Server HA & DR capabilities?
Scope of this Presentation
3
Presentation Focus Beyond Scope of Presentation
Data Availability In-depth how-to
(available elsewhere)
Data recovery Partitioned views (federated)
High availability Advanced DBA techniques
Custom application logic
Disaster recovery 3rd-party software solutions
Technology Focus Alternate DBMS engines
(e.g. Oracle; DB2)
MS SQL Server HA on virtual machines
Physical servers Complex scenarios & solutions
SANs Load balancing
4 Introduction to Data Availability
So, you need to make your
production database bulletproof…
Data Availability Continuum
5
Degrees of protection for information systems:
Business Risk Solution
Data Recovery Data loss Redundant data
High Availability Downtime of Redundant system
database service components
Disaster Recovery Downtime of Redundant systems
business operations and facilities
Business Case for Availability
6
High Availability Disaster Recovery
Keep business-critical Protect against loss of
applications available data center
Secondary: Secondary:
Server maintenance Application upgrades
Infrastructure
upgrades
Service Level Agreement (SLA)
7
Permitted downtime (planned vs. unplanned?)
Uptime SLA Downtime Downtime
per Year per Month
99.9% 8.76 hours 43.8 minutes
99.99% 52.6 minutes 4.38 minutes
99.999% 5.26 minutes 0.438 minutes
Acceptable data/transaction loss
Application response times
Mean time to recovery
Note: Database uptime is not equivalent to application availability
Failures of other application services
Network outages
Protect What?
8
Application data stores
Databases
Files
Other data repositories
Database services
DBMS availability for applications
Application services
Application availability for users and external systems
Databases are the heart of most information systems;
they deserve the highest affordable protection.
Database Failure Scenarios
9
Physical Infrastructure Failures Logical Data Failures
Storage subsystem Operator errors
Disk DBMS interruption
Controller Drops / deletes
Network Application defects
Server DBMS defects
Power Data corruption
Service Recovery Strategies
10
Standby Failover Behavior SQL Server Feature
Mode
Cold • Manual intervention • Backup and restore
standby required to restore offline
data copy
Warm • Data copy online and ready • Transaction log
standby • Manual failover required shipping
• Database mirroring
Hot • Automatic failover • Database mirroring
standby • Failover clustering
Data Recovery—Terminology
11
Terminology varies for source vs. copy
High Availability Data Source Data Copy
Strategy
Backup and Restore Database Backup
Log Shipping Primary Secondary
Standby
Database Mirroring Principal Mirror
Failover Clustering Primary Secondary
Active Passive
Standby
Inactive
12 Data Recovery
[Briefly…]
Database Backups
13
Traditional backup types
Full backup
Differential backup
Transaction log backup
Disk is better than tape
First backup to disk (separate physical disk volume)
Detect exceptions encountered during backup
Verify backup files
Copy backup files to tape or remote disk
Data retention policy for backup files
Database Backup Strategy
14
Backup of user databases not sufficient for recovery
System database
Master database
MSDB database
Model database
External data stores…
Synch with External Data Stores
15
Synchronize recovered database with external data
stores:
Identity column seeds
Full-text indexes
(SQL Server 2000)
LDAP entries
File system objects
Other databases
Backup Retention Policy
16
Location of backup files
Duration of retention
Protection of sensitive data
Sarbanes/Oxley (SOX)
HIPAA
Internal policies for data management and protection
Access to backups from offsite data storage
Data Recovery Process
17
Backup file sets Recovery strategy depends on
Full baseline, differential, and failure scenario
transaction logs Create comprehensive failure
Retrieving backup files matrix
Offsite storage
Devise recovery strategy for
each scenario
Tape Does worst-case recovery
Network copy scenario fit within SLA
Dependency on multiple parameters?
people to get access to Recovery time; SLA
backup files
Include future data growth in
recovery plan
Fully test recovery strategies
—practice is essential
18 High Availability
High Availability
19
Minimize or avoid service downtime
Whether planned or unplanned
When components fail,
service interruption is brief or non-existent
Automatic failover
Eliminate single points of failure (as affordable)
Redundant components
Fault-tolerant servers
Redundant Components
20
Objective: Avoid single points of failure (where affordable)
Approach: Use redundant components for database service
Database server nodes
Server components
ECC RAM; failure-tolerant HW & OS
DBMS instance
User databases
Storage devices
Storage unit components
MPIO: Interfaces; paths; switches; controllers
RAID: Disks
Networking
MPIO: Interfaces; paths; switches
Data copies
E.g. Recovering torn page from mirror in SQL Server 2008
Transaction Log Shipping
21
Warm standby solution
Duplicate user database
Copy transaction logs to standby server & restore
Database available for read-only access
Users must disconnect for logs to be applied
Two database licenses required if querying standby
Manual application failover
Supported on standard hardware
Possible data loss (unapplied transactions)
Database Mirroring
22
Redundancy at user database level
Duplicate copy of user database
Independent storage devices
Multiple copies of instance databases
witness
Mirrored over private network channel (optional)
Mirror always redoing transactions from principal
Negligible impact on transaction throughput
Multiple mirroring modes: node A node B
High-availability: commit @ log on mirror; automatic failover
High-protection: commit @ log on mirror; manual failover
High-performance: commit when logged on principal
Very fast automatic failover—seconds Local Storage Local Storage
· local sys DBs · local sys DBs
Requires witness server · source user DB · mirror user DB
Mirror-aware application client connection
Provided by client library
Database connection string must specify both servers
Mirror may be available for read-only access (snapshots)
Works with standard hardware
Mirror Witness
23
With mirroring, more than one server is required to
decide on failover
Witness automates failover from primary to mirror
Watches database availability
Reports observations back to principal and mirror
Runs in separate SQL Server instance (Express is OK)
Prevents “split brain” scenario
Very low resource consumption
Can be witness for multiple databases
Not a single point of failure
SQL Server Failover Clustering
24
Two clustered nodes
Active/Passive config
MS SQL services
Running on virtual server node A node B
Shared storage device Shared Storage
User databases · system DBs
· user DBs
· quorum
System databases
Quorum drive
Redundant internal
components
Active/Passive Failover Clustering
25
Redundancy at database instance level
All databases fail over together
Shared copy of system databases
Single data copy on shared storage
device
No I/O overhead reducing throughput
Storage unit is single point of failure for node A node B
cluster
All database services are clustered
SQL Agent; Analysis Services; Full-Text Shared Storage
engine, MS DTC · system DBs
· user DBs
Automatic failover (up to minutes)
· quorum
DBMS accessed over virtual IP
Database not available from inactive
node for DB client connections
Storage is controlled by one cluster node
at a time
Requires hardware certified by Microsoft
for Microsoft Cluster Service
HA Comparison
26
Database Mirroring Failover Clustering
Scope: user DB Scope: DBMS instance
Standard hardware Certified hardware
One SQL license One SQL license
(unless querying snapshots on (only one node can access
mirror) database)
Very fast failover (seconds) Automatic failover (up to minutes)
OS flexible (e.g. 32/64)
Enterprise OS
Independent storage
Shared storage
Independent services
Clustered services
Standby not available
Reporting on mirror
Servers are usually co-located
Geographic separation OK
Considerations for HA
27
HA complements backup and recovery strategy
Does not replace data recovery plan
Application service availability is often determined by
a network of interdependent services
Availability can be difficult to define (e.g. partial failures)
Failure probability difficult to measure or compute
Increased system complexity could lead to lower
service availability!
Operator error a leading cause of availability issues
Increased number/types of system components
More complex to configure and administer
Data Recovery Requirements
28
29 Disaster Recovery
Disaster Recovery
30
Minimize downtime of business operations
Redundant systems and facilities
SQL Server features:
Transaction log shipping
Database mirroring
Failover clustering
Other technologies
Storage-based mirroring
Disaster Recovery Planning
31
Data security requirements
Clarify SLA, data loss allowance
Evaluate system cost vs. data protection
Failure analysis
System redundancy
Process validation
Training for personnel
Prevention practices
Executing disaster recovery and business continuity
Practice, practice, practice
Business Continuity Facility
32
System redundancy
Systems: Web servers app servers; database, etc.
Data: Databases; data files on OS; security info, etc.
Networking: Domain, routing, subnet, VIPs, etc.
Alternate facilities
Network bandwidth
Physical or network access by operations staff
Failover
Often a deliberate decision, using manual failover
Data Redundancy
33
Synchronous redundancy
Network bandwidth cost
Network latency and application performance
Network reliability
Asynchronous redundancy
Risk of data loss
More cost-effective
Resilient to network latency issues
Candidate Technologies
SQL Server database mirroring
Failover clustering with SAN-based mirroring
DR Using Database Mirroring
34
Two sites: Primary and DR location
Separate failover clusters at each site
SQL Server database mirroring between sites
witness
(optional)
failover cluster at site A failover cluster at site B
node A1 node A2 database node B1 node B2
mirroring
Shared Storage A Shared Storage B
· local sys DBs · local sys DBs
· local quorum · local quorum
· source user DB · mirror user DB
DR Using SAN-Based Mirroring
35
Two sites: Primary and DR location
Four-node failover cluster; one virtual IP address
SAN-based mirroring between sites
Manual cluster failover
failover cluster nodes at site A failover cluster nodes at site B
node A1 node A2 node B1 node B2
storage-
based
Shared Storage A mirroring Shared Storage B
· system DBs · system DBs
· quorum · quorum
· user DBs · user DBs
36 Complimentary Technologies
[Skip if time is running short.]
SAN-Based Data Mirroring
37
Data blocks duplicated at storage level
Similar to transaction log shipping
Copy performed in sequence and coordinated with
database checkpoint
Ensures consistency of mirrored data files
Synchronous or asynchronous mirroring
Co-located or geographically dispersed—both are OK
SAN link bandwidth must support database I/O rate
May require extra feature support from SAN vendor
Could rely on Failover Clustering for HA
SQL Server Database Snapshots
38
Read-only point-in-time database snapshot
No data is copied—instantaneous
Historical snapshot pages tracked separately from
changing pages
Snapshots can be maintained indefinitely
Limited only by available storage
Snapshot copy can be used for reporting
Read-only, so no locking issues
SQL Server Replication
39
Transactional replication Subscriber databases
High transaction volume available for reporting
Low data latency required Replicate data subsets
Mixed technologies: Some data loss is possible
Integrates with other DBMS
Merge replication
Periodically validate
replicated data
Bi-directional data changes
Typically server-to-client
Snapshot replication
Large, infrequent data
changes
Data change latency OK
Best for smaller data sets
40 App Development and Admin
Considerations for App Developers
41
App services tolerant to database service interruptions
Application transactions must be handled in code—data consistency
Exception handling for transaction retry, connection recovery
Requires coding standards, code reviews, and testing
Bulk data operations
Transaction volume impacts rollback time during failover
Batch jobs must be run on alternate nodes
Don’t bypass transaction logging
Synchronization with external data sources?
Be aware of database recovery model
Mirroring uses FailoverPartner in connection string
Use TCP/IP as client protocol
Considerations for Admins
42
Use identical server hardware, when possible
Design network redundancies, when feasible
Consider network latency for geographic separation
Always manage through virtual cluster, not individual cluster nodes
Retest failover/failback after HA maintenance
Diagnose after failover
Repair alternate node
Resynchronize data, as necessary
Be aware of primary/secondary locations
Ensure application services are connected and functioning properly
Keep server node configurations synchronized:
Service pack and patch levels
Duplicate non-redundant resources
Jobs; logins and permissions; OS & sys objects
HA Risks
43
System performance degradation
HA system complexity leads to availability issues
Some system failures not planned for
Backup and recovery planning incomplete
Administrators not fully trained or informed
User databases not synchronized with other data
sources
Common Admin Use Cases
44
Maintain HA nodes
Hardware maintenance
Rolling upgrades and software patches
Resynchronize the redundant copy
Re-synch mirror
Restart log shipping
Diagnose and repair
Diagnose cause of failover
Repair failed node and restore failover capabilities
Test failover and failback
Common Admin Actions
45
Train and practice administrators to:
Initiate a database mirror
Manually failover mirror database or cluster node
Add/remove passive node from mirror or cluster
Upgrade/patch servers nodes
Restart or redirect application services
46 More Information
References—Books
47
High Availability Related Topics
Microsoft SQL Server 2008 High Pro SQL Server 2005 Replication
Availability with Clustering & by Sujoy Paul, 2006.
Database Mirroring Pro SQL Server 2005 Service
by Michael Otey, 2009. Broker
Microsoft SQL Server High by Klaus Aschenbrenner, 2007.
Availability The Rational Guide to SQL Server
by Paul Bertucci, 2004. 2005 Service Broker
Pro SQL Server 2005 High by Roger Wolter, 2006.
Availability
by Allan Hirt, 2007.
References—Presentations
48
Microsoft Load Balancing and Clustering
http://ce.sharif.edu/courses/84-85/2/ce317/resources/root/lecture%20slides/14.%20Microsoft%20L
oad%20Balancing%20and%20Clustering.ppt
SQL Server 2005 High Availability
http://www.atlantamdf.com/Presentations/AtlantaMDF_111207HA.ppt
High Availability Technologies In SQL Server 2000 And SQL Server 2005
http://202.181.238.2/hk/teched2004/ppt/Day_2_Rm407/DAT431(1330-1445).ppt
Meeting the Availability Challenge
http://download.microsoft.com/download/E/D/C/EDCF54DB-19CD-4882-9FC4-
4F7D46FCEAA6/HighAvailability.ppt
Disaster Recovery Mistakes
http://www.sqlsig.org/Oct%2011%20DASSUG%20-%20Jason%20Hall%2010-11-07%20MM.ppt
SQL Server 2005 High Availability
http://blogs.msdn.com/sql2005event/attachment/564303.ashx
Effective Usage of SQL Server 2005 Database Mirroring
http://www.sqlserver-qa.net/SSQA-Effective%20Usage%20of%20SQL%20Server
%202005%20Database%20Mirroring_show.ppt
References—Articles
49
Achieve High Availability for SQL Server
http://technet.microsoft.com/en-us/magazine/cc162477.aspx
Geographically Dispersed Clusters in Windows
Server 2003
http://www.microsoft.com/windowsserver2003/techinfo/overview/clustergeo.mspx
Restoring file and filegroup backups
http://support.microsoft.com/kb/281122/en-us
Restoring specific tables or rows from backups
http://support.microsoft.com/kb/321836/en-us
Maintaining Availability During Upgrades
http://msdn.microsoft.com/en-us/library/ms191449.aspx