Hyper v r2 deep dive

Hyper-V R2 High-Availability DEEP DIVE! Greg Shields, MVP, vExpert Head Geek, Concentrated Technology www.ConcentratedTech.com

This slide deck was used in one of our many conference presentations. We hope you enjoy it, and invite you to use it within your own organization however you like. For more information on our company, including information on private classes and upcoming conference appearances, please visit our Web site, www.ConcentratedTech.com . For links to newly-posted decks, follow us on Twitter: @concentrateddon or @concentratdgreg This work is copyright ©Concentrated Technology, LLC

Agenda Part I Understanding Live Migration ’s Role in Hyper-V HA Part II The Fundamentals of Windows Failover Clustering Part III Building a Two-Node Hyper-V Cluster with iSCSI Storage Part IV Walking through the Management of a Hyper-V Cluster Part V Adding Disaster Recovery with Multi-Site Clustering

Part I Understanding Live Migration ’s Role in Hyper-V HA

Do You Really Need HA? High-availability adds dramatically greater uptime for virtual machines. Protection against host failures Protection against resource overuse Protection against scheduled/unscheduled downtime High-availability also adds much greater cost… Shared storage between hosts Connectivity Higher (and more expensive) software editions Not every environment needs HA!

What Really is Live Migration? Part 1: Protection from Host Failures

What Really is Live Migration? Part 2: Load Balancing of VM/host Resources

Comparing Quick w/ Live Migration Simply put: Migration speed is the difference. In Hyper-V ’s original release, a Hyper-V virtual machine could be relocated with “a minimum” of downtime. This downtime was directly related to.. … the amount of memory assigned to the virtual machine … the connection speed between virtual hosts and shared storage. Virtual machines with greater levels of assigned virtual memory and slow networks would take longer to complete a migration from one host to another. Those with less could complete the migration in a smaller amount of time. With QM, a VM with 2G of vRAM could take 32 seconds or longer to migrate! Downtime ensues…

Comparing Quick w/ Live Migration Down/dirty details… During a Quick Migration, the virtual machine is immediately put into a “Saved” state. This state is not a power down, nor is it the same as the Paused state. In the saved state – and unlike pausing – the virtual machine releases its memory reservation on the host machine and stores the contents of its memory pages to disk. Once this has completed, the target host can take over the ownership of the virtual machine and bring it back to operations.

Comparing Quick w/ Live Migration Down/dirty details… This saving of virtual machine state consumes most of the time involved with a Quick Migration. Needed to reduce this time delay was a mechanism to pre-copy the virtual machine ’s memory from source to target host. At the same moment the pre-copy would to log changes to memory pages that occur during the period of the copy. These changes tend to be relatively small in quantity, making the delta copy significantly smaller and faster than the original copy. Once the initial copy has completed, Live Migration then… … pauses the virtual machine … copies the memory deltas … transfers ownership to the target host. Much faster. Effectively “zero” downtime.

Part II The Fundamentals of Windows Failover Clustering

Why Clustering Fundamentals? Isn ’t this, after all, a workshop on Hyper-V? It is, but the only way to do highly-available Hyper-V is atop Windows Failover Clustering Many people have given clustering a pass due to early difficulties with its technologies. Microsoft did us all a disservice by making every previous version of Failover Clustering ridiculously painful to implement. Most IT pros have no experience with clustering. … but clustering doesn ’t have to be hard. It just feels like it does! Doing clustering badly means doing HA Hyper-V badly!

Clustering ’s Sordid History Windows NT 4.0 Microsoft Cluster Service “wolfpack” High-availability service that reduced availability “ As the corporate expert in Windows clustering, I recommend you don’t use Windows clustering. ” Windows 2000 Greater availability, scalability. Still painful Windows 2003 Added iSCSI storage to traditional Fibre Channel SCSI Resets still used as method of last resort (painful) Windows 2008 Eliminated use of SCSI Resets Eliminated full-solution HCL requirement Added Cluster Validation Wizard and pre-cluster tests First version truly usable by IT generalists

What ’s New & Changed in 2008 x64 EE gets up to 16 nodes. Backups get VSS support. Disks can be brought on-line without taking dependencies offline. This allows disk extension without downtime. GPT disks are supported. Cluster self-healing. No longer reliant on disk signatures. Multiple paths for identifying “lost” or failed disks. IPv6 & DHCP support. Network Name resource now uses DNS instead of WINS. Network Name resource more resilient. Loss of an IP address need not bring Network Name resource offline. Geo-clustering…! a.k.a. cross-subnet clustering. Cluster communications use TCP unicast and can span subnets.

So, What IS a Cluster? Quorum Drive & Storage for Hyper-V VMs

Cluster Quorum Models Ever been to a Kiwanis meeting…? A cluster “exists” because it has quorum between its members. That quorum is achieved through a voting process. Different Kiwanis clubs have different rules for quorum. Different clusters have different rules for quorum. If a cluster “loses quorum”, the entire cluster shuts down and ceases to exist. This happens until quorum is regained. This is much different than a resource failover, which is the reason why clusters are implemented. Multiple quorum models exist, for different reasons.

Node & Disk Majority Node majority eliminates Win2003 ’s Quorum disk as a point of failure. Works on a “voting system”. A two-node cluster gets three votes. One for each node and one for the quorum. Two votes are needed for quorum. Because of this model, the loss of the quorum disk only results in the loss of one vote. Used when an even number of nodes are in the cluster. Most-deployed model in production.

Node Majority Only shared storage devices get votes, replicated storage does not. Requires 3+ votes, so need a minimum of three members. Used when the number of cluster nodes is odd. Can use replicated storage instead of shared storage. Handy for stretch clusters.

File Share Witness Model Clustering without the nasty (expensive) shared storage! (Sort of…OK…not really…) One file server can serve as witness for multiple clusters. Can be used for non- production Hyper-V clusters. (eval/demo only) Most flexible model for stretch clusters. Eliminates issues of complete site outage.

Witness Disk Model Nodes get no votes. Only the quorum. Cluster remains up as long as one node can talk to the witness disk. Effectively the same as legacy model. Bad. SPOF. Don ’t use.

4 Steps to Cluster! Step 1: Configure shared storage. Hardware SAN Software SAN a la StarWind iSCSI Target Software Step 2: Attach Hyper-V Hosts to the iSCSI Target Step 3: Configure Windows Failover Clustering Step 4: Configure Hyper-V

Part III -VIDEO- Building a Two-Node Hyper-V Cluster with iSCSI Storage

Part IV Walking through the Management of a Hyper-V Cluster

Cluster Shared Volumes Hyper-V v.1 required a single VM/LUN. v.1 ’s clustering underpinnings weren’t aware of the files on a LUN. The “disk” was the cluster resource to failover. Remember that only one node at a time can own a resource. v.2 adds cluster-awareness to individual volumes. This means that individual files on a LUN can be owned by different hosts. Hosts respect the ownership of each other.

Cluster Shared Volumes Because NTFS is still the file system, this means creating a meta-system of ownership information. Each cluster node checks for ownership, respects the ownership of others, and updates info when they take over ownership. Designed for use only by Hyper-V ’s tiny number of files.

Going Beyond Two Nodes Windows Failover Clustering gets non-linearly more complex as you add more hosts. Complexity arrives in failover options. Some critical best practices: Manage Preferred Owners & Persistent Mode options correctly. Consider carefully the effects of Failback. Resist creating hybrid clusters that support other services. Integrate SCVMM for dramatically improved management Use disk “dependencies” as Affinity/Anti-Affinity rules. Add servers in pairs. Segregate traffic!!!

Best Practices in Network Segregation

-DEMO- Walking through the Management of a Hyper-V Cluster

Part V Adding Disaster Recovery with Multi-Site Clustering

What Makes a Disaster? Which of the following would you consider a disaster? A naturally-occurring event, such as a tornado, flood, or hurricane, impacts your datacenter and causes damage. That damage causes the entire processing of that datacenter to cease. A widespread incident, such as a water leakage or long-term power outage, that interrupts the functionality of your datacenter for an extended period of time. A problem with a virtual host creates a “blue screen of death”, immediately ceasing all processing on that server. An administrator installs a piece of code that causes problems with a service, shutting down that service and preventing some action from occurring on the server. An issue with power connections causes a server or an entire rack of servers to inadvertently and rapidly power down.

What Makes a Disaster? Which of the following would you consider a disaster? A naturally-occurring event, such as a tornado, flood, or hurricane, impacts your datacenter and causes damage. That damage causes the entire processing of that datacenter to cease. A widespread incident, such as a water leakage or long-term power outage, that interrupts the functionality of your datacenter for an extended period of time. A problem with a virtual host creates a “blue screen of death”, immediately ceasing all processing on that server. An administrator installs a piece of code that causes problems with a service, shutting down that service and preventing some action from occurring on the server. An issue with power connections causes a server or an entire rack of servers to inadvertently and rapidly power down. DISASTER! JUST A BAD DAY!

What Makes a Disaster? Your business ’ decision to “declare a disaster” and move to “disaster operations” is a major one. The technologies that are used for disaster protection are different than those used for HA. More complex. More expensive. Failover and failback processes involve more thought.

What Makes a Disaster? At a very high level, disaster recovery for virtual environments is three things: A storage mechanism A replication mechanism A set of target servers to receive virtual machines and their data

What Makes a Disaster? Storage Device(s) Replication Mechanism Target Servers

Storage Device Typically, two SANs in two different locations Fibre Channel or iSCSI Usually similar model or manufacturer. This is often necessary for replication mechanism to function property. Backup SAN doesn ’t necessarily need to be of the same size as the primary SAN Replicated data isn ’t always full set of data.

Replication Mechanism Replication between SANs can occur… Synchronously Changes are made on one node at a time. Subsequent changes on primary SAN must wait for ACK from backup SAN. Asynchronously Changes on backup SAN will eventually be written. Are queued at primary SAN to be transferred at intervals.

Replication Mechanism Synchronously Changes are made on one node at a time. Subsequent changes on primary SAN must wait for ACK from backup SAN.

Replication Mechanism Asynchronously Changes on backup SAN will eventually be written. Are queued at primary SAN to be transferred at intervals.

Replication Mechanism Which Should You Choose…? Synchronous Assures no loss of data. Requires a high-bandwidth and low-latency connection. Write and acknowledgement latencies impact performance. Requires shorter distances between storage devices. Asynchronous Potential for loss of data during a failure. Leverages smaller-bandwidth connections, more tolerant of latency. No performance impact. Potential to stretch across longer distances. Your Recovery Point Objective makes this decision…

Replication Mechanism Replication processing can occur… Storage Layer Replication processing is handled by the SAN itself. Often agents are installed to virtual hosts or machines to ensure crash consistency. Easier to set up, fewer moving parts. More scalable. Concerns about crash consistency. OS / Application Layer Replication processing is handled by software in the VM OS. This software also operates as the agent. More challenging to set up, more moving parts. More installations to manage/monitor. Scalability and cost are linear. Fewer concerns about crash consistency.

The Problem with Transactional Databases O/S Crash Consistency is easy to obtain. Just quiesce the file system before beginning the replication. Application Crash Consistency much harder. Transactional databases like AD, Exchange, SQL don ’t quiesce when the file system does. Need to stop these databases before quiescence. Or, need an agent in the VM that handles DB quiescence. Replication without crash consistency will lose data. DB comes back in “inconsistent” state.

Four-Step Process for VSS Step 1: A requestor, such as replication software, requests the server to invoke a shadow copy. Step 2: A provider accepts the request and calls an application-specific provider (SQL, Exchange, etc.) if necessary. Step 3: Application-specific provider coordinates system shadow copy with app quiescence to ensure application consistency. Step 4: Shadow copy is created. … then the replication can start…

Target Servers & Cluster Finally is a set of target servers in the backup site. With Hyper-V these servers are part of a Multi-Site Hyper-V cluster. A multi-site cluster is the exact same thing as a single-site cluster, except that it expands over multiple sites. Some changes to management and configuration tactics required.

Multi-Site Cluster Tactics Install servers to sites so that your primary site always contains more servers than backup sites. Eliminates some problems with quorum during site outage.

Multi-Site Cluster Tactics Leverage Node and File Share Quorum when possible. Prevents entire-site outage from impacting quorum. Enables creation of multiple clusters if necessary. Third Site for Witness Server

Multi-Site Cluster Tactics Ensure that networking remains available when VMs migrate from primary to backup site. R2 clustering can now span subnets. This seems like a good thing, but only if you plan correctly for it. Remember that crossing subnets also means changing IP address, subnet mask, gateway, etc, at new site. This can be automatically done by using DHCP and dynamic DNS, or must be manually updated. DNS replication is also a problem. Clients will require time to update their local cache. Consider reducing DNS TTL or clearing client cache.

Hyper v r2 deep dive

More Related Content

What's hot (20)

Similar to Hyper v r2 deep dive (20)

More from Concentrated Technology (20)

Recently uploaded (20)

Hyper v r2 deep dive

Editor's Notes