Clarifying vSphere Fault Tolerance

I hear a lot of confusion about some of the new enhancements of vSphere 6. One is specifically Fault Tolerance (FT)

In case you do not know what FT is, this is a feature that basically (was supposed to) fit the need for a handful of your most critical VMs that High Availability (HA) didn’t protect well enough.  HA restarts a VM if the ESXi physical host it was running on failed on another host, or if you enable VM Monitoring, a VM that blue screened or locked up.  Note the VM would be down during the restart time of the VM and the boot up of the OS within the VM.  FT effectively runs a second copy of the VM in lockstep on another host, so should the host the live VM runs on fails, the second copy immediately takes over on the other host, with no downtime.

Please note that vSphere 6 nor previous versions of vSphere do not protect against an application crash itself unless the application crashed due to a hardware failure using Fault Tolerance.  It only protects against effectively failures pertaining to hardware, like a host failure.  There is no change there.  If you want protection from application failures, you still should look at application clustering and high availability solutions, like Exchange DAGs, network load balancing, SQL clustering, etc.  On the flip side, I have personally seen many environments actually have MORE downtime because of application clustering solutions, especially when customers don’t know how to manage them properly, but FT is a breeze to manage.

The problem with FT in the past is it had so many limitations.  The disks had to be zero eager thick provisioned for the VM, you could not VMotion the VM or the second copy, and more, but the biggest limitation was the VM could only have 1 vCPU.  If you’re thinking how many critical apps only need 1 vCPU, the answer is pretty much zero.  Almost all need more, so FT became the coolest VMaware feature nobody used.

That changes in vSphere 6.  You can use FT to protect VMs with up to 4 vCPUs.  They can be thin or thick provisioned.

FT protected VMs can now be backed up with whole VM backup products that utilize the VMware Backup APIs, which is all of them that backup whole VMs.  Veeam, VMware Data Protection, etc.  This is a pretty big deal!

You can hot configure FT for a VM on the fly now without disrupting the VM if it is currently running, which is yet also really cool.  Maybe you got a MS two node cluster, and one gets corrupted.  Enable FT on the remaining one to provide extra protection until the second node is rebuilt!

Also, the architecture changed.  This is good and bad.  In the past, FT required all the VMs disks to be on shared storage, and the active and passive VMs used the same Virtual disk files, VM Config files, etc.  This is no longer the case.  Now the storage is replicated as well, and it can be to the same Datastore or different datastores.   Those datastores can be on completely different storage arrays if you want.  On the downside, you need twice the storage for FT protected VMs than you did before, but the good news is a storage failure may not take out both data sets and kill the VM, too!

In my opinion, these changes have finally made FT definitely something that should be considered and will be implemented far more commonly.

So while a lot of the restrictions were lifted, there are still some left, notably:

  • Limit of 4 vCPUs, 64GBs of RAM for a FT protected VM.
  • Far more hardware is supported, but you still need hardware that is officially supported.
  • VM and the FT copy MUST be stored on VMFS volumes.  No NFS, VSAN, or VVOL stored VMs!
  • You cannot replicate the VM using vSphere Replication for a DR solution.
  • No storage DRS support for FT protected VMs
  • 10gb networking is highly recommended.  This is the first resource that runs out when protecting VMs with FT.  So if you were thinking FT with the storage replication would be a good DR solution across sites, uhh, no.
  • Only 4 FT active or passive copies per host.

So, if you’re thinking about a vSphere solution for a customer, and you pretty much dismissed FT, consider it now.  And if you support environments with VMware, get ready to see more FT as vSphere 6 gets adopted!