I recently setup a VMware Storage Metro Cluster using EMC’s VPLEX with synchronous storage replication. I wanted to put out there a description that’s hopefully easy to understand the failover logic.
VPLEX has volumes that are synchronously replicated, presented to VMware ESXi hosts that are in a single cluster, half of which are in SiteA, the other half in SiteB, and there’s a VPLEX witness in SiteC.
There are a couple of concepts to get out of the way. First off, VPLEX systems in this configuration have to be able to connect to each other over two different subnets – management and storage replication. The Witness has to be able to connect to both VPLEX’s via their management network. You should be taking into consideration the fact that these network connections are EXTREMELY important and do everything you reasonably can to avoid VPLEX’s especially from becoming network isolated from each other. Bad things will often happen if they do. Notice that EMC requires quite a bit of network redundancy.
VPLEX synchronously replicates storage within Consistency Groups, which contain one or more LUNs. For the rest of this explanation, I may say LUN, so assume we’re talking a consistency group that just has one LUN.
VPLEX consistency groups also contain two settings. One dictates which is the preferred site. This basically means that under various scenarios, the identified site’s VPLEX will become the sole active copy of the LUN. That value can be one of the VPLEX’s, or nothing at all. There’s also a setting within a consistency group that basically states whether or not the Witness should be used to determine the proper site that a LUN should be placed in under various scenarios.
For those of you who aren’t familiar with VPLEX, the witness is either a virtual appliance or physical server. It is optional but highly recommended. It must be deployed in a third site if it will be deployed, and it must have connectivity to both VPLEX management links, and those links must be completely independent from each other.
Failure scenarios work very similarly to majority node set clusters, such as MS Clustering. Most of this works in the manner you’d probably guess if you’ve ever dealt with high availability solutions, especially those that cross site links, such as an Exchange 2010 DAG. It’s pretty much a majority node set/majority node set + witness type logic. I want to focus on specific scenarios that has very significant design implications when it comes to vSphere in a Storage Metro Cluster scenario, and how VPLEX avoids split brain scenarios when network links go down.
The chief concept to remember in all this is VPLEX must always always ALWAYS make sure a LUN doesn’t get active in both VPLEX sites simultaneously should they not be able to talk to each other. If that happens, data for a single LUN would be inconsistent, both potentially with data that can’t be lost, but no real way to sync them up anymore. Under normal operations, VPLEX would allow both sites to actively write data into them, but the minute a VPLEX in a site goes down or gets disconnected, it must be ensured that ONLY one of them has an active copy of the LUNs. The absolute worst thing that could ever happen, even worse than prolonged downtime, is there can’t be two disparate copies of the same LUN.
Scenario 1: What happens if the synchronous storage replication link between the two sites for VPLEX goes down? What if total connectivity only between the two VPLEX sites is lost?
The problem here is that VPLEX can’t synchronously write data to both copies of the LUN in each site anymore. LUNs therefore must become active SiteA, SiteB, or worst case, neither.
How does this work? It depends on what site preference is set on the consistency group. It doesn’t really matter whether the option is set to use the witness or not or if a witness is even present. If no site preference has been identified for the consistency group, the LUNs must go offline in both sites because there’s no way to determine the right site in this situation. If a site preference is defined, LUNs would become active in their preferred sites only. The existence of the witness here is irrelevant because both VPLEX’s can still talk to each other via their management link.
There’s a VMware implication here – you should note probably in the datastore name somehow which is the preferred failover site, and then make sure you make VM-to-host should rules that encourage VMs placed in datastores that map to LUNs with a preference for SiteA VPLEX should failures occur to run on SiteA ESXi hosts. This eliminates HA events caused by connectivity problems between the VPLEX’s, specifically the synchronous storage link.
Scenario 2: What happens if the management connectivity between the VPLEX’s goes down?
Everything continues to work because VPLEX can communicate via the storage replication link. Both VPLEX’s Keep calm and write I/O on in both sites, just like it was. The presence and options concerning the witness are irrelevant.
Scenario 3: What happens if there’s a total loss of connectivity between the two VPLEX sites, both management and storage replication, but both sites can communicate to a witness if there is one?
In this scenario, basically, the outcome is one of two things: If the LUN has a preferred site identified, it becomes active on that site only. If it doesn’t, it goes offline in both sites. The witness, regardless if the option relevant to whether it factors into the decision on what to do is enabled, serves as a communication mechanism to both sites to let them known this is the scenario. Otherwise, the two VPLEX systems wouldn’t know this happened vs the other had actually failed.
Scenario 4: What happens if a VPLEX failed in one site?
Depends on if the witness option on the VPLEX consistency group is enabled (and of course if you deployed a witness). If it is enabled, the LUN fails over to the second site. If the option isn’t enabled, it depends if the preferred site is the one that failed. If it did, LUN goes offline. If the non-preferred site failed, the LUN remains active in the preferred site. You should see the value now of a witness. Usually, having a witness and enabling this option is a good thing. But not always!
Scenario 5: What happens if all sites stay up, but network connectivity fails complete between all of them?
Depends on if the option to use the witness is turned on or not. If it’s off, the LUN becomes active in its preferred site, and becomes inaccessible in the other. If the witness option is turned on in the consistency group, then there’s no way for each site to know if the other sites failed, or only it got isolated. Therefore, nobody knows if the LUN has become active anywhere else, so the only way to avoid a split brain is make the LUN unavailable in ALL sites.
There’s a design implication here – if a workload should stay up in a preferred site in any situation, even network isolation, at the cost it may be down if its site goes down, you should place the VM on datastores with a preference for the correct site, and DO NOT enable the consistency group to use the witness.
One last design implication with VPLEX – I see limited use of not identifying a preferred site. I see even less use of having a consistency group set without a preferred site AND not to use a witness when needed. You’re just asking for more instances in both cases of a LUN taken offline in every site. To be honest, I think almost always, a witness should be deployed, consistency groups should be a set with a preferred site for failure scenarios, and the witness use option should be enabled.
There you have it!