Change Block Tracking issues with SRM

As it may be obvious, I’ve been doing quite a bit of work with VMware Site Recovery Manager with storage based replication lately, specifically EMC’s MirrorView.  I ran into another issue while testing with SRM 6 + ESXi 5.0 hosts.

During the project, we are updating vCenter from 5.0 to 6.0, SRM from 5.0 to 6.0, verifying everything works, and then proceeding with updating ESXi hosts.  We didn’t bother patching ESXi 5.0 hosts, since they would be updated to 6.0 soon enough.  We wanted to make sure SRM worked through vCenter before updating ESXi simply to ensure an easy rollback.

However, during failover testing, we ran into an issue where most VMs would not power on during isolated testing and failovers.  The error was as follows:

Error – Cannot open the disk ‘/vmfs/volumes/<VMFS GUID>/VMNameVMName.vmdk’ or one of the snapshot disks it depends on.

When you look into the events for an impacted VM, you would find the following:

“Could not open/create change tracking file”

We cleared CBT files for all the VMs, and tried again, forcing replication, and it worked.  We figured CBT got corrupted.  But then Veeam ran its backups, we tried an isolated test, and almost all the VMs couldn’t power on in an isolated test again.

I know ESXi 6 has been in the news lately for corruption in Change Block Tracking, but it’s far from the only version that’s suffered from an issue with CBT.  ESXi 5.0, 5.1, and 5.5 have had their issues, too.  In this case, the customer was running a version that needed a patch to fix CBT.  We remediated the hosts to patch them to current, reset CBT data yet again, allowed Veeam to backup the VMs, and tried an isolated test.  All VMs powered on successfully.

It’s important to note that Veeam really had nothing to do with this problem, and neither did MirrorView.  This was strictly an unpatched ESXi 5.0 issue.  So, if you run into this with any ESXi version using storage based replication, I recommend patching the hosts to current, resetting CBT data, run another backup, make sure the storage replicated the LUN after this point, and try again.