Tag Archives: SRM

EMC VSI RecoverPoint/SRM Integration

I’ve recently set a customer up with new VNX storage arrays, RecoverPoint , and it’s all to be integrated with VMware Site Recovery Manager.  Previously, the customer used SRM in conjunction with MirrorView/A.  Why RecoverPoint?

The really cool thing about RecoverPoint is you can easily rollback to specific points in time, as they like to call it DVR functionality for disaster recovery.  MirrorView/A only allows you to rollback to a specific snapshots at specific points in time.

EMC also provides their VSI for VMware environments.  This integrates with many of their storage products, including VNX, RecoverPoint, and it provides the DVR selection ability within SRM if you integrate it as well!

Setup is pretty straight forward:

  1. Deploy the OVA for the VSI in each site.
  2. Login to the VSI’s web portal by hitting https://<ip>:8443/vsi_vum with user name admin and password ChangeMe.  Change the password as prompted.
  3. Install the VSI’s plugin with vCenter by going to VSI Setup and provide the required info.  If you don’t get “The Operation is successful.”, do it again unless you’re provided an error to troubleshoot.  For me, that happened on one of the two vCenter servers I was deploying this on.  Also, be patient, as this can take quite sometime.  For me, the plugin took about 10-15 minutes to complete the installation.
  4. Login to the vCenter Web Client, and go to vCenter Inventory Lists. At the end, you should see an EMC VSI section. emcvsisection
  5. Click on Storage Integration Service.  Under Actions, click Register Solutions Integration Service, and enter the VSI’s info for that vCenter.  Click Test to ensure there’s connectivity to the VSI, and click OK.
  6. Under Storage Systems, add the storage array for that site.  Again, click Test to ensure there’s connectivity to the storage array, and click OK.  VSI supports VMAX, VNX, VNXe, ViPR, and XtremIO, so this isn’t just limited to the VNX on this project.
  7. Under Data Protection Systems, add the RecoverPoint cluster info for that site using the RPA cluster IP address, and be sure to select RecoverPoint as the Protection System Type.  rpprotectionsystemtypeClick Test to ensure communication will work.  If successful, OK will no longer be grayed out.  Click OK.
  8. Repeat step 7, but select SRM this time for the Data Proection System type.  Here’s where I ran into a gotcha.  The FQDN/IP address and port fields were grayed out.  I went ahead and clicked to Test, and got an error: “Could not communicate with the data protection system SRM at <IP of vCenter server>. Details: Cannot reach the target SRM server at <IP of vCenter server>:1” vsisrmregerrorGoogle didn’t yield any results for a solution, so I began troubleshooting.  Thankfully, I knew my ports, and decided to click the check box for the FQDN or IP/Port line, and entered in the FQDN of the SRM server and the port.  Be aware that SRM 6.X uses 9086.  I provided that, clicked Test, got my green “OK to go” text, and clicked OK.

Note that this needs to be done for each vCenter/RPA cluster/storage array/SRM server in the environment.  Note also only one VSI instance can be registered per vCenter server, so you’ll need to deploy one VSI per vCenter.

After setting up each site, go to a VM, click it, go to Manage, view the snapshots for its Consistency Group, click the one you want and apply, and launch your Failover or Test action from SRM.

vsiselectsnapshot

And there you have it!

SRM installation error – unexpected error -1

I assisted a colleague today with an issue with reinstalling Site Recovery Manager 5.8 for a customer.  During installation, when requested to input the vCenter FQDN and administrative user, an “unexpected error: -1” pop up would occur.  After a bit of research, I found articles pointing to various certificate problems in SRM 4.X and 5.X.  The odd thing was he attempted to point it to the second vCenter server, and it would proceed, but this was the SRM server for one site, and this vCenter was in another.

I’ve been cognizant new vCenter releases disabling SSLv3, so we checked the build numbers of the two vCenter servers.  Sure enough, the one generating the error was 5.5.Update 3b which disabled SSLv3 support, but the vCenter that didn’t generate the error was 5.5 Update 2, which still supports SSLv3.  We then checked the build of the SRM 5.8 installation file, which was 5.8, NOT 5.8.1.  While this isn’t the same error, it states the fact that SRM 5.8.1 is required to interoperate well with vCenter 5.5 Update 3b, unless you want to enable SSLv3 in your vSphere environment, which isn’t the best thing for security. Even the interoperability guide shows that 5.8.0 as unsupported with vCenter 5.5 Update 3.

Apparently, the customer upgraded one vCenter very recently, but not the other, and also didn’t check SRM interoperability before doing so, which caused the weird behavior.   They also didn’t mention this to us.  vCenter in the second site was upgraded, and SRM 5.8.1 was installed instead of 5.8.0, and this resolved the issue.

So, if you have this error during SRM installation, it’s likely a problem with certificates, so start there, and be cognizant of any changes that might impact certificates or their use.  As always, check your builds, the interoperability matrix, and the upgrade order prior to updating any vSphere component.

Change Block Tracking issues with SRM

As it may be obvious, I’ve been doing quite a bit of work with VMware Site Recovery Manager with storage based replication lately, specifically EMC’s MirrorView.  I ran into another issue while testing with SRM 6 + ESXi 5.0 hosts.

During the project, we are updating vCenter from 5.0 to 6.0, SRM from 5.0 to 6.0, verifying everything works, and then proceeding with updating ESXi hosts.  We didn’t bother patching ESXi 5.0 hosts, since they would be updated to 6.0 soon enough.  We wanted to make sure SRM worked through vCenter before updating ESXi simply to ensure an easy rollback.

However, during failover testing, we ran into an issue where most VMs would not power on during isolated testing and failovers.  The error was as follows:

Error – Cannot open the disk ‘/vmfs/volumes/<VMFS GUID>/VMNameVMName.vmdk’ or one of the snapshot disks it depends on.

When you look into the events for an impacted VM, you would find the following:

“Could not open/create change tracking file”

We cleared CBT files for all the VMs, and tried again, forcing replication, and it worked.  We figured CBT got corrupted.  But then Veeam ran its backups, we tried an isolated test, and almost all the VMs couldn’t power on in an isolated test again.

I know ESXi 6 has been in the news lately for corruption in Change Block Tracking, but it’s far from the only version that’s suffered from an issue with CBT.  ESXi 5.0, 5.1, and 5.5 have had their issues, too.  In this case, the customer was running a version that needed a patch to fix CBT.  We remediated the hosts to patch them to current, reset CBT data yet again, allowed Veeam to backup the VMs, and tried an isolated test.  All VMs powered on successfully.

It’s important to note that Veeam really had nothing to do with this problem, and neither did MirrorView.  This was strictly an unpatched ESXi 5.0 issue.  So, if you run into this with any ESXi version using storage based replication, I recommend patching the hosts to current, resetting CBT data, run another backup, make sure the storage replicated the LUN after this point, and try again.

Adventures in SRM 6.0 and MirrorView

Recently, I setup SRM 6.0 with MirrorView storage based replication.  It was quite the adventure.  The environment was using SRM 5.0 and MirrorView, and we upgraded them to vSphere 6.0 and SRM 6.0 recently.  I wanted to get my findings down in case it may help others setting this up.  I found when I ran into issues, it wasn’t easy finding people who were doing this, as many who are using VNXs are using RecoverPoint now instead of MirrorView.

Version Support

First off, you might be wondering why I recently deployed SRM 6.0 instead of 6.1.  That’s an easy question to answer – currently, there is no support for MirrorView with SRM 6.1.  I’m posting this article in 11/2015, so that may change.  Until it does, you’ll need to go with SRM 6.0 if you want to use MirrorView.

Installation of Storage Replication Adapter

I’m assuming you already have installed SRM, and configured the pairings and what not.  At the very least, have SRM installed in both sites before you proceed.

Here’s where things got a little goofy.  First off, downloading the SRA is confusing.  If you go to VMware’s site to download SRA’s, you’ll see two listings for the SRA, with different names, suggesting they work for different arrays, or do something different, or are different components.

mirrorsradownload

They’re actually so far as I can tell two slightly different versions of the SRA.  Why are they both on the site for download?  No idea.  So I went with the newer of the two.

You also need to download and install Navisphere CLI from EMC for the SRA to work.  There are a few gotchas on the install of this to be aware of. Install this first.

During installation, you need to ensure you check the box “Include Navisphere CLI in the system environment path.”

navispherepath

That’s listed in the release notes of the SRA, so that was easy to know.  You also need to select to not store credentials in a security file.

I ended up having issues with the SRA being able to authenticate to the arrays when I originally told it to store credentials thinking this could allow easier manual use of Navisphere CLI should the need arise, but that messed things up, so uninstalled, and reinstalled Navisphere CLI without that option, and the bad authentication messages went away.

Next, install the SRA, which is straight forward.  After the installation of the SRA, you must reboot the SRM servers, or they will not detect that they have SRA’s installed.  That takes care of the SRAs.

Configuring the SRAs

Once you have installed the SRA’s, it’s time to configure the array pairs.  First, go into Site Recovery within the vSphere Web Client, and click Array Based Replication.

arraybasedreplication

Next, click Add Array Manager.

addarraymanager

Assuming you’re adding arrays from two sites, click “Add a pair of array managers”.

addarraypairs

Select the SRM Site location pair for the two arrays.

sralocationpair

Select the SRA type of EMC VNX SRA.

selectsratype

Enter the Display name, the management IPs of the array, filters for the mirrors or consistency groups if you are using MirrorView for multiple applications, and the username and password info for the array for each site.  Be sure to enter the correct array info for the indicated site.

sraarrayinfo

I always create a dedicated SRM service account within the array, so it’s easy to audit when SRM initiates actions on the storage array.

You’ll need to fill the information out for each site’s array.

Keep the array pair checked and click next.

enablearraypairs

Review the summary of action and click finish.

At this point, you can check the array in each site and see if it is aware of your mirrors being replicated.

checksrareplicationinfo

So far so good!  At this point, you should be able to create your protection groups and recovery plans, and start performing tests of a test VM and recoveries as well.

Problems

I began testing a test Consistency Group within MirrorView, which contained one LUN, which stored a test VM.  Test mode worked immediately to DR.  Failover to the DR site failed, as it often does in my experience with most Storage Based Replication deployments.  No problem, I simply launch it again, and it works, and it did in this case.

With the VM then in the DR site, I performed an isolated test back to production, which worked flawlessly.  It’s when I tried to fail back to production I encountered a serious problem.  SRM reported that the LUN could not be promoted.  Within SRM, I was given only the option to try failover again.  The icon was grayed out to do cleanup or a test.  Relaunching failover resulted in the same result.  I tried rebooting both SRM servers, vCenter, running rediscovery of the SRAs, you name it.  I was stuck.

I decided to just manually clean up everything myself.  I promoted the mirror in the production site, had hosts in both sites rescan for storage.  The LUN became unavailable in the DR site, but in production, while the LUN was visible in terms of seeing an available LUN, the datastore wouldn’t mount.  Rebooting the ESXi server didn’t help.  I finally added it as a datastore, selecting not to resignature the datastore.  The datastore mounted, but I found that the datastore wouldn’t mount after a host reboot.  Furthermore, SRM was reporting the MirrorView consistency group was stuck failing over, showing Failover in Progress.  I tried recreating the SRM protection group, re-adding the array pairs, and more, but nothing worked.

After messing with it for awhile, checking MirrorView and the VNX, VMware, etc., I gave up and contacted EMC support, who promptly had me call VMware support, who referred me back to EMC again because it was clearly an SRA problem for EMC.

With EMC’s help, I was able to cleanup the mess SRM/SRA made.

  1. The Failover in Progress reported by the SRA was due to description fields on the MirrorView description view.  Clearing those and rescanning the SRAs fixed that problem.
  2. The test LUN not mounting was due to me not selecting to resignature the VMFS datastore when I added it back in.

At this point, we were back to square one, and I went through the gambit of tests. I got errors because the SRM placeholders were reporting as invalid.  Going to the Protection Group within SRM and issuing the command to recreate the SRM placeholders fixed this issue.

We repeated testing again.  This time, everything worked, even failback.  Why did it fail before?  Even EMC support had no answer.  I suspect it’s because anytime I make the first attempt in a direction in an SRM environment to failover, it always fails.  Unfortunately, it was very difficult to fix this time.