Monthly Archives: March 2016

EMC VSI RecoverPoint/SRM Integration

I’ve recently set a customer up with new VNX storage arrays, RecoverPoint , and it’s all to be integrated with VMware Site Recovery Manager.  Previously, the customer used SRM in conjunction with MirrorView/A.  Why RecoverPoint?

The really cool thing about RecoverPoint is you can easily rollback to specific points in time, as they like to call it DVR functionality for disaster recovery.  MirrorView/A only allows you to rollback to a specific snapshots at specific points in time.

EMC also provides their VSI for VMware environments.  This integrates with many of their storage products, including VNX, RecoverPoint, and it provides the DVR selection ability within SRM if you integrate it as well!

Setup is pretty straight forward:

  1. Deploy the OVA for the VSI in each site.
  2. Login to the VSI’s web portal by hitting https://<ip>:8443/vsi_vum with user name admin and password ChangeMe.  Change the password as prompted.
  3. Install the VSI’s plugin with vCenter by going to VSI Setup and provide the required info.  If you don’t get “The Operation is successful.”, do it again unless you’re provided an error to troubleshoot.  For me, that happened on one of the two vCenter servers I was deploying this on.  Also, be patient, as this can take quite sometime.  For me, the plugin took about 10-15 minutes to complete the installation.
  4. Login to the vCenter Web Client, and go to vCenter Inventory Lists. At the end, you should see an EMC VSI section. emcvsisection
  5. Click on Storage Integration Service.  Under Actions, click Register Solutions Integration Service, and enter the VSI’s info for that vCenter.  Click Test to ensure there’s connectivity to the VSI, and click OK.
  6. Under Storage Systems, add the storage array for that site.  Again, click Test to ensure there’s connectivity to the storage array, and click OK.  VSI supports VMAX, VNX, VNXe, ViPR, and XtremIO, so this isn’t just limited to the VNX on this project.
  7. Under Data Protection Systems, add the RecoverPoint cluster info for that site using the RPA cluster IP address, and be sure to select RecoverPoint as the Protection System Type.  rpprotectionsystemtypeClick Test to ensure communication will work.  If successful, OK will no longer be grayed out.  Click OK.
  8. Repeat step 7, but select SRM this time for the Data Proection System type.  Here’s where I ran into a gotcha.  The FQDN/IP address and port fields were grayed out.  I went ahead and clicked to Test, and got an error: “Could not communicate with the data protection system SRM at <IP of vCenter server>. Details: Cannot reach the target SRM server at <IP of vCenter server>:1” vsisrmregerrorGoogle didn’t yield any results for a solution, so I began troubleshooting.  Thankfully, I knew my ports, and decided to click the check box for the FQDN or IP/Port line, and entered in the FQDN of the SRM server and the port.  Be aware that SRM 6.X uses 9086.  I provided that, clicked Test, got my green “OK to go” text, and clicked OK.

Note that this needs to be done for each vCenter/RPA cluster/storage array/SRM server in the environment.  Note also only one VSI instance can be registered per vCenter server, so you’ll need to deploy one VSI per vCenter.

After setting up each site, go to a VM, click it, go to Manage, view the snapshots for its Consistency Group, click the one you want and apply, and launch your Failover or Test action from SRM.

vsiselectsnapshot

And there you have it!

Insufficient permission to run CLI command

'Insufficient permission to run CLI command' when performing upgrade on VNX File OE.
Error message:  Insufficient permission to run CLI command

Ran into this today while attempting to update VNX File OE code for a customer using Unisphere Service Manager.  While there were no major issues reported within Unisphere, I got the following when attempting to start the process by running “Prepare for Installation (Step-1)”:

Google only yielded an article that basically said to ensure you’re running USM on the same subnet, which I was.  I began troubleshooting by running within USM “Health Check”, which showed various errors indicating a failover of a Control Station.  I failed back to CS0, and reran the Health Check within USM, which passed, and then tried again, and everything worked like a champ!

Update on Air Console and get 10% off!

Several weeks ago, I wrote about a remote serial solution called Air Console, which provides an all in one solution for wired, LAN, wifi, and bluetooth serial connectivity.  I’ve found Air Console extremely useful since then.  I just initializing  two Data Domains and four RecoverPoint RPAs in a cramped, crowded server room with no comfortable place to work from my laptop.  No problem!  I simply walked in with my Air Console Mini and iPad, and initialized all six devices wirelessly.  It beats figuring out how to maneuver a serial cable to some place where I would have to sit on a floor indian leg style.  Full disclosure: I have the flexibility of a 2×4.  It worked once again like a champ.

Get-Console noticed my blog article and contacted me to offer my readers 10% off using coupon code JJGH667QS on their orders.  (I wish I got that deal, but it was still worth every penny!)

Also, Get-Console has solutions to connect to multiple serial devices simultaneously.  This could be useful for initializing six devices like I just did.  It could also be used as an out of band management solution for a rack full of routers and switches, too!

So, if you’re looking for a smarter serial solution, check them out!

Troubleshoot VSS errors in whole VM backups

I’ve dealt with many whole VM backup products in my experience with virtualization, including Veeam, VMware Data Protection, Avamar, vRanger Pro, Backup Exec, and more.  With that experience came lots of troubleshooting through various issues.  Originally, this post was going to deal with a recent specific issue I had, but I thought a better post would be to deal with an entire category of problems with these products, so someone could use this post to perhaps fix what could be one (or more) of lots of potential root causes, not just the singular one.  Many of the steps to troubleshoot this stuff helps keep your environment healthy and avoid lots of issues, not just issues with backups.

This post will focus specifically with VSS quiescing problems, not a definitive guide to all backup problems of VMs.

Revision Level of Your Backup Product

Often times, the issue has to do with the revision level of your backup product itself.   Generally, it’s good to be on the latest patch level, but not always.  Here are a few things to think about:

  • Is your backup product patched to current?  If not, perhaps look into doing so.
  • Is your backup product compatibile with your environment?  Check to ensure it supports the current build of your hypervisor, your hypervisor management software such as SCVMM or vCenter, and the guests you’re backing up, and take appropriate action.
  • Did you install an update to the backup product recently?  If so, perhaps there’s a bug in that update.

Revision Level of Guests That Are Backed Up

Backups that quiesce the file systems of guests depend upon OS components within said guests, and this is especially true of Windows guests, which rely on Volume Shadow Copies (VSS).  VSS, just like any other software, can have bugs in it that need to be fixed, so there are patches to VSS.  Other OS components could also be the culprit.  Ensure your guests are patched to current.  Conversely, if you recently applied patches to your guests recently, perhaps there are problems with those updates, so you may try removing those.

As a side note, I would recommend using multiple methods of checking your guest patch levels.  For example, while not very common, I’ve seen numerous cases of Windows Update saying all patches are installed, but when I used a second utility to check, those utilities reported missing patches.  Use a second utility to check, such as Microsoft Baseline Security Analyzer (which is free) if the guest is Windows based, to ensure you’re not missing anything.

Also, don’t assume the guests are patched to current.  I recently ran into an issue where the customer somehow hadn’t patched the server… ever.  Somehow it slipped through the cracks.

Hypervisor Revisions

Hypervisors also can cause issues with quiescing.  Some considerations here:

  • Does the build of the hypervisor support the guest having the issue?
  • Are the hypervisors patched to current?  If not, consider updating them.
  • Were the hypervisors recently patched?  If so, perhaps one of the installed patches has a problem, and removing it might resolve the issue.
  • Have the in guest optimization components such as VMTools within the guests been updated?  If not, do so.  If this was done recently, perhaps try to downgrade them to see if that resolves the issue.  These are important, as this is typically the means by which the hypervisor issues the command to quiesce the file system within the guest.

Other Guest Considerations

There are other issues that can cause problems with backups.

  • Other backup agents installed within the guest can also cause problems.  Remove any backup agents that are no longer needed.  I personally just ran into this issue with a customer that had an old Backup Exec agent from before they used their current backup product.
  • Applications have their own VSS agents, such as SQL and Exchange.  Sometimes those need to be updated, too.  It can also be that recent updates to them can also cause problems with quiescing.  Look for updates to those, or remove recent updates.
  • Antivirus software has also been known to cause VSS issues.  Try updating, disabling, configure proper exclusions, uninstalling and/or reinstalling the AV agents.
  • Ensure there is adequate free space within the guests.
  • There are a finite number of shadow copies, and when that limit is reached, it can cause quiescing to fail.  Try removing all shadow copies within the guest using the command:  vssadmin delete shadows /all

Hopefully, this provides you with some ideas to try to resolve the issue you’re experiencing.

Do you have any other tips for resolving VSS issues with whole VM backups?

Unregister Cisco UCS from Cisco UCS Central if you’re not using it

Just banged my head against a wall for hours trying to get a new Service Profile associated with a new blade.  It kept retrying to “Configure resolve identifiers”.  Finally stumbled on an obscure support forum article just when I was about to give up and call Cisco TAC…

https://supportforums.cisco.com/discussion/12174866/cannot-create-service-profiles-template-anymore

TLDR version…

If the UCS domain (FIs) have been registered with UCS Central, you can’t associate service profiles with UCS Manager.  If it looks like this in UCS Manager with Registration Status showing Lost Visibility…

unregisterucscentral

Your UCS Domain is either having problems connecting to UCS Central, or UCS Central is burnt… or dead…  Troubleshoot why it has lost connectivity, or if UCS Central is gone, click the red circled Unregister From UCS Central, and proceed.

If you see a nice green check box, that means UCS Central is present, and you should use it to associate the Service Profile, not UCS Manager.

In this case, the customer deployed UCS Central to play with, decided they didn’t need it, and removed it from their environment, but kept it registered.  So I unregistered it, and the Service Profile associated nice and easy.

VCSA can’t enumerate AD accounts

Ran into an interesting issue.  After deploying greenfield vCenter 6 Server Appliances (VCSA) using an external PSC for a remote branch site, when I tried to do some permissioning with AD accounts.  Joining the PSC to the domain wasn’t a problem, nor was adding the AD domain as an identity source.  But when I tried to enumerate accounts for permissioning, that would fail with the error: “Cannot load the users for the selected domain”.

I found an excellent VMware KB article that gave lots of things to check when troubleshooting this.

I verified DNS was working.  No surprise there.  However, when I ran the command less /var/lib/likewise/krb5-affinity.conf, I noticed the DCs used were not the correct DCs that should be using, rather DCs from a different remote branch office site.  When I checked AD Sites and Services, it was clear that a subnet  object was associated to the wrong branch office that included the IP of the PSC.  Therefore, PSC was attempting to use the DCs in that site.  That’s good to know that vCenter Appliances are apparently AD Site aware.  Furthermore, the first DC used of the two in the remote branch site didn’t have a PTR record because the Reverse Lookup Zone for that subnet for the wrong remote branch didn’t exist.  Apparently, if the first domain controller to be used can be contacted but doesn’t have a PTR record, the PSC won’t enumerate users and groups for permissioning.

Creating the Reverse Lookup Zone and forcing the PTR record creation along with some AD replication fixed the issue, and I kindly suggested to the customer it was time for some tender loving care with AD Sites and Services, along with DNS.

So, FYI, it’s not a bad idea to review your Active Directory Sites and Services, and your DNS Forward and Reverse Lookup zones before you deploy the VCSA.