Category Archives: vSphere

vCloud Air OnDemand and DR Overview

Recently, I worked on a project to deploy for a customer vCloud Air Disaster Recovery.  Along the way, as mentioned in a previous post, I went ahead and picked up the applicable VMware certification for it.  I wanted to discuss how vCloud Air works.  To begin with, I wanted to discuss the two offerings, and how they interact.

vCloud Air OnDemand

vCloud Air OnDemand is a public cloud Infrastructure As a Service (IAaS) solution.  You can run pretty much whatever workloads you like.  These are VMs or appliances you deploy either manually by yourself, using their catalog, or perhaps even upload templates from your VMware on premise  environment.

vCloud Air Disaster Recovery

vCloud Air Disaster Recovery is a public cloud Infrastructure As a Service solution.  It’s identical to vCloud Air OnDemand except it’s geared specifically for failing over replicated virtual machines.  VMs you run in this cloud can only be replicated virtual machines from your vSphere environment.  These VMs are replicated with vSphere Replication.  In addition, you can designate isolated networks within vCloud Air DR for isolated testing.

Two Separate Clouds

One design concept to understand is vCloud Air OnDemand and vCloud Air Disaster Recovery are two completely separate clouds.  While they have very similar features and management interfaces, they are completely independent and separate.  This is so much so that in order for VMs in either to be able to communicate with each other, you must setup site to site VPN connections between them.  Keep this in mind.

When Should I Use vCloud Air OnDemand vs Disaster Recovery?

This seems obvious, right?  If you want to replicate VMs using a whole VM replication technology and fail VMs over, use DR.  Use OnDemand when application or service data will replicate natively instead of whole VM replication.

But it’s not so simple as that.  What if a VM has the ability to replicate within its service or application?  Do some services and replication prohibit you from using whole VM replication?

The giant elephant in the room for DR solutions when asking these questions is Active Directory.  If you want your DR solution to be wholly independent, you likely need domain controllers in the cloud and on premise.  And it’s not supported to use whole VM replication with Active Directory Domain Controllers.  So, DCs are out for use with vCloud Air DR for production use cases.

Generally speaking, it maybe a bad idea for infrastructure bread and butter type VMs generally speaking.  For example, DHCP servers could be replicated, but the problem is vSphere Replication can only configure VMs with as low a RPO as 15 minutes, and going over a WAN link might make that 15 minute RPO at best impossible with all your other VMs you’re replicating.  Perhaps it would be better to build a VM within OnDemand and run a script to export the DHCP database regularly to the DR site.

Bottom line though is this:  You only need OnDemand if you won’t replicate VMs using vSphere Replication to vCloud Air.  If you want to run replicated VMs within vCloud Air as a DR solution, AND you want to ensure that the vCloud DR site is not dependent at all on your on premise infrastructure, you probably need both if only to facilitate running Domain Controllers.  Perhaps other services and applications you’ll replicate data to through other means, too.

It would look something like this:

vcloudairdrcommondesign

 

VMware ESXi 6.0 Express Patch 6 causing CBT issues

The always useful Veeam support digest is reporting that at the very least Veeam is seeing issues with Change Block Tracking (CBT) caused by vSphere 6.0 Express Patch 6.  This build was released on May 12th of this year.  It is the current build according to VMware’s build number KB article.

Veeam is reporting they’re seeing the issue if you’re not using application aware processing and using VMware Tools quiescence on your Veeam jobs.

Other blog articles are mentioning other backup products also impacted, including VMware Data Protection and IBM TSM.  It’s safe to assume this will broadly impact all VMware centered backup products.

If you’re using Express Patch 6, you currently have two options:

  1. Roll back to ESXi 6.0 Update 2.
  2. Don’t use quiesced snapshots.

Heads up!

Qlogic QConvergeConsole vCenter Plugin

I recently had a customer running into storage performance issues, and we determined that Qlogic CNA I/O card firmware and driver upgrades were in order.

Updating the drivers on multiple hosts if you have VMware Update Manager is easy. Just simply import the VIB into the patch repository, setup or change out the drivers in an existing baseline, and remediate your hosts. You can see similar instructions on my other blog post about updating Cisco UCS drivers within vSphere environments.

But what about firmware updates?

Qlogic QConvergeConsole

Qlogic has a pre-boot package you can use to update the firmware, but that’s not automated.  Plus, it can get pretty tedious having to boot off a USB stick or image remotely through an out of band management card.

Thankfully, Qlogic has a set of utilities to help you get more information about your Qlogic cards, and deploy new adapter firmware packages right from vCenter!

How do you take advantage of this?

You need to deploy the vCenter Qlogic plug-in. There are two packages available – one for the thick client, and one for the Web Client. I would recommend the Web Client one. First off, VMware has ceased development of the thick client, so generally using web plug-ins when given a choice is better. Secondly, I quite honestly had issues even getting the vSphere client plug-in to work properly, but the web client plugin worked great. Both had installation issues. I neglected to write down the exact wording, but the installer (and this was on a Windows installable vCenter 5.5 server) said something to the effect of an invalid console mode. After some googling, I found for other software from other manufacturers this can be overcome by setting the compability mode to Windows 7/2008R2. After doing that, both installed easily.

You also need to deploy the CIM provider VIB to the ESXi hosts. This is again another simple VIB deployment to the hosts, but does require a reboot. Updated drivers and CIM provider can be done to save time.

How to update Qlogic firmware

To update the firmware on Qlogic HBAs, navigate to the host you wish to update, and go to Manage > QConvergeConsole. Be patient, as the plug-in gathers and displays the information. The plug-in isn’t exactly a speed demon. Once it displays information for the HBAs, you click on the card on the left pain of the plug-in, and then click the card you wish to update. You’ll get current information about the card, including firmware information, and various features. You can even toggle some modes on the card like Personality Type and SR-IOV.

qlogicfwupdate1

Click “Update Adapter Flash Image”. In the next window, browse to the downloaded card firmware, select the bin image, and click OK. You’ll then be provided with a confirmation dialogue box that shows you the firmware you’re currently running, and the new firmware you’re about to install. Click OK to proceed.

qlogicfwupdate2

Be patient as it deploys the firmware. This can take several minutes. Once completed, you’ll receive a dialogue box that says the firmware will not take effect until after you reboot.

Note also if you have multiple Qlogic cards in your server, you must update each card individually, even if they’re the same model card.

At this point, reboot your ESXi server normally, and verify firmware updated successfully after the server boots back up.

FYI, for this customer, updating the drivers and firmware increased synthetic benchmarks by about 250%.  It’s definitely worthwhile to check for these updates, especially if your environment isn’t performing up to snuff.

EMC VSI RecoverPoint/SRM Integration

I’ve recently set a customer up with new VNX storage arrays, RecoverPoint , and it’s all to be integrated with VMware Site Recovery Manager.  Previously, the customer used SRM in conjunction with MirrorView/A.  Why RecoverPoint?

The really cool thing about RecoverPoint is you can easily rollback to specific points in time, as they like to call it DVR functionality for disaster recovery.  MirrorView/A only allows you to rollback to a specific snapshots at specific points in time.

EMC also provides their VSI for VMware environments.  This integrates with many of their storage products, including VNX, RecoverPoint, and it provides the DVR selection ability within SRM if you integrate it as well!

Setup is pretty straight forward:

  1. Deploy the OVA for the VSI in each site.
  2. Login to the VSI’s web portal by hitting https://<ip>:8443/vsi_vum with user name admin and password ChangeMe.  Change the password as prompted.
  3. Install the VSI’s plugin with vCenter by going to VSI Setup and provide the required info.  If you don’t get “The Operation is successful.”, do it again unless you’re provided an error to troubleshoot.  For me, that happened on one of the two vCenter servers I was deploying this on.  Also, be patient, as this can take quite sometime.  For me, the plugin took about 10-15 minutes to complete the installation.
  4. Login to the vCenter Web Client, and go to vCenter Inventory Lists. At the end, you should see an EMC VSI section. emcvsisection
  5. Click on Storage Integration Service.  Under Actions, click Register Solutions Integration Service, and enter the VSI’s info for that vCenter.  Click Test to ensure there’s connectivity to the VSI, and click OK.
  6. Under Storage Systems, add the storage array for that site.  Again, click Test to ensure there’s connectivity to the storage array, and click OK.  VSI supports VMAX, VNX, VNXe, ViPR, and XtremIO, so this isn’t just limited to the VNX on this project.
  7. Under Data Protection Systems, add the RecoverPoint cluster info for that site using the RPA cluster IP address, and be sure to select RecoverPoint as the Protection System Type.  rpprotectionsystemtypeClick Test to ensure communication will work.  If successful, OK will no longer be grayed out.  Click OK.
  8. Repeat step 7, but select SRM this time for the Data Proection System type.  Here’s where I ran into a gotcha.  The FQDN/IP address and port fields were grayed out.  I went ahead and clicked to Test, and got an error: “Could not communicate with the data protection system SRM at <IP of vCenter server>. Details: Cannot reach the target SRM server at <IP of vCenter server>:1” vsisrmregerrorGoogle didn’t yield any results for a solution, so I began troubleshooting.  Thankfully, I knew my ports, and decided to click the check box for the FQDN or IP/Port line, and entered in the FQDN of the SRM server and the port.  Be aware that SRM 6.X uses 9086.  I provided that, clicked Test, got my green “OK to go” text, and clicked OK.

Note that this needs to be done for each vCenter/RPA cluster/storage array/SRM server in the environment.  Note also only one VSI instance can be registered per vCenter server, so you’ll need to deploy one VSI per vCenter.

After setting up each site, go to a VM, click it, go to Manage, view the snapshots for its Consistency Group, click the one you want and apply, and launch your Failover or Test action from SRM.

vsiselectsnapshot

And there you have it!

Troubleshoot VSS errors in whole VM backups

I’ve dealt with many whole VM backup products in my experience with virtualization, including Veeam, VMware Data Protection, Avamar, vRanger Pro, Backup Exec, and more.  With that experience came lots of troubleshooting through various issues.  Originally, this post was going to deal with a recent specific issue I had, but I thought a better post would be to deal with an entire category of problems with these products, so someone could use this post to perhaps fix what could be one (or more) of lots of potential root causes, not just the singular one.  Many of the steps to troubleshoot this stuff helps keep your environment healthy and avoid lots of issues, not just issues with backups.

This post will focus specifically with VSS quiescing problems, not a definitive guide to all backup problems of VMs.

Revision Level of Your Backup Product

Often times, the issue has to do with the revision level of your backup product itself.   Generally, it’s good to be on the latest patch level, but not always.  Here are a few things to think about:

  • Is your backup product patched to current?  If not, perhaps look into doing so.
  • Is your backup product compatibile with your environment?  Check to ensure it supports the current build of your hypervisor, your hypervisor management software such as SCVMM or vCenter, and the guests you’re backing up, and take appropriate action.
  • Did you install an update to the backup product recently?  If so, perhaps there’s a bug in that update.

Revision Level of Guests That Are Backed Up

Backups that quiesce the file systems of guests depend upon OS components within said guests, and this is especially true of Windows guests, which rely on Volume Shadow Copies (VSS).  VSS, just like any other software, can have bugs in it that need to be fixed, so there are patches to VSS.  Other OS components could also be the culprit.  Ensure your guests are patched to current.  Conversely, if you recently applied patches to your guests recently, perhaps there are problems with those updates, so you may try removing those.

As a side note, I would recommend using multiple methods of checking your guest patch levels.  For example, while not very common, I’ve seen numerous cases of Windows Update saying all patches are installed, but when I used a second utility to check, those utilities reported missing patches.  Use a second utility to check, such as Microsoft Baseline Security Analyzer (which is free) if the guest is Windows based, to ensure you’re not missing anything.

Also, don’t assume the guests are patched to current.  I recently ran into an issue where the customer somehow hadn’t patched the server… ever.  Somehow it slipped through the cracks.

Hypervisor Revisions

Hypervisors also can cause issues with quiescing.  Some considerations here:

  • Does the build of the hypervisor support the guest having the issue?
  • Are the hypervisors patched to current?  If not, consider updating them.
  • Were the hypervisors recently patched?  If so, perhaps one of the installed patches has a problem, and removing it might resolve the issue.
  • Have the in guest optimization components such as VMTools within the guests been updated?  If not, do so.  If this was done recently, perhaps try to downgrade them to see if that resolves the issue.  These are important, as this is typically the means by which the hypervisor issues the command to quiesce the file system within the guest.

Other Guest Considerations

There are other issues that can cause problems with backups.

  • Other backup agents installed within the guest can also cause problems.  Remove any backup agents that are no longer needed.  I personally just ran into this issue with a customer that had an old Backup Exec agent from before they used their current backup product.
  • Applications have their own VSS agents, such as SQL and Exchange.  Sometimes those need to be updated, too.  It can also be that recent updates to them can also cause problems with quiescing.  Look for updates to those, or remove recent updates.
  • Antivirus software has also been known to cause VSS issues.  Try updating, disabling, configure proper exclusions, uninstalling and/or reinstalling the AV agents.
  • Ensure there is adequate free space within the guests.
  • There are a finite number of shadow copies, and when that limit is reached, it can cause quiescing to fail.  Try removing all shadow copies within the guest using the command:  vssadmin delete shadows /all

Hopefully, this provides you with some ideas to try to resolve the issue you’re experiencing.

Do you have any other tips for resolving VSS issues with whole VM backups?

VCSA can’t enumerate AD accounts

Ran into an interesting issue.  After deploying greenfield vCenter 6 Server Appliances (VCSA) using an external PSC for a remote branch site, when I tried to do some permissioning with AD accounts.  Joining the PSC to the domain wasn’t a problem, nor was adding the AD domain as an identity source.  But when I tried to enumerate accounts for permissioning, that would fail with the error: “Cannot load the users for the selected domain”.

I found an excellent VMware KB article that gave lots of things to check when troubleshooting this.

I verified DNS was working.  No surprise there.  However, when I ran the command less /var/lib/likewise/krb5-affinity.conf, I noticed the DCs used were not the correct DCs that should be using, rather DCs from a different remote branch office site.  When I checked AD Sites and Services, it was clear that a subnet  object was associated to the wrong branch office that included the IP of the PSC.  Therefore, PSC was attempting to use the DCs in that site.  That’s good to know that vCenter Appliances are apparently AD Site aware.  Furthermore, the first DC used of the two in the remote branch site didn’t have a PTR record because the Reverse Lookup Zone for that subnet for the wrong remote branch didn’t exist.  Apparently, if the first domain controller to be used can be contacted but doesn’t have a PTR record, the PSC won’t enumerate users and groups for permissioning.

Creating the Reverse Lookup Zone and forcing the PTR record creation along with some AD replication fixed the issue, and I kindly suggested to the customer it was time for some tender loving care with AD Sites and Services, along with DNS.

So, FYI, it’s not a bad idea to review your Active Directory Sites and Services, and your DNS Forward and Reverse Lookup zones before you deploy the VCSA.

Configure Dump Collector with PowerCLI in vSphere 6

I had a script to configure Dump Collector settings that I used in previous versions of vSphere.  If you look around the web, you’ll find similar PowerCLI snippets to configure Dump Collector.

If you use that snippet in vSphere 6, it doesn’t work.  You’ll get the following error:

Message: Cannot set 2 server ip parameters.;
InnerText: Cannot set 2 server ip parameters.EsxCLI.CLIFault.summary
At line:4 char:1

This is because ESXCLI now has a parameter for whether to use IPv6, so when using get-esxcli, invoking the method to set requires an additional value.  Remember, esxcli is not intuitive in that “enabled” properties are either true or null, so don’t use $false.

The revised code should now be:

$vcenterip = '192.168.1.10'
foreach($vmhost in Get-VMHost){
	$esxcli = Get-EsxCli -VMHost $vmhost.Name
	$esxcli.system.coredump.network.set($null,"vmk0",$null,$vcenterip,6500)
	$esxcli.system.coredump.network.set($true)
	$esxcli.system.coredump.network.get()
}

Also not something commonly found on the internet – can you test the ESXi netdump configuration?  Yep!

foreach($vmhost in Get-VMHost){
$esxcli = Get-EsxCli -VMHost $vmhost.Name
Write-Host "Checking dump collector on host $vmhost.name"
$esxcli.system.coredump.network.check()
}

And there you have it!

VMware ending Enterprise licensing

In a surprising move, VMware is moving to end the Enterprise licensing level, leaving only Standard and Enterprise Plus licensing levels, along with the Essentials and Essentials Plus packs.

VMware in fact has already removed Enterprise licensing from their product page.

As a consolation, VMware apparently will be offering existing Enterprise licensed customers special promotion pricing to upgrade to Enterprise Plus.

Also, vCenter Standard licenses will be bundled with 25 OS instance licenses for vRealize Log Insight, which definitely adds value for vCenter customers.  Log Insight is actually a really good product seemingly few people are aware of that aggregates event logs from Windows OS’s and syslogging, and allows for analysis and monitoring for specific events.  I hope this encourages customers to take a good look at Log Insight, because I think it’s a really good product that deserves more attention than it gets.

But all and all, I think these changes are not good.  The elephant in the room is DRS.  I have many customers who thought the price jump from Standard to Enterprise to get DRS was hard enough to swallow.  Some did, and some didn’t.  But now there’s a gaping canyon between Standard and Enterprise Plus in both features and price.  For many customers, all they really wanted in additional features above Standard is DRS, and they’ll now be forced to pay more to get it when many already didn’t due to price, so I don’t see this working out for VMware nor customers in the end, as more customers will opt to either stick to Standard instead of begrudgingly stepping up to Enterprise Plus, or consider other less expensive hypervisors such as Hyper-V or KVM.

It’s impossible to avoid having flashbacks to the vRAM licensing debacle.  If they truly wanted to simplify licensing, in my opinion DRS should be added to Standard even with an increase in licensing costs for Standard. Most environments that cost conscious are typically small and often would have the ability to use Essentials packs.  Plus, EVERYONE can make use of DRS somehow, so at least somebody would pay more for something they could actually use.

SRM installation error – unexpected error -1

I assisted a colleague today with an issue with reinstalling Site Recovery Manager 5.8 for a customer.  During installation, when requested to input the vCenter FQDN and administrative user, an “unexpected error: -1” pop up would occur.  After a bit of research, I found articles pointing to various certificate problems in SRM 4.X and 5.X.  The odd thing was he attempted to point it to the second vCenter server, and it would proceed, but this was the SRM server for one site, and this vCenter was in another.

I’ve been cognizant new vCenter releases disabling SSLv3, so we checked the build numbers of the two vCenter servers.  Sure enough, the one generating the error was 5.5.Update 3b which disabled SSLv3 support, but the vCenter that didn’t generate the error was 5.5 Update 2, which still supports SSLv3.  We then checked the build of the SRM 5.8 installation file, which was 5.8, NOT 5.8.1.  While this isn’t the same error, it states the fact that SRM 5.8.1 is required to interoperate well with vCenter 5.5 Update 3b, unless you want to enable SSLv3 in your vSphere environment, which isn’t the best thing for security. Even the interoperability guide shows that 5.8.0 as unsupported with vCenter 5.5 Update 3.

Apparently, the customer upgraded one vCenter very recently, but not the other, and also didn’t check SRM interoperability before doing so, which caused the weird behavior.   They also didn’t mention this to us.  vCenter in the second site was upgraded, and SRM 5.8.1 was installed instead of 5.8.0, and this resolved the issue.

So, if you have this error during SRM installation, it’s likely a problem with certificates, so start there, and be cognizant of any changes that might impact certificates or their use.  As always, check your builds, the interoperability matrix, and the upgrade order prior to updating any vSphere component.

In-place upgrading Windows OS on vCenter 6?

I recently had a customer with two vCenter VMs running on Windows 2008 R2.  They were vCenter servers upgraded from 5.1 to vCenter 6.0 about six months ago.  They’re both using embedded PSCs, and have vSphere Replication and SRM plugged into them.  To simplify administration, they have embarked on a project to get all servers running Windows Server 2012 R2.

After researching, there really isn’t a great, documented way to transplant a vSphere 6 server from one OS instance to another.  Normally, I’m not a big fan of in place upgrading server operating systems, but this was a special case to meet the customer’s objective, and redeploying two vCenters and then likely having to redeploy/reconfigure SRM wasn’t something I’d want to do, plus any pitfalls with vSphere Replication. But the question is – will vCenter 6, especially with an embedded Platform Services Controller and lots of things plugged into it, work after an in place OS upgrade?

I definitely had my doubts.  The answer though in my lab is surprisingly yes!  I tried it both with an embedded PSC, and then tried it again with a once embedded PSC reconfigured to use an external PSC.  I didn’t encounter any problems whatsoever, although I should point out this was a lab environment with a clean fresh setup prior to the OS in place upgrades.

So I went ahead and did it for the customer’s environment (they aren’t  big enough to have a lab environment), and it worked like a champ as well!

Here are some things I would make sure of before proceeding:

  • You may want to backup the vCenter database.  Warning: the vPostgress Windows backup script said it ran successful for me but generated an empty 0KB backup file.  (This was one of the reasons I didn’t attempt a transplant of vCenter to a new OS instance!)  Check to ensure this database file is valid before counting it as a backup to fallback to if there’s a problem.  This may be a future blog article once I get some answers for why this happened.
  • Verify what version of Windows is running, and ensure you have the required media and license keys.  In particular, if vCenter is running Windows Server 2008 R2 Datacenter, you can’t upgrade to Windows Server 2012 R2 Standard.
  • Verify what database vCenter and VUM are using if on the same box.  vPostgress is fine.  But of it’s Microsoft SQL running on the vCenter server itself, make sure SQL is running something that is supported on Windows Server 2012 R2.  Of specific note on some of these older vCenter VMs, SQL 2008 R2 needs to be SP2 or later.
  • I would recommend stopping all vCenter related services, VUM related services (if on the same OS instance), the database service (if it’s on the same OS instance),  and AV active protection prior to OS upgrade.
  • Make sure the C drive has at least about 15GBs of free space.
  • Reboot the OS prior to starting the upgrade to clear out any cobwebs.
  • Take a snapshot and/or backup vCenter before proceeding.  (Kinda duh…)  What isn’t so duh is before you take the last snapshot, launch the upgrade prior to doing this and verify you don’t need to do anything prior to installing the upgrade.  This is usually stuff like it may require you to reboot the OS prior to the upgrade  If all you see is the warning to check to ensure your applications are compatible, cancel the upgrade, take your snapshot, and start the other precautionary steps below. If there are other things it asks you to do, do those first, THEN snapshot your VM.
  • Don’t forget to kill your snapshot once everything is done, and you’ve confirmed everything is working.

It worked flawlessly using these precautions for both production vCenter servers!