Tag Archives: vsphere

VMware ESXi 6.0 Express Patch 6 causing CBT issues

The always useful Veeam support digest is reporting that at the very least Veeam is seeing issues with Change Block Tracking (CBT) caused by vSphere 6.0 Express Patch 6.  This build was released on May 12th of this year.  It is the current build according to VMware’s build number KB article.

Veeam is reporting they’re seeing the issue if you’re not using application aware processing and using VMware Tools quiescence on your Veeam jobs.

Other blog articles are mentioning other backup products also impacted, including VMware Data Protection and IBM TSM.  It’s safe to assume this will broadly impact all VMware centered backup products.

If you’re using Express Patch 6, you currently have two options:

  1. Roll back to ESXi 6.0 Update 2.
  2. Don’t use quiesced snapshots.

Heads up!

EMC VSI RecoverPoint/SRM Integration

I’ve recently set a customer up with new VNX storage arrays, RecoverPoint , and it’s all to be integrated with VMware Site Recovery Manager.  Previously, the customer used SRM in conjunction with MirrorView/A.  Why RecoverPoint?

The really cool thing about RecoverPoint is you can easily rollback to specific points in time, as they like to call it DVR functionality for disaster recovery.  MirrorView/A only allows you to rollback to a specific snapshots at specific points in time.

EMC also provides their VSI for VMware environments.  This integrates with many of their storage products, including VNX, RecoverPoint, and it provides the DVR selection ability within SRM if you integrate it as well!

Setup is pretty straight forward:

  1. Deploy the OVA for the VSI in each site.
  2. Login to the VSI’s web portal by hitting https://<ip>:8443/vsi_vum with user name admin and password ChangeMe.  Change the password as prompted.
  3. Install the VSI’s plugin with vCenter by going to VSI Setup and provide the required info.  If you don’t get “The Operation is successful.”, do it again unless you’re provided an error to troubleshoot.  For me, that happened on one of the two vCenter servers I was deploying this on.  Also, be patient, as this can take quite sometime.  For me, the plugin took about 10-15 minutes to complete the installation.
  4. Login to the vCenter Web Client, and go to vCenter Inventory Lists. At the end, you should see an EMC VSI section. emcvsisection
  5. Click on Storage Integration Service.  Under Actions, click Register Solutions Integration Service, and enter the VSI’s info for that vCenter.  Click Test to ensure there’s connectivity to the VSI, and click OK.
  6. Under Storage Systems, add the storage array for that site.  Again, click Test to ensure there’s connectivity to the storage array, and click OK.  VSI supports VMAX, VNX, VNXe, ViPR, and XtremIO, so this isn’t just limited to the VNX on this project.
  7. Under Data Protection Systems, add the RecoverPoint cluster info for that site using the RPA cluster IP address, and be sure to select RecoverPoint as the Protection System Type.  rpprotectionsystemtypeClick Test to ensure communication will work.  If successful, OK will no longer be grayed out.  Click OK.
  8. Repeat step 7, but select SRM this time for the Data Proection System type.  Here’s where I ran into a gotcha.  The FQDN/IP address and port fields were grayed out.  I went ahead and clicked to Test, and got an error: “Could not communicate with the data protection system SRM at <IP of vCenter server>. Details: Cannot reach the target SRM server at <IP of vCenter server>:1” vsisrmregerrorGoogle didn’t yield any results for a solution, so I began troubleshooting.  Thankfully, I knew my ports, and decided to click the check box for the FQDN or IP/Port line, and entered in the FQDN of the SRM server and the port.  Be aware that SRM 6.X uses 9086.  I provided that, clicked Test, got my green “OK to go” text, and clicked OK.

Note that this needs to be done for each vCenter/RPA cluster/storage array/SRM server in the environment.  Note also only one VSI instance can be registered per vCenter server, so you’ll need to deploy one VSI per vCenter.

After setting up each site, go to a VM, click it, go to Manage, view the snapshots for its Consistency Group, click the one you want and apply, and launch your Failover or Test action from SRM.

vsiselectsnapshot

And there you have it!

VCSA can’t enumerate AD accounts

Ran into an interesting issue.  After deploying greenfield vCenter 6 Server Appliances (VCSA) using an external PSC for a remote branch site, when I tried to do some permissioning with AD accounts.  Joining the PSC to the domain wasn’t a problem, nor was adding the AD domain as an identity source.  But when I tried to enumerate accounts for permissioning, that would fail with the error: “Cannot load the users for the selected domain”.

I found an excellent VMware KB article that gave lots of things to check when troubleshooting this.

I verified DNS was working.  No surprise there.  However, when I ran the command less /var/lib/likewise/krb5-affinity.conf, I noticed the DCs used were not the correct DCs that should be using, rather DCs from a different remote branch office site.  When I checked AD Sites and Services, it was clear that a subnet  object was associated to the wrong branch office that included the IP of the PSC.  Therefore, PSC was attempting to use the DCs in that site.  That’s good to know that vCenter Appliances are apparently AD Site aware.  Furthermore, the first DC used of the two in the remote branch site didn’t have a PTR record because the Reverse Lookup Zone for that subnet for the wrong remote branch didn’t exist.  Apparently, if the first domain controller to be used can be contacted but doesn’t have a PTR record, the PSC won’t enumerate users and groups for permissioning.

Creating the Reverse Lookup Zone and forcing the PTR record creation along with some AD replication fixed the issue, and I kindly suggested to the customer it was time for some tender loving care with AD Sites and Services, along with DNS.

So, FYI, it’s not a bad idea to review your Active Directory Sites and Services, and your DNS Forward and Reverse Lookup zones before you deploy the VCSA.

VMware ending Enterprise licensing

In a surprising move, VMware is moving to end the Enterprise licensing level, leaving only Standard and Enterprise Plus licensing levels, along with the Essentials and Essentials Plus packs.

VMware in fact has already removed Enterprise licensing from their product page.

As a consolation, VMware apparently will be offering existing Enterprise licensed customers special promotion pricing to upgrade to Enterprise Plus.

Also, vCenter Standard licenses will be bundled with 25 OS instance licenses for vRealize Log Insight, which definitely adds value for vCenter customers.  Log Insight is actually a really good product seemingly few people are aware of that aggregates event logs from Windows OS’s and syslogging, and allows for analysis and monitoring for specific events.  I hope this encourages customers to take a good look at Log Insight, because I think it’s a really good product that deserves more attention than it gets.

But all and all, I think these changes are not good.  The elephant in the room is DRS.  I have many customers who thought the price jump from Standard to Enterprise to get DRS was hard enough to swallow.  Some did, and some didn’t.  But now there’s a gaping canyon between Standard and Enterprise Plus in both features and price.  For many customers, all they really wanted in additional features above Standard is DRS, and they’ll now be forced to pay more to get it when many already didn’t due to price, so I don’t see this working out for VMware nor customers in the end, as more customers will opt to either stick to Standard instead of begrudgingly stepping up to Enterprise Plus, or consider other less expensive hypervisors such as Hyper-V or KVM.

It’s impossible to avoid having flashbacks to the vRAM licensing debacle.  If they truly wanted to simplify licensing, in my opinion DRS should be added to Standard even with an increase in licensing costs for Standard. Most environments that cost conscious are typically small and often would have the ability to use Essentials packs.  Plus, EVERYONE can make use of DRS somehow, so at least somebody would pay more for something they could actually use.

SRM installation error – unexpected error -1

I assisted a colleague today with an issue with reinstalling Site Recovery Manager 5.8 for a customer.  During installation, when requested to input the vCenter FQDN and administrative user, an “unexpected error: -1” pop up would occur.  After a bit of research, I found articles pointing to various certificate problems in SRM 4.X and 5.X.  The odd thing was he attempted to point it to the second vCenter server, and it would proceed, but this was the SRM server for one site, and this vCenter was in another.

I’ve been cognizant new vCenter releases disabling SSLv3, so we checked the build numbers of the two vCenter servers.  Sure enough, the one generating the error was 5.5.Update 3b which disabled SSLv3 support, but the vCenter that didn’t generate the error was 5.5 Update 2, which still supports SSLv3.  We then checked the build of the SRM 5.8 installation file, which was 5.8, NOT 5.8.1.  While this isn’t the same error, it states the fact that SRM 5.8.1 is required to interoperate well with vCenter 5.5 Update 3b, unless you want to enable SSLv3 in your vSphere environment, which isn’t the best thing for security. Even the interoperability guide shows that 5.8.0 as unsupported with vCenter 5.5 Update 3.

Apparently, the customer upgraded one vCenter very recently, but not the other, and also didn’t check SRM interoperability before doing so, which caused the weird behavior.   They also didn’t mention this to us.  vCenter in the second site was upgraded, and SRM 5.8.1 was installed instead of 5.8.0, and this resolved the issue.

So, if you have this error during SRM installation, it’s likely a problem with certificates, so start there, and be cognizant of any changes that might impact certificates or their use.  As always, check your builds, the interoperability matrix, and the upgrade order prior to updating any vSphere component.

In-place upgrading Windows OS on vCenter 6?

I recently had a customer with two vCenter VMs running on Windows 2008 R2.  They were vCenter servers upgraded from 5.1 to vCenter 6.0 about six months ago.  They’re both using embedded PSCs, and have vSphere Replication and SRM plugged into them.  To simplify administration, they have embarked on a project to get all servers running Windows Server 2012 R2.

After researching, there really isn’t a great, documented way to transplant a vSphere 6 server from one OS instance to another.  Normally, I’m not a big fan of in place upgrading server operating systems, but this was a special case to meet the customer’s objective, and redeploying two vCenters and then likely having to redeploy/reconfigure SRM wasn’t something I’d want to do, plus any pitfalls with vSphere Replication. But the question is – will vCenter 6, especially with an embedded Platform Services Controller and lots of things plugged into it, work after an in place OS upgrade?

I definitely had my doubts.  The answer though in my lab is surprisingly yes!  I tried it both with an embedded PSC, and then tried it again with a once embedded PSC reconfigured to use an external PSC.  I didn’t encounter any problems whatsoever, although I should point out this was a lab environment with a clean fresh setup prior to the OS in place upgrades.

So I went ahead and did it for the customer’s environment (they aren’t  big enough to have a lab environment), and it worked like a champ as well!

Here are some things I would make sure of before proceeding:

  • You may want to backup the vCenter database.  Warning: the vPostgress Windows backup script said it ran successful for me but generated an empty 0KB backup file.  (This was one of the reasons I didn’t attempt a transplant of vCenter to a new OS instance!)  Check to ensure this database file is valid before counting it as a backup to fallback to if there’s a problem.  This may be a future blog article once I get some answers for why this happened.
  • Verify what version of Windows is running, and ensure you have the required media and license keys.  In particular, if vCenter is running Windows Server 2008 R2 Datacenter, you can’t upgrade to Windows Server 2012 R2 Standard.
  • Verify what database vCenter and VUM are using if on the same box.  vPostgress is fine.  But of it’s Microsoft SQL running on the vCenter server itself, make sure SQL is running something that is supported on Windows Server 2012 R2.  Of specific note on some of these older vCenter VMs, SQL 2008 R2 needs to be SP2 or later.
  • I would recommend stopping all vCenter related services, VUM related services (if on the same OS instance), the database service (if it’s on the same OS instance),  and AV active protection prior to OS upgrade.
  • Make sure the C drive has at least about 15GBs of free space.
  • Reboot the OS prior to starting the upgrade to clear out any cobwebs.
  • Take a snapshot and/or backup vCenter before proceeding.  (Kinda duh…)  What isn’t so duh is before you take the last snapshot, launch the upgrade prior to doing this and verify you don’t need to do anything prior to installing the upgrade.  This is usually stuff like it may require you to reboot the OS prior to the upgrade  If all you see is the warning to check to ensure your applications are compatible, cancel the upgrade, take your snapshot, and start the other precautionary steps below. If there are other things it asks you to do, do those first, THEN snapshot your VM.
  • Don’t forget to kill your snapshot once everything is done, and you’ve confirmed everything is working.

It worked flawlessly using these precautions for both production vCenter servers!

vCenter 6 Reconfigure from Embedded to External PSC

There have been some problems with embedded PSC configurations, so I’ve had requests to move away from the embedded PSC (PSC and vCenter in same OS instance) to external configurations.  Thankfully, vCenter Update 1 and above has a method to do just this!

Transitioning to External PSC

To accomplish this, I first built a new virtual machine running Server 2012 R2, patched it to current, joined it to the domain, and granted the appropriate rights for the vCenter service account.

It’s also important to note that the existing vCenter 6.0 must be running Update 1 or later for this to work.  Obviously, you should deploy a new PSC using the same build as the existing vCenter.  Patch up your current vCenter up to Update 1 or higher obviously if needed.

Also, make sure you have a good rollback plan, like a whole VM backup or snapshots as needed.

This process works just as well for the appliance.

You then install an external PSC joining the existing SSO domain and site.  Now there are two PSCs, but vCenter is still setup in an embedded configuration, so the external PSC isn’t used yet.

At this point, you need to use the cmsso-util utility with the reconfigure option, located in your vCenter installation folder.  It’s typically under C:\Program Files\VMware\vCenter Server\bin folder.

cmsso-util reconfigure –repoint-psc destpsc.vs6lab.local –username administrator –domain-name “vsphere.local” -passwd “P@ssw0rd”

I immediately ran into my first issue…

repointpscdnserror

“The provided Platform Services Controller(PSC) is not a replication partner of the localhost. Please make sure to provide the Primary Network Identifier (PNID) of the PSC.”

A little googling led me quickly to this community post that states the DNS name is apparently case sensitive, so check your DNS records to see if maybe it’s all caps, or what.  Use that, and you’re golden.  In my case, it was DESTPSC.vs6lab.local.

Sit back and be patient.  Mine took probably a solid 10 minutes, but I am running it in a slower lab environment.

When it’s finished, verify the vSphere Web Client is functioning.  Also, verify the PSC has been repointed under your vCenter Server – Manage – Advanced Settings – config.vpxd.sso.admin.uri

confirmpscrepoint

PSC is done!

VMware network test commands

I recently ran into an issue with vSphere Replication that involved network connectivity (probably a future post), and I quickly realized that VMware network test commands are not consistent across all their products, so this could be confusing for many people.  I’ll update this post later as I get the commands for other products, but this may help someone looking for how to do VMware network testing and troubleshooting.

ESXi

ESXi has two helpful commands.  For basic connectivity tests, vmkping is awesome because it’s simple to use and to specify which kernel port group you want to test.  Sure, you could use ping, but you can’t specify which vmk interface with it.

To ping 192.168.1.1 with your Management Port group, assuming it’s default, so it’s using vmk0, it’s simply:

vmkping 192.168.1.1 -I vmk0

Another good use is validating jumbo frames, as you can specify the packet size as well and disable packet fragmentation.  To conduct the same test with a packet size of 9000 and ensure the packet doesn’t get fragmented:

vmkping 192.168.1.1 -I vmk0 -s 9000 -d

For testing specific port connectivity, ESXi does support the netcat, aka nc command.  To test port 80 on destination 192.168.1.1:

nc -z 192.168.1.1 80

You can specify UDP mode using -u as well.  Note that at least in my experience -s <source IP> does NOT work, so I don’t believe it’s possible to specifically direct netcat through a specific vmkernel port.  When I tried it for example forcing it through an IP that shouldn’t work, connectivity was still made when it shouldn’t have.

Any VMware Product Running on Windows 2012 or Higher (vCenter, SRM)

Everybody knows ping.  I’m not gonna go over that.  But did you know that PowerShell has a ping cmdlet?  This is useful for documentation of results, using export-csv, and scripting lots of ping tests.

To ping 192.168.1.1:

test-connection 192.168.1.1

Another handy trick is you can remotely have multiple Windows machines ping the same computer and/or specify multiple targets.  For example, if I want server1, server2, to ping 192.168.1.1 and 192.168.1.2:

test-connection -Source Server1,Server2 -ComputerName server3,server4

PowerShell also has cmdlets to test network port connectivity as well.  To test if the local machine can connect to 192.168.1.1 on TCP port 80:

test-netconnection -computername 192.168.1.1 -InformationLevel detailed -port 80

Unfortunately, there isn’t a handy -source parameter, but you could use PowerShell remoting to run this command on multiple remote computers, too.

VMware vCenter Server Appliance

For pinging, there’s the ping command.  That’s easy enough.

If you try to use netcat for port testing, it isn’t there by default.  You have to run the following to temporarily install it on version 6:

/etc/vmware/gss-support/install.sh

Rebooting the VCSA removes it.

You can also use curl if that’s something you’d rather not do:

curl -v telnet://192.168.1.1:80

vSphere Replication Appliance

For pinging, there’s the ping command.  No surprises.

For network port testing, again, netcat isn’t installed, nor is there a supported way to install it to my knowledge.  Instead, use the curl command:

curl -v telnet://192.168.1.1:80

Keep checking back, as I add more.

Nutanix administration do’s and don’ts

As a virtualization consultant, I know there’s a wide variety of technologies at every level – hypervisor, storage, networking, and even server hardware is getting to some degree more complex in terms of what you need to know to manage it effectively.  Everyone can’t be an expert in every single storage technology as an example, and with more and more options that are radically different in their architecture, right now I wanted to make my own little contribution to the world for consultants and admins alike on basic things you should and shouldn’t do with one storage solution – Nutanix.  For us consultants, we often find ourselves within environments with something we’re not totally familiar with, so some helpful concise guidance can go a long way.  Admins, too, may have depended upon a consultant or previous colleagues that no longer work there for implementation and support, but now it’s on them, so I thought this would be helpful.

There are quite a few things everyone should know if they ever are working on a environment with Nutanix that aren’t necessarily obvious.  I can see it being pretty darn easy to blow up a Nutanix environment if you’re not aware of some of these things.

Common stuff

  • Contact Nutanix Support before downgrading licensing or destroying cluster to reclaim licenses (unnecessary if you’re using Starter licensing though). This was repeated many times, so I’m guessing if this isn’t done, you’ll be hating life getting licensing straight.
  • Do NOT delete the Nutanix Controller VM on any Nutanix host (CVM names look like: NTNX-<blockid>-<position>-CVM)
  • Do NOT modify any settings of a Controller VM, all the way down to even the name of the VM.
  • Shutdown/Startup gotchas:
    • It’s probably best to never shutdown/reboot/etc. more than one Nutanix node in a cluster at a time. If you do more, you may cause all hosts in the Nutanix cluster to lose storage connectivity.
    • When shutting down a single host or < the redundancy factor (Nutanix number of hosts it is configured to tolerate failure in a Nutanix cluster), migrate/shutdown all VMs on host EXCEPT the controller VM, THEN shutdown the controller VM.
    • If you are shutting down a number of hosts that exceeds the redundancy factor, you need to shutdown the Nutanix cluster. There’s also a specialized procedure to start up the Nutanix cluster in this situation.  That’s beyond the scope of this email.
    • When booting up a host, do the following:
      • start the Controller VM first that resides on it, and verify it’s services are working by SSH to it using:
        • Ncli cluster status | grep –A 15 <controllerVmIP>
      • Then have it rescan its datastores.
      • Then verify the Nutanix Cluster state using the following to ensure cluster services are all up via same SSH session:
        • cluster status
  • Hypervisor Patching
    • Make sure to patch one hypervisor node and ensure Controller VM comes back up with services are good before proceeding to the next one. Also do one at a time in a Nutanix cluster (see above).
    • Follow shutdown host procedure above.

vSphere

  • NEVER use “Reset System Configuration” command in Nutanix.
  • If resource pools are created, Controller VM (CVM) must have the highest share.
  • Do NOT modify NFS settings.
  • VM swapfile location should be the same folder as the VM. Do NOT place it on a dedicated datastore.
  • Do NOT modify the Controller VM startup/shutdown order.
  • Do NOT modify iSCSI software adapter settings.
  • Do NOT modify vSwitchNutanix standard vSwitch.
  • Do NOT modify Vmk0 interface in port group “Management Network”.
  • Do NOT disable ESXi host SSH.
  • HA configuration recommended settings:
    • Enable admission control and use percentage based policy with value based on number of nodes in cluster
    • Set VM Restart Priority for CVMs to Disabled.
    • Set Host Isolation Response of cluster to Power Off
    • Set Host Isolation Response of CVMs to Leave Powered ON.
    • Disable VM Monitoring for all CVMs
    • Enable Datastore Heartbeating by clicking Select only from my preferred datastores and choosing Nutanix datastores. If cluster has only one datastore (which would be common potentially in Nutanix deployments), add advanced option das.ignoreInsufficientHbDatastore=true to avoid warnings about not having at least two heartbeat datastores.
  • DRS stuff:
    • Disable automation of all CVMs
    • Leave power management disabled (DPM)
  • Enable EVC for lowest processor class in cluster.

Hyper-V

  • Do NOT use Validate Cluster within Failover Clustering nor SCVMM, as it is not supported. Not sure what would happen if you did, but I’m guessing it would be pretty awesome, and you probably should make sure you got popcorn ready if you’re gonna do that.
  • Do NOT modify the Nutanix or Hyper-V cluster name
  • Do NOT modify the external network adapter name
  • Do NOT modify the Nutanix specific virtual switch settings

KVM (the Hypervisor… also assuming this means if you’re using Acropolis Hypervisor from Nutanix since it’s KVM based…)

  • Do NOT modify the Hypervisor configuration, including installed packages
  • Do NOT modify iSCSI settings
  • Do NOT modify the Open vSwitch settings

I hope this proves helpful to people who unexpectedly find themselves working on Nutanix and need a quick primer to ensure they don’t break something!

Change Block Tracking issues with SRM

As it may be obvious, I’ve been doing quite a bit of work with VMware Site Recovery Manager with storage based replication lately, specifically EMC’s MirrorView.  I ran into another issue while testing with SRM 6 + ESXi 5.0 hosts.

During the project, we are updating vCenter from 5.0 to 6.0, SRM from 5.0 to 6.0, verifying everything works, and then proceeding with updating ESXi hosts.  We didn’t bother patching ESXi 5.0 hosts, since they would be updated to 6.0 soon enough.  We wanted to make sure SRM worked through vCenter before updating ESXi simply to ensure an easy rollback.

However, during failover testing, we ran into an issue where most VMs would not power on during isolated testing and failovers.  The error was as follows:

Error – Cannot open the disk ‘/vmfs/volumes/<VMFS GUID>/VMNameVMName.vmdk’ or one of the snapshot disks it depends on.

When you look into the events for an impacted VM, you would find the following:

“Could not open/create change tracking file”

We cleared CBT files for all the VMs, and tried again, forcing replication, and it worked.  We figured CBT got corrupted.  But then Veeam ran its backups, we tried an isolated test, and almost all the VMs couldn’t power on in an isolated test again.

I know ESXi 6 has been in the news lately for corruption in Change Block Tracking, but it’s far from the only version that’s suffered from an issue with CBT.  ESXi 5.0, 5.1, and 5.5 have had their issues, too.  In this case, the customer was running a version that needed a patch to fix CBT.  We remediated the hosts to patch them to current, reset CBT data yet again, allowed Veeam to backup the VMs, and tried an isolated test.  All VMs powered on successfully.

It’s important to note that Veeam really had nothing to do with this problem, and neither did MirrorView.  This was strictly an unpatched ESXi 5.0 issue.  So, if you run into this with any ESXi version using storage based replication, I recommend patching the hosts to current, resetting CBT data, run another backup, make sure the storage replicated the LUN after this point, and try again.