Tag Archives: veeam

useful utilities duct tape

Useful Utilities Are Useful

If you’re an IT pro, no matter if you’re an admin, and engineer, a consultant, a PC technician, you have a toolbox of useful utilities, scripts, and software that you use to fix problems.  As time goes by, some of those tools get used more and more.  Others are used less and less for various reasons.  But what surprises me is how many tools in my toolbox on the surface have less and less use cases, but I still come back to them even when it seems I never would need to again.

Over the last few weeks, I’ve been working with a customer who has had significant turnover from consultants they’ve used.  They are moving off a troubled disparate datacenter environment that had over time developed numerous problems to a more consolidated environment that various SyCom resources including me have built for them that is functioning properly, has updated software and firmware, etc.  Along the way, we’ve run into numerous challenges that you wouldn’t normally anticipate.  Troubleshooting them to fix the problems often would take too much time to fix,.  Finding a duct tape solution was more expedient.

I wanted to give a few examples just to illustrate that having a wide knowledge of utilities out there and experience with them can help you solve problems.

In this case, the task was seemingly simple – move VMs running on a legacy NetApp array and vSphere 5.1 servers to a new(er) cluster running vSphere 5.5.  The clusters were managed by two different vCenter servers.  These clusters were within the same physical datacenter.  They had network connectivity between them.  They did not have access to the same storage arrays.  The customer allowed downtime to move them.  Therefore, the easiest way was a shared nothing cold migration (we’re running 5.1 on the source side, remember).  Simple, right?

Doing It the Textbook Way

I approached this like how any vSphere resource would.  Get the two clusters into the same vCenter instance, shut the VMs down, and migrate them cold.  How many times have you seen that fail?  Me?  Pretty much never.  Well, it wouldn’t work.  I’ll spare you the troubleshooting details, but trust me, doing it the native way wouldn’t work.

At this point, the time had come to get creative and bust out some useful utilities I hadn’t used in a long time.  We had to get the job done.  Tick tick!

Useful Utilities #1 – Veeam FastSCP

The customer wasn’t a Veeam customer (yet).  While the customer could take some downtime off hours, there was a limit to that.  We had to move about 2TB of data, so we needed to move this data as quickly as possible without a ton of labor to reconfigure the networks to get both environments access to the storage.

Sure, I could use WinSCP to just bulk copy the VMs over, but Veeam FastSCP, built into Veeam Backup and Replication trial, is free, and it moves data quicker as it disables encryption on the data transfer, which was acceptable to the customer.  I hadn’t had any reason to use FastSCP in probably five years because cold migration functionality and exporting VMs to OVFs and what not within vSphere made it unnecessary.  But here I was, using it yet again.

And sure enough, it worked like a champ.  We tested a quick procedure using it on a few development workloads.  We then proceeded moving all but the critical VMs, and it worked great… except for the last VM of course.  Come to find out, that was a critical SQL VM that the customer didn’t realize was using physical Raw Device Mappings.

Well, shoot, how do we do this one in a quick manner?

Useful Utilities #2 – VMware Converter

For numerous reasons, including perhaps sheer circumstance of projects I’ve worked on, I hadn’t until this had a need to use VMware Converter in years.  Virtualization is so prevalent now, that P2V is one of those things for me that’s like, “Hey man, remember that time we had to convert like 100 physical machines to virtual back in the day?  Good times!”

Also, I’ve generally recommended to customers to avoid converting physical to virtual anyway.  It should generally be seen as a shortcut, but never optimal.  If you could just build a fresh new VM and get the data moved, the resulting VM would be cleaner.  It would probably perform better.  There’s less chance of instability from old drivers and what would inevitably be a significant change in hardware for the OS and application.  Obviously, if you’re dealing with a ton of machines, rebuilding them all isn’t practical.  In that case, you might have to turn to a P2V tool.

But if you got a VM with physical RDMs, you can’t clone the VM.  You can’t bulk copy the Virtual Machine files over.  You could create new VMDKs and copy everything out of the RDM disks to those and reassign drive letters.  However, this SQL VM was nasty with complex mount points and drive letters assigned.  We had to get it done the weekend the RDMs were discovered.

Solution?  VMware Converter!  I tried installing it on an admin server and set up the job.  That of course failed because of Murphy’s Law.  The Converter agent wouldn’t install due to insufficient permissions.  I installed it directly on the SQL VM (with the same account I tried to push the agent, mind you), stopped the SQL services to ensure the data was static, and ran it.  Other than it shuffling a few drive letters around on the converted VM that a few mouse clicks fixed, it worked like a champ.

How about you?  Any useful utilities you’ve used recently you haven’t used in awhile?

Resolving VM MAC Conflict alarm with Veeam Replicas

It’s been awhile since I’ve deployed Veeam using replication with vSphere 6.0.  I recently implemented it for a customer who was replicating VMs to a secondary storage appliance in addition to backing the VMs up to a Data Domain.  Upon running the initial replication for the VM, a “VM MAC Conflict” alarm triggered on the replica VM.

vm mac conflict alarm triggered

Here’s a description of what’s going on and how to prevent the VM MAC Conflict alarm from triggering.

VM MAC Conflict Alarm

The VM MAC Conflict alarm is new to vCenter 6.0 Update 1a.  The intent of the alarm is to warn you if two vNICs on VMs within a vCenter instance have the same MAC address.  This can happen for a variety of reasons:

  1. vCenter malfunctioned and dynamically provided the same MAC address to two or more vNICs.
  2. Either intentionally or mistakenly, an admin or a third party product might have statically assigned a MAC address already in use within the environment.  In this case, Veeam created a copy of the VNX file with identical MAC addresses for the source and replica VM’s vNICs.

It’s a good alarm to have to notify you just in case.  But how do you keep this alarm while stopping it from triggering on replica VMs?

Stopping VM MAC Conflict Alarms from triggering for Veeam Replicas

The solution for preserving the VM MAC Conflict alarm while stopping it from triggering on Veeam replicas is quite simple.  You can modify the alarm itself by setting an exception to exclude VMs.  In the case of Veeam replicas, they have a “_replica” suffix within the VM name by default.  If you changed that suffix in the replica job, just adjust accordingly.

Go to the VM MAC Conflict alarm definition.  It’s in the vCenter inventory object under Manage > Alarm Definitions.  Click the alarm and on the right, click Edit.

Under the bottom box that reads, “The following conditions must be satisfied for the trigger to fire”, add a condition that says the VM name does not end with “_replica”.  Once applied, the alarm disappears for your replica VMs.

vm mac conflict alarm modified

That’s it!

Troubleshoot VSS errors in whole VM backups

I’ve dealt with many whole VM backup products in my experience with virtualization, including Veeam, VMware Data Protection, Avamar, vRanger Pro, Backup Exec, and more.  With that experience came lots of troubleshooting through various issues.  Originally, this post was going to deal with a recent specific issue I had, but I thought a better post would be to deal with an entire category of problems with these products, so someone could use this post to perhaps fix what could be one (or more) of lots of potential root causes, not just the singular one.  Many of the steps to troubleshoot this stuff helps keep your environment healthy and avoid lots of issues, not just issues with backups.

This post will focus specifically with VSS quiescing problems, not a definitive guide to all backup problems of VMs.

Revision Level of Your Backup Product

Often times, the issue has to do with the revision level of your backup product itself.   Generally, it’s good to be on the latest patch level, but not always.  Here are a few things to think about:

  • Is your backup product patched to current?  If not, perhaps look into doing so.
  • Is your backup product compatibile with your environment?  Check to ensure it supports the current build of your hypervisor, your hypervisor management software such as SCVMM or vCenter, and the guests you’re backing up, and take appropriate action.
  • Did you install an update to the backup product recently?  If so, perhaps there’s a bug in that update.

Revision Level of Guests That Are Backed Up

Backups that quiesce the file systems of guests depend upon OS components within said guests, and this is especially true of Windows guests, which rely on Volume Shadow Copies (VSS).  VSS, just like any other software, can have bugs in it that need to be fixed, so there are patches to VSS.  Other OS components could also be the culprit.  Ensure your guests are patched to current.  Conversely, if you recently applied patches to your guests recently, perhaps there are problems with those updates, so you may try removing those.

As a side note, I would recommend using multiple methods of checking your guest patch levels.  For example, while not very common, I’ve seen numerous cases of Windows Update saying all patches are installed, but when I used a second utility to check, those utilities reported missing patches.  Use a second utility to check, such as Microsoft Baseline Security Analyzer (which is free) if the guest is Windows based, to ensure you’re not missing anything.

Also, don’t assume the guests are patched to current.  I recently ran into an issue where the customer somehow hadn’t patched the server… ever.  Somehow it slipped through the cracks.

Hypervisor Revisions

Hypervisors also can cause issues with quiescing.  Some considerations here:

  • Does the build of the hypervisor support the guest having the issue?
  • Are the hypervisors patched to current?  If not, consider updating them.
  • Were the hypervisors recently patched?  If so, perhaps one of the installed patches has a problem, and removing it might resolve the issue.
  • Have the in guest optimization components such as VMTools within the guests been updated?  If not, do so.  If this was done recently, perhaps try to downgrade them to see if that resolves the issue.  These are important, as this is typically the means by which the hypervisor issues the command to quiesce the file system within the guest.

Other Guest Considerations

There are other issues that can cause problems with backups.

  • Other backup agents installed within the guest can also cause problems.  Remove any backup agents that are no longer needed.  I personally just ran into this issue with a customer that had an old Backup Exec agent from before they used their current backup product.
  • Applications have their own VSS agents, such as SQL and Exchange.  Sometimes those need to be updated, too.  It can also be that recent updates to them can also cause problems with quiescing.  Look for updates to those, or remove recent updates.
  • Antivirus software has also been known to cause VSS issues.  Try updating, disabling, configure proper exclusions, uninstalling and/or reinstalling the AV agents.
  • Ensure there is adequate free space within the guests.
  • There are a finite number of shadow copies, and when that limit is reached, it can cause quiescing to fail.  Try removing all shadow copies within the guest using the command:  vssadmin delete shadows /all

Hopefully, this provides you with some ideas to try to resolve the issue you’re experiencing.

Do you have any other tips for resolving VSS issues with whole VM backups?

Change Block Tracking issues with SRM

As it may be obvious, I’ve been doing quite a bit of work with VMware Site Recovery Manager with storage based replication lately, specifically EMC’s MirrorView.  I ran into another issue while testing with SRM 6 + ESXi 5.0 hosts.

During the project, we are updating vCenter from 5.0 to 6.0, SRM from 5.0 to 6.0, verifying everything works, and then proceeding with updating ESXi hosts.  We didn’t bother patching ESXi 5.0 hosts, since they would be updated to 6.0 soon enough.  We wanted to make sure SRM worked through vCenter before updating ESXi simply to ensure an easy rollback.

However, during failover testing, we ran into an issue where most VMs would not power on during isolated testing and failovers.  The error was as follows:

Error – Cannot open the disk ‘/vmfs/volumes/<VMFS GUID>/VMNameVMName.vmdk’ or one of the snapshot disks it depends on.

When you look into the events for an impacted VM, you would find the following:

“Could not open/create change tracking file”

We cleared CBT files for all the VMs, and tried again, forcing replication, and it worked.  We figured CBT got corrupted.  But then Veeam ran its backups, we tried an isolated test, and almost all the VMs couldn’t power on in an isolated test again.

I know ESXi 6 has been in the news lately for corruption in Change Block Tracking, but it’s far from the only version that’s suffered from an issue with CBT.  ESXi 5.0, 5.1, and 5.5 have had their issues, too.  In this case, the customer was running a version that needed a patch to fix CBT.  We remediated the hosts to patch them to current, reset CBT data yet again, allowed Veeam to backup the VMs, and tried an isolated test.  All VMs powered on successfully.

It’s important to note that Veeam really had nothing to do with this problem, and neither did MirrorView.  This was strictly an unpatched ESXi 5.0 issue.  So, if you run into this with any ESXi version using storage based replication, I recommend patching the hosts to current, resetting CBT data, run another backup, make sure the storage replicated the LUN after this point, and try again.

vSphere 6.0 Change Block Tracking Patch released

Just a heads up, but VMware dropped the public release of the patch to resolve the Change Block Tracking problem in ESXi 6.0.  You can apply the patch using VMware Update Manager, or install it manually.

Logically, remember that you can’t just apply the patch and all is well.  You need to reset CBT data to “start fresh” because all changed blocks reported prior to the patch are still suspect.  Most backup vendors detail how you do this in VMware, but I wanted to share a few tips in this regard.

  1. Change Block Tracking can easily be disabled/enabled on Powered Off VMs.  That’s not an issue.
  2. You can reset Change Block Tracking information on a VM by disabling CBT on the VM, taking a snapshot of the VM, deleting the snapshot, and then re-enable CBT.  This makes for automation of this very easy.  Veeam has a PowerCLI script that can do this as an example, although it is clearly a use at your own risk affair.

Finally, don’t forget to enable CBT on your backup jobs and/or VMs when you’re ready if that was disabled as a workaround.  You can do this using PowerShell if you’re using Veeam.

 

Disable CBT on Veeam jobs via PowerShell

If you haven’t heard the not so great news, VMware has discovered a bug  in vSphere 6 with Change Block Tracking (CBT) that can cause your backups to be corrupt and therefore invalid.  Currently, they are recommending not to use CBT with vSphere 6 when backing up your VMs.

I was looking for an easy way to disable this on all jobs in Veeam quickly via PowerShell, but it’s not obvious how to do that, so I took some time to figure it out.  Here it is assuming the module is loaded in your PowerShell session.

$backupjobs = get-vbrjob | where jobtype -eq "Backup"
foreach ($job in $backupjobs){
$joboptions = $job | get-vbrjoboptions
$joboptions.visourceoptions.UseChangeTracking = $false
$job | set-vbrjoboptions -options $joboptions
}

Here’s now to enable it again:

$backupjobs = get-vbrjob | where jobtype -eq "Backup"
foreach ($job in $backupjobs){
$joboptions = $job | get-vbrjoboptions
$joboptions.visourceoptions.UseChangeTracking = $true
#$joboptions.visourceoptions.EnableChangeTracking = $true
$job | set-vbrjoboptions -options $joboptions
}

Sorry it’s not pretty on the page, but I wanted to get this out ASAP to help anyone needing to do this quickly and effectively.

One thing to note is in the enable script, there’s a commented line out.  If you have already set your jobs manually and wish to use the script to enable CBT again, be aware that the option to enable CBT within VMware if it is turned off gets disabled if you turn CBT off altogether within the job setup.  If you disable CBT with my script, that doesn’t get touched, so you don’t need to remove the # on that line.   If you want that option enabled again, take out the # before that line, and it’ll enable that option again.

Hope this helps!