Monthly Archives: December 2015

Evolution of storage – traditional to hyperconverged

These days, there’s been an explosion in diversity of storage options, which often bleed into compute and/or networking when it comes to virtualized architecture.  It used to be that storage was storage, networking was networking, and compute was compute.  And when it came to storage, while architectures differed, what you stored your virtual machines did storage and storage only.  EMC ClARiiON/VNX, NetApp Filers, iSCSI targets like LeftHand, Compellent, EqualLogic, etc.  These were all storage and storage only.

Some of these added SSD as permanent storage type disks and/or as an expanded caching tier.  We also saw the emergence of all flash storage arrays that attempted to make the most of SSD using technologies like compression and deduplication to overcome the inherent weakness of SSD of high cost per unit of storage.  These arrays often are architectured from the ground up to work best with SSD, taking into account garbage collection needed to reuse space in SSD.

But these are also all storage only type devices.

Over time, that’s changed.  We now have converged infrastructure, such as VCE and Flexpod, but those typically still use devices dedicated for storage.  VCE VBlock and VxRack use EMC arrays.  FlexPod uses NetApp filers.  These are prepackaged validated designs built in factory, but still use traditional type storage arrays.

Keep in mind I don’t think there’s inherently anything wrong with this or any of these architectures.  I’m just laying the framework down to describe the different options available.

Now, we do have options that truly move away from the concept of buying a dedicated storage array, called Hyperconverged.  They’re still shared storage in the sense that your VMs can be automatically restarted on a different host should the host they are running goes down.  There’s still (when architected and configured properly) no single point of failure.  But this category doesn’t use a dedicated storage device.  Instead, it utilizes effectively local storage/DAS connected to multiple compute units pooled together with special sauce to turn this storage into highly available, scalable storage, usually for use with virtualization.  In fact, many only work with virtualization.  These tend to use commodity type hardware in terms of x86 processors, RAM, and disk types, although many companies sell their own hardware with these components in them, and/or work with server hardware partners to build their hardware for them.

The common thread between them though is you’re not buying a storage array.  You’re buying compute + storage + special sauce software when you comprise the total solution.

These options are for example Nutanix, VMware VSAN (or EVO:RAIL that utilizes it), Simplivity, ScaleIO, and you will see more emerging, and plenty I didn’t mention just because I’m not intending that to be a definitive list.

While there are good solutions in each of these types of storage arrays, none of the types are perfect.  None of these types work best for everyone, despite what any technical marketing will try to tell you.

So while there are more good choices to choose from than there ever has been in storage, it’s also harder to choose a storage product than it ever has been.  My goal in these posts are to lay a foundation to help understand these different options, which might help people sort through them better.

Consider donating to the Mozilla Foundation

‘Tis the season of charitable giving.  Have you ever donated money to the Mozilla Foundation, or considered doing so?

I know many people have a browser of choice.  I like Chrome, Firefox, and I even like Internet Explorer.  I use whatever works.  I find that while all three generally work on most sites, IE works best with Microsoft based technologies within web pages.  Chrome works generally best on anything related to Google, and Firefox often times works well when the others don’t.

Funny how that works out, huh?  Of course Microsoft technologies and websites work best with their browser.  Of course Google related sites work best with Google.  That’s precisely why they developed Chrome in the first place.  Both have a vested interest to make sure they give the best experience for their stuff with their respective browsers.

I think honestly all three would like everything to run best on their browsers, but that’s a tall feat, and of course what’s important to their respective interests come first.  And that’s why it worked naturally out that way.

But what are Mozilla’s interests?  Who funds them?  Royalties coming from in browser internet searches, and charitable donations, some from people like you and me.

And is it any wonder then that their browser fills in the holes nicely where Chrome and IE fall flat on their faces?  As an engineer who deploys EMC VNX’s frequently, I’ve found that Firefox has worked by far the best for me the last year or two (thanks evil Java!).

I know it’s not perhaps the most noblest of charitable donations.  I’m not even here to try to persuade you to donate $1,000, $100, or even $50.

But how about $5, $10, or $20?  Has Firefox bailed you out when all other browsers failed you?  Was that worth at least $5?

If so, consider donating!  It’s tax deductible, too.

Hyper-V 2012 R2 not able to form cluster

Ran into an interesting problem with a colleague.  He was trying to form a basic Hyper-V cluster on Windows Server 2012 R2, but kept getting the following error:

Event ID: 1570
Source: Microsoft-Windows-FailoverClustering
Event Details:
Node 'Host1' failed to establish a communication session while joining the cluster.  This is due to an authentication failure.  Please verify that the nodes are running compatible versions of the cluster service software.

We verified DNS settings, disjoined and rejoined Active Directory, verified the host’s computer account was valid, time sync with the domain was good, rights of his account to form the cluster were sufficient, validated the nodes for clustering, and more.

At that point, we began looking at GPO policy settings like “Access this computer from the network”, and noticed that Authenticated Users was not in there.  Simply adding Authenticated Users and refreshing the GPO on the cluster nodes resolved the issue.

Be careful making changes to these types of settings.  While Authenticated Users may seem like a group you would want to remove from a policy like that, it’ll often cause problems down the road.

Change Block Tracking issues with SRM

As it may be obvious, I’ve been doing quite a bit of work with VMware Site Recovery Manager with storage based replication lately, specifically EMC’s MirrorView.  I ran into another issue while testing with SRM 6 + ESXi 5.0 hosts.

During the project, we are updating vCenter from 5.0 to 6.0, SRM from 5.0 to 6.0, verifying everything works, and then proceeding with updating ESXi hosts.  We didn’t bother patching ESXi 5.0 hosts, since they would be updated to 6.0 soon enough.  We wanted to make sure SRM worked through vCenter before updating ESXi simply to ensure an easy rollback.

However, during failover testing, we ran into an issue where most VMs would not power on during isolated testing and failovers.  The error was as follows:

Error – Cannot open the disk ‘/vmfs/volumes/<VMFS GUID>/VMNameVMName.vmdk’ or one of the snapshot disks it depends on.

When you look into the events for an impacted VM, you would find the following:

“Could not open/create change tracking file”

We cleared CBT files for all the VMs, and tried again, forcing replication, and it worked.  We figured CBT got corrupted.  But then Veeam ran its backups, we tried an isolated test, and almost all the VMs couldn’t power on in an isolated test again.

I know ESXi 6 has been in the news lately for corruption in Change Block Tracking, but it’s far from the only version that’s suffered from an issue with CBT.  ESXi 5.0, 5.1, and 5.5 have had their issues, too.  In this case, the customer was running a version that needed a patch to fix CBT.  We remediated the hosts to patch them to current, reset CBT data yet again, allowed Veeam to backup the VMs, and tried an isolated test.  All VMs powered on successfully.

It’s important to note that Veeam really had nothing to do with this problem, and neither did MirrorView.  This was strictly an unpatched ESXi 5.0 issue.  So, if you run into this with any ESXi version using storage based replication, I recommend patching the hosts to current, resetting CBT data, run another backup, make sure the storage replicated the LUN after this point, and try again.

Using PowerShell when there isn’t PowerShell support

I know many of us work on lots of different technologies, many of which don’t have native PowerShell cmdlets, and that kind of thing.  Sometimes it’s DOS, sometimes, it’s Telnet/SSHing into a command line where you got to run individual command strings to fix a bunch of individual objects.  I know many of you guys end up hacking stuff together using Excel or other tools to basically to assemble a repeated command to fix multiple objects, or create rules or whatever, like…

First part of command object1 second part of command

First part of command object2 second part of command

And you got a list of all your objects you got to do this on.  This can be painful.

Let me give you an example…

Working on an issue with an old version of EMC RecoverPoint, which has no PowerShell integration.

Basically, the customer masked some LUNs to VMAX front end ports that aren’t hooked up, and RecoverPoint is barking because it can’t access those ports.  So the customer has to unmap the front end ports and unmask.  I know for many of you guys, it’s this garbledy gook of tech you don’t work with.  In the end, the specific technology doesn’t matter.

RecoverPoint reports all the volumes that are the problem, like this:

Devices: 2B3B,277F,83D8,2B34,2250,21DD,2774,102A,21E2,281E,102B,281F,83D5,83E1,12B7,83CB,83DC,83DF,2775,83DB,24BB,83CE,818D,83D9,2784,2776,83CD,83DA,12CF,281D,83E3,0FB4,83D0,2B50,83CC,0FA3,8037,0FB3,83D1,2772,8196,83D4,83CF,83E2,83D3,83D7,2773,277E,12CC,12C9,8038,83DE,8036,1518,83D6,83D2,83DD,83E0

The first thing I need is an array of these I can pump into a loop.

This is stupid simple for PowerShell.  Each device is separated by a comma, so I can just use comma as the split character.

(Cut off the long string of devices, you get the idea)

$devicelist = “2B3B,277F,83D8,2B34,2250,21DD”

$devices = $devicelist.split(‘,’)

Now, if you type $devices, you get:

2B3B

277F

83D8

2B34

2250

21DD

Now we have our simple array.

Also, another helpful thing to know is if you have a sequence of numbers, you can use another PowerShell trick.  Say I need an array of objects that’s object1-10.  Also easy:

$objects = 1..10 | foreach-object {“object” + $_}

Type $objects and you get:

object1

object2

object3

object4

object5

object6

object7

object8

object9

object10

Yes, you can do this for IPs.  Say I want an array of all IPs in 192.168.0.0/24, so I can ping them or whatever.

$ips = 1..254 | foreach-object {‘192.168.1.’ + $_}

Maybe port ranges with “TCP” in front for firewall rule statements.

$tcpports = 3000..4000 | foreach-object {“TCP” + $_}

Now, I need to have command string stuff added in front and behind this.  Again, this doesn’t matter what tech you’re working on, just put your garbledy gook that I wouldn’t understand in.  $_ is the instance in the array

$commands = $devices | foreach-object {‘symconfigure -sid 1234 -cmd “unmap dev ‘ + $_ + ‘ from dir ALL:ALL;” commit’}

If I type $commands, I get:

symconfigure -sid 1234 -cmd “unmap dev 2B3B from dir ALL:ALL;” commit

symconfigure -sid 1234 -cmd “unmap dev 277F from dir ALL:ALL;” commit

symconfigure -sid 1234 -cmd “unmap dev 83D8 from dir ALL:ALL;” commit

symconfigure -sid 1234 -cmd “unmap dev 2B34 from dir ALL:ALL;” commit

symconfigure -sid 1234 -cmd “unmap dev 2250 from dir ALL:ALL;” commit

symconfigure -sid 1234 -cmd “unmap dev 21DD from dir ALL:ALL;” commit

BAM!  We got our commands, and we’re rolling.  If I want to save the commands as a text file…

$commands | out-file c:\dir\ourcoolscript.txt

Now I can copy/paste into putty/telnet session, or upload the script file and launch it if that’s possible, whatever I want to do.

WAY faster IMO than using other tools or duct taping a solution using Excel or other weird methods, and far more flexible.

So even if your technologies don’t have PowerShell, you can still use PowerShell!

Taking scripting too far?

I love scripting, and I am a huge advocate of PowerShell.  I talk about how it can be leveraged seemingly all the time to customers who don’t leverage it.  I encourage customers constantly to make use of it to make them more efficient.

But…  is it possible to take scripting too far?  Of course.

I stumbled across this article about a sysadmin who automated his job to arguably a ridiculous degree.

I shouldn’t say he arguably went too far.  He definitely did.  To me, the worst example in the article is where he automated the rollback of one of his user’s databases based on contents of an email if he received it from a particular end user.

Scripting is so beneficial virtually anyone in the IT field, or more specifically automation.  I applaud almost all efforts to do this.  However, scripting gets dicey when you begin to automate specifically decision making, especially complex decision making.

Don’t get me wrong, decision making is possible and beneficial in scripting, but it shouldn’t always be used.  I’ve many a times included conditional logic in a script, and it was absolutely essential to accomplishing the goal of the script.  However, sometimes decisions are just too complex to make based on limited information.

In this case, I have a lot of problems setting up what he did.  First off, how on earth can you tell just from some keywords in the contents of an email that you should roll back the database, without the end user asking specifically to roll back the database?  Even if the end user requested this, if the end user doesn’t know how to do this, there’s a pretty decent chance that this isn’t the best solution anyway.

Secondly, I seriously doubt the email was authenticated to be from this specific user.  IE, if this type of automation is wide spread given the general security posture of most email systems, it could be trivial to exploit to cause a day’s worth of data loss.

With all this said, I generally have the opposite problem with customers not automating anything, as opposed to customers automating things they shouldn’t, but this does demonstrate it’s possible to go to the opposite extreme.

Adventures in SRM 6.0 and MirrorView

Recently, I setup SRM 6.0 with MirrorView storage based replication.  It was quite the adventure.  The environment was using SRM 5.0 and MirrorView, and we upgraded them to vSphere 6.0 and SRM 6.0 recently.  I wanted to get my findings down in case it may help others setting this up.  I found when I ran into issues, it wasn’t easy finding people who were doing this, as many who are using VNXs are using RecoverPoint now instead of MirrorView.

Version Support

First off, you might be wondering why I recently deployed SRM 6.0 instead of 6.1.  That’s an easy question to answer – currently, there is no support for MirrorView with SRM 6.1.  I’m posting this article in 11/2015, so that may change.  Until it does, you’ll need to go with SRM 6.0 if you want to use MirrorView.

Installation of Storage Replication Adapter

I’m assuming you already have installed SRM, and configured the pairings and what not.  At the very least, have SRM installed in both sites before you proceed.

Here’s where things got a little goofy.  First off, downloading the SRA is confusing.  If you go to VMware’s site to download SRA’s, you’ll see two listings for the SRA, with different names, suggesting they work for different arrays, or do something different, or are different components.

mirrorsradownload

They’re actually so far as I can tell two slightly different versions of the SRA.  Why are they both on the site for download?  No idea.  So I went with the newer of the two.

You also need to download and install Navisphere CLI from EMC for the SRA to work.  There are a few gotchas on the install of this to be aware of. Install this first.

During installation, you need to ensure you check the box “Include Navisphere CLI in the system environment path.”

navispherepath

That’s listed in the release notes of the SRA, so that was easy to know.  You also need to select to not store credentials in a security file.

I ended up having issues with the SRA being able to authenticate to the arrays when I originally told it to store credentials thinking this could allow easier manual use of Navisphere CLI should the need arise, but that messed things up, so uninstalled, and reinstalled Navisphere CLI without that option, and the bad authentication messages went away.

Next, install the SRA, which is straight forward.  After the installation of the SRA, you must reboot the SRM servers, or they will not detect that they have SRA’s installed.  That takes care of the SRAs.

Configuring the SRAs

Once you have installed the SRA’s, it’s time to configure the array pairs.  First, go into Site Recovery within the vSphere Web Client, and click Array Based Replication.

arraybasedreplication

Next, click Add Array Manager.

addarraymanager

Assuming you’re adding arrays from two sites, click “Add a pair of array managers”.

addarraypairs

Select the SRM Site location pair for the two arrays.

sralocationpair

Select the SRA type of EMC VNX SRA.

selectsratype

Enter the Display name, the management IPs of the array, filters for the mirrors or consistency groups if you are using MirrorView for multiple applications, and the username and password info for the array for each site.  Be sure to enter the correct array info for the indicated site.

sraarrayinfo

I always create a dedicated SRM service account within the array, so it’s easy to audit when SRM initiates actions on the storage array.

You’ll need to fill the information out for each site’s array.

Keep the array pair checked and click next.

enablearraypairs

Review the summary of action and click finish.

At this point, you can check the array in each site and see if it is aware of your mirrors being replicated.

checksrareplicationinfo

So far so good!  At this point, you should be able to create your protection groups and recovery plans, and start performing tests of a test VM and recoveries as well.

Problems

I began testing a test Consistency Group within MirrorView, which contained one LUN, which stored a test VM.  Test mode worked immediately to DR.  Failover to the DR site failed, as it often does in my experience with most Storage Based Replication deployments.  No problem, I simply launch it again, and it works, and it did in this case.

With the VM then in the DR site, I performed an isolated test back to production, which worked flawlessly.  It’s when I tried to fail back to production I encountered a serious problem.  SRM reported that the LUN could not be promoted.  Within SRM, I was given only the option to try failover again.  The icon was grayed out to do cleanup or a test.  Relaunching failover resulted in the same result.  I tried rebooting both SRM servers, vCenter, running rediscovery of the SRAs, you name it.  I was stuck.

I decided to just manually clean up everything myself.  I promoted the mirror in the production site, had hosts in both sites rescan for storage.  The LUN became unavailable in the DR site, but in production, while the LUN was visible in terms of seeing an available LUN, the datastore wouldn’t mount.  Rebooting the ESXi server didn’t help.  I finally added it as a datastore, selecting not to resignature the datastore.  The datastore mounted, but I found that the datastore wouldn’t mount after a host reboot.  Furthermore, SRM was reporting the MirrorView consistency group was stuck failing over, showing Failover in Progress.  I tried recreating the SRM protection group, re-adding the array pairs, and more, but nothing worked.

After messing with it for awhile, checking MirrorView and the VNX, VMware, etc., I gave up and contacted EMC support, who promptly had me call VMware support, who referred me back to EMC again because it was clearly an SRA problem for EMC.

With EMC’s help, I was able to cleanup the mess SRM/SRA made.

  1. The Failover in Progress reported by the SRA was due to description fields on the MirrorView description view.  Clearing those and rescanning the SRAs fixed that problem.
  2. The test LUN not mounting was due to me not selecting to resignature the VMFS datastore when I added it back in.

At this point, we were back to square one, and I went through the gambit of tests. I got errors because the SRM placeholders were reporting as invalid.  Going to the Protection Group within SRM and issuing the command to recreate the SRM placeholders fixed this issue.

We repeated testing again.  This time, everything worked, even failback.  Why did it fail before?  Even EMC support had no answer.  I suspect it’s because anytime I make the first attempt in a direction in an SRM environment to failover, it always fails.  Unfortunately, it was very difficult to fix this time.

Matching $25 donation for Movember

Tis the season for charitable giving, so I wanted to raise awareness to a promotion for Movember, a great organization that helps raise money to help with men’s health issues, including cancer and others.  You may be familiar with their Movember annual “get guys to not shave” event for the month of November.

While I didn’t do that, I am donating to their cause, and you should, too!

The promotion is to donate using the VISA Checkout system, and VISA will match up to $25 of your donation, up to $1,000,000 of total matching.  Let’s make them donate that entire million dollars for a good cause!

Be sure to make your donation by 12/6/2015 to get that match!