This Week in Ad-Hoc Scripting: Time-Randomized vMotion Testing (DRS Simulator?)

There is no TL;DR this week because I get to the code pretty quickly.

I got vMotion moves.

The release date of No Time to Die has been postponed again . . .

I KNOW, RIGHT? I mean it was bad enough that the Idris Elba rumors didn’t pan out. But now I have to wait again? FML.

This is like the worst thing that could possibly happen right now in October 2020. . . .

Anyway, this week I want to let you in on a little secret that more seasoned VMware Admins will only tell you in the back rooms of breweries:

vMotion isn’t flawless. It will occasionally break stuff.

We ran into this situation on some of our RedHat machines. It would appear that, depending on what’s running on the VM, time synchronization after a vMotion event can break applications. There is a fix for it here: https://kb.vmware.com/s/article/1189. The temporary workaround is to downgrade the Automation Level of the VM(s) or Cluster to “Partially Automated” which has its own consequences.

The additional challenge is that this problem is intermittent, so it’s difficult to pin down.

So, long story short, we need a way to test vMotion with the new applied settings in a way that closely resembles Fully Automated DRS. So I meticulously coded out a well-planned solution threw something together.

What the Script Does

The script will:

  1. Select the VM(s) of your choice using the Get-VM CmdLet. This would be the (NON-PRODUCTION!) VM(s) with the applied fix, as above.
  2. Randomly select a different host in the cluster as a destination for the vMotion of each VM in question.
  3. Allow for you to choose a time frame with a randomized minimum/maximum for vMotions. For example, you can say within a time frame of every 10-30 minutes, choose a time at random in that range and move the VMs. You can be as aggressive or as lax with the frequency of movement all you want here, simply by tweaking the min/max values.
  4. Select a max number of times to move the VMs and once that max is reached, stop moving them.

I am always nervous about sharing these types of scripts. I always envision people looking over my scripts and reacting the way Art does in that scene in Christmas Vacation.

That’s two pop-culture/movie references in the same post, if you’re keeping track.

And, side note, the above . . . pseudo-code (YEAH! I feel like a real dev now!) . . . is a perfect example of by-the-book kinds of scripting concepts: random number generation, arrays, while loops, and so on. If you are honing your PowerCLI scripting skills, this is a good one to try on your own that has some subtle challenges.

Or, you can just use what I am providing. . . .

The Code Details

Before I link out the github repo for this, you should be forewarned that I am leveraging a previously-mentioned codebase. This script recycles a function from some code that is part of what I call my, “VMware Toolkit”. The details are here and the repo I am talking about is at https://github.com/bryansullins/myvmwarekit.

The script I made is called vMotionTest.ps1 in the “StandaloneScripts” directory here.

The assumption I make here is that you have already imported the MY.VMWAREKIT module. the method for doing this is in Installation details of the repo’s README.

The function I am using from the Toolkit is called Move-MyVMOffCurrHost, which is a recent addition to the Toolkit. This does the heavy lifting of choosing a random host in the cluster and performs the vMotion. You can look that over if you like, but let’s focus on the script at-hand.

Keep in mind that this script is currently built to run interactively, hence the interactive logon for vCenter and the Write-Host line, but you can certainly, “Jenkins this up” if you like. I plan on doing a future post very soon about Jenkins (stay-tuned!).

Anyway, let’s take a look:

## vMotion Test Script
# Connect to vCenter in Question and get desired VMs:
$vCenter = "your.vcenter.here"
Connect-VIServer -Server $vCenter
# Alter this line to get the VM(s) you want to test. Use Where-Object or Regex and go crazy! This is just a sample using 1 VM for testing:
$vMotionTestVMs = Get-VM -Name "VMNAMEHERE"

#Run Parameters for the MY.VMWAREKIT Function
$NumTimestoMove = 1
$TimetoMoveMax = 30
$MinimumSeconds = 600
$MaximumSeconds = 1800

While($NumTimestoMove -le $TimetoMoveMax) {
    Write-Host "This is $NumTimesToMove of $TimetoMoveMax vMotions."
    # Don't wait if it's the first time through the loop . . .
    If ($NumTimestoMove -eq 1){
        $Wait = 0
    } 
    Else {
        $Wait = Get-Random -Minimum $MinimumSeconds -Maximum $MaximumSeconds
        Write-Host "Waiting $Wait Seconds . . ."
    }

    Start-Sleep -Seconds $Wait 
    
    ForEach ($v in $vMotionTestVMs) {
        Move-MyVMOffCurrHost -VMName $v
    }

    $NumTimestoMove += 1
}

Most of this is pretty self-explanatory and I made comments where I could – the main flexibility is in the setup parameter variables. These lines:

$MinimumSeconds = 600
$MaximumSeconds = 1800

The above will wait anywhere from 10 minutes to 30 minutes and move the VMs. You can make this more frequent by decreasing these numbers.

You can also specify how many times per script-run the vMotions will occur by tweaking the $TimetoMoveMax variable.

How To Figure Out How Often the VMs are Moved?

That’s easy. In the toolkit, aforementioned, there is a Module Function called Get-MYvMotionHistory.

You’re welcome.

Some Takeaways

I think that about covers it, but before I close this one, notice that in the subtitle of this post I have (DRS Simulator?) with a question mark.

I tried my best to simulate the behavior of (Fully Automated) DRS as best I could, but whether or not I have succeeded I will leave that up to you.

I think the more relevant challenge is:

How long do you run this across these VMs before you declare this fixed? I mean what we’re doing is dropping the vMotion hammer on VM(s) and see if they scream. If they don’t, does that mean they’re fixed?

Not necessarily. But, I am running this for a week with the time interval as above and we’ll see what happens. So far, in 96 hours, I have done 120 vMotions on two different VMs each, and so far, so good.

Oh! We’re also using Splunk Alerting to monitor whether the vMotions break anything. YEAH!

Happy scripting!

Hit me up on twitter @RussianLitGuy or email me at bryansullins@thinkingoutcloud.org. I would love to hear from you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s