VM/Host/Cluster load in PowerCLI

In order to follow up our previous post about cluster load in PowerCLI, a chat with Alan Renouf convince us to add support for other vCenter objects, the basic idea was to build a universal Get-Load function that will display the load of pipelined or explicit objects, Piece Of Cake !

Thinking about it, what we wanted to do is quite similar as the Get-View behavior, for instance the possibility to use it like:

Get-View -ViewType VirtualMachines
Get-VM | Get-View
Get-View -ViewType HostSystem
Get-VMHost | Get-View

Thanks to the proxies commands, we looked at the Get-View cmdlet definition, for instance with the following command:

[System.Management.Automation.ProxyCommand]::create((New-Object System.Management.Automation.CommandMetaData(Get-Command Get-View)))

Our PowerCLI function will display graphical ASCII art load in console for several object types (virtual machine, ESX host or vSphere cluster). This function can be used in 2 ways, first as standalone cmdlet:

Get-Load -LoadType VirtualMachine
Get-Load -LoadType HostSystem
Get-Load -LoadType ClustercomputeResource

Then you can pass objects to it through pipeline:

Get-VM "vm*" | Get-Load
Get-VMHost | Get-Load
Get-Cluster -Location "folder01" | Get-Load

Right now, there is only 3 object types supported (it’ll be remember to you if you try to think out the box :p )

Here are samples for ASCII art load for different objects:

The script is available for download here (rename the extension in .ps1) : Get-Load.ps1

Cluster load in PowerCLI

Edit 01.04.2013 : A few script modification following a very nice observation by hypervisor (as usual :p)

It’s been a while we wanted to make a PowerCLI fancy display for vSphere cluster load, in a kind of way VIClient does for server load:

So here is a PowerCLI script that will display your cluster load in CLI:

We had as much fun making the graphical ascii art display as getting the statistics data (especially with the quest of perfect character in order to display filling bar ^^). This PowerCLI script is a Get-ClusterLoad function that took some non-mandatory parameters:

.PARAMETER ClusterName
 The cluster name or names separated by coma
 .PARAMETER Quickstat
 Switch, when true the method to get stats is based on quickstats through summary child properties. If not, the method will use PerfManager instance with QueryPerf method in order to get non computed stats. The default for this switch is $true.

So this function can be used to display the load of all your clusters (default behavior with no argument):

Get-ClusterLoad

You can also display specific cluster load (by giving cluster name separated by comma):

Get-ClusterLoad -ClusterName vCluster01,vCluster02

You can also force reals stats usage instead of quick stats:

Get-ClusterLoad -Quickstat:$false

The 2 ways of retrieving statistics data are pretty much different regarding execution time. First of all, the default method will retrieve quick stats available with runtime.memory and runtime.cpu properties from cluster root resource pool (named Resources, with ManagedObjectReference:ResourcePool type) :

This method is fast as it’ll only retrieve already computed values in order to get cluster load (just a few seconds for almost 200 ESXi servers):

The second method (using -QuickStat:$false switch) will retrieve mem.consumed.average and cpu.usagemhz.average statistic counters from each cluster. This load will be an average one (so it can be slightly different from the load generated with the first method) and will be longer to compute (a little more than 2 minutes for almost 200 ESXi servers):

This function is not extremely useful as-is (except for the fun part with ASCII art ^^), but we’re working on a plugin that could be added to vCheck in order to have these information in the sent report. We’ll update this post when it’ll be finished.

The function displaying graph bar with  percentage input is the sub-function Show-PercentageGraph (line 37 to 50):

function Show-PercentageGraph([int]$percent, [int]$maxSize=20) {
	if ($percent -gt 100) { $percent = 100 }
	$warningThreshold = 60 # percent
	$alertThreshold = 80 # percent
	[string]$g = [char]9632 #this is the graph character, use [char]9608 for full square character
	if ($percent -lt 10) { write-host -nonewline "0$percent [ " } else { write-host -nonewline "$percent [ " }
	for ($i=1; $i -le ($barValue = ([math]::floor($percent * $maxSize / 100)));$i++) {
		if ($i -le ($warningThreshold * $maxSize / 100)) { write-host -nonewline -foregroundcolor darkgreen $g }
		elseif ($i -le ($alertThreshold * $maxSize / 100)) { write-host -nonewline -foregroundcolor yellow $g }
		else { write-host -nonewline -foregroundcolor red $g }
	}
	for ($i=1; $i -le ($traitValue = $maxSize - $barValue);$i++) { write-host -nonewline "-" }
	write-host -nonewline " ]"
}

You can modify the graph width in the sub-function Show-PercentageGraph (line 37):

[int]$maxSize=20

You can modify the threshold for color change for warning/alert in the sub-function Show-PercentageGraph (line 39/40):

$warningThreshold = 60 # percent
$alertThreshold = 80 # percent

The script is available for download here (just rename the extension in .ps1) : Get-ClusterLoad.ps1

HP ESXi driver vs Dropped packets

As we were about to use some new BL465c G7 blades hosting ESXi servers installed with HP custom ISO image (available here, this image is built with needed drivers for G7/Gen8 hardware support, especially for FCoE onboard FlexFabric card), we started to receive some alert mail on dropped packets (thanks to the famous ”croissants-baguette” alarm from hypervisor available here: Alarmes vCenter pour les network packets dropped) :

We received theses alerts as there was no VM hosted on theses servers, so obviously no network load. As every troubleshooting should use esxtop, we ran it and used the network context (key ‘n’):

What was weird is that we didn’t saw any Transmit Dropped Packet (%DRPTX) or Received Dropped Packet (%DRPRX) while vCenter performance graph displayed a lot of them.

The vCenter counter is a summation one (i.e. a computed counter based on values retrieved from a time range), we wanted to check that esxtop counters stays null. So we ran esxtop in batch mode and redirect the results in a CSV file, and by importing it in perfmon, we were able to confirm that these counters stayed at 0…

We spare you all the failed tests about finding the reason for the misinterpretation between esxtop counters and vCenter performance graph. In an ultimate try in order to understand the origin of this issue, we captured some network traffic on an ESXi server (there was still no VM running on this server and was in maintenance mode) using the command tcpdump-uw :

tcpdump-uw -i vmk0 -s 1514 -w traffic.pcap

This command will run some packet capture, listening on vmk0 interface (-i vmk0), saving all frames (-s 1514 for regular packet, or -s 9014 with Jumbo Frames) in pcap format in order to be able to analyze them in WireShark.

Unfortunately, nothing seems to be strange/wrong or with bad destination which could explain why they’re dropped by vmnic…

Last but not least, as HP ESXi image was quite new (just a few weeks), we thought may be embedded driver could be troublesome. OK they were supported by HP, but we had some painful past experiences with Flex10 driver resulting in PSOD or lost connection, so why not :p

Using command ethtool on blades already in production and on new ones (with the same hardware configuration), we did saw differences regarding driver version (firmware was the same though). On older blades, version was 2.102.518.0 :

Since on new blades, version was 4.1.334.0 :

So we decided to downgrade ESXi driver for theses FlexFabric onboard card in order to get the same driver version as the old blades to be able to compare.

You can do it with the esxupdate command, first, you have to get the installed package name:

esxupdate query --vib-view | grep -i be2net

Then, you have to delete this package:

esxupdate -b cross_oem-vmware-esx-drivers-net-be2net_400.4.1.334.0-1vmw.2.17.249663 remove

Finally, you install requested version (the bundle has been uploaded on a datastore in order to have an easy access from the ESXi):

esxupdate --bundle=/vmfs/volumes/d3a29a6f-aa4be900/SVE-be2net-2.102.518.0-offline_bundle-329992.zip update

Note: You have to reboot ESXi server to load the new driver.

Once the server rebooted, we didn’t see dropped packets anymore, even under heavy load!

So we can never overemphasize the need of testing, testing, and always testing before starting production load (all components need to be stressed, CPU, memory, hard drives, etc…)! And having a good knowledge of your infrastructure is mandatory in order to know what’s going on (again thanks to the “croissants-baguette” alarm from Super Hypervisor !)

vCenter HTTP503 Service Unavailable error

As we were working on a vCenter diff files plugin for *.vmx and *.vmdk (only for the descriptor, not the full -flat) in order to easily track changes made on theses files (as we wanted to know precisely what’s going on under the “Reconfigure Virtual Machine” tasks), we encountered some issues regarding vCenter Web Services.

Aside, just for the fun, here is a little teaser of the unfinished plugin:

The issue we had started as we were running the script that get the *.vmx and *.vmdk files. The script started to get the files just well, but at some point (as we have a lot of VM) some errors showed up like this one:

In the same time, we had other problems, like vCenter MOB being unavailable or being unable to get VM console:

After a quick search in VMware KB, we found the KB 2033822 : vCenter Server returns “503 Service Unavailable” errors

Here is an excerpt of the KB that explain what was happening:

The vpxd log files contain entries that indicate that a socket connection attempt failed because it timed out. If you run netstat -an on the vCenter Server host machine immediately after the error, you will see many connections where one end is port 8085 on the loopback and the other end is another port on the loopback. Some of these connections will be in the TIME_WAIT state.

vCenter Server uses TCP connections on the loopback (localhost) for Remote Procedure Calls (RPC) to dispatch client requests and to communicate with vCenter Server companion services. As a result, under heavy loads, vCenter Server creates many local TCP connections, then closes them and opens new ones. Some of the closed connections remain open at the server side in theTIME_WAIT state for some time (four minutes with default Windows settings). Because the number of client-side ports is limited, if vCenter Server uses the connections fast enough, at some point the client side tries to reuse a port while the server side still has a connection for this client port in the TIME_WAIT state.

As we checked the vCenter server for TIME_WAIT connections on 8085th port (with the command netstat -an | findstr “8085.*TIME_WAIT”), we saw that there was a lot of them:

The KB 1030246 : Port 8085 in VMware vCenter Server give some explanation about port 8085 in vCenter:

This means that 8085 is the port where all the SDK connections to vCenter Server are being made, which in turn means that vSphere Client and any scripts built on vCenter Server SDK use this port.

And we can find the same alert as in the previous KB about heavy load:

If there are a lot of scripts or applications making connections with vCenter Server, it is possible to see a large number of ports in a TIME_WAIT state. This is normal because Windows keeps a socket in TIME_WAIT state for certain period of time (Twice Maximum Segment Lifetime, so this wait could be 4 minutes) before recycling it back for use.

As we have around 2000VM in this vCenter, and as we automate everything we can, we  had reach the limit for SDK connections ports. The KB gives the workaround in order to change that limit:

By default, vCenter Server has 3976 ephemeral ports. If you are running out, you can increase the limit.

To allow more local ports to be available:

  • Open Registry Editor (Regedt32.exe).
  • Locate this key in the registry:
  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
  • Right-click Parameters and choose New DWORD Value.
  • Enter MaxUserPort in the Name data box, enter 65534 (Decimal) in the Value data box, then click OK.
  • Note: The default setting for the MaxUserPort value is 5000 (Decimal).The maximum value for Windows Server 2003 is 65534 (Decimal).
  • Close Registry Editor.
  • Restart the machine for the new setting to take effect.

If you want to fix this HTTP503 error without changing this value, you can wait for the 4 minutes delay in order to let Windows cleaning up the TIME_WAIT remaining connections, or you can restart the Windows service VMware VirtualCenter Management Webservices (for the most hurried ones).

Finally, we were able to change this settings in order to keep using massive PowerCLI scripts OneLiner ^^

VMworld 2012

Here is a late review from Europe VMworld 2012 that took place at Barcelona from october 9th to 11th. First we wanted to thank John Troyer and all VMware community for letting us attend to this VMworld!

This is the second VMworld we assist to and this time we’d the chance of going with our pal hypervisor.fr and to meet Timo there !

This year was radically different from the last one. Maybe because we were not alone this time, or because it wasn’t the first anymore :p In any case, we really enjoyed a lot !

Thanks to the last year experience, we’re able to better schedule this one, and to better focus on meetings than sessions. What we wanted to say, is that unless you had some questions to ask during a session, you can catch them up with Online Sessions Videos. Of course it’s not the same thing as living it, but there are so many things to see/do, you have to prioritize :p At the end, there were so many people we wanted to meet we switch on “Where is waldo?” mode :p

We wanted to focus on Group Discussion and Meet The Expert sessions, basically becauses theses are more “intimate” and everything is based on attendees participation (the Meet The Expert session was an awesome 2:1 chat with Duncan :p ).

One of the nice session we attended to was the one directed by Kit Colbert on Understanding Virtualized Memory Performance Management (INF-VSP1729), we took some pictures:

Then, there was the traditionnal Alan and Luc “Must-See” PowerCLI Best Practices: The Return! (INF-VSP1329) :

We’re also able to assist the #NotSupported #BrownBag sesssion by William :

We didn’t miss the occasion to take some pictures with the PowerCLI Gods Alan and Luc :

Finally we had the chance to chat with a lot of awesome guys: LucAlanDuncanCormakWilliam. We didn’t had the chance to see Franck, we’ll try to catch up on the next belgium VMUG :p

Page 2 sur 26123451020Dernière page »