Kill Them All
During our adventures in troubleshooting world, we met an interesting case with some challenge and a lot of laziness (and that’s the kind of stuff we like :)).
To summarize, we had an ESXi server whose hostd process was unresponsive, its connection state in vCenter was therefore Not Responding (of course, after a few moments, VMs that were hosted on it were restarted as usual thanks to HA on another ESX servers in the cluster, so far so good).
After logging in the ESX server console thanks to HP ILO (SSH connectivity was also down), we realized that there was an issue with the network card’s driver (a famous known problem with ELXNET driver on ESX 5.5 related to Native Driver), resulting in vmnic connectivity issue (all links were down, which explains the connection state of the ESX server).
Due to this connectivity loss, ESX server didn’t have access to NFS storage anymore (as expected), and still have running dummy VM processes (since there were no network/storage available).
The interesting case that gave us the motivation to make this post was driven by our desire to enter the ESX into maintenance mode in this state.
We wanted to enter the server in maintenance mode, but as VMM instances were still running on the ESX server, this task wasn’t possible. So we started to kill these instances thanks to the esxtop ‘k’ option with the world-id. However, ILO connection did not allow copy/paste feature, and we were on a VPN connection with a 1980-ich latency, so performing this operation on 70 processes would take too much time (especially at 3am during on-duty call!).
As lazy as we are, we wanted to automate this process, minimizing commands to enter due to the copy/paste issue (we warned you, the key word here is ‘lazy’).
So here is a small bash script which uses esxcli vm process kill namespace. This script wil kill all VMM instance (in hard mode, so be careful as the process is stop instantly!!!) on the ESX server:
for vmid in esxcli vm process list | grep "World" | cut -f2 -d: do esxcli vm process kill --type hard --world-id $vmid done
Once the command was executed, we were able to enter in maintenance mode and go on with our troubleshooting (and go to sleep a few hours later) 🙂
Of course you should use it at your own risk, but it could be useful!