As we were about to use some new BL465c G7 blades hosting ESXi servers installed with HP custom ISO image (available here, this image is built with needed drivers for G7/Gen8 hardware support, especially for FCoE onboard FlexFabric card), we started to receive some alert mail on dropped packets (thanks to the famous ”croissants-baguette” alarm from hypervisor available here: Alarmes vCenter pour les network packets dropped) :
We received theses alerts as there was no VM hosted on theses servers, so obviously no network load. As every troubleshooting should use esxtop, we ran it and used the network context (key ‘n’):
What was weird is that we didn’t saw any Transmit Dropped Packet (%DRPTX) or Received Dropped Packet (%DRPRX) while vCenter performance graph displayed a lot of them.
The vCenter counter is a summation one (i.e. a computed counter based on values retrieved from a time range), we wanted to check that esxtop counters stays null. So we ran esxtop in batch mode and redirect the results in a CSV file, and by importing it in perfmon, we were able to confirm that these counters stayed at 0…
We spare you all the failed tests about finding the reason for the misinterpretation between esxtop counters and vCenter performance graph. In an ultimate try in order to understand the origin of this issue, we captured some network traffic on an ESXi server (there was still no VM running on this server and was in maintenance mode) using the command tcpdump-uw :
tcpdump-uw -i vmk0 -s 1514 -w traffic.pcap
This command will run some packet capture, listening on vmk0 interface (-i vmk0), saving all frames (-s 1514 for regular packet, or -s 9014 with Jumbo Frames) in pcap format in order to be able to analyze them in WireShark.
Unfortunately, nothing seems to be strange/wrong or with bad destination which could explain why they’re dropped by vmnic…
Last but not least, as HP ESXi image was quite new (just a few weeks), we thought may be embedded driver could be troublesome. OK they were supported by HP, but we had some painful past experiences with Flex10 driver resulting in PSOD or lost connection, so why not :p
Using command ethtool on blades already in production and on new ones (with the same hardware configuration), we did saw differences regarding driver version (firmware was the same though). On older blades, version was 2.102.518.0 :
Since on new blades, version was 4.1.334.0 :
So we decided to downgrade ESXi driver for theses FlexFabric onboard card in order to get the same driver version as the old blades to be able to compare.
You can do it with the esxupdate command, first, you have to get the installed package name:
esxupdate query --vib-view | grep -i be2net
Then, you have to delete this package:
esxupdate -b cross_oem-vmware-esx-drivers-net-be2net_400.4.1.334.0-1vmw.2.17.249663 remove
Finally, you install requested version (the bundle has been uploaded on a datastore in order to have an easy access from the ESXi):
esxupdate --bundle=/vmfs/volumes/d3a29a6f-aa4be900/SVE-be2net-2.102.518.0-offline_bundle-329992.zip update
Note: You have to reboot ESXi server to load the new driver.
Once the server rebooted, we didn’t see dropped packets anymore, even under heavy load!
So we can never overemphasize the need of testing, testing, and always testing before starting production load (all components need to be stressed, CPU, memory, hard drives, etc…)! And having a good knowledge of your infrastructure is mandatory in order to know what’s going on (again thanks to the “croissants-baguette” alarm from Super Hypervisor !)