SolidFire vs ATS heartbeat
We had to deal with an interesting case regarding a new VDI platform hosted on SolidFire arrays. During VM boot storm (around 1000 VM per cluster), we lost access to the ESXi servers, their vCenter status changed to ‘Not Responding’, with unresponsive hostd process as soon as we try to run some local command (through console and/or SSH).
By looking on the net, we found an IBM KB describing similar behavior: Host Disconnects Using VMware vSphere 5.5.0 Update 2 and vSphere 6.0 and question a new feature, ATS heartbeat since vSphere 5.5U2, i.e. offloading I/O heartbeat to the array thanks to ATS VAAI primitive.
While we were looking deeper at this settings, we read again the great post of Cormac explaining this feature, and more precisely the new behavior since 5.5U2: Heads Up! ATS Miscompare detected between test and set HB images.
There is now an official VMware KB (KB2113956) that explain the issue new behavior of ATS heartbeat before and after 5.5U2:
A change in the VMFS heartbeat update method was introduced in VMware vSphere ESXi 5.5, Update 2 to help optimize the VMFS heartbeat process. Whereas the legacy method involves plain SCSI reads and writes with the VMware ESXi kernel handling validation, the new method offloads the validation step to the storage system. This is similar to other VAAI-related offloads. This optimization results in a significant increase in the volume of ATS commands the ESXi kernel issues to the storage system, and resulting increased load on the storage system. Under certain circumstances VMFS heartbeat using ATS may fail with false ATS miscompare which causes the ESXi kernel to reverify its access to VMFS datastores. This leads to the Lost access to datastore messages.
The Support Case we opened at SolidFire support confirm this behavior, here is an excerpt of their answer:
SF is a 4K block size array internally but ESX only works with 512-byte blocks. ESX issues single 512-byte block ATS commands, which requires the SF array to perform an internal read-modify-write for every ATS command. If ESX could issue a 4K ATS command (eight 512-byte blocks) our ATS performance should be better. Essentially, every ATS operation is an unaligned I/O.
The SolidFire support ask us to to disable ATS heartbeat in order to switch back to legacy SCSI-2 reservations (only for the heartbeat of course).
As described on the VMware KB, you can do this thanks to PowerCLI:
Get-AdvancedSetting -Entity VMHost-Name -Name VMFS3.UseATSForHBOnVMFS5 | Set-AdvancedSetting -Value 0 -Confirm:$false
We will show in the next post another setting modification, DelayedAck