Slow VM snapshot deletion on NFS volumes on ESXi hosts

Slow VM snapshot deletion on NFS volumes on ESXi hosts

These days I had to troubleshoot a problem with system time within a VMware vSphere VM. During backup VM freezes for more than 30 seconds. When this happens, system time in VM also stops and resets to current time after the freeze. And this behavior causes massive problems in the application-layer. During troubleshooting we found a very slow VM snapshot deletion on NFS volumes on ESXi hosts.

Environment

In this setup, we were running:

  • HPE SimpliVity Hyper-converged infrastructure running in current Version.
    • Notice: SimpliVity uses NFS v3 to present volumes to their hosts.
  • VMware vSphere 6.7 in quite current version.
  • Current Version of Veeam Backup and Replication.

Symptoms

  • Slow VM snapshot deletion on NFS volumes on ESXi hosts.
    • Snapshot removal of comparable VMs, running on block storage last about 1-2 seconds. Deletion of a snapshot, hosted on a SimpliVity volumes (NFS v3) lasts at least 40 seconds.
  • During snapshot deletion period no additional IOps can be observed.
  • System time problems within VM.

Root cause

The combination of backup transport mode and NFS version cause the problem: Using NFS v3 (of any storage solution) and hot-add transport (uses virtual appliance for VMDK-mounting) mode (of any backup solution) lead to unresponsive VMs during creation and removal of snapshots.

Workaround

There are a few workarounds available, but none of them is desirable:

  • Use NFS v4 instead of v3.
    Nice, but most often we do not have the choice of the protocol version. For example SimpliVity: just v3 available!
  • Use another transport mode. Direct access and NBD (Network Block Device) is available.
    • For direct access, backup need access to NFS datastore. This is sometimes not possible (SimpliVity does not support this) or not desirable.
    • NBD is the slowest of all transport modes. Backup gets data through the management uplink of the host. There is a throughput limit within ESXi host. It can perform reasonable on 10Gbit links with parallel running jobs.
  • Continue to use hot-add mode but with a appliance on every singe host! See KB article 2010953.
  • A workaround for VMs/Application that suffer from this can be to disable all time synchronization between host and VM. There are a few operations, like vMotion, create/remove snapshot, expand VMDK, that triggers a re-synchronization of VM’s system time. You can try to disable all these triggers and keep VM’s time current by services like NTP or w32tm. See KB 1189 for disabling.

Notes

  • This issue is not related to any backup or storage solution! This problem is – in my opinion – VMware related. Please correct me, when I am wrong.
  • Issue will be a problem for time-sensitive applications. Mostly it will not matter.
  • Read VMware KB article 2010953 for more details.
  • Issue description and workaround for Veeam B&R see KB article 1681.
  • Some rather old, but for the most part still correct details about SimpliVity

Leave a Reply

Your email address will not be published. Required fields are marked *