Snapshot consolidation (or commit) operations in VMware vSphere have always been a problem, especially for large and really active virtual machines. But vSphere 6 has introduced some changes that are probably going to make commits a problem of the past!
A new consolidation method
Before vSphere 6.0, the consolidation and commit phases of any VM snapshot has always followed the same procedure: an additional helper snapshot was created to “freeze” not just the base virtual disk, but also the snapshot disk, and once the changes stored in the snapshot disk have been merged into the base disk, the helper snapshot was also committed, and at some point the I/O was not directed anymore to the snapshot, but back to the original disk.
This process sounds simple, but once you start looking at the details are not that easy, and in some production environments this behaviour have caused problems. If the virtual machine has a decent I/O write activity, also the helper snapshot could grow in size, and in order to commit the changed data back into the base disk, an additional helper snapshot may be created. Cormac Hogan wrote a great post to explain the situation in vSphere 5.5, and also how it has been improved in vSphere 6.0. You can read his great post Snapshot Consolidation changes in vSphere 6.0, here I’m only reporting a part of it:
What we did previously is used a helper snapshot, and redirected all the new I/Os to this helper snapshot while we consolidated the original chain. Once the original chain is consolidated, we then did a calculation to see how long it would take to consolidate the helper snapshot. It could be that this helper snapshot has grown considerably during the consolidate operation. If the time to consolidate the helper is within a certain time-frame (12 seconds), we stunned the VM and consolidated the helper snapshot into the base disk. If it was outside the acceptable time-frame, then we repeated the process (new helper snapshot while we consolidated original helper snapshot) until the helper could be committed to the base disk within the acceptable time-frame.
We retried this process for a defined number of iterations, but we have had situations where, due to the amount of I/O in flight, we could never successfully consolidate the snapshot chain and helpers.
In vSphere 6.0, this consolidation process has been dramatically changed:
In vSphere 6.0 the snapshot consolidation process also uses the mirror driver [as in Storage vMotion]. With the mirror driver mechanism, changes to the VM are written to the active VMDK and the base disk (while protecting write order) during consolidation. One should now hopefully see snapshot consolidations completing in 1 pass (with minimal or indeed no helper disks) and with hopefully a dramatically shorter stun time, and a much small chance of consolidation failure.
Testing the new consolidation process
This sounded too good to be true, so my colleague Tom Sightler, Solutions Architect in Veeam, decided it was time to run some tests to prove this change. Tom shared with me his findings, so thanks Tom for the reports!
The test involved two servers, one with ESXi 5.5 (5.5 Update 3) and one with ESXi 6.0 (6.0 Update 1b), while the rest of the environment is exactly the same: the same exact server model and characteristics, same network, and also the same NFS datastore to run the test virtual machine. This VM is a CentOS 7 minimal install (text only). It has 12 virtual disks, 16GB each, all stored in the NFS Datastore – 1 disk for the Operating System, while the other 11 disks are just attached, but not even formatted or mounted. Why this setup? Because the biggest problems for snapshot consolidation is observed with customers that have lots of VMDKs because vSphere consolidates multiple disks and their snapshots of the same VM sequentially. Even with a completely empty VM (the 11 disks are not even formatted, so they will be completely idle) you usually end up with a stun of around 1second per disk, and since VMkernel goes through each disk and consolidates the last helper, with 11 disks you will almost always have more than 10 seconds stun.
And remember, this is a completely idle VM, there is absolutely no I/O activity. Think about a really active VM like database machines for example, where setup with multiple disks are pretty common: the problem here can we way worse. It’s not the hardware in use, there’s no backup software involved, it’s all about the snapshot technology.
For each test, two parameters have been measured: a network ping executed each second, and the effective stun time read from VMware logs. The first gives you the result has it can be seen from another application or user over the network connecting to the VM, while the second gives you the exact time spend by VMware engine to manage snapshot creation and consolidation.
ESXi 5.5 U3
In order to verify the effects of snapshot creation and removal, different tests have been conducted. First, we took a snapshot, because even the snapshot creation requires at some point an I/O redirection from the base disk to the snapshot disk, and even if this time is minimal, it’s not zero.
2016-02-05T22:52:22.064Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1960925 us
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62 Reply from 192.168.60.159: bytes=32 time=2ms TTL=62 Reply from 192.168.60.159: bytes=32 time=1ms TTL=62 Request timed out. Request timed out. Reply from 192.168.60.159: bytes=32 time=3ms TTL=62 Reply from 192.168.60.159: bytes=32 time=22ms TTL=62 Reply from 192.168.60.159: bytes=32 time=2ms TTL=62 Reply from 192.168.60.159: bytes=32 time=2ms TTL=62
2016-02-05T22:57:02.783Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 843578 us 2016-02-05T22:57:04.633Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1224454 us 2016-02-05T22:57:06.457Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1180217 us 2016-02-05T22:57:08.193Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1125963 us 2016-02-05T22:57:09.870Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1083360 us 2016-02-05T22:57:11.517Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1061894 us 2016-02-05T22:57:13.124Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1014733 us 2016-02-05T22:57:14.623Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 967550 us 2016-02-05T22:57:16.131Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1004541 us 2016-02-05T22:57:17.600Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 961571 us 2016-02-05T22:57:18.980Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 907337 us 2016-02-05T22:57:20.338Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 890165 us 2016-02-05T22:57:21.162Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 383551 us
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62 Reply from 192.168.60.159: bytes=32 time=2ms TTL=62 Request timed out. Reply from 192.168.60.159: bytes=32 time=1ms TTL=62 Reply from 192.168.60.159: bytes=32 time=1ms TTL=62 Request timed out. Request timed out. Request timed out. Request timed out. Request timed out. Request timed out. Request timed out. Request timed out. Request timed out. Request timed out. Request timed out. Request timed out. Request timed out. Request timed out. Request timed out. Reply from 192.168.60.159: bytes=32 time=1ms TTL=62 Reply from 192.168.60.159: bytes=32 time=2ms TTL=62 Reply from 192.168.60.159: bytes=32 time=2ms TTL=62
ESXi 6.0 U1b
2016-02-05T22:06:11.606Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 649684 us
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62 Reply from 192.168.60.159: bytes=32 time=1ms TTL=62 Reply from 192.168.60.159: bytes=32 time=1ms TTL=62 Reply from 192.168.60.159: bytes=32 time=2ms TTL=62 Reply from 192.168.60.159: bytes=32 time=2ms TTL=62 Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Then, we removed the snapshot:
2016-02-05T22:10:08.276Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 5326 us 2016-02-05T22:10:08.493Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 69898 us 2016-02-05T22:10:08.714Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 63095 us 2016-02-05T22:10:08.942Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 67560 us 2016-02-05T22:10:09.166Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 65347 us 2016-02-05T22:10:09.388Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 64597 us 2016-02-05T22:10:09.610Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 63449 us 2016-02-05T22:10:09.827Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 65439 us 2016-02-05T22:10:10.054Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 69483 us 2016-02-05T22:10:10.277Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 65957 us 2016-02-05T22:10:10.501Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 64563 us 2016-02-05T22:10:10.759Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 95898 us 2016-02-05T22:10:10.983Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 67014 us
Reply from 192.168.60.159: bytes=32 time=2ms TTL=62 Reply from 192.168.60.159: bytes=32 time=2ms TTL=62 Reply from 192.168.60.159: bytes=32 time=1ms TTL=62 Request timed out. Reply from 192.168.60.159: bytes=32 time=1ms TTL=62 Reply from 192.168.60.159: bytes=32 time=3ms TTL=62 Reply from 192.168.60.159: bytes=32 time=1ms TTL=62