(note: this post also appears on the official Veeam blog)
Also available in this series:
Part 1: linux certificate-based authentication
Part 2: Quick Rollback
Part 3: Quick Backup
Part 4: support for vSphere tags
Part 6: Tape Server
Part 7: Save as Default
Part 8: Hyper-V
Veeam Availability Suite v8 has been released and customers and partners are already upgrading their environments and exploring the several new features that have been added to this latest version of the software. There are many enhancements that are not part of the marketing activities, but nonetheless they all contribute to create every time an awesome version of the software.
I’ve created a list of my favorite 8 new technical features, my “gems”. In this series of posts, I will show you them and dive a little bit into their technical details. In this fifth episode, we will talk about Snapshot Hunter.
Snapshots, love them or hate them
Snapshots are so tightly integrated into any virtualized environments that you cannot really talk about one without the other. Since they first became available, they were, together with vMotion, the best example of why virtualization would have changed the IT landscape.
They are so common that we do not even talk about them anymore, we simply use them. But, stop for a minute and think about what you can do with them, or even better what you would not be able to do without them: creating point in time copies of every virtual machine, and reverting them at will to any of this copies, is something truly amazing, and makes you feel like you can really control time. You applied a patch on a production system just to find out something has broken? There’s nothing simpler than hitting a button and reverting the virtual machine at the previous state, like nothing had happened. It’s like having a time machine for your IT infrastructure!
We all love snapshots, but the more we use them, the more we find out they also have a dark side. In VMware environments, snapshots are represented in the storage as additional disks were all the writes are committed, while the previous point in time is kept in read-only state. A simple explanation that entails some consequences: the space consumed by snapshots is taken from valuable storage arrays, and because of the I/O now flowing to and from at least two different virtual disks, the performances of the virtual machines can be impacted. As any VMware administrator knows, forgetting a snapshot open on a virtual machine for a long period of time is one of the worst thing that can happen. That’s the reason why administrators monitors their environments for forgotten snapshots. Just to say, the report “Active Snapshots” in Veeam ONE is probably the most used report by any Veeam user:
So there are effective ways to keep snapshots under control and avoid the problems they could create. But is it enough?
Hidden snapshots
Actually, no.
For different reasons, sometimes snapshots are “lost” by vCenter, they are not reported any more in the interface, but they still exist in the underlying storage. Because of this, they are still used by a virtual machine, they can still impact performances, and can lead to serious problems if not discovered like storage space consumption.
This can happen also during Veeam Backup & Replication activities. Any data protection task starts with a virtual machine snapshot: with it, Veeam can guarantee proper quiescence of data stored into the virtual disk, thus insuring the content of the backup is consistent. For this reason, at the beginning of a backup or replication, Veeam Backup & Replication first of all requests to vCenter to initiate a snapshot of a given virtual machine. Once completed, the quiesced virtual disk (or part of it during an incremental backup/replica) is copied, and at the end of the job again Veeam instructs vCenter to commit the snapshot.
Here lies the problem: sometimes, even if vCenter reports a successful removal of the snapshot, in reality the snapshot is still there, even if there is no way from the vCenter interface to be aware of this state. The snapshot keeps growing, unobserved, until something bad happens.
Snapshot Hunter to the rescue!
For this reason, Veeam introduced in Veeam Backup & Replication v8 a new feature, specifically designed to identify stuck snapshots left over after backup and replication activities, and automatically remove them. There’s no better name for this than Snapshot Hunter.
How does it work?
As soon as a snapshot commit activity is completed by vCenter, or better by the ESXi server running the virtual machine at that time, regardless the result the commit is reported as successful by vCenter itself:
Snapshot Hunter connects to the virtual infrastructure and reads the contents of the datastore hosting the virtual machine. If the snapshot file created during the backup operation is still there, this is first of all notified in the statistics of the job, and the removal process begins.
There is a specific schedule for Snapshot Hunter activities: the first attempt to remove the stuck snapshot is performed as soon as the processing of that virtual machine is finished. Chances are in fact the snapshot file or another file involved was simply locked at the time of commit, and a consolidation can immediately fix the issue.
If the immediate attempt is not successful, Snapshot Hunter retries after 4 hours for 3 times, for a total of 12 hours. For each attempt, the consolidation algorithm has three steps:
1. “soft consolidation” (calling VMware Consolidate method)
2. “hard consolidation without quiesce” – creating and removing a snapshot
3. “hard consolidation with quiesce” – creating and removing a quiesced snapshot
If after 12 hours, the snapshot still cannot be safely removed, Snapshot Hunter notifies the user about the stuck snapshot, because there is a permanent problem preventing the removal that needs to be addressed manually. If you have configured notifications, you will receive an email like this:
“VM <virtual machine name> needs snapshot consolidation, but all automatic snapshot consolidation attempts have failed.
Most likely reason is a virtual disk being locked by some external process. Please troubleshoot the locking issue, and initiate snapshot consolidation manually in vSphere Client.”
Most of the time, you will never end up with the notification, and the snapshot will be silently removed, successfully. You can check at any time the specific activities of Snapshot Hunter by opening the History tab, looking at “system” activities and filtering it with the term “snapshot”:
Each “VM snapshot consolidation” job is Snapshot hunter removing a stuck snapshot for you 😉
Snapshot Hunter is completely automated. You do not have to configure anything, “it just works”. It searches for and removes any stuck snapshot left over after a backup or replication activity, so it doesn’t touch any other snapshot (unless Hard Consolidation is involved), even if we really suggest you to remove all existing snapshots and use instead Quick Backup if you need security copies of your virtual machines without the problems created by long-living snapshots.
Snapshot Hunter is also completely integrated into Veeam Backup & Replication in terms of resource consumption: if for example the maximum number of snapshots per datastore has been reached, it waits for at least a snapshot to be removed before proceeding with consolidation. We don’t want Snapshot Hunter to create additional I/O to your storage while removing the snapshots.
How safe is Snapshot Hunter? It follows exactly the same procedures developed and shared by VMware support for snapshot removal, so it does not create any harm to your environment.
On the contrary, let it run and it will be the hero of your virtual infrastructure.