In one of my presentations for the VeeamON conference, titled The Quest for the Ultimate Backup Storage Architecture, I will explain how a tiered approach to data protection is the best solution to have an effective protection in place, and I will describe the different layers of data protection that can be applied to a production environment, and the layer I thought about for the most of the time was Storage Snapshots. After some thinking, I labeled it as a Tier-0 level, with specific pros and cons that should be carefully evaluated to properly use them in a data protection scenario.
First of all: are storage snapshots a data protection solution? There has always been two parties discussing, sometimes really hard, if storage snapshots can be considered o. For me, the answer is simple: yes, because of the simple fact they hold an additional copies of production data.
Then, as all the applied layers to the solution, the choice to have them comes after a proper evaluation of all the pros and cons it has. First of all, one should evaluate the available technologies. Especially on SMB and mid-market, not every storage solution comes with storage snapshots. If your storage of choice hasn’t this feature, then obviously this tier cannot be used. If snapshots are available as an additional license, it comes down instead to a ROI evaluation: are they going to be effective enough to justify the additional costs of the license? Luckily, many modern storage solutions offer snapshots at no additional cost.
Second, the technology itself. Some storage are really great at taking snapshots, some are not. If your storage is not able to take thin snapshots, or the snapshots are not thin again when they are converted to proper cloned volumes to become independent volumes, then again data protection done with storage snapshots maybe is not for you.
if your storage instead has a good storage snapshot technology (and most of the modern solutions have), you can move to evaluate pros and cons of this protection layer. Let’s see them.
Cons
– Failure domain. What people often do not realize, and I see it for example in many threads on the Veeam Forums, is the simple fact that snapshots are saved in the same array where the original production volumes are. If for any reason access to the storage array is lost, you loose at the same time the production volumes and their snapshots that you could have used to restore them. This is the main reason to apply other data protection layers together with storage snapshots.
– Performances. Again, another underestimated problem of storage snapsots is the performance impact they have. On modern storage arrays, blocks are mapped into a metadata database, and a snapshot simply is an additional metadata pointing twice to the same block. So when a snapshot is taken, no blocks are copied on the storage twice, but simply a new pointer is added to the the database. The original block is never modified, and it’s replaced with a new one when a write operation comes in to modify it, while the previos version remains intact into the snapshot copy. But if for any reason there is a copy activity for the snapshot creation (copy on write is still a common technology in many storage arrays, for further informations read All Snapshots Are Not Created Equal written by Howard Marks), then snapshots DO HAVE an impact on your production array. This is important to understand, because a high snapshot activity will reduce the I/O available for your production volumes: the bottleneck can be your disk pools (both volumes and snapshots are using the same disks) or the storage controller. Even if vendor loves to claim they can do an infinite number of snapshots, reality is an excessive amount of snapshots can lower the performances of your storage.
– granularity: if the snapshot is taken on a block volume, the copy involves all the virtual machines running in that volume. So, for every VM that needs to be protected, in reality you are taking snapshots also of other VMs that maybe are not so important for you. This means additional storage space for snapshots, and reduced performances for all the VMs hosted in that volume. A first and simple solution is to create different volumes, with different snapshots policies, and then distribute VMs across those volumes accordingly. A dev/test environment doesn’t need probably to be protected by taking snapshots every hour, while a production database can be saved every 10 minutes. If instead the underlying technology is file base (like NFS) and the storage is capable of identifying the single VM, then the snapshot is done only for that specific VM, and this limit is not valid anymore. This level of granularity involves obviously also restores: if I need to recover only a single VM, on a block volume I need to clone into a volume an entire snapshot, and if the cloned volume is not space efficient, both performances (I need to consume enough I/O to clone the entire original volume) and space consumption (I need to create a full copy of the original volume) will be impacted.
Pros
Then, there are also pros obviously:
– data path: the snapshot is already inside the storage array, and any operation is done internally. There is no need to move data in and out of the storage array, and its controller will complete any required operation without the need to move data out to a server. This means I/O performances are the highest, compare to data protection techniques involving an external server (like hypervisor snapshots or proper backup activities). This leads also to the other two advantages.
– frequency: because the data path is the shortest possible, the snapshot frequency can be really high before seeing an impact on production VMs. Depending on the storage solution in use, it can be something like very 5 minutes, down to one every few seconds. A warning however: this is true if the snapshot is crash consistent, that is the application inside the VM have not been quiesced before taking the snapshot. Depending on the application, a crash consistent copy can or cannot be enough; for example, an SQL database requires an application-consistent copy to guarantee its consistency. If application consistency is needed, than frequency cannot be high, because certain activities needs to be done inside the VM before taking the snapshot, and this takes time. If you also think about block volumes with multiple VMs, before taking a consistent snapshot of the entire volume, all the running VMs must be quiesced.
– impact on the hypervisor: because storage snapshots are done in a lower layer that is not seen by the hypervisor, the impact on the hypervisor itself and its running VM is way lower. In the eyes of the hypervisor there is NO snapshot. This is a great workaround to overcome for example the known limits of VMware snapshots technology: VMware snapshots are committed back into the original virtual disk using a redo log approach, that is each updated block while the VM was running with an active snapshot must be written back into the original virtual disk. This amount of I/O activity is detrimental to VM performances, and can also lead to disconnections (the notorius STUN problem). Storage snapshot do not involve usually the hypervisor, so VM performances are not impacted by this activity. Again, if application consistency is a concern, hypervisor should be involved.
Final notes
As every layer of data protection, storage snapshots are a great and effective solution, and in a Defense in Depth approach they could and should be implemented. People have to evaluate carefully their pros and cons, and think about storage snapshots not alone, but inside a wider design of their data protection plan. When combined with other tiers, they are a great addition to the plan.
What about you? Do you use them? Do you think there are additional pros or cons to be considered?