“oh gosh, Luca has gone mad…” I’m almost sure this is your feeling when you read the title of this post. It’s no secret I work for Veeam, and our flagship product is Veeam Backup & Replication, a data protection solution for virtualized environments. So why did I chose that title? I like catchy titles, and most of all this is the question I asked my self after reading another blog post. Read my post, and you will understand why my final answer is “yes, damn yes!”.
Backups are boooooring
The article I read was complaining about the uselessness of backups. Basically, the whole idea was something like this:
“ I hate backups, because they remind me something has failed in my infrastructure design and operations, because I wasn’t able to design my environment in order to cope with these problems, by having proper redundancy in data management, adequate protection from users mistakes, unlimited versioning of any data so I can choose to revert any of them at any point in time I want.”
According to this view, no doubt backups are a band-aid to these issues. So I questioned myself “Is it true?”. Are we really insisting (too much?) on a tactical solution, and so we are missing a more strategic solution? If at a certain point in time, our production environments will be flawless, we will not care anymore about data protection as an added solution, but maybe as something “embedded” in every environment?
I think this conclusion is correct, but only because the initial statement is completely WRONG.
Do you backup your dropbox content?
I’m almost sure you don’t. You assume the dropbox infrastructure (or any other similar service…) can never fail. Yes you can have problems connecting to internet, or there may be an interruption in their uplinks, but you act like your files are always going to be there. 99,999% of the time this is completely true. Probably I’m one of those you consider paranoids, but I backup my dropbox folders also with my Apple Time Machine, and then I sync my backups with another “Cloud Storage” (I must say I hate this term, because it makes me feel like my precious files are happily fluttering about in the sky, like a balloon…). My files are precious to me, and that 0,001% of remaining risk is enough for me to have additional protections.
But the point is another one. I don’t do multiple backups because of that tiny percentage of risk, nor because I don’t trust enough Dropbox. After all, I’m pretty sure they do internal backups of all the data they hosts, or at least they have multiple copies of them. The reason is another one.
If you design to never fail…
… you are definitely going to fail at some point, and I should add BADLY.
Any modern infrastructure, being it Google, Facebook, Dropbox, or the next version of YOUR infrastructure, should be designed around this simple yet underestimated concept:
Design to fail
No matter what kind of solutions you are going to use, their inner resiliency, the level of redundancy you are going to introduce (multiple servers, storage, datacenters, IT stuff…), at some point something is going to break. It’s not a matter of IF, but only of WHEN. I don’t want to have a monolyth in my datacenter, something that is supposed to never fail. Because, when it will fail, I’m sure there will be something in its design that will prevent me from recovering its services in a desirable timeframe. Because its designers never thought any possible problem, and the countermeasure for it.
Instead, if you design to fail, you start from the beginning thinking about any possible error or break in any component of your infrastructure. By using a scale-out approach, you will have multiple “nodes” for any components. When one or many of them will fail, you will have enough still running, so the overall service they are running will still be available.
And if you apply this kind of design, you also know at some point your overall production infrastructure will maybe fail. An update by a vendor applied to every server with a bug that is discovered only many week later, when there is no way to revert it, is going to immediately stop any production system. Or a problem with all your diesel generators at the same time because you filled their tanks with dirty fuel, and the engines break when they start burning that fuel…
Whatever the upcoming problem will be, data protection is vital. It’s a separated system, with no connection points with your production environment other than the backup stream, so it’s not influenced by problems happening to the former; and you are going to use it when there is a problem that cannot be fixed internally in the production environment.
Design to fail, and plan accordingly your data protection. One day it will save your back.