Last week, another outage of a large cloud provider hit the news, and the many companies using their services were impacted. This time it was Amazon Web Services, as their S3 service in the US-East region has been down for almost 4 hours, impacting so many other cloud services that are relying on this object storage technology. What impacted me, however, had been the reactions of other IT people around. Warning, this is a rant blog post…
The couch architects
I’ve learned from my years working with people from the USA that they use, exactly like here in Italy, the term “couch coach” to (sarcastically) identify those sports fans that, from the comfortable position of their home couch, watch sport events while commenting about coaches and players during the entire game, and how they surely would have done better. As a sports fan myself, but also an ex-player at a decent level (basketball in my case), I really HATE those people. Many of them have no playing experience at all, or if they do, it’s usually at a level so low that it can hardly be call the same sport as the one they are watching on the TV. Yet, since it’s really easy to do it, and no negative outcome is expected to come, they throw negative comments to any coach’s or players’ decision, totally ignoring what it means to play at a certain level in terms of stress, speed at which a choice has to be made, and the insane amount of skills that even the last guy on the bench has, compared to them.
Lately, especially since social media has become so popular, the same behavior has started to surface also in our industry. Bloggers, journalists, analysts, simple IT practitioners, everyone feels entitled to comment on a large Cloud Provider outage, obviously commenting that such an outage is unacceptable and claiming they would have done so much better. I know I’m walking a risky path, as I’m also a blogger and I’m blogging about a cloud provider outage right now. The difference is, I’m not going to blame AWS, rather these “couch architects”.
Well, dear couch lovers, AWS and many other cloud providers are totally top-notch companies where the skills of their technical employees is second to none. I work on a daily basis with many service providers, and some are really close to the size of the so-called hyper clouds, and I can tell you those people are way smarter than me for sure, and probably also than many of you. Simply, humans are not perfect, and errors can be made. AWS has been hit bit one of those error, as you can read in deep details in this post-incident report (by the way, how many companies publish such a detail report when something bad happens?), where they aso explained HOW they are going to improve their design in order to avoid this same error to happen again.
The piggybacking companies
I found it really hard to decide if it’s more sad an individual complaining about something he may not be even able to architect, or all those other companies that, as soon as the outage started, literally rushed to publish some content on their websites with some SEO words in it like “AWS”, “Outage”, “S3”, to tell everyone how their solution would have helped, offering competitive offers to migrate out of AWS, or maybe not even having anything directly related to AWS, and just dropping some click-baits in the Internet. I thought initially a thousands additional words on this sub-topic, but I ultimately decided I can spend my time writing about better things. This is just sad.
Final notes
“The cloud” is often described as “someone else’s computer”, and it’s partially true in general, but still I don’t like it. It’s disrespectful. for the complexity of the design and the efficiencies of operations that these infrastructures have reached, and the capabilities of people working there. Your “mom and pop” service provider down the road may be compared to your own computers, these giants totally not.
Just remember they are not perfect, their services may fail (like they did), but the outage is probably always going to be shorter and less impactful than the one of less sophisticated environments. I can imagine the rage of companies relying on S3 for their operations, and thus being offline for the entire duration of the outage, but it has been less than 4 hours after all. If you are ok with AWS (or any other provider) SLAs, just wait for the services to come back online, otherwise go and design services around the limits: in the specific case of the S3 outage, it would have been as easy as having S3 cross-region replication in place. It’s a native feature of S3, and it’s available since a few years. Just use it…