I’m working on some heavy lab tests in these weeks, plus I’m travelling a bit more than usual, so my blogging activity has slowed down a bit. As I’m catching up on the news I read around, I found two different articles that can give to all of us a good perception about two things about the public cloud, or the so called “hyper-scalers”. They have insanely massive resources, but as insane as they are, they are not infinite.
How many CPUs you need?
In one of my previous jobs, I worked close to a computing research center, and I had the pleasure a couple of times to visit that place. They had (and they have today) some massive computation systems, mainly Cray in this case, with the biggest one at that time having something like 20.000 Opteron CPUs. It was liquid-cooled, and was used by universities and researchers to run any possible kind of heavy computational workload. I remember the incredible noise in that server rooms, the massive size of the water cooling system, and the nice paitings on the front of those servers racks.
As I learnt a bit more about those systems, I also found out that the deployment of such machines had to be planned in advance, also considering things like all the hoses for cooling systems, power and so on.
So, you can imagine my surprise when I read this article: 220,000 cores and counting: MIT math professor breaks record for largest ever Compute Engine job. If you are fascinated by the number like I was, you can go to the page and read the entire article. I’m not going into the details, nor discussing what the price of this solution is, but let’s focus on the title itself: someone can connect to Google Cloud Platform and order some CPU power like this one, and pay it by the hour! This is 11 times the supercomputer I visited some years ago, without any issue (from the user point of view) like building or maintaining this infrastructure. Just pure “as a service” solution. And I’m pretty sure Google didn’t consume all the resources they have, so the total amount of compute power they have is even more impressive.
No more VMs for you
Another news I’ve seen in these months is this one: Capacity shortage hits AWS UK micro instances. I (noone?) don’t have any idea about why T2.Micro instances were not available anymore, while other types of virtual machines were still available to be deployed. We may speculate a lot of things, but this is not the point I wanted to highlight. The interesting fact is that, at some point, AWS run out of resources. It was a temporary issue of course, but it was interesting nonetheless. Some people may have been in need of some burst capacity for example for their webservers, and if they had auto-scaling programmed to use those type of instances, the end result wa that scaling didn’t happened…
Some lessons to learn?
As much as 220.000 cores may sounds like an incredible number of resouces that almost noone will ever buy as a service from a service provider, it happened. And as more and more people will move their workloads to the cloud, we will always have to remember that we may at some point find a situation where the cloud cannot serve the requests it has received. We have this perception that the public cloud is infinite, but it’s not, just like it’s not always available. Have you already forgot the AWS S3 outage?
So, whoever is planning to consume public cloud resources, from people needing just a tiny T2.micro virtual machine to omeone needing hundreds of thousands of them, remember to plan carefully and to always have a Plan B, if your business really depends on the public cloud resources you are looking for.
If you use the cloud, and the cloud is down, you are down.