With the release of Ceph Luminous 12.2 and its new BlueStore storage backend finally declared stable and ready for production, it was time to learn more about this new version of the open-source distributed storage, and plan to upgrade my Ceph cluster.
A bit of my personal history with Ceph
I’ve written a lot about Ceph in the past, and one of my most successful series of posts has always been the 11 parts “My adventures with Ceph storage“. Even if it’s by now more than two years old, many people came to my blog to read this series, for just one post or the entire series, and even if it was mainly based on Giant (v0.87) several readers still use this series to deploy their cluster. In the last post of the series I also blogged about upgrading it to Hammer (v0.94). Finally, in 2016 I created a new “quick start guide” to install from scratch Jewel (v10.2), without any theory or explanation about Ceph (for this, you can still rely on the original series).
I followed other projects after this one, so even if I still kept an eye on Ceph, my cluster remained at this level until now. But these days, because of the release of Luminous (v12.2) I decided it was time to go back to my cluster, learn more about the new features of Ceph, and plan another upgrade of my lab.
Ceph versions have always been named after different types of octopus, so we had over its history (in brackets the release date):
Argonaut – July 3, 2012
Bobtail (v0.56) – January 1, 2013
Cuttlefish (v0.61) – May 7, 2013
Dumpling (v0.67) – August 14, 2013
Emperor (v0.72) – November 9, 2013
Firefly (v0.80) – May 7, 2014
Giant (v0.87) – October 29, 2014
Hammer (v0.94) – April 7, 2015
Infernalis (v9.2.0) – November 6, 2015
Jewel (v10.2.0) – April 21, 2016
Kraken (v11.2.0) – January 20, 2017
Luminous (v12.2.0) – August 29, 2017
Mimic (we already know this will be the next major release name)
As you can see, I just skipped Kraken in my lab. This is because there are as always many many updates and new features in each release, and Kraken is no different, but the biggest one, by far, in Luminous that pushed me to upgrade was that finally, BlueStore is stable and ready for production use. Time to talk about it a bit more.
Ceph BlueStore
There are many new features in Luminous, and some of them are really cool, like the new built-in web-based dashboard, a so welcomed addition after all my troubles in the past with the Calamari components. However, I will let all of these for maybe some dedicated future posts, and you can read them all in the release notes; but today I will focus exclusively on BlueStore.
Note: many information is taken from this article from the Ceph team. As I read it, I made my own version/digest of the info that is written there.
Before Bluestore, Ceph was using a different format for its ODS, that are the data storage containers, called Filestore. Filestore used, as the name implies, binary files on top of a filesystem (XFS usually) to store objects. Even if the concept behind Filestore had nothing wrong per se, this layout was simply not efficient for running an object storage. It wasn’t just me thinking that there were too many layers, but also Ceph developers thought this, and for this reason, they developed Bluestore:
As you can see from this diagram, the filesystem layer has been removed, and now Ceph objects are written directly into the underlying storage medium, being it an HDD or an SSD (or else). This removes obviously some complexity, and performance is supposed to increase as there is one less abstraction layer. But it’s not just performance: Bluestore has full data checksumming and built-in compression.
In general terms, BlueStore is about twice as fast as FileStore, and performance is more consistent with a lower latency.The reality is, as usual, more complicated, and in the linked article there are specific examples of different IO operations and how they are handled by both Filestore and Bluestore. What I’m more interested in, in my daily job, is this one: “For large writes, we avoid a double-write that FileStore did, so we can be up to twice as fast”: Veeam backups will be surely much faster over Bluestore!
Also, unlike FileStore, BlueStore is copy-on-write: performance with RBD volumes (again, the way I usually mount Ceph volumes inside Veeam Linux repositories) or CephFS files that were recently snapshotted will be much better.
How does BlueStore work?
Since BlueStore consumes raw block devices, you’ll notice the data directory is now a tiny (100MB) partition with just a handful of files in it, and the rest of the device looks like a large unused partition, with a block symlink in the data directory pointing to it. This is where BlueStore is putting all of its data, and it is performing IO directly to the raw device (using the Linux asynchronous libaio infrastructure) from the ceph-osd process. (You can still see the per-OSD utilization via the standard ceph osd df command). Also, you can no longer see the underlying object files like you used to.
BlueStore can run against a combination of slow and fast devices just like FileStore, but it’s better designed to leverage the characteristics of fast devices. The general recommendation is to take as much SSD space as you have available for the OSD and use it for the block.db device. By default, a partition will be created on the sdc device that is 1% of the main device size.
Another aspect is memory. FileStore was a file-based solution so that it used a normal Linux file system, which meant the kernel was responsible for managing memory for caching data and metadata. Because BlueStore is implemented in userspace as part of the OSD, Ceph has now to manage its own cache; with BlueStore there is a bluestore_cache_size configuration option that controls how much memory each OSD will use for the BlueStore cache. By default this is 1 GB for HDD-backed OSDs and 3 GB for SSD-backed OSDs, but you can set it to whatever is appropriate for your environment. This is a bigger value than before, so we can expect Ceph clusters to become a bit more memory hungry than before.
Checksums and Compression
As I said before, BlueStore is not just about performance. One important news is checksumming, so that now data storage is even more reliable. Ceph with Bluestore now calculates, stores, and verifies checksums for all data and metadata it stores. Any time data is read off of disk, a checksum is used to verify the data is correct before it is exposed.
Also, BlueStore can transparently compress data using zlib, snappy, or lz4. This is disabled by default, but it can be enabled globally, for specific pools, or be selectively used when RADOS clients hint that data is compressible.
Converting existing clusters to use BlueStore
Obviously, using Bluestore in a newly created cluster is easy, but what about existing clusters that we may want to upgrade? The Ceph team designed this scenario in a smart way: a Ceph 12.2 cluster can have at the same time both Filestore and Bluestore OSD, and actually, the choice of backend type is done per-OSD even for the new ones: a single cluster can contain some FileStore OSDs and some BlueStore OSDs. An upgraded cluster will continue to operate as it did before with the exception that new OSDs will (by default) be deployed with BlueStore.
I looked around for ways to convert my existing (Filestore) OSDs over to the new backend. This is essentially a process of reprovisioning each OSD device with a new backend and letting the cluster use its existing healing capabilities to copy data back. There is a migration document available, and I’m planning to use it when I will start the upgrade of my cluster.