Switching to a new filesystem is never a task that is done with a light heart. We have our own trusted good old filesytem, that has maybe limits in features and performance, but has never let us down. New filesystems are available, and they promise wonderful things. But as much as we are fascinated by them, the big Q “Should I trust it?” comes to mind when we just start thinking about moving to a new filesystem. In Linux, this question arises everytime BTRFS is involved.
A new that is not that new anymore
I loved this screenshot as soon as I’ve seen it. The guy here is Chris Mason, he now works at Facebook (a heavy user of BTRFS by the way), and he started to develop BTRFS back in the days when he was working for Oracle, around 2007. So, it’s already 8 years old, and as the picture says, it’s not only “not new anymore”, it’s aging 🙂
There are many misconceptions around BTRFS on the Internet, some come from real initial problems that the filesystem had, but also because people usually don’t check the date of the informations they read. Yes, BTRFS was really unstable at the beginning, but if you read about huge data corruption problems in a blog post or topics like that, and that post was written in 2010, well maybe things have changed since then.
The most important part of a file system is its on-disk format, that is the format used to store data onto the underlying media. Well, the filesystem disk format is no longer unstable, and it’s not expected to change unless there are strong reasons to do so. This, alone should be enough to tell people that BTRFS is stable.
So, why it is considered unstable by many?
There are few reasons: first, as I said, people are scared of change when it comes to filesystems. Why changing from a trusted and known one, to something new? And it’s not just Linux, the same is happening in Microsoft and its will to move from NTFS to ReFS. But then, I’ve always seen a paradox here: ok for XFS that has 20 years of stable development, but ext4, the “trusted” default file system, has been developed as a fork of ext3 in 2006. So, it’s just 1 year older than BTRFS!!!
Second reason, probably, the fast development cycles. While the on-disk format is finalized, the code base is still under heavy development to always improve performance and introduce new features. This, together with management tools that had been stabilized only recently, made people think that the entire project wasn’t stable.
Confirmations come from the field
The screenshot I’ve took is coming from this video.
https://www.youtube.com/watch?v=W3QRWUfBua8
First important note, it’s not another ranting post from 8 years ago, it’s Chris Mason himself speaking at NYLUG in May 2015, so just few months ago. And the examples he brings are the best proof about BTRFS: they are using the filesystem at Facebook, where they store production data. And the nice part is that, right because BTRFS is used in production at Facebook, the size of the used storage helps in testing and fixing the code at a pace that wouldn’t be possible in smaller installations.
And if you watch the video, you’ll see how some really heavy weights in the industry are supporting and working to improve BTRFS: Facebook, SuSE, RedHat, Oracle, Intel… And the results are showing up: starting from SuSE Linux Enterprise Server 12, released in October 2015, BTRFS has become the default file system of this distribution. Kudos to the guys at SuSE, because for sure the best way to push its adoption is to place a statement like this “we are a profit company, not a group of Linux geeks, and we trust this filesystem to the point that it’s going to become our default one”.
Why BTRFS is awesome?
Ok, so BTRFS is stable enough to be trusted. Or at least I do, together with guys whose judgement has way more value then me like Facebook and SuSE Linux experts. At this point, if you still don’t trust it, stop reading this post and keep using ext4 or xfs, no problem.
But if you are thinking “maybe I can use BTRFS on my next Linux deployment”, why should you consider it? Well, because it has some great features! The page linked at the beginning has the complete list, here I’m going to list the ones I prefer the most.
BTRFS has been designed from the beginning to deal with modern data sources, and in fact is able to manage modern large hard disks and large disk groups, up to 2^64 byte. That number means 16 EiB of maximum file size and file system, and yes the E means Exabyte. This is possible thanks to the way it consumes space: other file systems use disks in a continguous manner, layering their structure in a single space from the beginning to the end of the disk. This makes the rebuild of a disk, especially large ones, extremely slow, and also there’s no internal protection mechanism as one disk is seen as a single entity by the filesystem itself.
BTRFS instead uses “chunks”. Each disk, regardless its size, is divided into pieces (the chunks) that are either 1 GiB in size (for data) or 256 MiB (for metadata). Chunks are then grouped in block groups, each stored on a different device. The number of chunks used in a block group will depend on its RAID level. And here comes another awesome feature of BTRFS: the volume manager is directly integrated into the filesystem, so it doesn’t need anything like hardware or software raid, or volume managers like LVM. Data protection and striping is done directly by the filesystem, so you can have different volumes that have inner redundancy:
For example, Block group 2 is configured for RAID1 redundancy. So, a chunk is consumed on disk1, and its mirror is stored in another device, Disk 2 in the picture. In this way, if we lose Disk1, another copy of the block is still available on Disk2, and another copy can be immediately recreated for exaple on Disk3 using the free chunk. You can configure BTRFS for File Striping, File Mirroring, File Striping+Mirroring, Striping with Single and Dual Parity.
Another aspect of BTRFS is its performance. Because of its modern design and the b-tree structure, BTRFS is damn fast. If you didn’t already, look at the video above starting from 30:30. They have run a test against the same storage, formatted at different stages with XFS, EXT4 and BTRFS, and they wrote around 24 million files of different size and layout. XFS takes 430 seconds to complete the operations and it was performance bound by its log system; EXT4 took 200 seconds to complete the test, and its limit comes from the fixed inode locations. Both limits are the results of their design, and overcoming of those limits was one of the original goal of BTRFS. Did they succeed? The same test took 62 seconds to be completed on BTRFS, and the limit was the CPU and Memory of the test system, while both XFS and EXT4 were able to use only around 25% of the available CPU because they were quickly IO bound.
Other features are worth a mention:
– Writable and read-only snapshots
– Checksums on data and metadata (crc32c): this is great in my view, as every stored block is checked, so it can immediately identify and correct any data corruption
– Compression (zlib and LZO)
– SSD (Flash storage) awareness: another sign of a modern filesystem. BTRFS identifies SSD devices, and changes its behaviour automatically. First, it uses TRIM/Discard for reporting free blocks for reuse, and also has some optimisations like avoiding unnecessary seek optimisations, sending writes in clusters, even if they are from unrelated files. This results in larger write operations and faster write throughput.
– Background scrub process for finding and fixing errors on files with redundant copies
– Online filesystem defragmentation. being a COW (copy-on-write) filesystem, each time a block is updated the block itself is not overwritten but written in a different location of the device, leaving the old block still in place. If the old block at some point is not needed anymore (for example if it’s not part of any snapshot) BTRFS marks the chunk as available and ready to be reused.
– In-place conversion of existing ext3/4 file systems
Final notes
As in any technology, BTRFS is not perfect. For example, it suffers when there are heavy write activities in the middle of an existing files, so probably it’s not the best candidate for virtualization (the virtual disks are updated in-place at each write). But as always, you have to decide if the features available in a given technology are worth the migration to it, and if the (few) limits are going to affect you.
For all these reasons, for sure I’m going to use more and more BTRFS in my next Linux deployments.