I'm not a fan of BTRFS. This page serves as a reminder of why, but also a cheat sheet to figure out basic tasks in a BTRFS environment because those are not obvious to me, even after repeatedly having to deal with them.

Content warning: there might be mentions of ZFS.

  1. Stability concerns
  2. My BTRFS test setup
  3. No builtin encryption
  4. Subvolumes, filesystems, and devices
  5. Mounting BTRFS subvolumes
  6. No disk usage per volume
  7. Conclusion

Stability concerns

I'm worried about BTRFS stability, which has been historically ... changing. RAID-5 and RAID-6 are still marked unstable, for example. It's kind of a lucky guess whether your current kernel will behave properly with your planned workload. For example, in Linux 4.9, RAID-1 and RAID-10 were marked as "mostly OK" with a note that says:

Needs to be able to create two copies always. Can get stuck in irreversible read-only mode if only one copy can be made.

Even as of now, RAID-1 and RAID-10 has this note:

The simple redundancy RAID levels utilize different mirrors in a way that does not achieve the maximum performance. The logic can be improved so the reads will spread over the mirrors evenly or based on device congestion.

Granted, that's not a stability concern anymore, just performance. A reviewer of a draft of this article actually claimed that BTRFS only reads from one of the drives, which hopefully is inaccurate, but goes to show how confusing all this is.

There are other warnings in the Debian wiki that are quite scary. Even the legendary Arch wiki has a warning on top of their BTRFS page, still.

Even if those issues are now fixed, it can be hard to tell when they were fixed. There is a changelog by feature but it explicitly warns that it doesn't know "which kernel version it is considered mature enough for production use", so it's also useless for this.

It would have been much better if BTRFS was released into the world only when those bugs were being completely fixed. Or that, at least, features were announced when they were stable, not just "we merged to mainline, good luck". Even now, we get mixed messages even in the official BTRFS documentation which says "The Btrfs code base is stable" (main page) while at the same time clearly stating unstable parts in the status page (currently RAID56).

There are much harsher BTRFS critics than me out there so I will stop here, but let's just say that I feel a little uncomfortable trusting server data with full RAID arrays to BTRFS. But surely, for a workstation, things should just work smoothly... Right? Well, let's see the snags I hit.

My BTRFS test setup

Before I go any further, I should probably clarify how I am testing BTRFS in the first place.

The reason I tried BTRFS is that I was ... let's just say "strongly encouraged" by the LWN editors to install Fedora for the terminal emulators series. That, in turn, meant the setup was done with BTRFS, because that was somewhat the default in Fedora 27 (or did I want to experiment? I don't remember, it's been too long already).

So Fedora was setup on my 1TB HDD and, with encryption, the partition table looks like this:

NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                      8:0    0 931,5G  0 disk  
├─sda1                   8:1    0   200M  0 part  /boot/efi
├─sda2                   8:2    0     1G  0 part  /boot
├─sda3                   8:3    0   7,8G  0 part  
│ └─fedora_swap        253:5    0   7.8G  0 crypt [SWAP]
└─sda4                   8:4    0 922,5G  0 part  
  └─fedora_crypt       253:4    0 922,5G  0 crypt /

(This might not entirely be accurate: I rebuilt this from the Debian side of things.)

This is pretty straightforward, except for the swap partition: normally, I just treat swap like any other logical volume and create it in a logical volume. This is now just speculation, but I bet it was setup this way because "swap" support was only added in BTRFS 5.0.

I fully expect BTRFS experts to yell at me now because this is an old setup and BTRFS is so much better now, but that's exactly the point here. That setup is not that old (2018? old? really?), and migrating to a new partition scheme isn't exactly practical right now. But let's move on to more practical considerations.

No builtin encryption

BTRFS aims at replacing the entire mdadm, LVM, and ext4 stack with a single entity, and adding new features like deduplication, checksums and so on.

Yet there is one feature it is critically missing: encryption. See, my typical stack is actually mdadm, LUKS, and then LVM and ext4. This is convenient because I have only a single volume to decrypt.

If I were to use BTRFS on servers, I'd need to have one LUKS volume per-disk. For a simple RAID-1 array, that's not too bad: one extra key. But for large RAID-10 arrays, this gets really unwieldy.

The obvious BTRFS alternative, ZFS, supports encryption out of the box and mixes it above the disks so you only have one passphrase to enter. The main downside of ZFS encryption is that it happens above the "pool" level so you can typically see filesystem names (and possibly snapshots, depending on how it is built), which is not the case with a more traditional stack.

Subvolumes, filesystems, and devices

I find BTRFS's architecture to be utterly confusing. In the traditional LVM stack (which is itself kind of confusing if you're new to that stuff), you have those layers:

A typical server setup would look like this:

NAME                      MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme0n1                   259:0    0   1.7T  0 disk  
├─nvme0n1p1               259:1    0     8M  0 part  
├─nvme0n1p2               259:2    0   512M  0 part  
│ └─md0                     9:0    0   511M  0 raid1 /boot
├─nvme0n1p3               259:3    0   1.7T  0 part  
│ └─md1                     9:1    0   1.7T  0 raid1 
│   └─crypt_dev_md1       253:0    0   1.7T  0 crypt 
│     ├─vg_tbbuild05-root 253:1    0    30G  0 lvm   /
│     ├─vg_tbbuild05-swap 253:2    0 125.7G  0 lvm   [SWAP]
│     └─vg_tbbuild05-srv  253:3    0   1.5T  0 lvm   /srv
└─nvme0n1p4               259:4    0     1M  0 part

I stripped the other nvme1n1 disk because it's basically the same.

Now, if we look at my BTRFS-enabled workstation, which doesn't even have RAID, we have the following:

It looks something like this to lsblk:

NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                      8:0    0 931,5G  0 disk  
├─sda1                   8:1    0   200M  0 part  /boot/efi
├─sda2                   8:2    0     1G  0 part  /boot
├─sda3                   8:3    0   7,8G  0 part  [SWAP]
└─sda4                   8:4    0 922,5G  0 part  
  └─fedora_crypt       253:4    0 922,5G  0 crypt /srv

Notice how we don't see all the BTRFS volumes here? Maybe it's because I'm mounting this from the Debian side, but lsblk definitely gets confused here. I frankly don't quite understand what's going on, even after repeatedly looking around the rather dismal documentation. But that's what I gather from the following commands:

root@curie:/home/anarcat# btrfs filesystem show
Label: 'fedora'  uuid: 5abb9def-c725-44ef-a45e-d72657803f37
    Total devices 1 FS bytes used 883.29GiB
    devid    1 size 922.47GiB used 916.47GiB path /dev/mapper/fedora_crypt

root@curie:/home/anarcat# btrfs subvolume list /srv
ID 257 gen 108092 top level 5 path home
ID 258 gen 108094 top level 5 path root
ID 263 gen 108020 top level 258 path root/var/lib/machines

I only got to that point through trial and error. Notice how I use an existing mountpoint to list the related subvolumes. If I try to use the filesystem path, the one that's listed in filesystem show, I fail:

root@curie:/home/anarcat# btrfs subvolume list /dev/mapper/fedora_crypt 
ERROR: not a btrfs filesystem: /dev/mapper/fedora_crypt
ERROR: can't access '/dev/mapper/fedora_crypt'

Maybe I just need to use the label? Nope:

root@curie:/home/anarcat# btrfs subvolume list fedora
ERROR: cannot access 'fedora': No such file or directory
ERROR: can't access 'fedora'

This is really confusing. I don't even know if I understand this right, and I've been staring at this all afternoon. Hopefully, the lazyweb will correct me eventually.

(As an aside, why are they called "subvolumes"? If something is a "sub" of "something else", that "something else" must exist right? But no, BTRFS doesn't have "volumes", it only has "subvolumes". Go figure. Presumably the filesystem still holds "files" though, at least empirically it doesn't seem like it lost anything so far.

In any case, at least I can refer to this section in the future, the next time I fumble around the btrfs commandline, as I surely will. I will possibly even update this section as I get better at it, or based on my reader's judicious feedback.

Mounting BTRFS subvolumes

So how did I even get to that point? I have this in my /etc/fstab, on the Debian side of things:

UUID=5abb9def-c725-44ef-a45e-d72657803f37   /srv    btrfs  defaults 0   2

This thankfully ignores all the subvolume nonsense because it relies on the UUID. mount tells me that's actually the "root" (? /?) subvolume:

root@curie:/home/anarcat# mount | grep /srv
/dev/mapper/fedora_crypt on /srv type btrfs (rw,relatime,space_cache,subvolid=5,subvol=/)

Let's see if I can mount the other volumes I have on there. Remember that subvolume list showed I had home, root, and var/lib/machines. Let's try root:

mount -o subvol=root /dev/mapper/fedora_crypt /mnt

Interestingly, root is not the same as /, it's a different subvolume! It seems to be the Fedora root (/, really) filesystem. No idea what is happening here. I also have a home subvolume, let's mount it too, for good measure:

mount -o subvol=home /dev/mapper/fedora_crypt /mnt/home

Note that lsblk doesn't notice those two new mountpoints, and that's normal: it only lists block devices and subvolumes (rather inconveniently, I'd say) do not show up as devices:

root@curie:/home/anarcat# lsblk 
NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                      8:0    0 931,5G  0 disk  
├─sda1                   8:1    0   200M  0 part  
├─sda2                   8:2    0     1G  0 part  
├─sda3                   8:3    0   7,8G  0 part  
└─sda4                   8:4    0 922,5G  0 part  
  └─fedora_crypt       253:4    0 922,5G  0 crypt /srv

This is really, really confusing. Maybe I did something wrong in the setup. Maybe it's because I'm mounting it from outside Fedora. Either way, it just doesn't feel right.

No disk usage per volume

If you want to see what's taking up space in one of those subvolumes, tough luck:

root@curie:/home/anarcat# df -h  /srv /mnt /mnt/home
Filesystem                Size  Used Avail Use% Mounted on
/dev/mapper/fedora_crypt  923G  886G   31G  97% /srv
/dev/mapper/fedora_crypt  923G  886G   31G  97% /mnt
/dev/mapper/fedora_crypt  923G  886G   31G  97% /mnt/home

(Notice, in passing, that it looks like the same filesystem is mounted in different places. In that sense, you'd expect /srv and /mnt (and /mnt/home?!) to be exactly the same, but no: they are entirely different directory structures, which I will not call "filesystems" here because everyone's head will explode in sparks of confusion.)

Yes, disk space is shared (that's the Size and Avail columns, makes sense). But nope, no cookie for you: they all have the same Used columns, so you need to actually walk the entire filesystem to figure out what each disk takes.

(For future reference, that's basically:

root@curie:/home/anarcat# time du -schx /mnt/home /mnt /srv
124M    /mnt/home
7.5G    /mnt
875G    /srv
883G    total

real    2m49.080s
user    0m3.664s
sys 0m19.013s

And yes, that was painfully slow.)

ZFS actually has some oddities in that regard, but at least it tells me how much disk each volume (and snapshot) takes:

root@tubman:~# time df -t zfs -h
Filesystem         Size  Used Avail Use% Mounted on
rpool/ROOT/debian  3.5T  1.4G  3.5T   1% /
rpool/var/tmp      3.5T  384K  3.5T   1% /var/tmp
rpool/var/spool    3.5T  256K  3.5T   1% /var/spool
rpool/var/log      3.5T  2.0G  3.5T   1% /var/log
rpool/home/root    3.5T  2.2G  3.5T   1% /root
rpool/home         3.5T  256K  3.5T   1% /home
rpool/srv          3.5T   80G  3.5T   3% /srv
rpool/var/cache    3.5T  114M  3.5T   1% /var/cache
bpool/BOOT/debian  571M   90M  481M  16% /boot

real    0m0.003s
user    0m0.002s
sys 0m0.000s

That's 56360 times faster, by the way.

But yes, that's not fair: those in the know will know there's a different command to do what df does with BTRFS filesystems, the btrfs filesystem usage command:

root@curie:/home/anarcat# time btrfs filesystem usage /srv
Overall:
    Device size:         922.47GiB
    Device allocated:        916.47GiB
    Device unallocated:        6.00GiB
    Device missing:          0.00B
    Used:            884.97GiB
    Free (estimated):         30.84GiB  (min: 27.84GiB)
    Free (statfs, df):        30.84GiB
    Data ratio:               1.00
    Metadata ratio:           2.00
    Global reserve:      512.00MiB  (used: 0.00B)
    Multiple profiles:              no

Data,single: Size:906.45GiB, Used:881.61GiB (97.26%)
   /dev/mapper/fedora_crypt  906.45GiB

Metadata,DUP: Size:5.00GiB, Used:1.68GiB (33.58%)
   /dev/mapper/fedora_crypt   10.00GiB

System,DUP: Size:8.00MiB, Used:128.00KiB (1.56%)
   /dev/mapper/fedora_crypt   16.00MiB

Unallocated:
   /dev/mapper/fedora_crypt    6.00GiB

real    0m0,004s
user    0m0,000s
sys 0m0,004s

Almost as fast as ZFS's df! Good job. But wait. That doesn't actually tell me usage per subvolume. Notice it's filesystem usage, not subvolume usage, which unhelpfully refuses to exist. That command only shows that one "filesystem" internal statistics that are pretty opaque.. You can also appreciate that it's wasting 6GB of "unallocated" disk space there: I probably did something Very Wrong and should be punished by Hacker News. I also wonder why it has 1.68GB of "metadata" used...

At this point, I just really want to throw that thing out of the window and restart from scratch. I don't really feel like learning the BTRFS internals, as they seem oblique and completely bizarre to me. It feels a little like the state of PHP now: it's actually pretty solid, but built upon so many layers of cruft that I still feel it corrupts my brain every time I have to deal with it (needle or haystack first? anyone?)...

Conclusion

I find BTRFS utterly confusing and I'm worried about its reliability. I think a lot of work is needed on usability and coherence before I even consider running this anywhere else than a lab, and that's really too bad, because there are really nice features in BTRFS that would greatly help my workflow. (I want to use filesystem snapshots as high-performance, high frequency backups.)

So now I'm experimenting with OpenZFS. It's so much simpler, just works, and it's rock solid. After this 8 minute read, I had a good understanding of how ZFS worked. Here's the 30 seconds overview:

There's also other special volumes like caches and logs that you can (really easily, compared to LVM caching) use to tweak your setup. You might also want to look at recordsize or ashift to tweak the filesystem to fit better your workload (or deal with drives lying about their sector size, I'm looking at you Samsung), but that's it.

Running ZFS on Linux currently involves building kernel modules from scratch on every host, which I think is pretty bad. But I was able to setup a ZFS-only server using this excellent documentation without too much problem.

I'm hoping some day the copyright issues are resolved and we can at least ship binary packages, but the politics (e.g. convincing Debian that is the right thing to do) and the logistics (e.g. DKMS auto-builders? is that even a thing? how about signed DKMS packages? fun-fun-fun!) seem really impractical. Who knows, maybe hell will freeze over (again) and Oracle will fix the CDDL. I personally think that we should just completely ignore this problem (which wasn't even supposed to be a problem) and ship binary packages directly, but I'm a pragmatic and do not always fit well with the free software fundamentalists.

All of this to say that, short term, we don't have a reliable, advanced filesystem/logical disk manager in Linux. And that's really too bad.

Some comments

Concerning stability: of course it is circumstantial, but facebook runs a few thousand servers with btrfs (AFAIR), and personally I am running it since many years without a single failure. Admittingly, raid5/6 is broken, don't touch it. raid1 works also rock solid afais (with some quirks, again, I admit - namely that healing is not automatic).

Concerning subvolumes - you don't get separate disk usage simply because it is part of the the main volume (volid=5), and within there just a subdirectory. So getting disk usage from there would mean reading the whole tree (du-like -- btw, I recommend gdu which is many many times faster than du!).

The root subvolume on fedora is not the same as volid=5 but just another subvolume. I also have root volumes for Debian/Arch/Fedora on my system (sharing /usr/local and /home volumes). That they called it root is indeed confusing.

One thing that I like a lot about btrfs is btrfs send/receive, it is a nice way to do incremental backups.

Comment by Norbert

of course it is circumstantial, but facebook runs a few thousand servers with btrfs (AFAIR)

i wouldn't call this "circumstantial", that's certainly a strong data point. But Facebook has a whole other scale and, if they're anything like other large SV shops, they have their own Linux kernel fork that might have improvements we don't.

and personally I am running it since many years without a single failure.

that, however, I would call "anecdotal" and something I hear a lot.... often followed with things like:

with some quirks, again, I admit - namely that healing is not automatic

... which, for me, is the entire problem here. "it kind of works for me, except sometimes not" is really not what I expect from a filesystem.

Concerning subvolumes - you don't get separate disk usage simply because it is part of the the main volume (volid=5), and within there just a subdirectory. So getting disk usage from there would mean reading the whole tree (du-like -- btw, I recommend gdu which is many many times faster than du!).

The root subvolume on fedora is not the same as volid=5 but just another subvolume. I also have root volumes for Debian/Arch/Fedora on my system (sharing /usr/local and /home volumes). That they called it root is indeed confusing.

Hacker News helpfully reminded me that:

the author intentionally gave up early on understanding and simply rants about everything that does not look or work as usual

I think it's framed as criticism of my work, but I take it as a compliment. I reread the two paragraphs a few times, and they still don't make much sense to me. It just begs more questions:

  1. can we have more than one main volumes?
  2. why was it setup with subvolumes instead of volumes?
  3. why isn't everything volumes?

I know I sound like a newbie meeting a complex topic and giving up. But here's the thing: I've encountered (and worked in production) with at least half a dozen filesystems in my lifetime (ext2/ext3/ext4, XFS, UFS, FAT16/FAT32, NTFS, HFS, ExFAT, ZFS), and for most of those, I could use them without having to go very deep into the internals.

But BTRFS gets obscure quick. Even going through official documentation (e.g. BTRFS Design), you start with C structs. And somewhere down there there's this confusing diagram about the internal mechanics of the btree and how you build subvolumes and snapshots on top of that.

If you want to hack on BTRFS, that's great. You can get up to speed pretty quick. But I'm not looking at BTRFS from an enthusiast, kernel developer look. I'm looking at it from a "OMG what is this" look, with very little time to deal with it. Every other filesystem architecture I've used like this so far has been able to somewhat be operational in a day or two. After spending multiple days banging my head on this problem, I felt I had to write this down, because everything seems so obtuse that I can't wrap my head around it.

Anyways, thanks for the constructive feedback, it certainly clarifies things a little, but really doesn't make me want to adopt BTRFS in any significant way.

Comment by anarcat
Created . Edited .