ZFS documentation
Installation
The installation itself is not documented here, see below for examples instead.
Example installations
Issues
Swap
Swap on ZFS volumes (AKA "swap on ZVOL") can trigger lockups and that issue is still not fixed upstream. Ubuntu recommends using a separate partition for swap instead. cks would rather have no swap that swap on ZFS and compares it to NFS...
curie was setup without a swap partition (or, at least, hoping to use a ZFS dataset as a swap backend) but this has proven to be generally a bad idea. Were we to setup a new ZFS system, we'd use LUKS encryption and setup a dedicated swap partition, as we had problems with ZFS encryption as well.
Native encryption
ZFS supports native encryption, but there are serious caveats with it.
I've had trouble moving encrypted datasets between pools when trying
to move the tubman rpool
from HDDs to SSDs. This is a
problem many people are facing, without good solutions, see also this
TrueNAS discussion, reddit thread, HN thread, this
openzfs docs thread, and this other one.
Also, native encryption "will not encrypt metadata related to the pool structure, including dataset and snapshot names, dataset hierarchy, properties, file size, file holes, and deduplication tables (though the deduplicated data itself is encrypted)." So it will leak some metadata about the filesystem. Deduplication is limited to the dataset level.
Therefore, it might be better to use LUKS encryption underneath ZFS to configure fully encrypted systems, although I haven't tested this directly.
Note that I use dropbear-initramfs
alongside zfs-initramfs
to
unlock the partitions remotely. This requires the key in
/etc/dropbear/authorized_keys
as normal.
TRIM
I enabled (a little late) TRIM on the SSD pools:
zfs set org.debian:periodic-trim=enable bpoolssd
zfs set org.debian:periodic-trim=enable rpoolssd
That will setup periodic TRIMs, but it's also possible to set the equivalent of "discard" that "looks for space which has been recently freed, and is no longer allocated by the pool, to be periodically trimmed, however it does not immediately reclaim blocks after a free, which makes it very effective at a cost of more likely of encountering tiny ranges."
zpool set autotrim=on bpoolssd
zpool set autotrim=on rpoolssd
You can do a manual trim with:
zpool trim bpoolssd
zpool trim rpoolssd
Here's an example run:
root@tubman:/etc# zpool status -t rpoolssd
pool: rpoolssd
state: ONLINE
scan: scrub repaired 0B in 00:00:37 with 0 errors on Sun Nov 13 00:24:38 2022
config:
NAME STATE READ WRITE CKSUM
rpoolssd ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb4 ONLINE 0 0 0 (untrimmed)
sdd4 ONLINE 0 0 0 (untrimmed)
errors: No known data errors
root@tubman:/etc# zpool trim rpoolssd
root@tubman:/etc# zpool status -t rpoolssd
pool: rpoolssd
state: ONLINE
scan: scrub repaired 0B in 00:00:37 with 0 errors on Sun Nov 13 00:24:38 2022
config:
NAME STATE READ WRITE CKSUM
rpoolssd ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb4 ONLINE 0 0 0 (3% trimmed, started at Wed 16 Nov 2022 12:19:04 PM EST)
sdd4 ONLINE 0 0 0 (3% trimmed, started at Wed 16 Nov 2022 12:19:04 PM EST)
errors: No known data errors
See also the TRIM documentation in the Debian wiki.
Information
Listing partitions and snapshots:
zfs list
IO statistics, every second:
zpool iostat 1
Mounts
Mounting
After a zfs list
, you should see the datasets you can mount. You can
mount one by name, for example with:
zfs mount bpool/ROOT/debian
Alternate mountpoints
Note that it will mount the device in its pre-defined mountpoint
property. In the above, it was /boot
. If you want to change its
mountpoint, it can be done on the fly with:
zfs set -o mountpoint=/mnt/boot bpool/ROOT/debian
If the dataset is already mounted, it will be moved to that new
location immediately. Note that the parent pool's altroot
property
affects this path, as it's pre-pended to the mountpoint
. See
zpoolprops(8) for details.
If you are dealing with a new pool that's not yet known to ZFS
(e.g. you just added a new drive), you will first need to import
it. Typically, you'd also want to do that in an altroot
, so that it
doesn't override existing mounts, like this:
zpool import POOLNAME -R /mnt
This would import all pools ZFS can find:
zpool import -a -R /mnt
Encrypted datasets
If the dataset is encrypted, however, you first need to unlock it with:
zpool import -l -a
For rescue operations, that would be the right incantation:
zpool import -l -a -R /mnt
Deprecated: zfsutil
This is another way to use an alternate mountpoint, although I'm less certain it's a good way anymore:
mount -o zfsutil -t zfs bpool/BOOT/debian /mnt
Cool hack: moving data into ZFS easily
I used this procedure to move /srv/sbuild/qemu
from a spinning rust
drive (BTRFS, on curie) to a ZFS dataset running over
NVMe. With other filesystems, this would have required either creating
a new logical volumes or hacking around bind mounts. With ZFS, this
was the procedure:
zfs create -o mountpoint=none -o canmount=off rpool/srv
zfs create -o mountpoint=/mnt/sbuild rpool/srv/sbuild
mv /srv/sbuild/* /mnt/sbuild/
zfs set mountpoint=/srv/sbuild rpool/srv/sbuild
That's it! You can graft mountpoints like this anywhere, which is powerful and scary!
Snapshots
Creating:
zfs snapshot pool/volume@LABEL
Listing:
zfs list -t snapshot
Listing with creation date:
zfs list -t snapshot -o name,creation
Rollback:
zfs rollback pool/volume@LABEL
Destroy:
zfs destroy pool/volume@LABEL
Limiting the number of snapshots:
zfs set snapshot_limit=2 rpool/var/cache
This is useful if you automate snapshot creation (like, say, with sanoid) and you have filesystems that have ridiculous disk usage because of old, useless snapshots.
Automated snapshots
Automatic snapshots we configured with sanoid, see the Puppet code and configuration file).
Sanoid/syncoid alternatives
TODO: we're considering alternatives to sanoid/syncoid.
After reading the code to implement a --dryrun argument on
syncoid, I have found the code to have some issues. There's large
functions, lots of system
calls without arrays... It feels a little
messy, and hard to audit, review, or work on..
zrepl
zrepl is an interesting alternative. It claims support for native encryption, bandwidth limiting, pull/push, Prometheus monitoring with a provided Grafana dashboard. It's written in Golang, and is not packaged in Debian.
There's an issue and discussion that gives a rough idea of how it differs from sanoid. There's this ticket open for a migration guide.
It has no dry run mode.
zfs-auto-snapshot
The zfs-auto-snapshot upstream is possibly dead, or at least looking for volunteers, so probably not an option.
simplesnap
Goerzen's simplesnap is another option. It's a pair of fairly short shell scripts (~600 lines total) that send snapshots to a backup host. It's unclear if it supports encryption any better than other tools, fairly minimalist.
Packaged in Debian.
znapzend
Znapzend stores the configuration inside dataset's metadata, can use local snapshots or (multiple) ssh remotes, with mbuffer support. It supports pre/post hooks to quiesce datasets, progressive thinning, and a built-in scheduler that can deal with long transfers. It has a daemon mode, a dry run, debugging output, can run as a normal user, and has a utility to analyze snapshot disk usage.
It has a setup command to initialize a configuration, example setup:
znapzendzetup create --recursive\
--pre-snap-command="/bin/sh /usr/local/bin/lock_flush_db.sh" \
--post-snap-command="/bin/sh /usr/local/bin/unlock_db.sh" \
SRC '7d=>1h,30d=>4h,90d=>1d' tank/home \
DST:a '7d=>1h,30d=>4h,90d=>1d,1y=>1w,10y=>1month' root@bserv:backup/home
There is no official Debian package but upstream has a debian source package. It is written in Perl.
zelta
zelta is written in Awk. Incomplete, ran with zfsnap.
Other DIY solutions
twb (#debian-til
) wrote cyber-zfs-backup. It's short (~300
SLOCC lines of Python, 1600 with comments). There's a MySQL/MariaDB
part that has a "quiescence" hook (another 100 SLOCC) and does the
good ol' FLUSH TABLES WITH READ LOCK;
trick which, it turns out, is
apparently better served by the BACKUP STAGE
command now (see the
upstream docs).
Another person from the Debian community wrote their own shell script, backup-zfs.
Caveats
Empty datasets
You can sometimes end up with odd situations when mounting
datasets. In the tubman install, I ended up in a
situation where /var
was a valid dataset, but it had canmount=off
so it wasn't actually used.
This meant that the data in /var
was actually in the
rpool/ROOT/debian
dataset, mounted on /
. I mistakenly reset the
canmount
flag to on
which shadowed that mountpoint, and basically
emptied /var
.
There's also some evidence that having a mountpoint for a ZFS dataset will cause it to shadow the actual dataset, which is the reverse of what one would normally expect from a filesystem. According to this discussion:
Yes you need to delete the directory -- if it exists, it cannot be mounted there.
In other words, if you have a directory called /mnt/foo
and you have
a dataset pool/foo
configured to mount on /mnt/foo
:
zfs mount pool/foo
will show /mnt/foo
empty, because the /mnt/foo
directory will
shadow the dataset. The solution is to unmount the dataset, remove (or
rename, if not empty) the directory, and remount the dataset:
zfs umount pool/foo
rmdir /mnt/foo || mv /mnt/foo /mnt/foo.bak
zfs mount pool/foo
Not mainline
ZFS is still not mainline, and will likely never be.
It should be possible, however, to ship Debian binary packages for ZFS. It's apparently possible to directly build a package with this magic command:
dkms mkbmdeb zfs/2.0.3
See also this idea in grml and this packaging attempt.
Also note that Ubuntu actually ships binary packages for ZFS and are questioning the incompatibility claims.
Write amplification
When layering filesystems, you are always at risk of causing "write amplification" because of mismatched block sizes or alignment. For example, if you have a virtual machine with a filesystem with a 4kB block size over a host device with a 8kB block size, the host will have to read that 8kB block to get the other 4kB half before writing it back.
In ZFS, it's even worse; from what I understand, because of the copy-on-write semantics. I'm not exactly clear on the details of this unfortunately, but it's something to keep in mind when deploying ZFS in complex setups.
In particular, this affects Proxmox which uses 8kB block size zvols for virtual machines, which seems to cause performance problems.
Other documentation
TODO: document those debugging tools:
tail -f /proc/spl/kstat/zfs/dbgmsg zpool iostat 1 -l -q queues -r size histogram per vdev -w latency histogram -v verbose include vdevq
ZFS documentation
- Debian wiki page: good introduction, basic commands, some advanced stuff
- Arch wiki page: much more stuff
- Gentoo wiki page: less more stuff, similar to Arch
- FreeBSD handbook: FreeBSD-specific of course, but excellent as always
- OpenZFS FAQ
- OpenZFS: Debian buyllseye root on ZFS: excellent documentation, basis for the install procedures in 2022-11-17-zfs-migration and tubman
- OpenZFS: Debian buster root on ZFS: same, for buster
- WIP PR for Bullseye root on ZFS instructions: where I contributed to turn the latter into the former
- another ZFS on linux documentation
- Goerzen's ZFS series