Installation

The installation itself is not documented here, see below for examples instead.

Example installations

New pool creation

The above instructions are "full ZFS" setups, with even the root and boot partitions under ZFS, which is a little ... involved. A simpler setup is to use a normal install for the root and boot partitions, but ZFS for, say, /srv.

Here we're assuming you're setting up a simple two-disk array, or "pool" in ZFS parlance, made of /dev/sde and /dev/sdd, encrypted with standard LUKS instead of ZFS encryption:

Install requirements

apt install zfs-dkms zfsutils-linux
modprobe zfs

Partition the disks:

for disk in /dev/sde /dev/sdd ; do
  parted -s $disk mklabel gpt &&
  parted -s $disk -a optimal mkpart primary 0% 100%
done

Setup full disk encryption:

for disk in sde1 sdd1 ; do
    cryptsetup luksFormat /dev/$disk
    cryptsetup luksOpen /dev/$disk crypt_dev_$disk
    echo crypt_dev_$disk UUID=$(lsblk -n -o UUID /dev/$disk | head -1) none luks,discard | tee -a /etc/crypttab &&
done

Use a --keyfile to avoid typing, while retaining the backup recovery password:

for disk in sde1 sdd1 ; do
    cryptsetup luksFormat /dev/$disk &&
    cryptsetup luksOpen /dev/$disk crypt_dev_$disk &&
    mkdir -p -m 0 /etc/luks &&
    ( umask 077 && dd if=/dev/random bs=64 count=128 of=/etc/luks/crypt_dev_$disk ) &&
    cryptsetup luksAddKey /dev/$disk /etc/luks/crypt_dev_$disk &&
    echo crypt_dev_$disk UUID=$(lsblk -n -o UUID /dev/$disk | head -1) /etc/luks/crypt_dev_$disk luks,discard | tee -a /etc/crypttab
done

The above will ask you for the encryption key four times, but will not require typing it on boot while simultaneously allowing recovery without the key file.

Create the pool:
```
zpool create \
    -o ashift=12 \
    -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
    -O compression=zstd \
    -O relatime=on \
    -O canmount=off \
    -O mountpoint=none \
    -f \
    tank \
    mirror /dev/mapper/crypt_dev_sde1 /dev/mapper/crypt_dev_sdd1
```
That creatures a "mirror" pool with the two drives, which is essentially a RAID-1 mirror. You could also do a RAID-Z pool, if you have an odd number of drives, which is sort of like a RAID-5 array, except you have a flexible number of spares:
```
zpool create \
    -o ashift=12 \
    -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
    -O compression=zstd \
    -O relatime=on \
    -O canmount=off \
    -O mountpoint=none \
    -f \
    tank \
    raidz sda1 sdb1 sdc1
```
To calculate the tradeoff, you can compute the final size of the array with the formula (N-P)*X, where N is the number of drives, P is the parity, and X is the size.

As a rule of thumb, with 1 spare, it's like RAID-5. Note that RAID-Z cannot be resized, so in the above, you will be stuck with 3 drives in that array forever. It can be grown in size by replacing the drives with bigger ones progressively, that said.

Jim Salter recommends mirrors instead of RAID-Z, but the rsync.net people recommend RAID-Z3 with 12-15 drives joined in 3-4 vdev pools (which would make ~20-36PiB arrays with 8TiB drives, by the way). Note that this means three spares in a 12-15 drive array, or a 20-25% ratio.

dRAID is similar, except resilvering is faster, as the spare is distributed among all the devices. The TrueNAS documentation doesn't recommend dRAID except in special circumstances.

This guide talks more about the different RAID types and compares performance.

Make an actual filesystem:

zfs create -o mountpoint=/srv-zfs tank/srv

This should result in the following:

root@marcos:/etc/luks# zpool status
  pool: tank
 state: ONLINE
config:

        NAME                STATE     READ WRITE CKSUM
        tank                ONLINE       0     0     0
          mirror-0          ONLINE       0     0     0
            crypt_dev_sdb1  ONLINE       0     0     0
            crypt_dev_sdc1  ONLINE       0     0     0

errors: No known data errors
root@marcos:/etc/luks# zfs list
NAME   USED  AVAIL     REFER  MOUNTPOINT
tank   432K  7.14T       96K  none
root@marcos:/etc/luks# zfs list
NAME       USED  AVAIL     REFER  MOUNTPOINT
tank       600K  7.14T       96K  none
tank/srv    96K  7.14T       96K  /srv-zfs

Issues

Swap

Swap on ZFS volumes (AKA "swap on ZVOL") can trigger lockups and that issue is still not fixed upstream. Ubuntu recommends using a separate partition for swap instead. cks would rather have no swap that swap on ZFS and compares it to NFS...

curie was setup without a swap partition (or, at least, hoping to use a ZFS dataset as a swap backend) but this has proven to be generally a bad idea. Were we to setup a new ZFS system, we'd use LUKS encryption and setup a dedicated swap partition, as we had problems with ZFS encryption as well.

Native encryption

ZFS supports native encryption, but there are serious caveats with it.

I've had trouble moving encrypted datasets between pools when trying to move the tubman rpool from HDDs to SSDs. This is a problem many people are facing, without good solutions, see also this TrueNAS discussion, reddit thread, HN thread, this openzfs docs thread, and this other one.

Also, native encryption "will not encrypt metadata related to the pool structure, including dataset and snapshot names, dataset hierarchy, properties, file size, file holes, and deduplication tables (though the deduplicated data itself is encrypted)." So it will leak some metadata about the filesystem. Deduplication is limited to the dataset level.

Therefore, it might be better to use LUKS encryption underneath ZFS to configure fully encrypted systems, although I haven't tested this directly.

Note that I use dropbear-initramfs alongside zfs-initramfs to unlock the partitions remotely. This requires the key in /etc/dropbear/authorized_keys as normal.

TRIM

I enabled (a little late) TRIM on the SSD pools:

zfs set org.debian:periodic-trim=enable bpoolssd
zfs set org.debian:periodic-trim=enable rpoolssd

That will setup periodic TRIMs, but it's also possible to set the equivalent of "discard" that "looks for space which has been recently freed, and is no longer allocated by the pool, to be periodically trimmed, however it does not immediately reclaim blocks after a free, which makes it very effective at a cost of more likely of encountering tiny ranges."

zpool set autotrim=on bpoolssd
zpool set autotrim=on rpoolssd

You can do a manual trim with:

zpool trim bpoolssd
zpool trim rpoolssd

Here's an example run:

root@tubman:/etc# zpool status -t rpoolssd
  pool: rpoolssd
 state: ONLINE
  scan: scrub repaired 0B in 00:00:37 with 0 errors on Sun Nov 13 00:24:38 2022
config:

    NAME        STATE     READ WRITE CKSUM
    rpoolssd    ONLINE       0     0     0
     mirror-0  ONLINE       0     0     0
       sdb4    ONLINE       0     0     0  (untrimmed)
       sdd4    ONLINE       0     0     0  (untrimmed)

errors: No known data errors
root@tubman:/etc# zpool trim rpoolssd
root@tubman:/etc# zpool status -t rpoolssd
  pool: rpoolssd
 state: ONLINE
  scan: scrub repaired 0B in 00:00:37 with 0 errors on Sun Nov 13 00:24:38 2022
config:

    NAME        STATE     READ WRITE CKSUM
    rpoolssd    ONLINE       0     0     0
     mirror-0  ONLINE       0     0     0
       sdb4    ONLINE       0     0     0  (3% trimmed, started at Wed 16 Nov 2022 12:19:04 PM EST)
       sdd4    ONLINE       0     0     0  (3% trimmed, started at Wed 16 Nov 2022 12:19:04 PM EST)

errors: No known data errors

Information

Listing partitions and snapshots:

zfs list

IO statistics, every second:

zpool iostat 1

Extending a pool

Say you have a pool that's mirrored between two encrypted drives:

root@marcos:/home/anarcat# zpool status 
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 1 days 12:16:25 with 0 errors on Mon Feb 10 12:41:24 2025
config:

        NAME                STATE     READ WRITE CKSUM
        tank                ONLINE       0     0     0
          mirror-0          ONLINE       0     0     0
            crypt_dev_sdb1  ONLINE       0     0     0
            crypt_dev_sdc1  ONLINE       0     0     0

You want to grow this array with two more mirrored drives.

First, partition the drives:

for disk in /dev/sde /dev/sdd ; do
  parted -s $disk mklabel gpt &&
  parted -s $disk -a optimal mkpart primary 0% 100%
done

Setup full disk encryption:

for disk in sde1 sdd1 ; do
    cryptsetup luksFormat /dev/$disk
    cryptsetup luksOpen /dev/$disk crypt_dev_$disk
    echo crypt_dev_$disk UUID=$(lsblk -n -o UUID /dev/$disk | head -1) none luks,discard | tee -a /etc/crypttab &&
done

Use a --keyfile to avoid typing, while retaining the backup recovery password:

for disk in sde1 sdd1 ; do
    cryptsetup luksFormat /dev/$disk &&
    cryptsetup luksOpen /dev/$disk crypt_dev_$disk &&
    mkdir -p -m 0 /etc/luks &&
    ( umask 077 && dd if=/dev/random bs=64 count=128 of=/etc/luks/crypt_dev_$disk ) &&
    cryptsetup luksAddKey /dev/$disk /etc/luks/crypt_dev_$disk &&
    echo crypt_dev_$disk UUID=$(lsblk -n -o UUID /dev/$disk | head -1) /etc/luks/crypt_dev_$disk luks,discard | tee -a /etc/crypttab
done

The above will ask you for the encryption key four times, but will not require typing it on boot while simultaneously allowing recovery without the key file.

add the drives as a mirror vdev to the pool:
```
root@marcos:/home/anarcat# zpool add -n tank mirror /dev/sdb2 /dev/sdd2
would update 'tank' to the following configuration:

        tank
          mirror-0
            crypt_dev_sdb1
            crypt_dev_sdc1
          mirror
            sdb2
            sdd2
```
Notice how we use -n to simulate the result here. This adds another mirror, essentially turning the pool in a RAID-10 mirror. See also the notes about RAID-Z and dRAID in the pool creation above.

Note that this is likely not the right time to change the pool layout: if you have a mirror layout, keep a mirror layout. If you have a RAID-Z layout, keep that layout as well, just make a new RAID-Z vdev instead.

Note that you zpool add, you do not zpool attach: that would add a spare to a mirror, effectively.

Mounts

Mounting

After a zfs list, you should see the datasets you can mount. You can mount one by name, for example with:

zfs mount bpool/ROOT/debian

Alternate mountpoints

Note that it will mount the device in its pre-defined mountpoint property. In the above, it was /boot. If you want to change its mountpoint, it can be done on the fly with:

zfs set mountpoint=/mnt/boot bpool/ROOT/debian

If the dataset is already mounted, it will be moved to that new location immediately. Note that the parent pool's altroot property affects this path, as it's pre-pended to the mountpoint. See zpoolprops(8) for details.

If you are dealing with a new pool that's not yet known to ZFS (e.g. you just added a new drive), you will first need to import it. Typically, you'd also want to do that in an altroot, so that it doesn't override existing mounts, like this:

zpool import POOLNAME -R /mnt

This would import all pools ZFS can find:

zpool import -a -R /mnt

Encrypted datasets

If the dataset is encrypted, however, you first need to unlock it with:

zpool import -l -a

For rescue operations, that would be the right incantation:

zpool import -l -a -R /mnt

Deprecated: zfsutil

This is another way to use an alternate mountpoint, although I'm less certain it's a good way anymore:

mount -o zfsutil -t zfs bpool/BOOT/debian /mnt

Cool hack: moving data into ZFS easily

I used this procedure to move /srv/sbuild/qemu from a spinning rust drive (BTRFS, on curie) to a ZFS dataset running over NVMe. With other filesystems, this would have required either creating a new logical volumes or hacking around bind mounts. With ZFS, this was the procedure:

zfs create -o mountpoint=none -o canmount=off rpool/srv
zfs create -o mountpoint=/mnt/sbuild rpool/srv/sbuild
mv /srv/sbuild/* /mnt/sbuild/
zfs set mountpoint=/srv/sbuild rpool/srv/sbuild

That's it! You can graft mountpoints like this anywhere, which is powerful and scary!

Snapshots

Creating:

zfs snapshot pool/volume@LABEL

Listing:

zfs list -t snapshot

Listing with creation date:

zfs list -t snapshot -o name,creation

Rollback:

zfs rollback pool/volume@LABEL

Destroy:

zfs destroy pool/volume@LABEL

Limiting the number of snapshots:

zfs set snapshot_limit=2 rpool/var/cache

This is useful if you automate snapshot creation (like, say, with sanoid) and you have filesystems that have ridiculous disk usage because of old, useless snapshots.

Automated snapshots

Automatic snapshots we configured with sanoid, see the Puppet code and configuration file).

Sanoid/syncoid alternatives

TODO: we're considering alternatives to sanoid/syncoid.

After reading the code to implement a --dryrun argument on syncoid, I have found the code to have some issues. There's large functions, lots of system calls without arrays... It feels a little messy, and hard to audit, review, or work on..

zrepl

zrepl is an interesting alternative. It claims support for native encryption, bandwidth limiting, pull/push, Prometheus monitoring with a provided Grafana dashboard. It's written in Golang, and is not packaged in Debian.

There's an issue and discussion that gives a rough idea of how it differs from sanoid. There's this ticket open for a migration guide.

It has no dry run mode.

zfs-auto-snapshot

The zfs-auto-snapshot upstream is possibly dead, or at least looking for volunteers, so probably not an option.

simplesnap

Goerzen's simplesnap is another option. It's a pair of fairly short shell scripts (~600 lines total) that send snapshots to a backup host. It's unclear if it supports encryption any better than other tools, fairly minimalist.

Packaged in Debian.

znapzend

Znapzend stores the configuration inside dataset's metadata, can use local snapshots or (multiple) ssh remotes, with mbuffer support. It supports pre/post hooks to quiesce datasets, progressive thinning, and a built-in scheduler that can deal with long transfers. It has a daemon mode, a dry run, debugging output, can run as a normal user, and has a utility to analyze snapshot disk usage.

It has a setup command to initialize a configuration, example setup:

znapzendzetup create --recursive\
   --pre-snap-command="/bin/sh /usr/local/bin/lock_flush_db.sh" \
   --post-snap-command="/bin/sh /usr/local/bin/unlock_db.sh" \
   SRC '7d=>1h,30d=>4h,90d=>1d' tank/home \
   DST:a '7d=>1h,30d=>4h,90d=>1d,1y=>1w,10y=>1month' root@bserv:backup/home

There is no official Debian package but upstream has a debian source package. It is written in Perl.

zelta

zelta is written in Awk. Incomplete, ran with zfsnap.

zfs-autobackup

zfs-autobackup:

Python
v3.3 Dec 2024
14 contributors
compression
(re-)encryption support
rate-limiting
debug/dry-run mode
progressive thinning

Caveats

Empty datasets

You can sometimes end up with odd situations when mounting datasets. In the tubman install, I ended up in a situation where /var was a valid dataset, but it had canmount=off so it wasn't actually used.

This meant that the data in /var was actually in the rpool/ROOT/debian dataset, mounted on /. I mistakenly reset the canmount flag to on which shadowed that mountpoint, and basically emptied /var.

There's also some evidence that having a mountpoint for a ZFS dataset will cause it to shadow the actual dataset, which is the reverse of what one would normally expect from a filesystem. According to this discussion:

Yes you need to delete the directory -- if it exists, it cannot be mounted there.

In other words, if you have a directory called /mnt/foo and you have a dataset pool/foo configured to mount on /mnt/foo:

zfs mount pool/foo

will show /mnt/foo empty, because the /mnt/foo directory will shadow the dataset. The solution is to unmount the dataset, remove (or rename, if not empty) the directory, and remount the dataset:

zfs umount pool/foo
rmdir /mnt/foo || mv /mnt/foo /mnt/foo.bak
zfs mount pool/foo

Not mainline

ZFS is still not mainline, and will likely never be.

It should be possible, however, to ship Debian binary packages for ZFS. It's apparently possible to directly build a package with this magic command:

dkms mkbmdeb zfs/2.0.3

Also note that Ubuntu actually ships binary packages for ZFS and are questioning the incompatibility claims.

Write amplification

When layering filesystems, you are always at risk of causing "write amplification" because of mismatched block sizes or alignment. For example, if you have a virtual machine with a filesystem with a 4kB block size over a host device with a 8kB block size, the host will have to read that 8kB block to get the other 4kB half before writing it back.

In ZFS, it's even worse; from what I understand, because of the copy-on-write semantics. I'm not exactly clear on the details of this unfortunately, but it's something to keep in mind when deploying ZFS in complex setups.

In particular, this affects Proxmox which uses 8kB block size zvols for virtual machines, which seems to cause performance problems.

ZFS documentation