Tubman is named after Harriet Tubman, an "American abolitionist and political activist. Born into slavery, Tubman escaped and subsequently made some 13 missions to rescue approximately 70 enslaved people, including family and friends, using the network of antislavery activists and safe houses known as the Underground Railroad. During the American Civil War, she served as an armed scout and spy for the Union Army. The first woman to lead an armed expedition in the war, she guided the raid at Combahee Ferry, which liberated more than 700 enslaved people. In her later years, Tubman was an activist in the movement for women's suffrage."

I was the conductor of the Underground Railroad for eight years, and I can say what most conductors can't say — I never ran my train off the track and I never lost a passenger.

  1. Specification
    1. 4TB disk health inspection
  2. Installation procedure
    1. SSD caching
      1. L2ARC
      2. SLOG caches
    2. Next steps
    3. Done
    4. Decisions taken during the procedure
    5. Changes from the original procedure
    6. Abandoned ideas
    7. To be improved
    8. Troubleshooting
  3. Conversion into a backup server
    1. Before
    2. After
    3. tank pool creation
    4. first rsync transfer
    5. resilvering
    6. move rpool and bpool to SSDs
      1. Moving encrypted pools is hard
    7. SSD TRIM
    8. extending the main tank
    9. remaining work
  4. Other documentation

Specification

tubman's install changed bodies and is now in "toutatis"'s body. so the specs below are inaccurate.

Note that tubman was originally built with the old marcos hardware, but transplanted in what used to be known as toutatis, see v1 for the old spec. The toutatis install was kept install, on a stack of 5 disks (3x~2TB HDD, 2x128GB SSD).

4TB disk health inspection

Before the migration from marcos' body to toutatis', a ZFS scrub triggered a warning. The drives were inspected for health, this is the report (copied from 2022-11-17-zfs-migration).

Here's some SMART stats:

root@tubman:~# smartctl -a -qnoserial /dev/sdb | grep -e  Head_Flying_Hours -e Power_On_Hours -e Total_LBA -e 'Sector Sizes'
Sector Sizes:     512 bytes logical, 4096 bytes physical
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       12464 (206 202 0)
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       10966h+55m+23.757s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       21107792664
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3201579750

That's over a year of power on, which shouldn't be so bad. It has written about 10TB of data (21107792664 LBAs * 512 byte/LBA), which is about two full writes. According to its specification, this device is supposed to support 55 TB/year of writes, so we're far below spec. Note that are still far from the "non-recoverable read error per bits" spec (1 per 10E15), as we've basically read 13E12 bits (3201579750 LBAs * 512 byte/LBA = 13E12 bits).

It's likely this disk was made in 2018, so it is in its fourth year.

Interestingly, /dev/sdc is also a Seagate drive, but of a different series:

root@tubman:~# smartctl -qnoserial  -i /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5
Device Model:     ST4000DM004-2CV104
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Oct 11 11:21:35 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

It has seen much more reads than the other disk which is also interesting:

root@tubman:~# smartctl -a -qnoserial /dev/sdc | grep -e  Head_Flying_Hours -e Power_On_Hours -e Total_LBA -e 'Sector Sizes'
Sector Sizes:     512 bytes logical, 4096 bytes physical
  9 Power_On_Hours          0x0032   059   059   000    Old_age   Always       -       36240
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       33994h+10m+52.118s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       30730174438
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       51894566538

That's 4 years of Head_Flying_Hours, and over 4 years (4 years and 48 days) of Power_On_Hours. The copyright date on that drive's specs goes back to 2016, so it's a much older drive.

Installation procedure

I would have used FAI's setup-storage but it doesn't support ZFS, unfortunately. It is part of the long term roadmap, that said, and there's a howto for stretch, but that doesn't use setup-storage. I was hoping I would reuse the installer I've been working on at work...

We have the following disk configuration:

We boot from a grml live image based on Debian testing (bullseye), and will follow this howto:

  1. install requirements:

    apt update
    apt install --yes debootstrap gdisk dkms dpkg-dev linux-headers-$(uname -r) zfs-dkms
    modprobe zfs
    apt install --yes zfsutils-linux
    

    Note that those instructions differ from the documentation (we don't use buster-backports) because we start from a bullseye live image.

  2. clear the partitions on the two HDD, and setup a BIOS, UEFI, boot pool and native encrypted partition:

    for DISK in /dev/sdb /dev/sdc ; do
        sgdisk --zap-all $DISK
        sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK
        sgdisk     -n2:1M:+512M   -t2:EF00 $DISK
        sgdisk     -n3:0:+1G      -t3:BF01 $DISK
        sgdisk     -n4:0:0        -t4:BF00 $DISK
    done
    

    resulting partition table:

    root@grml ~ # sgdisk -p /dev/sdb
    Disk /dev/sdb: 7814037168 sectors, 3.6 TiB
    Model: ST4000DM004-2CV1
    Sector size (logical/physical): 512/4096 bytes
    Disk identifier (GUID): 63B2F372-B4E9-45FF-8151-9706F9F158C9
    Partition table holds up to 128 entries
    Main partition table begins at sector 2 and ends at sector 33
    First usable sector is 34, last usable sector is 7814037134
    Partitions will be aligned on 16-sector boundaries
    Total free space is 14 sectors (7.0 KiB)
    
    Number  Start (sector)    End (sector)  Size       Code  Name
       1              48            2047   1000.0 KiB  EF02  
       2            2048         1050623   512.0 MiB   EF00  
       3         1050624         3147775   1024.0 MiB  BF01  
       4         3147776      7814037134   3.6 TiB     BF00
    
  3. create the boot pool called bpool and the root pool called rpool, the latter will prompt for a disk encryption key:

    zpool create \
        -o cachefile=/etc/zfs/zpool.cache \
        -o ashift=12 -d \
        -o feature@async_destroy=enabled \
        -o feature@bookmarks=enabled \
        -o feature@embedded_data=enabled \
        -o feature@empty_bpobj=enabled \
        -o feature@enabled_txg=enabled \
        -o feature@extensible_dataset=enabled \
        -o feature@filesystem_limits=enabled \
        -o feature@hole_birth=enabled \
        -o feature@large_blocks=enabled \
        -o feature@lz4_compress=enabled \
        -o feature@spacemap_histogram=enabled \
        -o feature@zpool_checkpoint=enabled \
        -O acltype=posixacl -O canmount=off -O compression=lz4 \
        -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
        -O mountpoint=/boot -R /mnt \
        bpool mirror /dev/sdb3 /dev/sdc3
    zpool create \
        -o ashift=12 \
        -O encryption=aes-256-gcm \
        -O keylocation=prompt -O keyformat=passphrase \
        -O acltype=posixacl -O canmount=off -O compression=lz4 \
        -O dnodesize=auto -O normalization=formD -O relatime=on \
        -O xattr=sa -O mountpoint=/ -R /mnt \
        rpool mirror /dev/sdb4 /dev/sdc4
    
  4. create filesytems and "datasets":

    • this creates two containers, for ROOT and BOOT

      zfs create -o canmount=off -o mountpoint=none rpool/ROOT zfs create -o canmount=off -o mountpoint=none bpool/BOOT

  5. this actually creates the boot and root filesystems:

    zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/debian
    zfs mount rpool/ROOT/debian
    zfs create -o mountpoint=/boot bpool/BOOT/debian
    
  6. then they use even more data sets, although I'm not sure they are all necessary:

    zfs create                                 rpool/home
    zfs create -o mountpoint=/root             rpool/home/root
    chmod 700 /mnt/root
    zfs create -o canmount=off                 rpool/var
    zfs create -o canmount=off                 rpool/var/lib
    zfs create                                 rpool/var/log
    zfs create                                 rpool/var/spool
    
  7. to exclude temporary files from snapshots, for example:

    zfs create -o com.sun:auto-snapshot=false  rpool/var/cache
    zfs create -o com.sun:auto-snapshot=false  rpool/var/tmp
    chmod 1777 /mnt/var/tmp
    
  8. and a /srv:

    zfs create                                 rpool/srv
    
  9. or for Docker (TODO):

    zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker
    
  10. make a tmpfs for /run:

    mkdir /mnt/run
    mount -t tmpfs tmpfs /mnt/run
    mkdir /mnt/run/lock
    
  11. install the base system and copy the ZFS config:

    debootstrap --components=main,contrib bullseye /mnt
    mkdir /mnt/etc/zfs
    cp /etc/zfs/zpool.cache /mnt/etc/zfs/
    
  12. base system configuration:

    echo HOSTNAME > /mnt/etc/hostname
    vi /mnt/etc/hosts
    apt install ca-certificates
    echo 'deb https://deb.debian.org/debian-security bullseye-security main contrib' > /etc/apt/sources.list.d/security.list
    
  13. bind mounts and chroot for more complex config:

    mount --rbind /dev  /mnt/dev
    mount --rbind /proc /mnt/proc
    mount --rbind /sys  /mnt/sys
    chroot /mnt /bin/bash
    
  14. more base system config:

    ln -s /proc/self/mounts /etc/mtab
    apt update
    apt install --yes console-setup locales
    dpkg-reconfigure locales tzdata
    
  15. ZFS boot configuration

    apt install --yes dpkg-dev linux-headers-amd64 linux-image-amd64
    apt install --yes zfs-initramfs
    echo REMAKE_INITRD=yes > /etc/dkms/zfs.conf
    apt install --yes grub-pc
    apt remove --purge os-prober
    
  16. pick a root password

    passwd
    
  17. bpool import hack (TODO: whyy)

    cat > /etc/systemd/system/zfs-import-bpool.service <<EOF
    [Unit]
    DefaultDependencies=no
    Before=zfs-import-scan.service
    Before=zfs-import-cache.service
    
    [Service]
    Type=oneshot
    RemainAfterExit=yes
    ExecStart=/sbin/zpool import -N -o cachefile=none bpool
    # Work-around to preserve zpool cache:
    ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache
    ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache
    
    [Install]
    WantedBy=zfs-import.target
    EOF
    systemctl enable zfs-import-bpool.service
    
  18. enable tmpfs:

    ln -s /usr/share/systemd/tmp.mount /etc/systemd/system/ &&
    systemctl enable tmp.mount
    
  19. grub setup:

    root@grml:/# grub-probe /boot
    zfs
    root@grml:/# update-initramfs -c -k all
    update-initramfs: Generating /boot/initrd.img-5.10.0-6-amd64
    root@grml:/# sed -i 's,GRUB_CMDLINE_LINUX.*,GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/debian",' /etc/default/grub
    root@grml:/# update-grub
    Generating grub configuration file ...
    Found linux image: /boot/vmlinuz-5.10.0-6-amd64
    Found initrd image: /boot/initrd.img-5.10.0-6-amd64
    done
    root@grml:/# grub-install /dev/sdb 
    Installing for i386-pc platform.
    Installation finished. No error reported.
    root@grml:/# grub-install /dev/sdc 
    Installing for i386-pc platform.
    Installation finished. No error reported.
    

    make sure you check both disks in there:

     dpkg-reconfigure grub-pc
    
  20. filesystem mount ordering, rationale in the OpenZFS guide:

    mkdir /etc/zfs/zfs-list.cache
    touch /etc/zfs/zfs-list.cache/bpool
    touch /etc/zfs/zfs-list.cache/rpool
    zed -F &
    

    then verify the files have data:

    root@grml:/# cat /etc/zfs/zfs-list.cache/bpool                                                                                                                         
    bpool   /mnt/boot       off     on      on      off     on      off     on      off     -       none    -       -       -       -       -       -       -       -
    bpool/BOOT      none    off     on      on      off     on      off     on      off     -       none    -       -       -       -       -       -       -       -
    bpool/BOOT/debian       /mnt/boot       on      on      on      off     on      off     on      off     -       none    -       -       -       -       -       -     -
            -
    root@grml:/# cat /etc/zfs/zfs-list.cache/rpool                                                                                                                         |
    rpool   /mnt    off     on      on      on      on      off     on      off     rpool   prompt  -       -       -       -       -       -       -       -
    rpool/ROOT      none    off     on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -       -
    rpool/ROOT/debian       /mnt    noauto  on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
    rpool/home      /mnt/home       on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
    rpool/home/root /mnt/root       on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
    rpool/srv       /mnt/srv        on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
    rpool/var       /mnt/var        off     on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
    rpool/var/cache /mnt/var/cache  on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
    rpool/var/lib   /mnt/var/lib    off     on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
    rpool/var/log   /mnt/var/log    on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
    rpool/var/spool /mnt/var/spool  on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
    rpool/var/tmp   /mnt/var/tmp    on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
    root@grml:/# fg
    zed -F
    ^CExiting
    
  21. fix the paths to eliminate /mnt:

    sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*
    
  22. extra config, setup SSH with auth key:

    apt install --yes openssh-server
    mkdir /root/.ssh/
    cat > /root/.ssh/authorized_keys <<EOF
    ...
    EOF
    
  23. snapshot initial install:

    zfs snapshot bpool/BOOT/debian@install
    zfs snapshot rpool/ROOT/debian@install
    
  24. exit chroot:

    exit
    
  25. unmount filesystems:

    mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
        xargs -i{} umount -lf {}
    zpool export -a
    
  26. reboot:

    reboot
    

That procedure actually worked! The only problem was the interfaces(5) configuration, which was missing (regardless of what the above says). I want to do systemd-networkd anyways.

We performed steps 1 through 6, remaining steps are optional and troubleshooting.

SSD caching

The machine has been installed on two HDD: spinning rust! Those are typically slow, but they are redundant which should ensure high availability. To boost performance, we're setting up a SSD cache.

ZFS has two types of caches:

L2ARC

The L2ARC is purely a performance cache, and if it dies, no data is lost. The former, however, can cause data loss (typically a few seconds, but still) in case the drive dies. So we're going with L2ARC, based on this source for the redundancy claim.

We also use the L2ARC cache because it's useful for read caching as well. The SLOG cache is mostly useful for write-heavy workloads, which is not the case of this server.

To configure the L2ARC cache, we simply did this:

zpool add rpool cache /dev/sda3

(Actually, -f was necessary because there already was a crypto_LUKS partition on there, which we didn't care about.)

The sda3 device is the third partition on the SSD drive. It's 465GB so it should provide a lot of space for the cache.

The status of the cache can be found with the zpool iostat command:

root@tubman:~# zpool iostat -v
              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
bpool       47.8M   912M      0      0      3     14
  mirror    47.8M   912M      0      0      3     14
    sdb3        -      -      0      0      1      7
    sdc3        -      -      0      0      1      7
----------  -----  -----  -----  -----  -----  -----
rpool       1.29G  3.62T      0     60    437   432K
  mirror    1.29G  3.62T      0     60    437   432K
    sdb4        -      -      0     30    199   216K
    sdc4        -      -      0     30    238   216K
cache           -      -      -      -      -      -
  sda3       326M   465G      0    183  4.96K  11.9M
----------  -----  -----  -----  -----  -----  -----

Note that this guide actually discourages the use of symbolic names like sda. Quoting that warning directly:

WARNING: Some motherboards will not present disks in a consistent manner to the Linux kernel across reboots. As such, a disk identified as /dev/sda on one boot might be /dev/sdb on the next. For the main pool where your data is stored, this is not a problem as ZFS can reconstruct the VDEVs based on the metadata geometry. For your L2ARC and SLOG devices, however, no such metadata exists. [...] If you don't heed this warning, your L2ARC device may not be added to your hybrid pool at all, and you will need to re-add it later. This could drastically affect the performance of the applications when pulling evicted pages off of disk.

TL;DR: the cache might disappear after a reboot if disk ordering is changed by the BIOS. This only affects caches like the L2ARC (above) and the SLOG.

Eventually, there were two SSD drives in this system, and both were added as caches (following the above warning), with:

zpool add tank cache \
  /dev/disk/by-id/ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5 \
  /dev/disk/by-id/ata-Crucial_CT480M500SSD1_1311092ED40E-part5

... it makes the zpool status output quite large though:

root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:

        NAME                                                   STATE     READ WRITE CKSUM
        tank                                                   ONLINE       0     0     0
          mirror-0                                             ONLINE       0     0     0
            sdc4                                               ONLINE       0     0     0
            sde4                                               ONLINE       0     0     0
          mirror-1                                             ONLINE       0     0     0
            sda4                                               ONLINE       0     0     0
            sdf4                                               ONLINE       0     0     0
        cache
          ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5  ONLINE       0     0     0
          ata-Crucial_CT480M500SSD1_1311092ED40E-part5         ONLINE       0     0     0

errors: No known data errors
root@tubman:~# zpool iostat -v
                                                         capacity     operations     bandwidth
pool                                                   alloc   free   read  write   read  write
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
bpoolssd                                                280M   680M      0      0    867  5.83K
  mirror                                                280M   680M      0      0    867  5.83K
    sdb3                                                   -      -      0      0    500  2.91K
    sdd3                                                   -      -      0      0    366  2.91K
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
rpoolssd                                               4.28G  95.2G      3     11  47.2K   160K
  mirror                                               4.28G  95.2G      3     11  47.2K   160K
    sdb4                                                   -      -      1      5  23.7K  79.9K
    sdd4                                                   -      -      1      5  23.5K  79.9K
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                                   6.64T  4.25T      0    178  16.5K  21.1M
  mirror                                               6.62T   664G      0     49  16.0K  4.69M
    sdc4                                                   -      -      0     24  8.97K  2.35M
    sde4                                                   -      -      0     24  7.04K  2.35M
  mirror                                               25.1G  3.60T      0    128    546  16.4M
    sda4                                                   -      -      0     70    293  8.21M
    sdf4                                                   -      -      0     58    252  8.21M
cache                                                      -      -      -      -      -      -
  ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5   479M   364G      0    162  5.61K  19.8M
  ata-Crucial_CT480M500SSD1_1311092ED40E-part5          444M   345G      0    152  5.61K  18.3M
-----------------------------------------------------  -----  -----  -----  -----  -----  -----

Note that both caches are of different size. ZFS doesn't care; they are striped anyways, and it doesn't actually matter that there are two, it provides no special redundancy as the cache is disposable. That is different from the SLOG configuration, see below.

Also note that the L2ARC cache is indexed in memory and that, in itself, takes memory from the in-memory ARC cache, so it might actually be detrimental to have too big of a cache. The arch wiki suggests the formula for that memory usage is:

(L2ARC size) / (recordsize) * 70 bytes

... where recordsize is typically 128KiB. So in our case, it would mean:

70B×345GB/128KiB = ((70 × byte) × (345 × gigabyte))/(128 × kibibyte)
≈ 184.249 877 930 MB

... 200MB of RAM, not a problem, given this machine has 12GB of RAM:

root@tubman:~# free -h
               total        used        free      shared  buff/cache   available
Mem:            11Gi       6.4Gi       5.1Gi       0.0Ki       188Mi       5.1Gi
Swap:             0B          0B          0B

SLOG caches

SLOG caches are more sensitive. They are actually where ZFS will commit a write before confirming it to the caller, so it needs a reliable storage medium. Typically, you'd use RAM (the default, to simplify), NVMe, or fast SSD storage for this. Using NVMe or SSDs, you want to make sure those are mirrored, so that if there's a failure in one drive, there is no data lost.

To create a SLOG, you should first choose its size. It doesn't have to be as big as the L2ARC cache because it's only a write cache and gets regularly flushed to disk. This article from Klara systems suggests:

Often, 16GB to 64GB is sufficient. For a busy server with a lot of writes, a general rule of thumb for calculating size is: max amount of write traffic per second x 15.

The TrueNAS docs also say:

The iXsystems current recommendation is a 16 GB SLOG device over-provisioned from larger SSDs to increase the write endurance and throughput of an individual SSD. This 16 GB size recommendation is based on performance characteristics of typical HDD pools with SSD SLOGs and capped by the value of the tunable vfs.zfs.dirty_data_max_max.

The parameter vfs.zfs.dirty_data_max_max defaults to 25% of physical RAM which, in my case, is 3GB:

root@tubman:~# cat /sys/module/zfs/parameters/zfs_dirty_data_max_max
3137032192

Considering that the core memory might be boosted in the future, it's worth raising the size a little, so we're going to pick 16GB as suggested. The final partition table looks something like this:

Number  Start (sector)    End (sector)  Size       Code  Name
   1              48            2047   1000.0 KiB  EF02
   2            2048         1050623   512.0 MiB   EF00
   3         1050624         3147775   1024.0 MiB  BF01  bpool
   4         3147776       212862975   100.0 GiB   BF00  rpool
   5       212862976       246417407   16.0 GiB    BF00  SLOG
   6       246417408       937703054   329.6 GiB   BF00  L2ARC

To create the cache, we use the disk's symbolic name (as explained in the L2ARC section):

zpool add tank log mirror \
  /dev/disk/by-id/ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5 \
  /dev/disk/by-id/ata-Crucial_CT480M500SSD1_1311092ED40E-part5

Also be careful to use the log keyword here. If you forget it, you will extend the pool with a new mirror device, striped with the other mirrors!

To see how effective the SLOG is, you can use:

zpool status tank 1

... and you will see it fill and empty as the timeout (zfs.txg.timeout, set in /sys/module/zfs/parameters/zfs_txg_timeout, defaults to 5 seconds) expires. You can raise that timeout to use the SLOG more, if you are comfortable with that delay in data loss if the SLOG fails.

Removing a SLOG device is a little different than a L2ARC cache, because you need to remove the entire mirror, you can't remove individual devices. First, find the mirror that's under the log:

zpool status tank

... then remove that mirror, be careful to not remove the actual data mirror!

zpool remove tank mirror-4

See also this documentation on the SLOG for more information.

Next steps

TODO:

Done

Decisions taken during the procedure

Changes from the original procedure

Abandoned ideas

To be improved

Troubleshooting

Conversion into a backup server

Originally, this server was meant to be a test server, a "lab" if you will, to do some tests on ZFS and generally just have another PoP for some of my services. The server was running, for example, https://radio.anarc.at. But it was still using the really old marcos v1 hardware, and was due for an upgrade.

It was therefore merged with the server I previously used for offsite backups (toutatis), by moving its disks in the new backup server's body. Tubman's install was kept, but the data was moved around disks quite a bit.

Before

This is how tubman's disks were laid out before the transfer:

And this was toutatis:

After

This is not ideal. Ideally, all drives would be the same size (e.g. 8TB) and use some RAID-Z layout to optimize available disk space (e.g. better than RAID-1). That could still be done, but by rebuilding a new vdev using 2x8TB drives, as a future expansion. Only one SATA connector is available on board right now, so this would be a tricky operation, probably involving degrading the tank pool.

It might have been possible to RAID-0 the 2x4TB drives to give an extra 8TB drive to the ZFS pool, but this idea was rejected as too risk and clunky. ZFS itself doesn't support such configuration.

tank pool creation

We create another pool, called tank, for the 2x8TB drives, fully encrypted. The point of this is to have a separate pool from the main system to alleviate any possible confusion. It will also make it possible to move the system (and only that) to SSD (it's currently on 2x4TB + 500GB SSD cache).

  1. first, partition the new disk (we reuse the disk formatting command used for curie, see 2022-11-17-zfs-migration):

    sgdisk --zap-all /dev/sdc
    sgdisk -a1 -n1:24K:+1000K -t1:EF02 /dev/sdc
    sgdisk     -n2:1M:+512M   -t2:EF00 /dev/sdc
    sgdisk     -n3:0:+1G      -t3:BF01 /dev/sdc
    sgdisk     -n4:0:0        -t4:BF00 /dev/sdc
    

    ... that opens the possibility of running a full system on that disk (because of 1GB for cleartext /boot and MBR/EFI partitions), at the cost of 1GB lost

  2. create a fake file to fool ZFS there is a second disk:

    truncate -s 8TB /tmp/8tb.raw
    
  3. create the pool with the fake disk (notice the -force):

    zpool create \
        -o ashift=12 \
        -O encryption=on -O keylocation=prompt -O keyformat=passphrase \
        -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
        -O compression=zstd \
        -O relatime=on \
        -O canmount=off \
        -O mountpoint=none \
        -f \
        tank \
        mirror /dev/sdb4 /tmp/8tb.raw
    
  4. immediately tell zpool to forget about the fake disk:

    zpool offline tank /tmp/8tb.raw
    
  5. cleanup:

    rm /tmp/8tb.raw
    
  6. make an actual filesystem:

    zfs create -o mountpoint=/srv tank/srv
    

It should look like this:

root@tubman:/# zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
config:

        NAME              STATE     READ WRITE CKSUM
        tank              DEGRADED     0     0     0
          mirror-0        DEGRADED     0     0     0
            sdc4          ONLINE       0     0     0
            /tmp/8tb.raw  OFFLINE      0     0     0

errors: No known data errors

I actually also ran:

zpool detach tank /tmp/8tb.raw

... but I'm not sure that's a good idea, because now ZFS thinks this is not a mirror anymore.

root@tubman:~# zpool detach tank /tmp/8tb.raw
root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          sdc4      ONLINE       0     0     0

errors: No known data errors

That, fortunately, is easily fixed:

root@tubman:~# truncate -s 8T /tmp/8tb.raw
root@tubman:~# zpool attach tank /dev/sdc4 /tmp/8tb.raw
root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 1.98M in 00:00:00 with 0 errors on Fri Oct 14 15:23:50 2022
config:

        NAME              STATE     READ WRITE CKSUM
        tank              ONLINE       0     0     0
          mirror-0        ONLINE       0     0     0
            sdc4          ONLINE       0     0     0
            /tmp/8tb.raw  ONLINE       0     0     0

errors: No known data errors

It might also have been possible to just create a pool normally, with a single disk, and reattach the second one when done.

first rsync transfer

The files were copied from ext4 to ZFS with this magic rsync command:

rsync -ASHaXx --info=progress2 /mnt/ /srv/

/dev/sde was mounted in /mnt and had all the old data:

root@tubman:~# df -h /mnt /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/sde1       7.2T  6.7T  142G  98% /mnt
tank/srv        7.1T  152M  7.1T   1% /srv

The ETA was:

6.7Tbyte/(60MB/s) = (6,7 × térabyte)/(60 × (mégabyte/seconde))
= 1 d + 7 h + 1 min + 6,666… s

AKA about 31 hours.

The rsync command started at 2022-10-14T15:27-04:00, and finished some time before 2022-10-15T18:29-04:00. That is a little over 27 hours of run time, which is faster than the above estimate. The final rsync output was:

root@tubman:/# rsync -ASHaXx --info=progress2 /mnt/ /srv/
7,326,208,287,067  99%   71.87MB/s   27:00:08 (xfr#250467, to-chk=0/592532)

resilvering

AKA rebuilding or adding back the old disk:

  1. partition sde:

    sgdisk --zap-all /dev/sde &&
    sgdisk -a1 -n1:24K:+1000K -t1:EF02 /dev/sde &&
    sgdisk     -n2:1M:+512M   -t2:EF00 /dev/sde &&
    sgdisk     -n3:0:+1G      -t3:BF01 /dev/sde &&
    sgdisk     -n4:0:0        -t4:BF00 /dev/sde
    
  2. resilver tank with sde4:

    date; time zpool replace tank /tmp/8tb.raw /dev/sde4; date
    

    in progress, started at 2022-10-15T21:40-04:00:

    root@tubman:~# zpool status tank
      pool: tank
     state: DEGRADED
    status: One or more devices is currently being resilvered.  The pool will
            continue to function, possibly in a degraded state.
    action: Wait for the resilver to complete.
      scan: resilver in progress since Sat Oct 15 21:40:16 2022
            151G scanned at 2.36G/s, 936K issued at 14.6K/s, 6.61T total
            0B resilvered, 0.00% done, no estimated completion time
    config:
    
            NAME                STATE     READ WRITE CKSUM
            tank                DEGRADED     0     0     0
              mirror-0          DEGRADED     0     0     0
                sdc4            ONLINE       0     0     0
                replacing-1     DEGRADED     0     0     0
                  /tmp/8tb.raw  OFFLINE      0     0     0
                  sde4          ONLINE       0     0     0
    
    errors: No known data errors
    

    Odd, that 0B resilvered. Eventually though, it did give me an estimate, about 5 minutes in:

    21.5G resilvered, 0.32% done, 1 days 02:27:10 to go
    

    ... which was suspiciously similar to the final rsync run time (~27 hours). After 10 minutes, we had this more encouraging estimate:

    80.7G resilvered, 1.19% done, 14:22:52 to go
    

    Interestingly, I have no idea if I'll get a notification when the thing is finished resilvering. Logging progress with:

    while sleep 600; do
        zpool status tank | grep -e scanned -e resilvered, \
        | sed 's/[\t ]*//' | logger -t resilver --id=$$
    done
    
  3. once the resilver finishes, you get an email notification from zed, in my case it said:

    scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
    

Full status says:

root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc4    ONLINE       0     0     0
            sde4    ONLINE       0     0     0

errors: No known data errors

That's an average 146.931MB/s, or 1.175Gbit/s, pretty nice.

move rpool and bpool to SSDs

  1. take SSD cache offline:

    zpool remove rpool /dev/sdd3
    
  2. partition SSD disks, keep most of the disk for caching (100G system, rest for caching):

    for device in /dev/sdb /dev/sdd ; do
      sgdisk --zap-all $device &&
      sgdisk -a8 -n1:24K:+1000K -t1:EF02 \
                 -n2:1M:+512M   -t2:EF00 \
                 -n3:0:+1G      -t3:BF01 \
                 -n4:0:+100G    -t4:BF00 \
                 -n5:0:0        -t5:BF00 \
                 $device
    done
    
  3. create new pools for the SSD drives (rpoolssd, bpoolssd?):

    zpool create \
        -o cachefile=/etc/zfs/zpool.cache \
        -o ashift=12 -d \
        -o feature@async_destroy=enabled \
        -o feature@bookmarks=enabled \
        -o feature@embedded_data=enabled \
        -o feature@empty_bpobj=enabled \
        -o feature@enabled_txg=enabled \
        -o feature@extensible_dataset=enabled \
        -o feature@filesystem_limits=enabled \
        -o feature@hole_birth=enabled \
        -o feature@large_blocks=enabled \
        -o feature@lz4_compress=enabled \
        -o feature@spacemap_histogram=enabled \
        -o feature@zpool_checkpoint=enabled \
        -O acltype=posixacl -O canmount=off \
        -O compression=lz4 \
        -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
        -O mountpoint=/boot -R /mnt \
        bpoolssd mirror /dev/sdb3 /dev/sdd3
    zpool create \
        -o ashift=12 \
        -O encryption=on -O keylocation=prompt -O keyformat=passphrase \
        -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
        -O compression=zstd \
        -O relatime=on \
        -O canmount=off \
        -O mountpoint=/ -R /mnt \
        rpoolssd mirror /dev/sdb4 /dev/sdd4
    
  4. copy the datasets over to the new pool:

    zfs snapshot -r bpool@shrink &&
    zfs send -vR bpool@shrink | zfs receive -vFd bpoolssd
    

    The above worked, and quickly. The same with rpool, however:

    zfs snapshot -r rpool@shrink &&
    zfs send -vR rpool@shrink | zfs receive -vFd rpoolssd
    

    ... failed with:

    cannot send rpool@shrink: encrypted dataset rpool may not be sent with properties without the raw flag
    

    [many more attempts later, see below for the full discussion]

    A workaround I found is to specify each dataset individually (inspired by this reddit discussion):

    for dataset in $(zfs list -H -o name  | grep -E 'rpool($|/)')
    do 
        zfs send $dataset@shrink | zfs receive -vd rpoolssd
    done
    zfs set mountpoint=none rpoolssd/ROOT
    zfs set mountpoint=/ rpoolssd/ROOT/debian
    

    The last set mountpoint is necessary because otherwise the mountpoints are wrong:

    root@tubman:~# zfs list | grep ROOT
    NAME                   USED  AVAIL     REFER  MOUNTPOINT
    rpoolssd/ROOT         1.67G  92.5G      200K  /mnt/ROOT
    rpoolssd/ROOT/debian  1.67G  92.5G     1.37G  /mnt/ROOT/debian
    

    ... and /mnt is basically empty

    root@tubman:~# ls /mnt/
    home  ROOT  var
    

    After the remount, things look more logical:

    root@tubman:~# zfs list | grep ROOT
    NAME                   USED  AVAIL     REFER  MOUNTPOINT
    rpool/ROOT            4.56G  3.50T      192K  none
    rpool/ROOT/debian     4.56G  3.50T     1.62G  /
    

    This shows the base datasets are the same:

    root@tubman:~# diff -u <(zfs list -o name | grep rpoolssd | sed 's/rpoolssd/rpool/') <(zfs list -o name | grep rpool | grep -v rpoolssd)
    root@tubman:~#
    

    ... but of course we're missing a lot of snapshots:

    root@tubman:~# zfs list -t snapshot| grep rpoolssd | wc -l
    10
    root@tubman:~# zfs list -t snapshot| grep rpool | grep -v rpoolssd | wc -l
    395
    

    And we tweaked the mountpoints a little, so that the root dataset doesn't mount on / anymore:

    root@tubman:~# diff -u <(zfs list -o name,mountpoint | grep rpoolssd | sed 's/rpoolssd/rpool/;s,/mnt/,/,;s,/mnt,/,;s/  */ /') <(zfs list -o name,mountpoint | grep rpool | grep -v rpoolssd | sed 's/  */ /')
    --- /dev/fd/63  2022-10-18 10:52:30.654369806 -0400
    +++ /dev/fd/62  2022-10-18 10:52:30.654369806 -0400
    @@ -1,4 +1,4 @@
    -rpool none
    +rpool /
     rpool/ROOT none
     rpool/ROOT/debian /
     rpool/home /home
    

    The downside of this approach is that it's clunky and it doesn't copy over snapshots. But for my use case (just move this shit over already!), it's going to be sufficient. We could actually go through each snapshot the same way again, of course, but that that point we're basically trying to reimplement -R here and failing.

  5. reinstall grub:

    for fs in /run /sys /dev /proc; do 
        mount -o rbind $fs "/mnt${fs}"
    done
    zfs mount bpoolssd/BOOT/debian
    sed -i 's,ZFS=rpool,ZFS=rpoolssd,' /mnt/etc/default/grub
    sed -i  s/bpool/bpoolssd/ /mnt/etc/systemd/system/zfs-import-bpool.service
    rm /mnt/etc/zfs/zfs-list.cache/*
    touch /mnt/etc/zfs/zfs-list.cache/bpoolssd
    touch /mnt/etc/zfs/zfs-list.cache/rpoolssd
    chroot /mnt /usr/sbin/update-grub
    chroot /mnt /usr/sbin/grub-install /dev/sdb
    chroot /mnt /usr/sbin/grub-install /dev/sdd
    for fs in /run /sys /dev /proc; do 
        umount "/mnt${fs}"
    done
    
  6. reboot and make sure we boot from the SSD drives (e.g. the new pool)

    reboot
    
  7. stop using the old pools:

    zpool export bpool
    zpool export rpool
    
  8. reboot again:

    reboot
    

    now make sure that the old pools really are not used. zpool status shouldn't show the old pools.

Note that for this conversion, we cannot just attach and detach the SSD drives because they are different sizes than the other disks in the pool. We could use add to create a second mirror in the pool and remove to remove the old mirror, moving the data to the new SSD drives but may make the bpool unbootable, among other problems

Also, a procedure the above is inspired from goes through the extra steps of recreating rpool and bpool, and reattaching the drives there, and destroying rpoolssd and bpoolssd, basically as a way to rename the pools back to their original names. It's possible to "just" rename a pool, but it must not be in use, so possibly the simplest way to do this would be to boot a rescue image, and use export/import to rename the pool.

Moving encrypted pools is hard

In step 4, above, we failed to just move the rpool datasets to the new pool:

zfs snapshot -r rpool@shrink &&
zfs send -vR rpool@shrink | zfs receive -vFd rpoolssd

... failed with:

cannot send rpool@shrink: encrypted dataset rpool may not be sent with properties without the raw flag

I have tried using the --raw flag to send the dataset, but then that fails with:

root@tubman:~# zfs send --raw -R rpool@shrink | zfs receive -vFd rpoolssd
cannot receive new filesystem stream: zfs receive -F cannot be used to destroy an encrypted filesystem or overwrite an unencrypted one with an encrypted one

Okay, then let's try to remove -F:

root@tubman:~# zfs send --raw -R rpool@shrink | zfs receive -vd rpoolssd
cannot receive new filesystem stream: destination 'rpoolssd' exists
must specify -F to overwrite it

AAARGHL. And of course, we can't destroy rpoolssd, that would destroy the entire pool.

One thing that does seem to work is to use -e instead of -d:

zfs send --raw -R rpool@shrink | zfs receive -v -e rpoolssd

I found out about the -e flag in this post. The description of -d and -e is actually rather confusing in the man page:

The -d and -e options cause the file system name of the target snapshot to be determined by appending a portion of the sent snapshot's name to the specified target filesystem. If the -d option is specified, all but the first element of the sent snapshot's file system path (usually the pool name) is used and any required intermediate file systems within the specified one are created. If the -e option is specified, then only the last element of the sent snapshot's file system name (i.e. the name of the source file system itself) is used as the target file system name.

I actually can't make heads or tails of this, but essentially, -e seems to do nothing at all here, which means we end up with an extra rpool component in the dataset path:

root@tubman:~# zfs list | grep rpool | grep -v rpoolssd | head -3
rpool                       7.16G  3.50T      192K  /
rpool/ROOT                  4.56G  3.50T      192K  none
rpool/ROOT/debian           4.56G  3.50T     1.62G  /
root@tubman:~# zfs list | grep rpoolssd | head -3
rpoolssd                    6.98G  89.4G      192K  /mnt
rpoolssd/rpool              6.95G  89.4G      168K  /mnt
rpoolssd/rpool/ROOT         4.55G  89.4G      168K  none

So that's also wrong. I eventually ended up with the procedure detailed in step 4, above, to individually copy over the datasets, one by one. This doesn't work as well; the snapshots are not copied over, for example. But it's better than nothing, which was the situation I was stuck with for days.

It's possible that this situation is specific to the Debian and Ubuntu install guides, which put most datasets directly on the root dataset (e.g. rpool/var instead of rpool/ROOT/var). It's possible that adding that layer of indirection could help with such situations, but the jury is actually still out on that, see this discussion.

SSD TRIM

See zfs.

extending the main tank

Once we are confident we can boot without the old HDD pool, we can repartition the old drives and add them to tank. The drives are already partitioned, and it was probably done with something like this, from what I can tell:

for device in /dev/sda /dev/sdf ; do
  sgdisk --zap-all $device &&
  sgdisk -a8 -n1:24K:+1000K -t1:EF02 \
             -n2:1M:+512M   -t2:EF00 \
             -n3:0:+1G      -t3:BF01 \
             -n5:0:0        -t5:BF00 \
             $device
done

In fact, it wasn't quite like that; the exact procedure is step one in the main installation procedure here. The main difference is that here we call sgdisk only once, so the -a flag applies to all. In the original, we call it multiple times which means things are not necessarily aligned as they should. We'll just disregard this for a moment.

Adding the old drives to the pool is pretty simple. We follow this model where we basically have two RAID-1 mirrors striped together. Eventually, it might make more sense to replace the 2x4TB drives with one 8TB drive and use RAID-Z, but I already broke the bank to get a second 8TB drive to get the first part of this stripe, so this is what we have.

The actual command is:

root@tubman:~# zpool add -n tank mirror /dev/sda4 /dev/sdf4
would update 'tank' to the following configuration:

        tank
          mirror-0
            sdc4
            sde4
          mirror
            sda4
            sdf4

Note the -n is a dry run, the actual command doesn't return anything, and takes very little time:

root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc4    ONLINE       0     0     0
            sde4    ONLINE       0     0     0

errors: No known data errors
root@tubman:~# zpool add tank mirror /dev/sda4 /dev/sdf4
root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc4    ONLINE       0     0     0
            sde4    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sda4    ONLINE       0     0     0
            sdf4    ONLINE       0     0     0

errors: No known data errors
root@tubman:~#

I heard the disks scratch for a few seconds and that was it.

I had the problem that the filesystem wasn't coming up on boot. Because it's not the root filesystem, presumably, it needs something special to be loaded. Furthermore, its encryption key would be rather problematic to load as it doesn't get prompted in the initrd either. So it's better to shift to a keylocation that is actually on disk.

umask 0777 &&
dd if=/dev/urandom of=/etc/zfs/tank.key bs=32 count=1 ;
umask 0022 &&
zfs change-key -l -o keylocation=file:///etc/zfs/tank.key tank

Then to make the pool automatically loaded at boot, use:

zpool set cachefile=/etc/zfs/zpool.cache tank

Then the systemd zfs-import-cache.service and zfs-import.service units will make sure the pool is imported. Normally, if zfs-mount.service and zfs.target are enabled, underlying datasets should also be automatically mounted. In our case, however, we need an extra shim to make sure the cryptographic key gets loaded. So we need this unit in /etc/systemd/system/zfs-load-keyfile@.service (a modified version of this service:

[Unit]
Description=Load %I encryption keys from disk
Before=systemd-user-sessions.service zfs-mount.service
After=zfs-import.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=zfs load-key %I

[Install]
WantedBy=zfs-mount.service

... which we enable with:

systemctl enable zfs-load-keyfile@tank.service

We can test this works with:

zfs umount tank/srv &&
zfs unload-key tank &&
systemctl start zfs-load-keyfile@tank.service &&
zfs mount tank/srv

And a reboot is probably in order to make sure systemd doesn't get stuck at a prompt:

reboot

remaining work

Other documentation

See zfs for more documentation on ZFS and 2022-11-17-zfs-migration for another installation and migration procedure.

Created . Edited .