Tubman is named after Harriet Tubman, an "American abolitionist and political activist. Born into slavery, Tubman escaped and subsequently made some 13 missions to rescue approximately 70 enslaved people, including family and friends, using the network of antislavery activists and safe houses known as the Underground Railroad. During the American Civil War, she served as an armed scout and spy for the Union Army. The first woman to lead an armed expedition in the war, she guided the raid at Combahee Ferry, which liberated more than 700 enslaved people. In her later years, Tubman was an activist in the movement for women's suffrage."

I was the conductor of the Underground Railroad for eight years, and I can say what most conductors can't say — I never ran my train off the track and I never lost a passenger.

Specification

tubman's install changed bodies and is now in "toutatis"'s body. so the specs below are inaccurate.

motherboard: MSI X58M (MS-7593)
case: some alien atrocity
CPU: Intel Core i7 CPU 960 (2009, Nehalem bloomfield, 45nm, 4/8 cores, 3.46GHz) not to be confused with the best-selling, embedded i960 (1984-2007, still in use)
Memory: 12GiB (3x4GB) DIMM 1066 MHz 0.9ns
Storage:
- SSD:
  - 500GB Samsung SSD 850
  - 480GB Crucial CT480M50
- HDD:
  - 8TB Seagate IronWolf ST8000VN004-2M21
  - 8TB Seagate IronWolf ST8000VN0022-2EL
  - 4TB Seagate Barracuda ST4000DM000-1F21
  - 4TB Seagate Barracuda ST4000DM004-2CV1
Network: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
Display: Oland XT [Radeon HD 8670 / R7 250/350]
Audio: Oland/Hainan/Cape Verde/Pitcairn HDMI Audio [Radeon HD 7000 Series]

Note that tubman was originally built with the old marcos hardware, but transplanted in what used to be known as toutatis, see v1 for the old spec. The toutatis install was kept install, on a stack of 5 disks (3x~2TB HDD, 2x128GB SSD).

4TB disk health inspection

Before the migration from marcos' body to toutatis', a ZFS scrub triggered a warning. The drives were inspected for health, this is the report (copied from 2022-11-17-zfs-migration).

Here's some SMART stats:

root@tubman:~# smartctl -a -qnoserial /dev/sdb | grep -e  Head_Flying_Hours -e Power_On_Hours -e Total_LBA -e 'Sector Sizes'
Sector Sizes:     512 bytes logical, 4096 bytes physical
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       12464 (206 202 0)
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       10966h+55m+23.757s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       21107792664
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3201579750

That's over a year of power on, which shouldn't be so bad. It has written about 10TB of data (21107792664 LBAs * 512 byte/LBA), which is about two full writes. According to its specification, this device is supposed to support 55 TB/year of writes, so we're far below spec. Note that are still far from the "non-recoverable read error per bits" spec (1 per 10E15), as we've basically read 13E12 bits (3201579750 LBAs * 512 byte/LBA = 13E12 bits).

It's likely this disk was made in 2018, so it is in its fourth year.

Interestingly, /dev/sdc is also a Seagate drive, but of a different series:

root@tubman:~# smartctl -qnoserial  -i /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5
Device Model:     ST4000DM004-2CV104
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Oct 11 11:21:35 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

It has seen much more reads than the other disk which is also interesting:

root@tubman:~# smartctl -a -qnoserial /dev/sdc | grep -e  Head_Flying_Hours -e Power_On_Hours -e Total_LBA -e 'Sector Sizes'
Sector Sizes:     512 bytes logical, 4096 bytes physical
  9 Power_On_Hours          0x0032   059   059   000    Old_age   Always       -       36240
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       33994h+10m+52.118s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       30730174438
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       51894566538

That's 4 years of Head_Flying_Hours, and over 4 years (4 years and 48 days) of Power_On_Hours. The copyright date on that drive's specs goes back to 2016, so it's a much older drive.

Installation procedure

I would have used FAI's setup-storage but it doesn't support ZFS, unfortunately. It is part of the long term roadmap, that said, and there's a howto for stretch, but that doesn't use setup-storage. I was hoping I would reuse the installer I've been working on at work...

We have the following disk configuration:

/dev/sda: SSD drive, 512MB used for caching
/dev/sdb: HDD drive, 4TB, to be used in a ZFS pool with native encryption
/dev/sdc: HDD drive, 4TB, same

We boot from a grml live image based on Debian testing (bullseye), and will follow this howto:

install requirements:
```
apt update
apt install --yes debootstrap gdisk dkms dpkg-dev linux-headers-$(uname -r) zfs-dkms
modprobe zfs
apt install --yes zfsutils-linux
```
Note that those instructions differ from the documentation (we don't use buster-backports) because we start from a bullseye live image.

clear the partitions on the two HDD, and setup a BIOS, UEFI, boot pool and native encrypted partition:

for DISK in /dev/sdb /dev/sdc ; do
    sgdisk --zap-all $DISK
    sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK
    sgdisk     -n2:1M:+512M   -t2:EF00 $DISK
    sgdisk     -n3:0:+1G      -t3:BF01 $DISK
    sgdisk     -n4:0:0        -t4:BF00 $DISK
done

resulting partition table:

root@grml ~ # sgdisk -p /dev/sdb
Disk /dev/sdb: 7814037168 sectors, 3.6 TiB
Model: ST4000DM004-2CV1
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): 63B2F372-B4E9-45FF-8151-9706F9F158C9
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 7814037134
Partitions will be aligned on 16-sector boundaries
Total free space is 14 sectors (7.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1              48            2047   1000.0 KiB  EF02  
   2            2048         1050623   512.0 MiB   EF00  
   3         1050624         3147775   1024.0 MiB  BF01  
   4         3147776      7814037134   3.6 TiB     BF00

create the boot pool called bpool and the root pool called rpool, the latter will prompt for a disk encryption key:

zpool create \
    -o cachefile=/etc/zfs/zpool.cache \
    -o ashift=12 -d \
    -o feature@async_destroy=enabled \
    -o feature@bookmarks=enabled \
    -o feature@embedded_data=enabled \
    -o feature@empty_bpobj=enabled \
    -o feature@enabled_txg=enabled \
    -o feature@extensible_dataset=enabled \
    -o feature@filesystem_limits=enabled \
    -o feature@hole_birth=enabled \
    -o feature@large_blocks=enabled \
    -o feature@lz4_compress=enabled \
    -o feature@spacemap_histogram=enabled \
    -o feature@zpool_checkpoint=enabled \
    -O acltype=posixacl -O canmount=off -O compression=lz4 \
    -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
    -O mountpoint=/boot -R /mnt \
    bpool mirror /dev/sdb3 /dev/sdc3
zpool create \
    -o ashift=12 \
    -O encryption=aes-256-gcm \
    -O keylocation=prompt -O keyformat=passphrase \
    -O acltype=posixacl -O canmount=off -O compression=lz4 \
    -O dnodesize=auto -O normalization=formD -O relatime=on \
    -O xattr=sa -O mountpoint=/ -R /mnt \
    rpool mirror /dev/sdb4 /dev/sdc4

create filesytems and "datasets":
- this creates two containers, for ROOT and BOOT
  
  zfs create -o canmount=off -o mountpoint=none rpool/ROOT zfs create -o canmount=off -o mountpoint=none bpool/BOOT

this actually creates the boot and root filesystems:

zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/debian
zfs mount rpool/ROOT/debian
zfs create -o mountpoint=/boot bpool/BOOT/debian

then they use even more data sets, although I'm not sure they are all necessary:

zfs create                                 rpool/home
zfs create -o mountpoint=/root             rpool/home/root
chmod 700 /mnt/root
zfs create -o canmount=off                 rpool/var
zfs create -o canmount=off                 rpool/var/lib
zfs create                                 rpool/var/log
zfs create                                 rpool/var/spool

to exclude temporary files from snapshots, for example:

zfs create -o com.sun:auto-snapshot=false  rpool/var/cache
zfs create -o com.sun:auto-snapshot=false  rpool/var/tmp
chmod 1777 /mnt/var/tmp

and a /srv:

zfs create                                 rpool/srv

or for Docker (TODO):

zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker

make a tmpfs for /run:

mkdir /mnt/run
mount -t tmpfs tmpfs /mnt/run
mkdir /mnt/run/lock

install the base system and copy the ZFS config:

debootstrap --components=main,contrib bullseye /mnt
mkdir /mnt/etc/zfs
cp /etc/zfs/zpool.cache /mnt/etc/zfs/

base system configuration:

echo HOSTNAME > /mnt/etc/hostname
vi /mnt/etc/hosts
apt install ca-certificates
echo 'deb https://deb.debian.org/debian-security bullseye-security main contrib' > /etc/apt/sources.list.d/security.list

bind mounts and chroot for more complex config:

mount --rbind /dev  /mnt/dev
mount --rbind /proc /mnt/proc
mount --rbind /sys  /mnt/sys
chroot /mnt /bin/bash

more base system config:

ln -s /proc/self/mounts /etc/mtab
apt update
apt install --yes console-setup locales
dpkg-reconfigure locales tzdata

ZFS boot configuration

apt install --yes dpkg-dev linux-headers-amd64 linux-image-amd64
apt install --yes zfs-initramfs
echo REMAKE_INITRD=yes > /etc/dkms/zfs.conf
apt install --yes grub-pc
apt remove --purge os-prober

pick a root password
```
passwd
```

bpool import hack (TODO: whyy)

cat > /etc/systemd/system/zfs-import-bpool.service <<EOF
[Unit]
DefaultDependencies=no
Before=zfs-import-scan.service
Before=zfs-import-cache.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/zpool import -N -o cachefile=none bpool
# Work-around to preserve zpool cache:
ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache
ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache

[Install]
WantedBy=zfs-import.target
EOF
systemctl enable zfs-import-bpool.service

enable tmpfs:

ln -s /usr/share/systemd/tmp.mount /etc/systemd/system/ &&
systemctl enable tmp.mount

grub setup:

root@grml:/# grub-probe /boot
zfs
root@grml:/# update-initramfs -c -k all
update-initramfs: Generating /boot/initrd.img-5.10.0-6-amd64
root@grml:/# sed -i 's,GRUB_CMDLINE_LINUX.*,GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/debian",' /etc/default/grub
root@grml:/# update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.10.0-6-amd64
Found initrd image: /boot/initrd.img-5.10.0-6-amd64
done
root@grml:/# grub-install /dev/sdb 
Installing for i386-pc platform.
Installation finished. No error reported.
root@grml:/# grub-install /dev/sdc 
Installing for i386-pc platform.
Installation finished. No error reported.

make sure you check both disks in there:

 dpkg-reconfigure grub-pc

filesystem mount ordering, rationale in the OpenZFS guide:

mkdir /etc/zfs/zfs-list.cache
touch /etc/zfs/zfs-list.cache/bpool
touch /etc/zfs/zfs-list.cache/rpool
zed -F &

then verify the files have data:

root@grml:/# cat /etc/zfs/zfs-list.cache/bpool                                                                                                                         
bpool   /mnt/boot       off     on      on      off     on      off     on      off     -       none    -       -       -       -       -       -       -       -
bpool/BOOT      none    off     on      on      off     on      off     on      off     -       none    -       -       -       -       -       -       -       -
bpool/BOOT/debian       /mnt/boot       on      on      on      off     on      off     on      off     -       none    -       -       -       -       -       -     -
        -
root@grml:/# cat /etc/zfs/zfs-list.cache/rpool                                                                                                                         |
rpool   /mnt    off     on      on      on      on      off     on      off     rpool   prompt  -       -       -       -       -       -       -       -
rpool/ROOT      none    off     on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -       -
rpool/ROOT/debian       /mnt    noauto  on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
rpool/home      /mnt/home       on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
rpool/home/root /mnt/root       on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
rpool/srv       /mnt/srv        on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
rpool/var       /mnt/var        off     on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
rpool/var/cache /mnt/var/cache  on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
rpool/var/lib   /mnt/var/lib    off     on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
rpool/var/log   /mnt/var/log    on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
rpool/var/spool /mnt/var/spool  on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
rpool/var/tmp   /mnt/var/tmp    on      on      on      on      on      off     on      off     rpool   none    -       -       -       -       -       -       -     -
root@grml:/# fg
zed -F
^CExiting

fix the paths to eliminate /mnt:

sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*

extra config, setup SSH with auth key:

apt install --yes openssh-server
mkdir /root/.ssh/
cat > /root/.ssh/authorized_keys <<EOF
...
EOF

snapshot initial install:

zfs snapshot bpool/BOOT/debian@install
zfs snapshot rpool/ROOT/debian@install

exit chroot:
```
exit
```

unmount filesystems:

mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \
    xargs -i{} umount -lf {}
zpool export -a

reboot:
```
reboot
```

That procedure actually worked! The only problem was the interfaces(5) configuration, which was missing (regardless of what the above says). I want to do systemd-networkd anyways.

We performed steps 1 through 6, remaining steps are optional and troubleshooting.

SSD caching

The machine has been installed on two HDD: spinning rust! Those are typically slow, but they are redundant which should ensure high availability. To boost performance, we're setting up a SSD cache.

ZFS has two types of caches:

write intent log (external ZIL or SLOG)
layer 2 adaptive replacement cache (L2ARC)

L2ARC

The L2ARC is purely a performance cache, and if it dies, no data is lost. The former, however, can cause data loss (typically a few seconds, but still) in case the drive dies. So we're going with L2ARC, based on this source for the redundancy claim.

We also use the L2ARC cache because it's useful for read caching as well. The SLOG cache is mostly useful for write-heavy workloads, which is not the case of this server.

To configure the L2ARC cache, we simply did this:

zpool add rpool cache /dev/sda3

(Actually, -f was necessary because there already was a crypto_LUKS partition on there, which we didn't care about.)

The sda3 device is the third partition on the SSD drive. It's 465GB so it should provide a lot of space for the cache.

The status of the cache can be found with the zpool iostat command:

root@tubman:~# zpool iostat -v
              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
bpool       47.8M   912M      0      0      3     14
  mirror    47.8M   912M      0      0      3     14
    sdb3        -      -      0      0      1      7
    sdc3        -      -      0      0      1      7
----------  -----  -----  -----  -----  -----  -----
rpool       1.29G  3.62T      0     60    437   432K
  mirror    1.29G  3.62T      0     60    437   432K
    sdb4        -      -      0     30    199   216K
    sdc4        -      -      0     30    238   216K
cache           -      -      -      -      -      -
  sda3       326M   465G      0    183  4.96K  11.9M
----------  -----  -----  -----  -----  -----  -----

Note that this guide actually discourages the use of symbolic names like sda. Quoting that warning directly:

WARNING: Some motherboards will not present disks in a consistent manner to the Linux kernel across reboots. As such, a disk identified as /dev/sda on one boot might be /dev/sdb on the next. For the main pool where your data is stored, this is not a problem as ZFS can reconstruct the VDEVs based on the metadata geometry. For your L2ARC and SLOG devices, however, no such metadata exists. [...] If you don't heed this warning, your L2ARC device may not be added to your hybrid pool at all, and you will need to re-add it later. This could drastically affect the performance of the applications when pulling evicted pages off of disk.

TL;DR: the cache might disappear after a reboot if disk ordering is changed by the BIOS. This only affects caches like the L2ARC (above) and the SLOG.

Eventually, there were two SSD drives in this system, and both were added as caches (following the above warning), with:

zpool add tank cache \
  /dev/disk/by-id/ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5 \
  /dev/disk/by-id/ata-Crucial_CT480M500SSD1_1311092ED40E-part5

... it makes the zpool status output quite large though:

root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:

        NAME                                                   STATE     READ WRITE CKSUM
        tank                                                   ONLINE       0     0     0
          mirror-0                                             ONLINE       0     0     0
            sdc4                                               ONLINE       0     0     0
            sde4                                               ONLINE       0     0     0
          mirror-1                                             ONLINE       0     0     0
            sda4                                               ONLINE       0     0     0
            sdf4                                               ONLINE       0     0     0
        cache
          ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5  ONLINE       0     0     0
          ata-Crucial_CT480M500SSD1_1311092ED40E-part5         ONLINE       0     0     0

errors: No known data errors
root@tubman:~# zpool iostat -v
                                                         capacity     operations     bandwidth
pool                                                   alloc   free   read  write   read  write
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
bpoolssd                                                280M   680M      0      0    867  5.83K
  mirror                                                280M   680M      0      0    867  5.83K
    sdb3                                                   -      -      0      0    500  2.91K
    sdd3                                                   -      -      0      0    366  2.91K
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
rpoolssd                                               4.28G  95.2G      3     11  47.2K   160K
  mirror                                               4.28G  95.2G      3     11  47.2K   160K
    sdb4                                                   -      -      1      5  23.7K  79.9K
    sdd4                                                   -      -      1      5  23.5K  79.9K
-----------------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                                   6.64T  4.25T      0    178  16.5K  21.1M
  mirror                                               6.62T   664G      0     49  16.0K  4.69M
    sdc4                                                   -      -      0     24  8.97K  2.35M
    sde4                                                   -      -      0     24  7.04K  2.35M
  mirror                                               25.1G  3.60T      0    128    546  16.4M
    sda4                                                   -      -      0     70    293  8.21M
    sdf4                                                   -      -      0     58    252  8.21M
cache                                                      -      -      -      -      -      -
  ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5   479M   364G      0    162  5.61K  19.8M
  ata-Crucial_CT480M500SSD1_1311092ED40E-part5          444M   345G      0    152  5.61K  18.3M
-----------------------------------------------------  -----  -----  -----  -----  -----  -----

Note that both caches are of different size. ZFS doesn't care; they are striped anyways, and it doesn't actually matter that there are two, it provides no special redundancy as the cache is disposable. That is different from the SLOG configuration, see below.

Also note that the L2ARC cache is indexed in memory and that, in itself, takes memory from the in-memory ARC cache, so it might actually be detrimental to have too big of a cache. The arch wiki suggests the formula for that memory usage is:

(L2ARC size) / (recordsize) * 70 bytes

... where recordsize is typically 128KiB. So in our case, it would mean:

70B×345GB/128KiB = ((70 × byte) × (345 × gigabyte))/(128 × kibibyte)
≈ 184.249 877 930 MB

... 200MB of RAM, not a problem, given this machine has 12GB of RAM:

root@tubman:~# free -h
               total        used        free      shared  buff/cache   available
Mem:            11Gi       6.4Gi       5.1Gi       0.0Ki       188Mi       5.1Gi
Swap:             0B          0B          0B

SLOG caches

SLOG caches are more sensitive. They are actually where ZFS will commit a write before confirming it to the caller, so it needs a reliable storage medium. Typically, you'd use RAM (the default, to simplify), NVMe, or fast SSD storage for this. Using NVMe or SSDs, you want to make sure those are mirrored, so that if there's a failure in one drive, there is no data lost.

To create a SLOG, you should first choose its size. It doesn't have to be as big as the L2ARC cache because it's only a write cache and gets regularly flushed to disk. This article from Klara systems suggests:

Often, 16GB to 64GB is sufficient. For a busy server with a lot of writes, a general rule of thumb for calculating size is: max amount of write traffic per second x 15.

The TrueNAS docs also say:

The iXsystems current recommendation is a 16 GB SLOG device over-provisioned from larger SSDs to increase the write endurance and throughput of an individual SSD. This 16 GB size recommendation is based on performance characteristics of typical HDD pools with SSD SLOGs and capped by the value of the tunable vfs.zfs.dirty_data_max_max.

The parameter vfs.zfs.dirty_data_max_max defaults to 25% of physical RAM which, in my case, is 3GB:

root@tubman:~# cat /sys/module/zfs/parameters/zfs_dirty_data_max_max
3137032192

Considering that the core memory might be boosted in the future, it's worth raising the size a little, so we're going to pick 16GB as suggested. The final partition table looks something like this:

Number  Start (sector)    End (sector)  Size       Code  Name
   1              48            2047   1000.0 KiB  EF02
   2            2048         1050623   512.0 MiB   EF00
   3         1050624         3147775   1024.0 MiB  BF01  bpool
   4         3147776       212862975   100.0 GiB   BF00  rpool
   5       212862976       246417407   16.0 GiB    BF00  SLOG
   6       246417408       937703054   329.6 GiB   BF00  L2ARC

To create the cache, we use the disk's symbolic name (as explained in the L2ARC section):

zpool add tank log mirror \
  /dev/disk/by-id/ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5 \
  /dev/disk/by-id/ata-Crucial_CT480M500SSD1_1311092ED40E-part5

Also be careful to use the log keyword here. If you forget it, you will extend the pool with a new mirror device, striped with the other mirrors!

To see how effective the SLOG is, you can use:

zpool status tank 1

... and you will see it fill and empty as the timeout (zfs.txg.timeout, set in /sys/module/zfs/parameters/zfs_txg_timeout, defaults to 5 seconds) expires. You can raise that timeout to use the SLOG more, if you are comfortable with that delay in data loss if the SLOG fails.

Removing a SLOG device is a little different than a L2ARC cache, because you need to remove the entire mirror, you can't remove individual devices. First, find the mirror that's under the log:

zpool status tank

... then remove that mirror, be careful to not remove the actual data mirror!

zpool remove tank mirror-4

Next steps

TODO:

configure swap? (step 7, issues with memory pressure)
disable log compression? (step 8.3)

delete install snapshots?

 zfs snapshot bpool/BOOT/debian@install
 zfs snapshot rpool/ROOT/debian@install

setup services:
- radio (DONE)
- sonic
- paste
- photos (Nextcloud?)
- torrent

Done

SSD caching
static IP
port forward SSH so that it doesn't land on curie
report back on the procedure
automatic snapshots (with sanoid, see the Puppet code and configuration file)

Decisions taken during the procedure

use a tmpfs for /run
use native ZFS encryption
setup both BIOS and UEFI partitions, in case we switch to the latter later

Changes from the original procedure

we install a bullseye system from a bullseye live image (instead of buster from buster)
interfaces(5) file untouched, default is fine (allow-hotplug eth0 etc)
we skip keyboard-configuration and console-setup config, defaults are fine

this was skipped, as the target file already exists in bullseye:

 ln -s /usr/lib/zfs-linux/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zed.d

Abandoned ideas

using mmdebstrap: it complains that /mnt is "not empty" even though it only has related mountpoints (actually, that's supported with --skip=check/empty, but that wasn't in the buster manpage and I failed to look at the bullseye one)

To be improved

the /var/log and /var/spool datasets are creating needless complexity in the boot process, we could do without them

Troubleshooting

initrd documentation: booting from a snapshot, rollbacks, etc
install troubleshooting

Conversion into a backup server

Originally, this server was meant to be a test server, a "lab" if you will, to do some tests on ZFS and generally just have another PoP for some of my services. The server was running, for example, https://radio.anarc.at. But it was still using the really old marcos v1 hardware, and was due for an upgrade.

It was therefore merged with the server I previously used for offsite backups (toutatis), by moving its disks in the new backup server's body. Tubman's install was kept, but the data was moved around disks quite a bit.

Before

This is how tubman's disks were laid out before the transfer:

1x500GB SSD cache
2x4TB HDD mirror pool (rpool and bpool, 4TB equivalent)
all disposable data in rpool/srv
base Debian install, fully managed by Puppet

And this was toutatis:

2x2TB + 1x2.5TB RAID-5 HDD array (4TB equivalent)
1x8TB HDD single drive (anarcat's "offsite")
2x128GB SSD (OS)

After

2x500GB SSD mirror pool for base system (rpool and bpool) and cache
2x4TB + 2x8TB HDD mirror pool (tank, 12TB equivalent)
rpool/srv dataset destroyed

This is not ideal. Ideally, all drives would be the same size (e.g. 8TB) and use some RAID-Z layout to optimize available disk space (e.g. better than RAID-1). That could still be done, but by rebuilding a new vdev using 2x8TB drives, as a future expansion. Only one SATA connector is available on board right now, so this would be a tricky operation, probably involving degrading the tank pool.

It might have been possible to RAID-0 the 2x4TB drives to give an extra 8TB drive to the ZFS pool, but this idea was rejected as too risk and clunky. ZFS itself doesn't support such configuration.

tank pool creation

We create another pool, called tank, for the 2x8TB drives, fully encrypted. The point of this is to have a separate pool from the main system to alleviate any possible confusion. It will also make it possible to move the system (and only that) to SSD (it's currently on 2x4TB + 500GB SSD cache).

first, partition the new disk (we reuse the disk formatting command used for curie, see 2022-11-17-zfs-migration):
```
sgdisk --zap-all /dev/sdc
sgdisk -a1 -n1:24K:+1000K -t1:EF02 /dev/sdc
sgdisk     -n2:1M:+512M   -t2:EF00 /dev/sdc
sgdisk     -n3:0:+1G      -t3:BF01 /dev/sdc
sgdisk     -n4:0:0        -t4:BF00 /dev/sdc
```
... that opens the possibility of running a full system on that disk (because of 1GB for cleartext /boot and MBR/EFI partitions), at the cost of 1GB lost
create a fake file to fool ZFS there is a second disk:
```
truncate -s 8TB /tmp/8tb.raw
```

create the pool with the fake disk (notice the -force):

zpool create \
    -o ashift=12 \
    -O encryption=on -O keylocation=prompt -O keyformat=passphrase \
    -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
    -O compression=zstd \
    -O relatime=on \
    -O canmount=off \
    -O mountpoint=none \
    -f \
    tank \
    mirror /dev/sdb4 /tmp/8tb.raw

immediately tell zpool to forget about the fake disk:
```
zpool offline tank /tmp/8tb.raw
```
cleanup:
```
rm /tmp/8tb.raw
```
make an actual filesystem:
```
zfs create -o mountpoint=/srv tank/srv
```

It should look like this:

root@tubman:/# zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
config:

        NAME              STATE     READ WRITE CKSUM
        tank              DEGRADED     0     0     0
          mirror-0        DEGRADED     0     0     0
            sdc4          ONLINE       0     0     0
            /tmp/8tb.raw  OFFLINE      0     0     0

errors: No known data errors

I actually also ran:

zpool detach tank /tmp/8tb.raw

... but I'm not sure that's a good idea, because now ZFS thinks this is not a mirror anymore.

root@tubman:~# zpool detach tank /tmp/8tb.raw
root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          sdc4      ONLINE       0     0     0

errors: No known data errors

That, fortunately, is easily fixed:

root@tubman:~# truncate -s 8T /tmp/8tb.raw
root@tubman:~# zpool attach tank /dev/sdc4 /tmp/8tb.raw
root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 1.98M in 00:00:00 with 0 errors on Fri Oct 14 15:23:50 2022
config:

        NAME              STATE     READ WRITE CKSUM
        tank              ONLINE       0     0     0
          mirror-0        ONLINE       0     0     0
            sdc4          ONLINE       0     0     0
            /tmp/8tb.raw  ONLINE       0     0     0

errors: No known data errors

It might also have been possible to just create a pool normally, with a single disk, and reattach the second one when done.

first rsync transfer

The files were copied from ext4 to ZFS with this magic rsync command:

rsync -ASHaXx --info=progress2 /mnt/ /srv/

/dev/sde was mounted in /mnt and had all the old data:

root@tubman:~# df -h /mnt /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/sde1       7.2T  6.7T  142G  98% /mnt
tank/srv        7.1T  152M  7.1T   1% /srv

The ETA was:

6.7Tbyte/(60MB/s) = (6,7 × térabyte)/(60 × (mégabyte/seconde))
= 1 d + 7 h + 1 min + 6,666… s

AKA about 31 hours.

The rsync command started at 2022-10-14T15:27-04:00, and finished some time before 2022-10-15T18:29-04:00. That is a little over 27 hours of run time, which is faster than the above estimate. The final rsync output was:

root@tubman:/# rsync -ASHaXx --info=progress2 /mnt/ /srv/
7,326,208,287,067  99%   71.87MB/s   27:00:08 (xfr#250467, to-chk=0/592532)

resilvering

AKA rebuilding or adding back the old disk:

partition sde:

sgdisk --zap-all /dev/sde &&
sgdisk -a1 -n1:24K:+1000K -t1:EF02 /dev/sde &&
sgdisk     -n2:1M:+512M   -t2:EF00 /dev/sde &&
sgdisk     -n3:0:+1G      -t3:BF01 /dev/sde &&
sgdisk     -n4:0:0        -t4:BF00 /dev/sde

resilver tank with sde4:

date; time zpool replace tank /tmp/8tb.raw /dev/sde4; date

in progress, started at 2022-10-15T21:40-04:00:

root@tubman:~# zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Oct 15 21:40:16 2022
        151G scanned at 2.36G/s, 936K issued at 14.6K/s, 6.61T total
        0B resilvered, 0.00% done, no estimated completion time
config:

        NAME                STATE     READ WRITE CKSUM
        tank                DEGRADED     0     0     0
          mirror-0          DEGRADED     0     0     0
            sdc4            ONLINE       0     0     0
            replacing-1     DEGRADED     0     0     0
              /tmp/8tb.raw  OFFLINE      0     0     0
              sde4          ONLINE       0     0     0

errors: No known data errors

Odd, that 0B resilvered. Eventually though, it did give me an estimate, about 5 minutes in:

21.5G resilvered, 0.32% done, 1 days 02:27:10 to go

... which was suspiciously similar to the final rsync run time (~27 hours). After 10 minutes, we had this more encouraging estimate:

80.7G resilvered, 1.19% done, 14:22:52 to go

Interestingly, I have no idea if I'll get a notification when the thing is finished resilvering. Logging progress with:

while sleep 600; do
    zpool status tank | grep -e scanned -e resilvered, \
    | sed 's/[\t ]*//' | logger -t resilver --id=$$
done

once the resilver finishes, you get an email notification from zed, in my case it said:
```
scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
```

Full status says:

root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc4    ONLINE       0     0     0
            sde4    ONLINE       0     0     0

errors: No known data errors

That's an average 146.931MB/s, or 1.175Gbit/s, pretty nice.

move rpool and bpool to SSDs

take SSD cache offline:
```
zpool remove rpool /dev/sdd3
```

partition SSD disks, keep most of the disk for caching (100G system, rest for caching):

for device in /dev/sdb /dev/sdd ; do
  sgdisk --zap-all $device &&
  sgdisk -a8 -n1:24K:+1000K -t1:EF02 \
             -n2:1M:+512M   -t2:EF00 \
             -n3:0:+1G      -t3:BF01 \
             -n4:0:+100G    -t4:BF00 \
             -n5:0:0        -t5:BF00 \
             $device
done

create new pools for the SSD drives (rpoolssd, bpoolssd?):

zpool create \
    -o cachefile=/etc/zfs/zpool.cache \
    -o ashift=12 -d \
    -o feature@async_destroy=enabled \
    -o feature@bookmarks=enabled \
    -o feature@embedded_data=enabled \
    -o feature@empty_bpobj=enabled \
    -o feature@enabled_txg=enabled \
    -o feature@extensible_dataset=enabled \
    -o feature@filesystem_limits=enabled \
    -o feature@hole_birth=enabled \
    -o feature@large_blocks=enabled \
    -o feature@lz4_compress=enabled \
    -o feature@spacemap_histogram=enabled \
    -o feature@zpool_checkpoint=enabled \
    -O acltype=posixacl -O canmount=off \
    -O compression=lz4 \
    -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
    -O mountpoint=/boot -R /mnt \
    bpoolssd mirror /dev/sdb3 /dev/sdd3
zpool create \
    -o ashift=12 \
    -O encryption=on -O keylocation=prompt -O keyformat=passphrase \
    -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
    -O compression=zstd \
    -O relatime=on \
    -O canmount=off \
    -O mountpoint=/ -R /mnt \
    rpoolssd mirror /dev/sdb4 /dev/sdd4

copy the datasets over to the new pool:

zfs snapshot -r bpool@shrink &&
zfs send -vR bpool@shrink | zfs receive -vFd bpoolssd

The above worked, and quickly. The same with rpool, however:

zfs snapshot -r rpool@shrink &&
zfs send -vR rpool@shrink | zfs receive -vFd rpoolssd

... failed with:

cannot send rpool@shrink: encrypted dataset rpool may not be sent with properties without the raw flag

[many more attempts later, see below for the full discussion]

A workaround I found is to specify each dataset individually (inspired by this reddit discussion):

for dataset in $(zfs list -H -o name  | grep -E 'rpool($|/)')
do 
    zfs send $dataset@shrink | zfs receive -vd rpoolssd
done
zfs set mountpoint=none rpoolssd/ROOT
zfs set mountpoint=/ rpoolssd/ROOT/debian

The last set mountpoint is necessary because otherwise the mountpoints are wrong:

root@tubman:~# zfs list | grep ROOT
NAME                   USED  AVAIL     REFER  MOUNTPOINT
rpoolssd/ROOT         1.67G  92.5G      200K  /mnt/ROOT
rpoolssd/ROOT/debian  1.67G  92.5G     1.37G  /mnt/ROOT/debian

... and /mnt is basically empty

root@tubman:~# ls /mnt/
home  ROOT  var

After the remount, things look more logical:

root@tubman:~# zfs list | grep ROOT
NAME                   USED  AVAIL     REFER  MOUNTPOINT
rpool/ROOT            4.56G  3.50T      192K  none
rpool/ROOT/debian     4.56G  3.50T     1.62G  /

This shows the base datasets are the same:

root@tubman:~# diff -u <(zfs list -o name | grep rpoolssd | sed 's/rpoolssd/rpool/') <(zfs list -o name | grep rpool | grep -v rpoolssd)
root@tubman:~#

... but of course we're missing a lot of snapshots:

root@tubman:~# zfs list -t snapshot| grep rpoolssd | wc -l
10
root@tubman:~# zfs list -t snapshot| grep rpool | grep -v rpoolssd | wc -l
395

And we tweaked the mountpoints a little, so that the root dataset doesn't mount on / anymore:

root@tubman:~# diff -u <(zfs list -o name,mountpoint | grep rpoolssd | sed 's/rpoolssd/rpool/;s,/mnt/,/,;s,/mnt,/,;s/  */ /') <(zfs list -o name,mountpoint | grep rpool | grep -v rpoolssd | sed 's/  */ /')
--- /dev/fd/63  2022-10-18 10:52:30.654369806 -0400
+++ /dev/fd/62  2022-10-18 10:52:30.654369806 -0400
@@ -1,4 +1,4 @@
-rpool none
+rpool /
 rpool/ROOT none
 rpool/ROOT/debian /
 rpool/home /home

The downside of this approach is that it's clunky and it doesn't copy over snapshots. But for my use case (just move this shit over already!), it's going to be sufficient. We could actually go through each snapshot the same way again, of course, but that that point we're basically trying to reimplement -R here and failing.

reinstall grub:

for fs in /run /sys /dev /proc; do 
    mount -o rbind $fs "/mnt${fs}"
done
zfs mount bpoolssd/BOOT/debian
sed -i 's,ZFS=rpool,ZFS=rpoolssd,' /mnt/etc/default/grub
sed -i  s/bpool/bpoolssd/ /mnt/etc/systemd/system/zfs-import-bpool.service
rm /mnt/etc/zfs/zfs-list.cache/*
touch /mnt/etc/zfs/zfs-list.cache/bpoolssd
touch /mnt/etc/zfs/zfs-list.cache/rpoolssd
chroot /mnt /usr/sbin/update-grub
chroot /mnt /usr/sbin/grub-install /dev/sdb
chroot /mnt /usr/sbin/grub-install /dev/sdd
for fs in /run /sys /dev /proc; do 
    umount "/mnt${fs}"
done

reboot and make sure we boot from the SSD drives (e.g. the new pool)
```
reboot
```
stop using the old pools:
```
zpool export bpool
zpool export rpool
```
reboot again:
```
reboot
```
now make sure that the old pools really are not used. zpool status shouldn't show the old pools.

Note that for this conversion, we cannot just attach and detach the SSD drives because they are different sizes than the other disks in the pool. We could use add to create a second mirror in the pool and remove to remove the old mirror, moving the data to the new SSD drives but may make the bpool unbootable, among other problems

Also, a procedure the above is inspired from goes through the extra steps of recreating rpool and bpool, and reattaching the drives there, and destroying rpoolssd and bpoolssd, basically as a way to rename the pools back to their original names. It's possible to "just" rename a pool, but it must not be in use, so possibly the simplest way to do this would be to boot a rescue image, and use export/import to rename the pool.

Moving encrypted pools is hard

In step 4, above, we failed to just move the rpool datasets to the new pool:

zfs snapshot -r rpool@shrink &&
zfs send -vR rpool@shrink | zfs receive -vFd rpoolssd

... failed with:

cannot send rpool@shrink: encrypted dataset rpool may not be sent with properties without the raw flag

I have tried using the --raw flag to send the dataset, but then that fails with:

root@tubman:~# zfs send --raw -R rpool@shrink | zfs receive -vFd rpoolssd
cannot receive new filesystem stream: zfs receive -F cannot be used to destroy an encrypted filesystem or overwrite an unencrypted one with an encrypted one

Okay, then let's try to remove -F:

root@tubman:~# zfs send --raw -R rpool@shrink | zfs receive -vd rpoolssd
cannot receive new filesystem stream: destination 'rpoolssd' exists
must specify -F to overwrite it

AAARGHL. And of course, we can't destroy rpoolssd, that would destroy the entire pool.

One thing that does seem to work is to use -e instead of -d:

zfs send --raw -R rpool@shrink | zfs receive -v -e rpoolssd

I found out about the -e flag in this post. The description of -d and -e is actually rather confusing in the man page:

The -d and -e options cause the file system name of the target snapshot to be determined by appending a portion of the sent snapshot's name to the specified target filesystem. If the -d option is specified, all but the first element of the sent snapshot's file system path (usually the pool name) is used and any required intermediate file systems within the specified one are created. If the -e option is specified, then only the last element of the sent snapshot's file system name (i.e. the name of the source file system itself) is used as the target file system name.

I actually can't make heads or tails of this, but essentially, -e seems to do nothing at all here, which means we end up with an extra rpool component in the dataset path:

root@tubman:~# zfs list | grep rpool | grep -v rpoolssd | head -3
rpool                       7.16G  3.50T      192K  /
rpool/ROOT                  4.56G  3.50T      192K  none
rpool/ROOT/debian           4.56G  3.50T     1.62G  /
root@tubman:~# zfs list | grep rpoolssd | head -3
rpoolssd                    6.98G  89.4G      192K  /mnt
rpoolssd/rpool              6.95G  89.4G      168K  /mnt
rpoolssd/rpool/ROOT         4.55G  89.4G      168K  none

So that's also wrong. I eventually ended up with the procedure detailed in step 4, above, to individually copy over the datasets, one by one. This doesn't work as well; the snapshots are not copied over, for example. But it's better than nothing, which was the situation I was stuck with for days.

It's possible that this situation is specific to the Debian and Ubuntu install guides, which put most datasets directly on the root dataset (e.g. rpool/var instead of rpool/ROOT/var). It's possible that adding that layer of indirection could help with such situations, but the jury is actually still out on that, see this discussion.

SSD TRIM

See zfs.

extending the main tank

Once we are confident we can boot without the old HDD pool, we can repartition the old drives and add them to tank. The drives are already partitioned, and it was probably done with something like this, from what I can tell:

for device in /dev/sda /dev/sdf ; do
  sgdisk --zap-all $device &&
  sgdisk -a8 -n1:24K:+1000K -t1:EF02 \
             -n2:1M:+512M   -t2:EF00 \
             -n3:0:+1G      -t3:BF01 \
             -n5:0:0        -t5:BF00 \
             $device
done

In fact, it wasn't quite like that; the exact procedure is step one in the main installation procedure here. The main difference is that here we call sgdisk only once, so the -a flag applies to all. In the original, we call it multiple times which means things are not necessarily aligned as they should. We'll just disregard this for a moment.

Adding the old drives to the pool is pretty simple. We follow this model where we basically have two RAID-1 mirrors striped together. Eventually, it might make more sense to replace the 2x4TB drives with one 8TB drive and use RAID-Z, but I already broke the bank to get a second 8TB drive to get the first part of this stripe, so this is what we have.

The actual command is:

root@tubman:~# zpool add -n tank mirror /dev/sda4 /dev/sdf4
would update 'tank' to the following configuration:

        tank
          mirror-0
            sdc4
            sde4
          mirror
            sda4
            sdf4

Note the -n is a dry run, the actual command doesn't return anything, and takes very little time:

root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc4    ONLINE       0     0     0
            sde4    ONLINE       0     0     0

errors: No known data errors
root@tubman:~# zpool add tank mirror /dev/sda4 /dev/sdf4
root@tubman:~# zpool status tank
  pool: tank
 state: ONLINE
  scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc4    ONLINE       0     0     0
            sde4    ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sda4    ONLINE       0     0     0
            sdf4    ONLINE       0     0     0

errors: No known data errors
root@tubman:~#

I heard the disks scratch for a few seconds and that was it.

I had the problem that the filesystem wasn't coming up on boot. Because it's not the root filesystem, presumably, it needs something special to be loaded. Furthermore, its encryption key would be rather problematic to load as it doesn't get prompted in the initrd either. So it's better to shift to a keylocation that is actually on disk.

umask 0777 &&
dd if=/dev/urandom of=/etc/zfs/tank.key bs=32 count=1 ;
umask 0022 &&
zfs change-key -l -o keylocation=file:///etc/zfs/tank.key tank

Then to make the pool automatically loaded at boot, use:

zpool set cachefile=/etc/zfs/zpool.cache tank

Then the systemd zfs-import-cache.service and zfs-import.service units will make sure the pool is imported. Normally, if zfs-mount.service and zfs.target are enabled, underlying datasets should also be automatically mounted. In our case, however, we need an extra shim to make sure the cryptographic key gets loaded. So we need this unit in /etc/systemd/system/zfs-load-keyfile@.service (a modified version of this service:

[Unit]
Description=Load %I encryption keys from disk
Before=systemd-user-sessions.service zfs-mount.service
After=zfs-import.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=zfs load-key %I

[Install]
WantedBy=zfs-mount.service

... which we enable with:

systemctl enable zfs-load-keyfile@tank.service

We can test this works with:

zfs umount tank/srv &&
zfs unload-key tank &&
systemctl start zfs-load-keyfile@tank.service &&
zfs mount tank/srv

And a reboot is probably in order to make sure systemd doesn't get stuck at a prompt:

reboot

remaining work

TODO: re-sync backups