tubman
Tubman is named after Harriet Tubman, an "American abolitionist and political activist. Born into slavery, Tubman escaped and subsequently made some 13 missions to rescue approximately 70 enslaved people, including family and friends, using the network of antislavery activists and safe houses known as the Underground Railroad. During the American Civil War, she served as an armed scout and spy for the Union Army. The first woman to lead an armed expedition in the war, she guided the raid at Combahee Ferry, which liberated more than 700 enslaved people. In her later years, Tubman was an activist in the movement for women's suffrage."
I was the conductor of the Underground Railroad for eight years, and I can say what most conductors can't say — I never ran my train off the track and I never lost a passenger.
Specification
tubman's install changed bodies and is now in "toutatis"'s body. so the specs below are inaccurate.
- motherboard: MSI X58M (MS-7593)
- case: some alien atrocity
- CPU: Intel Core i7 CPU 960 (2009, Nehalem bloomfield, 45nm, 4/8 cores, 3.46GHz) not to be confused with the best-selling, embedded i960 (1984-2007, still in use)
- Memory: 12GiB (3x4GB) DIMM 1066 MHz 0.9ns
- Storage:
- SSD:
- 500GB Samsung SSD 850
- 480GB Crucial CT480M50
- HDD:
- 8TB Seagate IronWolf ST8000VN004-2M21
- 8TB Seagate IronWolf ST8000VN0022-2EL
- 4TB Seagate Barracuda ST4000DM000-1F21
- 4TB Seagate Barracuda ST4000DM004-2CV1
- SSD:
- Network: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
- Display: Oland XT [Radeon HD 8670 / R7 250/350]
- Audio: Oland/Hainan/Cape Verde/Pitcairn HDMI Audio [Radeon HD 7000 Series]
Note that tubman was originally built with the old marcos hardware,
but transplanted in what used to be known as toutatis
, see
v1 for the old spec. The toutatis
install
was kept install, on a stack of 5 disks (3x~2TB HDD, 2x128GB SSD).
4TB disk health inspection
Before the migration from marcos' body to toutatis', a ZFS scrub triggered a warning. The drives were inspected for health, this is the report (copied from 2022-11-17-zfs-migration).
Here's some SMART stats:
root@tubman:~# smartctl -a -qnoserial /dev/sdb | grep -e Head_Flying_Hours -e Power_On_Hours -e Total_LBA -e 'Sector Sizes'
Sector Sizes: 512 bytes logical, 4096 bytes physical
9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 12464 (206 202 0)
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 10966h+55m+23.757s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 21107792664
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3201579750
That's over a year of power on, which shouldn't be so bad. It has
written about 10TB of data (21107792664 LBAs * 512 byte/LBA
), which
is about two full writes. According to its specification, this
device is supposed to support 55 TB/year of writes, so we're far below
spec. Note that are still far from the "non-recoverable read error per
bits" spec (1 per 10E15), as we've basically read 13E12 bits
(3201579750 LBAs * 512 byte/LBA
= 13E12 bits).
It's likely this disk was made in 2018, so it is in its fourth year.
Interestingly, /dev/sdc
is also a Seagate drive, but of a different
series:
root@tubman:~# smartctl -qnoserial -i /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate BarraCuda 3.5
Device Model: ST4000DM004-2CV104
Firmware Version: 0001
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5425 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Tue Oct 11 11:21:35 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
It has seen much more reads than the other disk which is also interesting:
root@tubman:~# smartctl -a -qnoserial /dev/sdc | grep -e Head_Flying_Hours -e Power_On_Hours -e Total_LBA -e 'Sector Sizes'
Sector Sizes: 512 bytes logical, 4096 bytes physical
9 Power_On_Hours 0x0032 059 059 000 Old_age Always - 36240
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 33994h+10m+52.118s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 30730174438
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 51894566538
That's 4 years of Head_Flying_Hours
, and over 4 years (4 years and
48 days) of Power_On_Hours
. The copyright date on that drive's
specs goes back to 2016, so it's a much older drive.
Installation procedure
I would have used FAI's setup-storage but it doesn't support ZFS, unfortunately. It is part of the long term roadmap, that said, and there's a howto for stretch, but that doesn't use setup-storage. I was hoping I would reuse the installer I've been working on at work...
We have the following disk configuration:
/dev/sda
: SSD drive, 512MB used for caching/dev/sdb
: HDD drive, 4TB, to be used in a ZFS pool with native encryption/dev/sdc
: HDD drive, 4TB, same
We boot from a grml live image based on Debian testing (bullseye), and will follow this howto:
install requirements:
apt update apt install --yes debootstrap gdisk dkms dpkg-dev linux-headers-$(uname -r) zfs-dkms modprobe zfs apt install --yes zfsutils-linux
Note that those instructions differ from the documentation (we don't use
buster-backports
) because we start from abullseye
live image.clear the partitions on the two HDD, and setup a BIOS, UEFI, boot pool and native encrypted partition:
for DISK in /dev/sdb /dev/sdc ; do sgdisk --zap-all $DISK sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK sgdisk -n2:1M:+512M -t2:EF00 $DISK sgdisk -n3:0:+1G -t3:BF01 $DISK sgdisk -n4:0:0 -t4:BF00 $DISK done
resulting partition table:
root@grml ~ # sgdisk -p /dev/sdb Disk /dev/sdb: 7814037168 sectors, 3.6 TiB Model: ST4000DM004-2CV1 Sector size (logical/physical): 512/4096 bytes Disk identifier (GUID): 63B2F372-B4E9-45FF-8151-9706F9F158C9 Partition table holds up to 128 entries Main partition table begins at sector 2 and ends at sector 33 First usable sector is 34, last usable sector is 7814037134 Partitions will be aligned on 16-sector boundaries Total free space is 14 sectors (7.0 KiB) Number Start (sector) End (sector) Size Code Name 1 48 2047 1000.0 KiB EF02 2 2048 1050623 512.0 MiB EF00 3 1050624 3147775 1024.0 MiB BF01 4 3147776 7814037134 3.6 TiB BF00
create the boot pool called
bpool
and the root pool calledrpool
, the latter will prompt for a disk encryption key:zpool create \ -o cachefile=/etc/zfs/zpool.cache \ -o ashift=12 -d \ -o feature@async_destroy=enabled \ -o feature@bookmarks=enabled \ -o feature@embedded_data=enabled \ -o feature@empty_bpobj=enabled \ -o feature@enabled_txg=enabled \ -o feature@extensible_dataset=enabled \ -o feature@filesystem_limits=enabled \ -o feature@hole_birth=enabled \ -o feature@large_blocks=enabled \ -o feature@lz4_compress=enabled \ -o feature@spacemap_histogram=enabled \ -o feature@zpool_checkpoint=enabled \ -O acltype=posixacl -O canmount=off -O compression=lz4 \ -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \ -O mountpoint=/boot -R /mnt \ bpool mirror /dev/sdb3 /dev/sdc3 zpool create \ -o ashift=12 \ -O encryption=aes-256-gcm \ -O keylocation=prompt -O keyformat=passphrase \ -O acltype=posixacl -O canmount=off -O compression=lz4 \ -O dnodesize=auto -O normalization=formD -O relatime=on \ -O xattr=sa -O mountpoint=/ -R /mnt \ rpool mirror /dev/sdb4 /dev/sdc4
create filesytems and "datasets":
this creates two containers, for
ROOT
andBOOT
zfs create -o canmount=off -o mountpoint=none rpool/ROOT zfs create -o canmount=off -o mountpoint=none bpool/BOOT
this actually creates the boot and root filesystems:
zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/debian zfs mount rpool/ROOT/debian zfs create -o mountpoint=/boot bpool/BOOT/debian
then they use even more data sets, although I'm not sure they are all necessary:
zfs create rpool/home zfs create -o mountpoint=/root rpool/home/root chmod 700 /mnt/root zfs create -o canmount=off rpool/var zfs create -o canmount=off rpool/var/lib zfs create rpool/var/log zfs create rpool/var/spool
to exclude temporary files from snapshots, for example:
zfs create -o com.sun:auto-snapshot=false rpool/var/cache zfs create -o com.sun:auto-snapshot=false rpool/var/tmp chmod 1777 /mnt/var/tmp
and a
/srv
:zfs create rpool/srv
or for Docker (TODO):
zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker
make a
tmpfs
for/run
:mkdir /mnt/run mount -t tmpfs tmpfs /mnt/run mkdir /mnt/run/lock
install the base system and copy the ZFS config:
debootstrap --components=main,contrib bullseye /mnt mkdir /mnt/etc/zfs cp /etc/zfs/zpool.cache /mnt/etc/zfs/
base system configuration:
echo HOSTNAME > /mnt/etc/hostname vi /mnt/etc/hosts apt install ca-certificates echo 'deb https://deb.debian.org/debian-security bullseye-security main contrib' > /etc/apt/sources.list.d/security.list
bind mounts and chroot for more complex config:
mount --rbind /dev /mnt/dev mount --rbind /proc /mnt/proc mount --rbind /sys /mnt/sys chroot /mnt /bin/bash
more base system config:
ln -s /proc/self/mounts /etc/mtab apt update apt install --yes console-setup locales dpkg-reconfigure locales tzdata
ZFS boot configuration
apt install --yes dpkg-dev linux-headers-amd64 linux-image-amd64 apt install --yes zfs-initramfs echo REMAKE_INITRD=yes > /etc/dkms/zfs.conf apt install --yes grub-pc apt remove --purge os-prober
pick a root password
passwd
bpool import hack (TODO: whyy)
cat > /etc/systemd/system/zfs-import-bpool.service <<EOF [Unit] DefaultDependencies=no Before=zfs-import-scan.service Before=zfs-import-cache.service [Service] Type=oneshot RemainAfterExit=yes ExecStart=/sbin/zpool import -N -o cachefile=none bpool # Work-around to preserve zpool cache: ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache [Install] WantedBy=zfs-import.target EOF systemctl enable zfs-import-bpool.service
enable tmpfs:
ln -s /usr/share/systemd/tmp.mount /etc/systemd/system/ && systemctl enable tmp.mount
grub setup:
root@grml:/# grub-probe /boot zfs root@grml:/# update-initramfs -c -k all update-initramfs: Generating /boot/initrd.img-5.10.0-6-amd64 root@grml:/# sed -i 's,GRUB_CMDLINE_LINUX.*,GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/debian",' /etc/default/grub root@grml:/# update-grub Generating grub configuration file ... Found linux image: /boot/vmlinuz-5.10.0-6-amd64 Found initrd image: /boot/initrd.img-5.10.0-6-amd64 done root@grml:/# grub-install /dev/sdb Installing for i386-pc platform. Installation finished. No error reported. root@grml:/# grub-install /dev/sdc Installing for i386-pc platform. Installation finished. No error reported.
make sure you check both disks in there:
dpkg-reconfigure grub-pc
filesystem mount ordering, rationale in the OpenZFS guide:
mkdir /etc/zfs/zfs-list.cache touch /etc/zfs/zfs-list.cache/bpool touch /etc/zfs/zfs-list.cache/rpool zed -F &
then verify the files have data:
root@grml:/# cat /etc/zfs/zfs-list.cache/bpool bpool /mnt/boot off on on off on off on off - none - - - - - - - - bpool/BOOT none off on on off on off on off - none - - - - - - - - bpool/BOOT/debian /mnt/boot on on on off on off on off - none - - - - - - - - root@grml:/# cat /etc/zfs/zfs-list.cache/rpool | rpool /mnt off on on on on off on off rpool prompt - - - - - - - - rpool/ROOT none off on on on on off on off rpool none - - - - - - - - rpool/ROOT/debian /mnt noauto on on on on off on off rpool none - - - - - - - - rpool/home /mnt/home on on on on on off on off rpool none - - - - - - - - rpool/home/root /mnt/root on on on on on off on off rpool none - - - - - - - - rpool/srv /mnt/srv on on on on on off on off rpool none - - - - - - - - rpool/var /mnt/var off on on on on off on off rpool none - - - - - - - - rpool/var/cache /mnt/var/cache on on on on on off on off rpool none - - - - - - - - rpool/var/lib /mnt/var/lib off on on on on off on off rpool none - - - - - - - - rpool/var/log /mnt/var/log on on on on on off on off rpool none - - - - - - - - rpool/var/spool /mnt/var/spool on on on on on off on off rpool none - - - - - - - - rpool/var/tmp /mnt/var/tmp on on on on on off on off rpool none - - - - - - - - root@grml:/# fg zed -F ^CExiting
fix the paths to eliminate
/mnt
:sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/*
extra config, setup SSH with auth key:
apt install --yes openssh-server mkdir /root/.ssh/ cat > /root/.ssh/authorized_keys <<EOF ... EOF
snapshot initial install:
zfs snapshot bpool/BOOT/debian@install zfs snapshot rpool/ROOT/debian@install
exit chroot:
exit
unmount filesystems:
mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | \ xargs -i{} umount -lf {} zpool export -a
reboot:
reboot
That procedure actually worked! The only problem was the interfaces(5) configuration, which was missing (regardless of what the above says). I want to do systemd-networkd anyways.
We performed steps 1 through 6, remaining steps are optional and troubleshooting.
SSD caching
The machine has been installed on two HDD: spinning rust! Those are typically slow, but they are redundant which should ensure high availability. To boost performance, we're setting up a SSD cache.
ZFS has two types of caches:
- write intent log (external ZIL or SLOG)
- layer 2 adaptive replacement cache (L2ARC)
L2ARC
The L2ARC is purely a performance cache, and if it dies, no data is lost. The former, however, can cause data loss (typically a few seconds, but still) in case the drive dies. So we're going with L2ARC, based on this source for the redundancy claim.
We also use the L2ARC cache because it's useful for read caching as well. The SLOG cache is mostly useful for write-heavy workloads, which is not the case of this server.
To configure the L2ARC cache, we simply did this:
zpool add rpool cache /dev/sda3
(Actually, -f
was necessary because there already was a
crypto_LUKS
partition on there, which we didn't care about.)
The sda3
device is the third partition on the SSD drive. It's 465GB
so it should provide a lot of space for the cache.
The status of the cache can be found with the zpool iostat
command:
root@tubman:~# zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
bpool 47.8M 912M 0 0 3 14
mirror 47.8M 912M 0 0 3 14
sdb3 - - 0 0 1 7
sdc3 - - 0 0 1 7
---------- ----- ----- ----- ----- ----- -----
rpool 1.29G 3.62T 0 60 437 432K
mirror 1.29G 3.62T 0 60 437 432K
sdb4 - - 0 30 199 216K
sdc4 - - 0 30 238 216K
cache - - - - - -
sda3 326M 465G 0 183 4.96K 11.9M
---------- ----- ----- ----- ----- ----- -----
Note that this guide actually discourages the use of symbolic
names like sda
. Quoting that warning directly:
WARNING: Some motherboards will not present disks in a consistent manner to the Linux kernel across reboots. As such, a disk identified as /dev/sda on one boot might be /dev/sdb on the next. For the main pool where your data is stored, this is not a problem as ZFS can reconstruct the VDEVs based on the metadata geometry. For your L2ARC and SLOG devices, however, no such metadata exists. [...] If you don't heed this warning, your L2ARC device may not be added to your hybrid pool at all, and you will need to re-add it later. This could drastically affect the performance of the applications when pulling evicted pages off of disk.
TL;DR: the cache might disappear after a reboot if disk ordering is changed by the BIOS. This only affects caches like the L2ARC (above) and the SLOG.
Eventually, there were two SSD drives in this system, and both were added as caches (following the above warning), with:
zpool add tank cache \
/dev/disk/by-id/ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5 \
/dev/disk/by-id/ata-Crucial_CT480M500SSD1_1311092ED40E-part5
... it makes the zpool status
output quite large though:
root@tubman:~# zpool status tank
pool: tank
state: ONLINE
scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdc4 ONLINE 0 0 0
sde4 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sda4 ONLINE 0 0 0
sdf4 ONLINE 0 0 0
cache
ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5 ONLINE 0 0 0
ata-Crucial_CT480M500SSD1_1311092ED40E-part5 ONLINE 0 0 0
errors: No known data errors
root@tubman:~# zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write
----------------------------------------------------- ----- ----- ----- ----- ----- -----
bpoolssd 280M 680M 0 0 867 5.83K
mirror 280M 680M 0 0 867 5.83K
sdb3 - - 0 0 500 2.91K
sdd3 - - 0 0 366 2.91K
----------------------------------------------------- ----- ----- ----- ----- ----- -----
rpoolssd 4.28G 95.2G 3 11 47.2K 160K
mirror 4.28G 95.2G 3 11 47.2K 160K
sdb4 - - 1 5 23.7K 79.9K
sdd4 - - 1 5 23.5K 79.9K
----------------------------------------------------- ----- ----- ----- ----- ----- -----
tank 6.64T 4.25T 0 178 16.5K 21.1M
mirror 6.62T 664G 0 49 16.0K 4.69M
sdc4 - - 0 24 8.97K 2.35M
sde4 - - 0 24 7.04K 2.35M
mirror 25.1G 3.60T 0 128 546 16.4M
sda4 - - 0 70 293 8.21M
sdf4 - - 0 58 252 8.21M
cache - - - - - -
ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5 479M 364G 0 162 5.61K 19.8M
ata-Crucial_CT480M500SSD1_1311092ED40E-part5 444M 345G 0 152 5.61K 18.3M
----------------------------------------------------- ----- ----- ----- ----- ----- -----
Note that both caches are of different size. ZFS doesn't care; they are striped anyways, and it doesn't actually matter that there are two, it provides no special redundancy as the cache is disposable. That is different from the SLOG configuration, see below.
Also note that the L2ARC cache is indexed in memory and that, in itself, takes memory from the in-memory ARC cache, so it might actually be detrimental to have too big of a cache. The arch wiki suggests the formula for that memory usage is:
(L2ARC size) / (recordsize) * 70 bytes
... where recordsize
is typically 128KiB. So in our case, it would
mean:
70B×345GB/128KiB = ((70 × byte) × (345 × gigabyte))/(128 × kibibyte)
≈ 184.249 877 930 MB
... 200MB of RAM, not a problem, given this machine has 12GB of RAM:
root@tubman:~# free -h
total used free shared buff/cache available
Mem: 11Gi 6.4Gi 5.1Gi 0.0Ki 188Mi 5.1Gi
Swap: 0B 0B 0B
SLOG caches
SLOG caches are more sensitive. They are actually where ZFS will commit a write before confirming it to the caller, so it needs a reliable storage medium. Typically, you'd use RAM (the default, to simplify), NVMe, or fast SSD storage for this. Using NVMe or SSDs, you want to make sure those are mirrored, so that if there's a failure in one drive, there is no data lost.
To create a SLOG, you should first choose its size. It doesn't have to be as big as the L2ARC cache because it's only a write cache and gets regularly flushed to disk. This article from Klara systems suggests:
Often, 16GB to 64GB is sufficient. For a busy server with a lot of writes, a general rule of thumb for calculating size is: max amount of write traffic per second x 15.
The TrueNAS docs also say:
The iXsystems current recommendation is a 16 GB SLOG device over-provisioned from larger SSDs to increase the write endurance and throughput of an individual SSD. This 16 GB size recommendation is based on performance characteristics of typical HDD pools with SSD SLOGs and capped by the value of the tunable
vfs.zfs.dirty_data_max_max
.
The parameter vfs.zfs.dirty_data_max_max defaults to 25% of physical RAM which, in my case, is 3GB:
root@tubman:~# cat /sys/module/zfs/parameters/zfs_dirty_data_max_max
3137032192
Considering that the core memory might be boosted in the future, it's worth raising the size a little, so we're going to pick 16GB as suggested. The final partition table looks something like this:
Number Start (sector) End (sector) Size Code Name
1 48 2047 1000.0 KiB EF02
2 2048 1050623 512.0 MiB EF00
3 1050624 3147775 1024.0 MiB BF01 bpool
4 3147776 212862975 100.0 GiB BF00 rpool
5 212862976 246417407 16.0 GiB BF00 SLOG
6 246417408 937703054 329.6 GiB BF00 L2ARC
To create the cache, we use the disk's symbolic name (as explained in the L2ARC section):
zpool add tank log mirror \
/dev/disk/by-id/ata-Samsung_SSD_850_EVO_500GB_S2RANX0J408632Y-part5 \
/dev/disk/by-id/ata-Crucial_CT480M500SSD1_1311092ED40E-part5
Also be careful to use the log
keyword here. If you forget it, you
will extend the pool with a new mirror device, striped with the other
mirrors!
To see how effective the SLOG is, you can use:
zpool status tank 1
... and you will see it fill and empty as the timeout
(zfs.txg.timeout, set in
/sys/module/zfs/parameters/zfs_txg_timeout
, defaults to 5 seconds)
expires. You can raise that timeout to use the SLOG more, if you are
comfortable with that delay in data loss if the SLOG fails.
Removing a SLOG device is a little different than a L2ARC cache, because you need to remove the entire mirror, you can't remove individual devices. First, find the mirror that's under the log:
zpool status tank
... then remove that mirror, be careful to not remove the actual data mirror!
zpool remove tank mirror-4
See also this documentation on the SLOG for more information.
Next steps
TODO:
- configure swap? (step 7, issues with memory pressure)
- disable log compression? (step 8.3)
delete install snapshots?
zfs snapshot bpool/BOOT/debian@install zfs snapshot rpool/ROOT/debian@install
setup services:
- radio (DONE)
- sonic
- paste
- photos (Nextcloud?)
- torrent
Done
- SSD caching
- static IP
- port forward SSH so that it doesn't land on curie
- report back on the procedure
- automatic snapshots (with sanoid, see the Puppet code and configuration file)
Decisions taken during the procedure
- use a
tmpfs
for/run
- use native ZFS encryption
- setup both BIOS and UEFI partitions, in case we switch to the latter later
Changes from the original procedure
- we install a bullseye system from a bullseye live image (instead of buster from buster)
interfaces(5)
file untouched, default is fine (allow-hotplug eth0
etc)- we skip
keyboard-configuration
andconsole-setup
config, defaults are fine this was skipped, as the target file already exists in bullseye:
ln -s /usr/lib/zfs-linux/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zed.d
Abandoned ideas
- using
mmdebstrap
: it complains that/mnt
is "not empty" even though it only has related mountpoints (actually, that's supported with--skip=check/empty
, but that wasn't in the buster manpage and I failed to look at the bullseye one)
To be improved
- the
/var/log
and/var/spool
datasets are creating needless complexity in the boot process, we could do without them
Troubleshooting
- initrd documentation: booting from a snapshot, rollbacks, etc
- install troubleshooting
Conversion into a backup server
Originally, this server was meant to be a test server, a "lab" if you will, to do some tests on ZFS and generally just have another PoP for some of my services. The server was running, for example, https://radio.anarc.at. But it was still using the really old marcos v1 hardware, and was due for an upgrade.
It was therefore merged with the server I previously used for offsite
backups (toutatis
), by moving its disks in the new backup server's
body. Tubman's install was kept, but the data was moved around disks
quite a bit.
Before
This is how tubman's disks were laid out before the transfer:
- 1x500GB SSD cache
- 2x4TB HDD mirror pool (
rpool
andbpool
, 4TB equivalent) - all disposable data in
rpool/srv
- base Debian install, fully managed by Puppet
And this was toutatis:
- 2x2TB + 1x2.5TB RAID-5 HDD array (4TB equivalent)
- 1x8TB HDD single drive (anarcat's "offsite")
- 2x128GB SSD (OS)
After
- 2x500GB SSD mirror pool for base system (
rpool
andbpool
) and cache - 2x4TB + 2x8TB HDD mirror pool (
tank
, 12TB equivalent) rpool/srv
dataset destroyed
This is not ideal. Ideally, all drives would be the same size
(e.g. 8TB) and use some RAID-Z layout to optimize available disk space
(e.g. better than RAID-1). That could still be done, but by rebuilding
a new vdev using 2x8TB drives, as a future expansion. Only one SATA
connector is available on board right now, so this would be a tricky
operation, probably involving degrading the tank
pool.
It might have been possible to RAID-0 the 2x4TB drives to give an extra 8TB drive to the ZFS pool, but this idea was rejected as too risk and clunky. ZFS itself doesn't support such configuration.
tank pool creation
We create another pool, called tank
, for the 2x8TB drives, fully
encrypted. The point of this is to have a separate pool from the main
system to alleviate any possible confusion. It will also make it
possible to move the system (and only that) to SSD (it's currently on
2x4TB + 500GB SSD cache).
first, partition the new disk (we reuse the disk formatting command used for curie, see 2022-11-17-zfs-migration):
sgdisk --zap-all /dev/sdc sgdisk -a1 -n1:24K:+1000K -t1:EF02 /dev/sdc sgdisk -n2:1M:+512M -t2:EF00 /dev/sdc sgdisk -n3:0:+1G -t3:BF01 /dev/sdc sgdisk -n4:0:0 -t4:BF00 /dev/sdc
... that opens the possibility of running a full system on that disk (because of 1GB for cleartext /boot and MBR/EFI partitions), at the cost of 1GB lost
create a fake file to fool ZFS there is a second disk:
truncate -s 8TB /tmp/8tb.raw
create the pool with the fake disk (notice the
-f
orce):zpool create \ -o ashift=12 \ -O encryption=on -O keylocation=prompt -O keyformat=passphrase \ -O acltype=posixacl -O xattr=sa -O dnodesize=auto \ -O compression=zstd \ -O relatime=on \ -O canmount=off \ -O mountpoint=none \ -f \ tank \ mirror /dev/sdb4 /tmp/8tb.raw
immediately tell zpool to forget about the fake disk:
zpool offline tank /tmp/8tb.raw
cleanup:
rm /tmp/8tb.raw
make an actual filesystem:
zfs create -o mountpoint=/srv tank/srv
It should look like this:
root@tubman:/# zpool status tank
pool: tank
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sdc4 ONLINE 0 0 0
/tmp/8tb.raw OFFLINE 0 0 0
errors: No known data errors
I actually also ran:
zpool detach tank /tmp/8tb.raw
... but I'm not sure that's a good idea, because now ZFS thinks this is not a mirror anymore.
root@tubman:~# zpool detach tank /tmp/8tb.raw
root@tubman:~# zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
sdc4 ONLINE 0 0 0
errors: No known data errors
That, fortunately, is easily fixed:
root@tubman:~# truncate -s 8T /tmp/8tb.raw
root@tubman:~# zpool attach tank /dev/sdc4 /tmp/8tb.raw
root@tubman:~# zpool status tank
pool: tank
state: ONLINE
scan: resilvered 1.98M in 00:00:00 with 0 errors on Fri Oct 14 15:23:50 2022
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdc4 ONLINE 0 0 0
/tmp/8tb.raw ONLINE 0 0 0
errors: No known data errors
It might also have been possible to just create a pool normally, with a single disk, and reattach the second one when done.
first rsync transfer
The files were copied from ext4
to ZFS with this magic rsync command:
rsync -ASHaXx --info=progress2 /mnt/ /srv/
/dev/sde
was mounted in /mnt
and had all the old data:
root@tubman:~# df -h /mnt /srv
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 7.2T 6.7T 142G 98% /mnt
tank/srv 7.1T 152M 7.1T 1% /srv
The ETA was:
6.7Tbyte/(60MB/s) = (6,7 × térabyte)/(60 × (mégabyte/seconde))
= 1 d + 7 h + 1 min + 6,666… s
AKA about 31 hours.
The rsync
command started at 2022-10-14T15:27-04:00, and finished
some time before 2022-10-15T18:29-04:00. That is a little over 27
hours of run time, which is faster than the above estimate. The final
rsync output was:
root@tubman:/# rsync -ASHaXx --info=progress2 /mnt/ /srv/
7,326,208,287,067 99% 71.87MB/s 27:00:08 (xfr#250467, to-chk=0/592532)
resilvering
AKA rebuilding or adding back the old disk:
partition
sde
:sgdisk --zap-all /dev/sde && sgdisk -a1 -n1:24K:+1000K -t1:EF02 /dev/sde && sgdisk -n2:1M:+512M -t2:EF00 /dev/sde && sgdisk -n3:0:+1G -t3:BF01 /dev/sde && sgdisk -n4:0:0 -t4:BF00 /dev/sde
resilver
tank
withsde4
:date; time zpool replace tank /tmp/8tb.raw /dev/sde4; date
in progress, started at 2022-10-15T21:40-04:00:
root@tubman:~# zpool status tank pool: tank state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sat Oct 15 21:40:16 2022 151G scanned at 2.36G/s, 936K issued at 14.6K/s, 6.61T total 0B resilvered, 0.00% done, no estimated completion time config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 sdc4 ONLINE 0 0 0 replacing-1 DEGRADED 0 0 0 /tmp/8tb.raw OFFLINE 0 0 0 sde4 ONLINE 0 0 0 errors: No known data errors
Odd, that
0B resilvered
. Eventually though, it did give me an estimate, about 5 minutes in:21.5G resilvered, 0.32% done, 1 days 02:27:10 to go
... which was suspiciously similar to the final rsync run time (~27 hours). After 10 minutes, we had this more encouraging estimate:
80.7G resilvered, 1.19% done, 14:22:52 to go
Interestingly, I have no idea if I'll get a notification when the thing is finished resilvering. Logging progress with:
while sleep 600; do zpool status tank | grep -e scanned -e resilvered, \ | sed 's/[\t ]*//' | logger -t resilver --id=$$ done
once the resilver finishes, you get an email notification from
zed
, in my case it said:scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
Full status says:
root@tubman:~# zpool status tank
pool: tank
state: ONLINE
scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdc4 ONLINE 0 0 0
sde4 ONLINE 0 0 0
errors: No known data errors
That's an average 146.931MB/s, or 1.175Gbit/s, pretty nice.
move rpool and bpool to SSDs
take SSD cache offline:
zpool remove rpool /dev/sdd3
partition SSD disks, keep most of the disk for caching (100G system, rest for caching):
for device in /dev/sdb /dev/sdd ; do sgdisk --zap-all $device && sgdisk -a8 -n1:24K:+1000K -t1:EF02 \ -n2:1M:+512M -t2:EF00 \ -n3:0:+1G -t3:BF01 \ -n4:0:+100G -t4:BF00 \ -n5:0:0 -t5:BF00 \ $device done
create new pools for the SSD drives (
rpoolssd
,bpoolssd
?):zpool create \ -o cachefile=/etc/zfs/zpool.cache \ -o ashift=12 -d \ -o feature@async_destroy=enabled \ -o feature@bookmarks=enabled \ -o feature@embedded_data=enabled \ -o feature@empty_bpobj=enabled \ -o feature@enabled_txg=enabled \ -o feature@extensible_dataset=enabled \ -o feature@filesystem_limits=enabled \ -o feature@hole_birth=enabled \ -o feature@large_blocks=enabled \ -o feature@lz4_compress=enabled \ -o feature@spacemap_histogram=enabled \ -o feature@zpool_checkpoint=enabled \ -O acltype=posixacl -O canmount=off \ -O compression=lz4 \ -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \ -O mountpoint=/boot -R /mnt \ bpoolssd mirror /dev/sdb3 /dev/sdd3 zpool create \ -o ashift=12 \ -O encryption=on -O keylocation=prompt -O keyformat=passphrase \ -O acltype=posixacl -O xattr=sa -O dnodesize=auto \ -O compression=zstd \ -O relatime=on \ -O canmount=off \ -O mountpoint=/ -R /mnt \ rpoolssd mirror /dev/sdb4 /dev/sdd4
copy the datasets over to the new pool:
zfs snapshot -r bpool@shrink && zfs send -vR bpool@shrink | zfs receive -vFd bpoolssd
The above worked, and quickly. The same with
rpool
, however:zfs snapshot -r rpool@shrink && zfs send -vR rpool@shrink | zfs receive -vFd rpoolssd
... failed with:
cannot send rpool@shrink: encrypted dataset rpool may not be sent with properties without the raw flag
[many more attempts later, see below for the full discussion]
A workaround I found is to specify each dataset individually (inspired by this reddit discussion):
for dataset in $(zfs list -H -o name | grep -E 'rpool($|/)') do zfs send $dataset@shrink | zfs receive -vd rpoolssd done zfs set mountpoint=none rpoolssd/ROOT zfs set mountpoint=/ rpoolssd/ROOT/debian
The last
set mountpoint
is necessary because otherwise the mountpoints are wrong:root@tubman:~# zfs list | grep ROOT NAME USED AVAIL REFER MOUNTPOINT rpoolssd/ROOT 1.67G 92.5G 200K /mnt/ROOT rpoolssd/ROOT/debian 1.67G 92.5G 1.37G /mnt/ROOT/debian
... and
/mnt
is basically emptyroot@tubman:~# ls /mnt/ home ROOT var
After the remount, things look more logical:
root@tubman:~# zfs list | grep ROOT NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT 4.56G 3.50T 192K none rpool/ROOT/debian 4.56G 3.50T 1.62G /
This shows the base datasets are the same:
root@tubman:~# diff -u <(zfs list -o name | grep rpoolssd | sed 's/rpoolssd/rpool/') <(zfs list -o name | grep rpool | grep -v rpoolssd) root@tubman:~#
... but of course we're missing a lot of snapshots:
root@tubman:~# zfs list -t snapshot| grep rpoolssd | wc -l 10 root@tubman:~# zfs list -t snapshot| grep rpool | grep -v rpoolssd | wc -l 395
And we tweaked the mountpoints a little, so that the root dataset doesn't mount on
/
anymore:root@tubman:~# diff -u <(zfs list -o name,mountpoint | grep rpoolssd | sed 's/rpoolssd/rpool/;s,/mnt/,/,;s,/mnt,/,;s/ */ /') <(zfs list -o name,mountpoint | grep rpool | grep -v rpoolssd | sed 's/ */ /') --- /dev/fd/63 2022-10-18 10:52:30.654369806 -0400 +++ /dev/fd/62 2022-10-18 10:52:30.654369806 -0400 @@ -1,4 +1,4 @@ -rpool none +rpool / rpool/ROOT none rpool/ROOT/debian / rpool/home /home
The downside of this approach is that it's clunky and it doesn't copy over snapshots. But for my use case (just move this shit over already!), it's going to be sufficient. We could actually go through each snapshot the same way again, of course, but that that point we're basically trying to reimplement
-R
here and failing.reinstall grub:
for fs in /run /sys /dev /proc; do mount -o rbind $fs "/mnt${fs}" done zfs mount bpoolssd/BOOT/debian sed -i 's,ZFS=rpool,ZFS=rpoolssd,' /mnt/etc/default/grub sed -i s/bpool/bpoolssd/ /mnt/etc/systemd/system/zfs-import-bpool.service rm /mnt/etc/zfs/zfs-list.cache/* touch /mnt/etc/zfs/zfs-list.cache/bpoolssd touch /mnt/etc/zfs/zfs-list.cache/rpoolssd chroot /mnt /usr/sbin/update-grub chroot /mnt /usr/sbin/grub-install /dev/sdb chroot /mnt /usr/sbin/grub-install /dev/sdd for fs in /run /sys /dev /proc; do umount "/mnt${fs}" done
reboot and make sure we boot from the SSD drives (e.g. the new pool)
reboot
stop using the old pools:
zpool export bpool zpool export rpool
reboot again:
reboot
now make sure that the old pools really are not used.
zpool status
shouldn't show the old pools.
Note that for this conversion, we cannot just attach
and detach
the SSD drives because they are different sizes than the other disks
in the pool. We could use add
to create a second mirror in
the pool and remove
to remove the old mirror, moving the data to the
new SSD drives but may make the bpool
unbootable, among other
problems
Also, a procedure the above is inspired from goes through the
extra steps of recreating rpool
and bpool
, and reattaching the
drives there, and destroying rpoolssd
and bpoolssd
, basically as a
way to rename the pools back to their original names. It's possible
to "just" rename a pool, but it must not be in use, so possibly the
simplest way to do this would be to boot a rescue image, and use
export/import to rename the pool.
Moving encrypted pools is hard
In step 4, above, we failed to just move the rpool
datasets to the
new pool:
zfs snapshot -r rpool@shrink &&
zfs send -vR rpool@shrink | zfs receive -vFd rpoolssd
... failed with:
cannot send rpool@shrink: encrypted dataset rpool may not be sent with properties without the raw flag
I have tried using the --raw
flag to send the dataset, but then that
fails with:
root@tubman:~# zfs send --raw -R rpool@shrink | zfs receive -vFd rpoolssd
cannot receive new filesystem stream: zfs receive -F cannot be used to destroy an encrypted filesystem or overwrite an unencrypted one with an encrypted one
Okay, then let's try to remove -F
:
root@tubman:~# zfs send --raw -R rpool@shrink | zfs receive -vd rpoolssd
cannot receive new filesystem stream: destination 'rpoolssd' exists
must specify -F to overwrite it
AAARGHL. And of course, we can't destroy rpoolssd
, that would
destroy the entire pool.
One thing that does seem to work is to use -e
instead of -d
:
zfs send --raw -R rpool@shrink | zfs receive -v -e rpoolssd
I found out about the -e
flag in this post. The description of
-d
and -e
is actually rather confusing in the man page:
The -d and -e options cause the file system name of the target snapshot to be determined by appending a portion of the sent snapshot's name to the specified target filesystem. If the -d option is specified, all but the first element of the sent snapshot's file system path (usually the pool name) is used and any required intermediate file systems within the specified one are created. If the -e option is specified, then only the last element of the sent snapshot's file system name (i.e. the name of the source file system itself) is used as the target file system name.
I actually can't make heads or tails of this, but essentially, -e
seems to do nothing at all here, which means we end up with an extra
rpool
component in the dataset path:
root@tubman:~# zfs list | grep rpool | grep -v rpoolssd | head -3
rpool 7.16G 3.50T 192K /
rpool/ROOT 4.56G 3.50T 192K none
rpool/ROOT/debian 4.56G 3.50T 1.62G /
root@tubman:~# zfs list | grep rpoolssd | head -3
rpoolssd 6.98G 89.4G 192K /mnt
rpoolssd/rpool 6.95G 89.4G 168K /mnt
rpoolssd/rpool/ROOT 4.55G 89.4G 168K none
So that's also wrong. I eventually ended up with the procedure detailed in step 4, above, to individually copy over the datasets, one by one. This doesn't work as well; the snapshots are not copied over, for example. But it's better than nothing, which was the situation I was stuck with for days.
It's possible that this situation is specific to the Debian and Ubuntu
install guides, which put most datasets directly on the root dataset
(e.g. rpool/var
instead of rpool/ROOT/var
). It's possible that
adding that layer of indirection could help with such situations, but
the jury is actually still out on that, see this discussion.
SSD TRIM
See zfs.
extending the main tank
Once we are confident we can boot without the old HDD pool, we can
repartition the old drives and add them to tank
. The drives are
already partitioned, and it was probably done with something like
this, from what I can tell:
for device in /dev/sda /dev/sdf ; do
sgdisk --zap-all $device &&
sgdisk -a8 -n1:24K:+1000K -t1:EF02 \
-n2:1M:+512M -t2:EF00 \
-n3:0:+1G -t3:BF01 \
-n5:0:0 -t5:BF00 \
$device
done
In fact, it wasn't quite like that; the exact procedure is step one
in the main installation procedure here. The main difference is
that here we call sgdisk
only once, so the -a
flag applies to
all. In the original, we call it multiple times which means things are
not necessarily aligned as they should. We'll just disregard this for
a moment.
Adding the old drives to the pool is pretty simple. We follow this model where we basically have two RAID-1 mirrors striped together. Eventually, it might make more sense to replace the 2x4TB drives with one 8TB drive and use RAID-Z, but I already broke the bank to get a second 8TB drive to get the first part of this stripe, so this is what we have.
The actual command is:
root@tubman:~# zpool add -n tank mirror /dev/sda4 /dev/sdf4
would update 'tank' to the following configuration:
tank
mirror-0
sdc4
sde4
mirror
sda4
sdf4
Note the -n
is a dry run, the actual command doesn't return
anything, and takes very little time:
root@tubman:~# zpool status tank
pool: tank
state: ONLINE
scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdc4 ONLINE 0 0 0
sde4 ONLINE 0 0 0
errors: No known data errors
root@tubman:~# zpool add tank mirror /dev/sda4 /dev/sdf4
root@tubman:~# zpool status tank
pool: tank
state: ONLINE
scan: resilvered 6.63T in 12:32:03 with 0 errors on Sun Oct 16 10:12:19 2022
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdc4 ONLINE 0 0 0
sde4 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sda4 ONLINE 0 0 0
sdf4 ONLINE 0 0 0
errors: No known data errors
root@tubman:~#
I heard the disks scratch for a few seconds and that was it.
I had the problem that the filesystem wasn't coming up on
boot. Because it's not the root filesystem, presumably, it needs
something special to be loaded. Furthermore, its encryption key would
be rather problematic to load as it doesn't get prompted in the initrd
either. So it's better to shift to a keylocation
that is actually on
disk.
umask 0777 &&
dd if=/dev/urandom of=/etc/zfs/tank.key bs=32 count=1 ;
umask 0022 &&
zfs change-key -l -o keylocation=file:///etc/zfs/tank.key tank
Then to make the pool automatically loaded at boot, use:
zpool set cachefile=/etc/zfs/zpool.cache tank
Then the systemd zfs-import-cache.service
and zfs-import.service
units will make sure the pool is imported. Normally, if
zfs-mount.service
and zfs.target
are enabled, underlying
datasets should also be automatically mounted. In our case, however,
we need an extra shim to make sure the cryptographic key gets
loaded. So we need this unit in
/etc/systemd/system/zfs-load-keyfile@.service
(a modified version of
this service:
[Unit]
Description=Load %I encryption keys from disk
Before=systemd-user-sessions.service zfs-mount.service
After=zfs-import.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=zfs load-key %I
[Install]
WantedBy=zfs-mount.service
... which we enable with:
systemctl enable zfs-load-keyfile@tank.service
We can test this works with:
zfs umount tank/srv &&
zfs unload-key tank &&
systemctl start zfs-load-keyfile@tank.service &&
zfs mount tank/srv
And a reboot is probably in order to make sure systemd doesn't get stuck at a prompt:
reboot
remaining work
- TODO: re-sync backups
Other documentation
See zfs for more documentation on ZFS and 2022-11-17-zfs-migration for another installation and migration procedure.