Major email crash with syncmaildir
TL:DR; lost half my mail (150,000 messages, ~6GB) last night. Cause
uncertain, but possibly a combination of a dead CMOS battery, systemd
OnCalendar=daily
, a (locking?) bug in syncmaildir, and generally, a
system too exotic and complicated.
The crash
So I somehow lost half my mail:
anarcat@angela:~(main)$ du -sh Maildir/
7,9G Maildir/
anarcat@curie:~(main)$ du -sh Maildir
14G Maildir
anarcat@marcos:~$ du -sh Maildir
8,0G Maildir
Those are three different machines:
- angela: my laptop, not always on
- curie: my workstation, mostly always on
- marcos: my mail server, always on
Those mails are synchronized using a rather exotic system based on SSH, syncmaildir and rsendmail.
The anomaly started on curie:
-- Reboot --
mar 22 16:13:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:13:00 curie smd-pull[4801]: rm: impossible de supprimer '/home/anarcat/.smd/workarea/Maildir': Le dossier n'est pas vide
mar 22 16:13:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:13:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:13:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:14:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:14:00 curie smd-pull[7025]: 4091 ? 00:00:00 smd-push
mar 22 16:14:00 curie smd-pull[7025]: Already running.
mar 22 16:14:00 curie smd-pull[7025]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:14:00 curie smd-pull[7025]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 4091) run(rm /home/anarcat/.smd/lock))
mar 22 16:14:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:14:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:14:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
Then it seems like smd-push (from curie) started destroying the universe for some reason:
mar 22 16:20:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:20:00 curie smd-pull[9319]: 4091 ? 00:00:00 smd-push
mar 22 16:20:00 curie smd-pull[9319]: Already running.
mar 22 16:20:00 curie smd-pull[9319]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:20:00 curie smd-pull[9319]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(ru
mar 22 16:20:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:20:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:20:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:21:34 curie smd-push[4091]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:21:35 curie smd-push[9374]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:21:35 curie systemd[3199]: smd-push.service: Succeeded.
Notice the del-mails(293920)
there: it is actively trying to destroy
basically every email in my mail spool.
Then somehow push
and pull
started both at once:
mar 22 16:21:35 curie systemd[3199]: Started push emails with syncmaildir.
mar 22 16:21:35 curie systemd[3199]: Starting push emails with syncmaildir...
mar 22 16:22:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:22:00 curie smd-pull[10333]: 9455 ? 00:00:00 smd-push
mar 22 16:22:00 curie smd-pull[10333]: Already running.
mar 22 16:22:00 curie smd-pull[10333]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:22:00 curie smd-pull[10333]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(r
mar 22 16:22:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:22:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:22:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: Data transmission failed.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: Unable to open Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: Maildir/.kobo/cur/1498563708.M122624P22121.marco
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: Unable to open requested file.
mar 22 16:22:00 curie smd-push[9455]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:22:00 curie smd-push[9455]: default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:22:00 curie smd-push[9455]: default: smd-server@localhost: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested-actions(r
mar 22 16:22:00 curie systemd[3199]: smd-push.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:22:00 curie systemd[3199]: smd-push.service: Failed with result 'exit-code'.
mar 22 16:22:00 curie systemd[3199]: Failed to start push emails with syncmaildir.
There it seems push
tried to destroy the universe again:
del-mails(293920)
.
Interestingly, the push started again in parallel with the pull, right that minute:
mar 22 16:22:00 curie systemd[3199]: Starting push emails with syncmaildir...
... but didn't complete for a while, here's pull
trying to start
again:
mar 22 16:24:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:24:00 curie smd-pull[12051]: 10466 ? 00:00:00 smd-push
mar 22 16:24:00 curie smd-pull[12051]: Already running.
mar 22 16:24:00 curie smd-pull[12051]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:24:00 curie smd-pull[12051]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 10466) run(rm /home/anarcat/.smd/lock))
mar 22 16:24:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:24:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:24:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
... and the long push finally resolving:
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: Data transmission failed.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: Data transmission failed.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: Unable to open Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: No such file or directory
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: Unable to open requested file.
mar 22 16:24:00 curie smd-push[10466]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:24:00 curie smd-push[10466]: default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:24:00 curie smd-push[10466]: default: smd-server@localhost: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:24:00 curie systemd[3199]: smd-push.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:24:00 curie systemd[3199]: smd-push.service: Failed with result 'exit-code'.
mar 22 16:24:00 curie systemd[3199]: Failed to start push emails with syncmaildir.
mar 22 16:24:00 curie systemd[3199]: Starting push emails with syncmaildir...
This pattern repeats until 16:35, when that locking issue silently recovered somehow:
mar 22 16:35:03 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:35:41 curie smd-pull[20788]: default: smd-client@localhost: TAGS: stats::new-mails(5), del-mails(1), bytes-received(21885), xdelta-received(6863398)
mar 22 16:35:42 curie smd-pull[21373]: register: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:35:42 curie systemd[3199]: smd-pull.service: Succeeded.
mar 22 16:35:42 curie systemd[3199]: Started pull emails with syncmaildir.
mar 22 16:36:35 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:36:36 curie smd-pull[21738]: default: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(214)
mar 22 16:36:37 curie smd-pull[21816]: register: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:36:37 curie systemd[3199]: smd-pull.service: Succeeded.
mar 22 16:36:37 curie systemd[3199]: Started pull emails with syncmaildir.
... notice that huge xdelta-received
there, that's 7GB right
there. Mysteriously, the curie mail spool survived this, possibly
because smd-pull
started failing again:
mar 22 16:38:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:38:00 curie smd-pull[23556]: 21887 ? 00:00:00 smd-push
mar 22 16:38:00 curie smd-pull[23556]: Already running.
mar 22 16:38:00 curie smd-pull[23556]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:38:00 curie smd-pull[23556]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 21887) run(rm /home/anarcat/.smd/lock))
mar 22 16:38:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:38:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:38:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
That could have been when i got on angela to check my mail, and it was busy doing the nasty removal stuff... although the times don't match. Here is when angela came back online:
anarcat@angela:~(main)$ last
anarcat :0 :0 Mon Mar 22 19:57 still logged in
reboot system boot 5.10.0-0.bpo.3-a Mon Mar 22 19:57 still running
anarcat :0 :0 Mon Mar 22 17:43 - 18:47 (01:03)
reboot system boot 5.10.0-0.bpo.3-a Mon Mar 22 17:39 still running
Then finally the sync on curie started failing with:
mar 22 16:46:35 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:46:42 curie smd-pull[27455]: smd-server: ERROR: Client aborted, removing /home/anarcat/.smd/curie-anarcat__Maildir.db.txt.new and /home/anarcat/.smd/curie-anarcat__Maildir.db.txt.mtime.new
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: Failed to copy Maildir/.debian/cur/1613401668.M901837P27073.marcos,S=3740,W=3815:2,S to Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: The destination already exists but its content differs.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: To fix this problem you have two options:
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: - rename Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S by hand so that Maildir/.debian/cur/1613401668.M901837P27073.marcos,S=3740,W=3815:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: can be copied without replacing it.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: Executing `cd; mv -n "Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S" "Maildir/.koumbit/cur/1616446002.1.localhost"` should work.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: - run smd-push so that your changes to Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: are propagated to the other mailbox
mar 22 16:46:42 curie smd-pull[27455]: default: smd-client@localhost: TAGS: error::context(copy-message) probable-cause(concurrent-mailbox-edit) human-intervention(necessary) suggested-actions(run(mv -n "/home/anarcat/.smd/workarea/Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S" "/home/anarcat/.smd/workarea/Maildir/.koumbit/tmp/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S") run(smd-push default))
mar 22 16:46:42 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:46:42 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:46:42 curie systemd[3199]: Failed to start pull emails with syncmaildir.
It went on like this until I found the problem. This is, presumably, a good thing because those emails were not being destroyed.
On angela, things looked like this:
-- Reboot --
mar 22 17:39:29 angela systemd[1677]: Started run notmuch new at least once a day.
mar 22 17:39:29 angela systemd[1677]: Started run smd-pull regularly.
mar 22 17:40:46 angela systemd[1677]: Starting pull emails with syncmaildir...
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: Unable to open Maildir/.tor/new/1616446842.M285912P26118.marcos,S=8860,W=8996: Maildir/.tor/new/1616446842.M285912P26118.marcos,S=886
0,W=8996: No such file or directory
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: Unable to open requested file.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: Data transmission failed.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: This problem is transient, please retry.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: server sent ABORT or connection died
mar 22 17:43:18 angela smd-pull[3916]: default: smd-server@smd-server-anarcat: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested
-actions(retry)
mar 22 17:43:18 angela smd-pull[3916]: default: smd-client@localhost: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 17:43:18 angela systemd[1677]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 17:43:18 angela systemd[1677]: smd-pull.service: Failed with result 'exit-code'.
mar 22 17:43:18 angela systemd[1677]: Failed to start pull emails with syncmaildir.
mar 22 17:43:18 angela systemd[1677]: Starting pull emails with syncmaildir...
mar 22 17:43:29 angela smd-pull[4847]: default: smd-client@localhost: TAGS: stats::new-mails(29), del-mails(0), bytes-received(401519), xdelta-received(38914)
mar 22 17:43:29 angela smd-pull[5600]: register: smd-client@localhost: TAGS: stats::new-mails(2), del-mails(0), bytes-received(92150), xdelta-received(471)
mar 22 17:43:29 angela systemd[1677]: smd-pull.service: Succeeded.
mar 22 17:43:29 angela systemd[1677]: Started pull emails with syncmaildir.
mar 22 17:43:29 angela systemd[1677]: Starting push emails with syncmaildir...
mar 22 17:43:32 angela smd-push[5693]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(217)
mar 22 17:43:33 angela smd-push[6575]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(219)
mar 22 17:43:33 angela systemd[1677]: smd-push.service: Succeeded.
mar 22 17:43:33 angela systemd[1677]: Started push emails with syncmaildir.
Notice how long it took to get the first error, in that first failure:
it failed after 3 minutes! Presumably that's when it started deleting
all that mail. And this is during pull
, not push
, so the error
didn't come from angela.
Affected data
It seems 2GB of mail from my main INBOX was destroyed. Another 2.4GB
of spam (kept for training purposes) was also destroyed, along with
700MB of Sent mail. The rest is hard to figure out, because the
folders are actually still there, just smaller. So I relied on ncdu
to figure out the size changes.
(Note that I don't really archive (or delete much of) my mail since I use notmuch, which is why the INBOX is so large...)
Concretely, according to the notmuch-new.service
which still runs
periodically on marcos, here are the changes that happened on the
server:
mar 22 16:17:12 marcos notmuch[10729]: Added 7 new messages to the database. Removed 57985 messages. Detected 1372 file renames.
mar 22 16:22:43 marcos notmuch[12826]: No new mail. Removed 143842 messages. Detected 6072 file renames.
mar 22 16:27:02 marcos notmuch[13969]: No new mail. Removed 82071 messages. Detected 1783 file renames.
mar 22 16:29:45 marcos notmuch[15079]: Added 22743 new messages to the database. Detected 1 file rename.
mar 22 16:31:48 marcos notmuch[16196]: Added 22779 new messages to the database. Removed 5 messages.
mar 22 16:33:11 marcos notmuch[17192]: Added 3711 new messages to the database.
mar 22 16:40:41 marcos notmuch[19122]: Added 74558 new messages to the database. Detected 1 file rename.
mar 22 16:43:21 marcos notmuch[20325]: Added 9061 new messages to the database. Detected 4 file renames.
mar 22 17:43:08 marcos notmuch[7420]: Added 1793 new messages to the database. Detected 6 file renames.
That is basically the entire mail spool destroyed at first (283 898 messages), and then bits and pieces of it progressively re-added (134 645 messages), somehow, so 149 253 mails were lost, presumably.
Recovery
I disabled the services all over the place:
systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
(Well, technically, I did that only on angela, as I thought the problem was there. Luckily, curie kept going but it seems like it was harmless.)
I made a backup of the mail spool on curie:
tar cf - Maildir/ | pv -s 14G | gzip -c > Maildir.tgz
Then I crossed my fingers and ran smd-push -v -s
, as that was
suggested by smd
error codes themselves. That thankfully started
restoring mail. It failed a few times on weird cases of files being
duplicates, but I resolved this by following the instructions. Or
mostly: I actually deleted the files instead of moving them, which
made smd
even unhappier (if there ever was such a thing). I had to
recreate some of those files, so, lesson learned: do follow the advice
smd
gives you, even if it seems useless or strange.
But then smd-push
was humming along, uploading tens of thousands of
messages, saturating the upload in the office, refilling the mail
spool on the server... yaay!... ?
Except... well, of course that didn't quite work: the mail spool in
the office eventually started to grow beyond the size of the mail
spool on the workstation. That is what smd-push
eventually settled
on:
default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(151697), del-mails(0), bytes-received(7539147811), xdelta-received(10881198)
It recreated 151 697 emails, adding about 2000 emails to the pool, kind of from nowhere at all. On marcos, before:
ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------
4,0 GiB [##########] /.notmuch
717,3 MiB [# ] /.Archives.2014
498,2 MiB [# ] /.feeds.debian-planet
453,1 MiB [# ] /.Archives.2012
414,5 MiB [# ] /.debian
408,2 MiB [# ] /.quoifaire
389,8 MiB [ ] /.rapports
356,6 MiB [ ] /.tor
182,6 MiB [ ] /.koumbit
179,8 MiB [ ] /tmp
56,8 MiB [ ] /.nn
43,0 MiB [ ] /.act-mtl
32,6 MiB [ ] /.feeds.sysadvent
31,7 MiB [ ] /.feeds.releases
31,4 MiB [ ] /.Sent.2005
26,3 MiB [ ] /.sage
25,5 MiB [ ] /.freedombox
24,0 MiB [ ] /.feeds.git-annex
21,1 MiB [ ] /.Archives.2011
19,1 MiB [ ] /.Sent.2003
16,7 MiB [ ] /.bugtraq
16,2 MiB [ ] /.mlug
Total disk usage: 8,0 GiB Apparent size: 7,6 GiB Items: 184426
After:
ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------
4,7 GiB [##########] /.notmuch
2,7 GiB [##### ] /.junk
1,9 GiB [### ] /cur
717,3 MiB [# ] /.Archives.2014
659,3 MiB [# ] /.Sent
513,9 MiB [# ] /.Archives.2012
498,2 MiB [# ] /.feeds.debian-planet
449,6 MiB [ ] /.Archives.2015
414,5 MiB [ ] /.debian
408,2 MiB [ ] /.quoifaire
389,8 MiB [ ] /.rapports
380,8 MiB [ ] /.Archives.2013
356,6 MiB [ ] /.tor
261,1 MiB [ ] /.Archives.2011
240,9 MiB [ ] /.koumbit
183,6 MiB [ ] /.Archives.2010
179,8 MiB [ ] /tmp
128,4 MiB [ ] /.lists
106,1 MiB [ ] /.inso-interne
103,0 MiB [ ] /.github
75,0 MiB [ ] /.nanog
69,8 MiB [ ] /.full-disclosure
Total disk usage: 16,2 GiB Apparent size: 15,5 GiB Items: 341143
That is 156 717 files more.
On curie:
ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------------------------------------
2,7 GiB [##########] /.junk
2,3 GiB [######## ] /.notmuch
1,9 GiB [###### ] /cur
661,2 MiB [## ] /.Archives.2014
655,3 MiB [## ] /.Sent
512,0 MiB [# ] /.Archives.2012
447,3 MiB [# ] /.Archives.2015
438,5 MiB [# ] /.feeds.debian-planet
406,5 MiB [# ] /.quoifaire
383,6 MiB [# ] /.debian
378,6 MiB [# ] /.Archives.2013
303,3 MiB [# ] /.tor
296,0 MiB [# ] /.rapports
237,6 MiB [ ] /.koumbit
233,2 MiB [ ] /.Archives.2011
182,1 MiB [ ] /.Archives.2010
127,0 MiB [ ] /.lists
104,8 MiB [ ] /.inso-interne
102,7 MiB [ ] /.register
89,6 MiB [ ] /.github
67,1 MiB [ ] /.full-disclosure
66,5 MiB [ ] /.nanog
Total disk usage: 13,3 GiB Apparent size: 12,6 GiB Items: 342465
Interestingly, there are more files, but less disk usage. It's possible the notmuch database there is more efficient. So maybe there's nothing to worry about.
Last night's marcos backup has:
root@marcos:/home/anarcat# find /mnt/home/anarcat/Maildir | pv -l | wc -l
341k 0:00:16 [20,4k/s] [ <=> ]
341040
... 341040 files, which seems about right, considering some mail was
delivered during the day. An audit can be performed with hashdeep
:
borg mount /media/sdb2/borg/::marcos-auto-2021-03-22 /mnt
hashdeep -c sha256 -r /mnt/home/anarcat/Maildir | pv -l -s 341k > Maildir-backup-manifest.txt
And then compared with:
hashdeep -c sha256 -k Maildir-backup-manifest.txt Maildir/
Some extra files should show up in the Maildir, and very few should actually be missing, because I shouldn't have deleted mail from the previous day the next day, or at least very few. The actual summary hashdeep gave me was:
hashdeep: Audit failed
Input files examined: 0
Known files expecting: 0
Files matched: 339080
Files partially matched: 0
Files moved: 782
New files found: 107
Known files not found: 106
So 106 files added, 107 deleted. Seems good enough for me...
Postfix was stopped at Mar 22 21:12:59
to try and stop external
events from confusing things even further. I reviewed the delivery log
to see if mail that came in during the problem window disappeared:
grep 'dovecot:.*stored mail into mailbox' /var/log/mail.log |
tail -20 |
sed 's/.*msgid=<//;s/>.*//' |
while read msgid; do
notmuch count --exclude=false id:$msgid |
grep 0 && echo $msgid missing;
done
And things looked okay. Now of course if we go further back, we find mail I actually deleted (because I do do that sometimes), so it's hard to use this log as an audit trail. We can only hope that the curie spool is sufficiently coherent to be relied on.
Worst case, we'll have to restore from last night's backup, but that's getting far away now: I get hundreds of mails a day in that mail spool, and reseting back to last night does not seem like a good idea.
A dry run of smd-pull
on angela seems to agree that it's missing
some files:
default: smd-client@localhost: TAGS: stats::new-mails(154914), del-mails(0), bytes-received(0), xdelta-received(0)
... a number of mails somewhere in between the other two, go figure. A
"wet" run of this was started, without deletion (-n
), which gave us:
default: smd-client@localhost: TAGS: stats::new-mails(154911), del-mails(0), bytes-received(7658160107), xdelta-received(10837609)
Strange that it sync'd three less emails, but that's still better than nothing, and we have a mail spool on angela again:
anarcat@angela:~(main)$ notmuch new
purging with prefix '.': spam moved (0), ham moved (0), deleted (0), done
Note: Ignoring non-mail file: /home/anarcat/Maildir//.uidvalidity
Processed 1779 total files in 26s (66 files/sec.).
Added 1190 new messages to the database. Removed 3 messages. Detected 593 file renames.
tagging with prefix '.': spam, sent, feeds, koumbit, tor, lists, rapports, folders, done.
Notice how only 1190 messages were re-added, that is because I killed notmuch before it had time to remove all those mails from its database.
Possible causes
I am totally at a loss as to why smd
started destroying everything
like it did. But a few things come to mind:
I rewired my office on that day.
This meant unplugging
curie
, the workstation.It has a bad CMOS battery (known problem), so it jumped around the time continuum a few times, sometimes by years.
The
smd
services are ran from a systemd unit withOnCalendar=*:0/2
. I have heard that it's possible that major time jumps "pile up" execution of jobs, and it seems this happened in this case.It's possible that locking in
smd
is not as great as it could be, and that it corrupted its internal data structures oncurie
, which led it to command a destruction of the remote mail spool.
It's also possible that there was a disk failure on the server,
marcos
. But since it's running on a (software) RAID-1 array, and no
errors have been found (according to dmesg
), I don't think that's a
plausible hypothesis.
Lessons learned
follow what
smd
says, even if it seems useless or strange.trust but verify: just backup everything before you do anything, especially the largest data set.
daily backups are not great for email, unless you're ready to lose a day of email (which I'm not).
hashdeep is great. I keep finding new use cases for it. Last time it was to audit my camera SD card to make sure I didn't forget anything, and now this. it's fast and powerful.
borg is great too. the FUSE mount was especially useful, and it was pretty fast to explore the backup, even through that overhead: checksumming 15GB of mail took about 35 minutes, which gives a respectable 8MB/s, probably bottlenecked by the crap external USB drive I use for backups (!).
I really need to finish my backup system so that I have automated offsite backups, although in this case that would actually have been much slower (certainly not 8MB/s!).
Workarounds and solutions
I setup fake-hwclock on curie, so that the next power failure will not upset my clock that badly.
I am thinking of switching to ZFS or BTRFS for most of my filesystems, so that I can use filesystem snapshots (including remotely!) as a backup strategy. This seems so much more powerful than crawling the filesystem for changes, and allows for truly offsite backups protected from an attacker (hopefully). But it's a long way there.
I'm also thinking of rebuilding my mail setup without smd
. It's not
the first time something like this happens with smd
. It's the first
time I am more confident it's the root cause of the problem, however,
and it makes me really nervous for the future.
I have used offlineimap in the past and it seems it was finally ported to Python 3 so that could be an option again. isync/mbsync is another option, which I tried before but do not remember why I didn't switch. A complete redesign with something like getmail and/or nncp could also be an option. But alas, I lack the time to go crazy with those experiments.
Somehow, doing like everyone else and just going with Google still doesn't seem to be an option for me. Screw big tech. But I am afraid they will win, eventually.
In any case, I'm just happy I got mail again, strangely.
Update: this happened again, believe it or not! See this next post.