TL:DR; lost half my mail (150,000 messages, ~6GB) last night. Cause uncertain, but possibly a combination of a dead CMOS battery, systemd OnCalendar=daily, a (locking?) bug in syncmaildir, and generally, a system too exotic and complicated.

  1. The crash
  2. Affected data
  3. Recovery
  4. Possible causes
  5. Lessons learned
  6. Workarounds and solutions

The crash

So I somehow lost half my mail:

anarcat@angela:~(main)$ du -sh Maildir/
7,9G    Maildir/

anarcat@curie:~(main)$ du -sh Maildir
14G     Maildir

anarcat@marcos:~$ du -sh Maildir
8,0G    Maildir

Those are three different machines:

Those mails are synchronized using a rather exotic system based on SSH, syncmaildir and rsendmail.

The anomaly started on curie:

-- Reboot --
mar 22 16:13:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:13:00 curie smd-pull[4801]: rm: impossible de supprimer '/home/anarcat/.smd/workarea/Maildir': Le dossier n'est pas vide
mar 22 16:13:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:13:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:13:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:14:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:14:00 curie smd-pull[7025]:  4091 ?        00:00:00 smd-push
mar 22 16:14:00 curie smd-pull[7025]: Already running.
mar 22 16:14:00 curie smd-pull[7025]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:14:00 curie smd-pull[7025]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 4091) run(rm /home/anarcat/.smd/lock))
mar 22 16:14:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:14:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:14:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.

Then it seems like smd-push (from curie) started destroying the universe for some reason:

mar 22 16:20:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:20:00 curie smd-pull[9319]:  4091 ?        00:00:00 smd-push
mar 22 16:20:00 curie smd-pull[9319]: Already running.
mar 22 16:20:00 curie smd-pull[9319]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:20:00 curie smd-pull[9319]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(ru
mar 22 16:20:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:20:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:20:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:21:34 curie smd-push[4091]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:21:35 curie smd-push[9374]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:21:35 curie systemd[3199]: smd-push.service: Succeeded.

Notice the del-mails(293920) there: it is actively trying to destroy basically every email in my mail spool.

Then somehow push and pull started both at once:

mar 22 16:21:35 curie systemd[3199]: Started push emails with syncmaildir.
mar 22 16:21:35 curie systemd[3199]: Starting push emails with syncmaildir...
mar 22 16:22:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:22:00 curie smd-pull[10333]:  9455 ?        00:00:00 smd-push
mar 22 16:22:00 curie smd-pull[10333]: Already running.
mar 22 16:22:00 curie smd-pull[10333]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:22:00 curie smd-pull[10333]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(r
mar 22 16:22:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:22:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:22:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: Data transmission failed.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: Unable to open Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: Maildir/.kobo/cur/1498563708.M122624P22121.marco
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: Unable to open requested file.
mar 22 16:22:00 curie smd-push[9455]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:22:00 curie smd-push[9455]: default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:22:00 curie smd-push[9455]: default: smd-server@localhost: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested-actions(r
mar 22 16:22:00 curie systemd[3199]: smd-push.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:22:00 curie systemd[3199]: smd-push.service: Failed with result 'exit-code'.
mar 22 16:22:00 curie systemd[3199]: Failed to start push emails with syncmaildir.

There it seems push tried to destroy the universe again: del-mails(293920).

Interestingly, the push started again in parallel with the pull, right that minute:

mar 22 16:22:00 curie systemd[3199]: Starting push emails with syncmaildir...

... but didn't complete for a while, here's pull trying to start again:

mar 22 16:24:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:24:00 curie smd-pull[12051]: 10466 ?        00:00:00 smd-push
mar 22 16:24:00 curie smd-pull[12051]: Already running.
mar 22 16:24:00 curie smd-pull[12051]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:24:00 curie smd-pull[12051]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 10466) run(rm /home/anarcat/.smd/lock))
mar 22 16:24:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:24:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:24:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.

... and the long push finally resolving:

mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: Data transmission failed.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: Data transmission failed.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: Unable to open Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: No such file or directory
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: Unable to open requested file.
mar 22 16:24:00 curie smd-push[10466]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:24:00 curie smd-push[10466]: default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:24:00 curie smd-push[10466]: default: smd-server@localhost: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:24:00 curie systemd[3199]: smd-push.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:24:00 curie systemd[3199]: smd-push.service: Failed with result 'exit-code'.
mar 22 16:24:00 curie systemd[3199]: Failed to start push emails with syncmaildir.
mar 22 16:24:00 curie systemd[3199]: Starting push emails with syncmaildir...

This pattern repeats until 16:35, when that locking issue silently recovered somehow:

mar 22 16:35:03 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:35:41 curie smd-pull[20788]: default: smd-client@localhost: TAGS: stats::new-mails(5), del-mails(1), bytes-received(21885), xdelta-received(6863398)
mar 22 16:35:42 curie smd-pull[21373]: register: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:35:42 curie systemd[3199]: smd-pull.service: Succeeded.
mar 22 16:35:42 curie systemd[3199]: Started pull emails with syncmaildir.
mar 22 16:36:35 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:36:36 curie smd-pull[21738]: default: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(214)
mar 22 16:36:37 curie smd-pull[21816]: register: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:36:37 curie systemd[3199]: smd-pull.service: Succeeded.
mar 22 16:36:37 curie systemd[3199]: Started pull emails with syncmaildir.

... notice that huge xdelta-received there, that's 7GB right there. Mysteriously, the curie mail spool survived this, possibly because smd-pull started failing again:

mar 22 16:38:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:38:00 curie smd-pull[23556]: 21887 ?        00:00:00 smd-push
mar 22 16:38:00 curie smd-pull[23556]: Already running.
mar 22 16:38:00 curie smd-pull[23556]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:38:00 curie smd-pull[23556]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 21887) run(rm /home/anarcat/.smd/lock))
mar 22 16:38:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:38:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:38:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.

That could have been when i got on angela to check my mail, and it was busy doing the nasty removal stuff... although the times don't match. Here is when angela came back online:

anarcat@angela:~(main)$ last
anarcat  :0           :0               Mon Mar 22 19:57   still logged in
reboot   system boot  5.10.0-0.bpo.3-a Mon Mar 22 19:57   still running
anarcat  :0           :0               Mon Mar 22 17:43 - 18:47  (01:03)
reboot   system boot  5.10.0-0.bpo.3-a Mon Mar 22 17:39   still running

Then finally the sync on curie started failing with:

mar 22 16:46:35 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:46:42 curie smd-pull[27455]: smd-server: ERROR: Client aborted, removing /home/anarcat/.smd/curie-anarcat__Maildir.db.txt.new and /home/anarcat/.smd/curie-anarcat__Maildir.db.txt.mtime.new
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: Failed to copy Maildir/.debian/cur/1613401668.M901837P27073.marcos,S=3740,W=3815:2,S to Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: The destination already exists but its content differs.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: To fix this problem you have two options:
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: - rename Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S by hand so that Maildir/.debian/cur/1613401668.M901837P27073.marcos,S=3740,W=3815:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR:   can be copied without replacing it.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR:   Executing `cd; mv -n "Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S" "Maildir/.koumbit/cur/1616446002.1.localhost"` should work.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: - run smd-push so that your changes to Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR:   are propagated to the other mailbox
mar 22 16:46:42 curie smd-pull[27455]: default: smd-client@localhost: TAGS: error::context(copy-message) probable-cause(concurrent-mailbox-edit) human-intervention(necessary) suggested-actions(run(mv -n "/home/anarcat/.smd/workarea/Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S" "/home/anarcat/.smd/workarea/Maildir/.koumbit/tmp/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S") run(smd-push default))
mar 22 16:46:42 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:46:42 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:46:42 curie systemd[3199]: Failed to start pull emails with syncmaildir.

It went on like this until I found the problem. This is, presumably, a good thing because those emails were not being destroyed.

On angela, things looked like this:

-- Reboot --
mar 22 17:39:29 angela systemd[1677]: Started run notmuch new at least once a day.
mar 22 17:39:29 angela systemd[1677]: Started run smd-pull regularly.
mar 22 17:40:46 angela systemd[1677]: Starting pull emails with syncmaildir...
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: Unable to open Maildir/.tor/new/1616446842.M285912P26118.marcos,S=8860,W=8996: Maildir/.tor/new/1616446842.M285912P26118.marcos,S=886
0,W=8996: No such file or directory
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: Unable to open requested file.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: Data transmission failed.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: This problem is transient, please retry.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: server sent ABORT or connection died
mar 22 17:43:18 angela smd-pull[3916]: default: smd-server@smd-server-anarcat: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested
-actions(retry)
mar 22 17:43:18 angela smd-pull[3916]: default: smd-client@localhost: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 17:43:18 angela systemd[1677]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 17:43:18 angela systemd[1677]: smd-pull.service: Failed with result 'exit-code'.
mar 22 17:43:18 angela systemd[1677]: Failed to start pull emails with syncmaildir.
mar 22 17:43:18 angela systemd[1677]: Starting pull emails with syncmaildir...
mar 22 17:43:29 angela smd-pull[4847]: default: smd-client@localhost: TAGS: stats::new-mails(29), del-mails(0), bytes-received(401519), xdelta-received(38914)
mar 22 17:43:29 angela smd-pull[5600]: register: smd-client@localhost: TAGS: stats::new-mails(2), del-mails(0), bytes-received(92150), xdelta-received(471)
mar 22 17:43:29 angela systemd[1677]: smd-pull.service: Succeeded.
mar 22 17:43:29 angela systemd[1677]: Started pull emails with syncmaildir.
mar 22 17:43:29 angela systemd[1677]: Starting push emails with syncmaildir...
mar 22 17:43:32 angela smd-push[5693]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(217)
mar 22 17:43:33 angela smd-push[6575]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(219)
mar 22 17:43:33 angela systemd[1677]: smd-push.service: Succeeded.
mar 22 17:43:33 angela systemd[1677]: Started push emails with syncmaildir.

Notice how long it took to get the first error, in that first failure: it failed after 3 minutes! Presumably that's when it started deleting all that mail. And this is during pull, not push, so the error didn't come from angela.

Affected data

It seems 2GB of mail from my main INBOX was destroyed. Another 2.4GB of spam (kept for training purposes) was also destroyed, along with 700MB of Sent mail. The rest is hard to figure out, because the folders are actually still there, just smaller. So I relied on ncdu to figure out the size changes.

(Note that I don't really archive (or delete much of) my mail since I use notmuch, which is why the INBOX is so large...)

Concretely, according to the notmuch-new.service which still runs periodically on marcos, here are the changes that happened on the server:

mar 22 16:17:12 marcos notmuch[10729]: Added 7 new messages to the database. Removed 57985 messages. Detected 1372 file renames.
mar 22 16:22:43 marcos notmuch[12826]: No new mail. Removed 143842 messages. Detected 6072 file renames.
mar 22 16:27:02 marcos notmuch[13969]: No new mail. Removed 82071 messages. Detected 1783 file renames.
mar 22 16:29:45 marcos notmuch[15079]: Added 22743 new messages to the database. Detected 1 file rename.
mar 22 16:31:48 marcos notmuch[16196]: Added 22779 new messages to the database. Removed 5 messages.
mar 22 16:33:11 marcos notmuch[17192]: Added 3711 new messages to the database.
mar 22 16:40:41 marcos notmuch[19122]: Added 74558 new messages to the database. Detected 1 file rename.
mar 22 16:43:21 marcos notmuch[20325]: Added 9061 new messages to the database. Detected 4 file renames.
mar 22 17:43:08 marcos notmuch[7420]: Added 1793 new messages to the database. Detected 6 file renames.

That is basically the entire mail spool destroyed at first (283 898 messages), and then bits and pieces of it progressively re-added (134 645 messages), somehow, so 149 253 mails were lost, presumably.

Recovery

I disabled the services all over the place:

systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer

(Well, technically, I did that only on angela, as I thought the problem was there. Luckily, curie kept going but it seems like it was harmless.)

I made a backup of the mail spool on curie:

tar cf - Maildir/ | pv -s 14G | gzip -c > Maildir.tgz

Then I crossed my fingers and ran smd-push -v -s, as that was suggested by smd error codes themselves. That thankfully started restoring mail. It failed a few times on weird cases of files being duplicates, but I resolved this by following the instructions. Or mostly: I actually deleted the files instead of moving them, which made smd even unhappier (if there ever was such a thing). I had to recreate some of those files, so, lesson learned: do follow the advice smd gives you, even if it seems useless or strange.

But then smd-push was humming along, uploading tens of thousands of messages, saturating the upload in the office, refilling the mail spool on the server... yaay!... ?

Except... well, of course that didn't quite work: the mail spool in the office eventually started to grow beyond the size of the mail spool on the workstation. That is what smd-push eventually settled on:

default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(151697), del-mails(0), bytes-received(7539147811), xdelta-received(10881198)

It recreated 151 697 emails, adding about 2000 emails to the pool, kind of from nowhere at all. On marcos, before:

ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------
    4,0 GiB [##########] /.notmuch
  717,3 MiB [#         ] /.Archives.2014
  498,2 MiB [#         ] /.feeds.debian-planet
  453,1 MiB [#         ] /.Archives.2012
  414,5 MiB [#         ] /.debian
  408,2 MiB [#         ] /.quoifaire
  389,8 MiB [          ] /.rapports
  356,6 MiB [          ] /.tor
  182,6 MiB [          ] /.koumbit
  179,8 MiB [          ] /tmp
   56,8 MiB [          ] /.nn
   43,0 MiB [          ] /.act-mtl
   32,6 MiB [          ] /.feeds.sysadvent
   31,7 MiB [          ] /.feeds.releases
   31,4 MiB [          ] /.Sent.2005
   26,3 MiB [          ] /.sage
   25,5 MiB [          ] /.freedombox
   24,0 MiB [          ] /.feeds.git-annex
   21,1 MiB [          ] /.Archives.2011
   19,1 MiB [          ] /.Sent.2003
   16,7 MiB [          ] /.bugtraq
   16,2 MiB [          ] /.mlug
 Total disk usage:   8,0 GiB  Apparent size:   7,6 GiB  Items: 184426

After:

ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------
    4,7 GiB [##########] /.notmuch
    2,7 GiB [#####     ] /.junk
    1,9 GiB [###       ] /cur
  717,3 MiB [#         ] /.Archives.2014
  659,3 MiB [#         ] /.Sent
  513,9 MiB [#         ] /.Archives.2012
  498,2 MiB [#         ] /.feeds.debian-planet
  449,6 MiB [          ] /.Archives.2015
  414,5 MiB [          ] /.debian
  408,2 MiB [          ] /.quoifaire
  389,8 MiB [          ] /.rapports
  380,8 MiB [          ] /.Archives.2013
  356,6 MiB [          ] /.tor
  261,1 MiB [          ] /.Archives.2011
  240,9 MiB [          ] /.koumbit
  183,6 MiB [          ] /.Archives.2010
  179,8 MiB [          ] /tmp
  128,4 MiB [          ] /.lists
  106,1 MiB [          ] /.inso-interne
  103,0 MiB [          ] /.github
   75,0 MiB [          ] /.nanog
   69,8 MiB [          ] /.full-disclosure
 Total disk usage:  16,2 GiB  Apparent size:  15,5 GiB  Items: 341143

That is 156 717 files more.

On curie:

ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------------------------------------
    2,7 GiB [##########] /.junk
    2,3 GiB [########  ] /.notmuch
    1,9 GiB [######    ] /cur
  661,2 MiB [##        ] /.Archives.2014
  655,3 MiB [##        ] /.Sent
  512,0 MiB [#         ] /.Archives.2012
  447,3 MiB [#         ] /.Archives.2015
  438,5 MiB [#         ] /.feeds.debian-planet
  406,5 MiB [#         ] /.quoifaire
  383,6 MiB [#         ] /.debian
  378,6 MiB [#         ] /.Archives.2013
  303,3 MiB [#         ] /.tor
  296,0 MiB [#         ] /.rapports
  237,6 MiB [          ] /.koumbit
  233,2 MiB [          ] /.Archives.2011
  182,1 MiB [          ] /.Archives.2010
  127,0 MiB [          ] /.lists
  104,8 MiB [          ] /.inso-interne
  102,7 MiB [          ] /.register
   89,6 MiB [          ] /.github
   67,1 MiB [          ] /.full-disclosure
   66,5 MiB [          ] /.nanog
 Total disk usage:  13,3 GiB  Apparent size:  12,6 GiB  Items: 342465

Interestingly, there are more files, but less disk usage. It's possible the notmuch database there is more efficient. So maybe there's nothing to worry about.

Last night's marcos backup has:

root@marcos:/home/anarcat# find /mnt/home/anarcat/Maildir | pv -l | wc -l
 341k 0:00:16 [20,4k/s] [                             <=>                                                                                                                                     ]
341040

... 341040 files, which seems about right, considering some mail was delivered during the day. An audit can be performed with hashdeep:

borg mount /media/sdb2/borg/::marcos-auto-2021-03-22 /mnt
hashdeep -c sha256 -r /mnt/home/anarcat/Maildir | pv -l -s 341k > Maildir-backup-manifest.txt

And then compared with:

hashdeep -c sha256 -k Maildir-backup-manifest.txt Maildir/

Some extra files should show up in the Maildir, and very few should actually be missing, because I shouldn't have deleted mail from the previous day the next day, or at least very few. The actual summary hashdeep gave me was:

hashdeep: Audit failed
   Input files examined: 0
  Known files expecting: 0
          Files matched: 339080
Files partially matched: 0
            Files moved: 782
        New files found: 107
  Known files not found: 106

So 106 files added, 107 deleted. Seems good enough for me...

Postfix was stopped at Mar 22 21:12:59 to try and stop external events from confusing things even further. I reviewed the delivery log to see if mail that came in during the problem window disappeared:

grep 'dovecot:.*stored mail into mailbox' /var/log/mail.log |
  tail -20 |
  sed 's/.*msgid=<//;s/>.*//' | 
  while read msgid; do 
    notmuch count --exclude=false id:$msgid |
      grep 0 && echo $msgid missing;
  done

And things looked okay. Now of course if we go further back, we find mail I actually deleted (because I do do that sometimes), so it's hard to use this log as an audit trail. We can only hope that the curie spool is sufficiently coherent to be relied on.

Worst case, we'll have to restore from last night's backup, but that's getting far away now: I get hundreds of mails a day in that mail spool, and reseting back to last night does not seem like a good idea.

A dry run of smd-pull on angela seems to agree that it's missing some files:

default: smd-client@localhost: TAGS: stats::new-mails(154914), del-mails(0), bytes-received(0), xdelta-received(0)

... a number of mails somewhere in between the other two, go figure. A "wet" run of this was started, without deletion (-n), which gave us:

default: smd-client@localhost: TAGS: stats::new-mails(154911), del-mails(0), bytes-received(7658160107), xdelta-received(10837609)

Strange that it sync'd three less emails, but that's still better than nothing, and we have a mail spool on angela again:

anarcat@angela:~(main)$ notmuch new
purging with prefix '.': spam moved (0), ham moved (0), deleted (0), done
Note: Ignoring non-mail file: /home/anarcat/Maildir//.uidvalidity
Processed 1779 total files in 26s (66 files/sec.).
Added 1190 new messages to the database. Removed 3 messages. Detected 593 file renames.
tagging with prefix '.': spam, sent, feeds, koumbit, tor, lists, rapports, folders, done.

Notice how only 1190 messages were re-added, that is because I killed notmuch before it had time to remove all those mails from its database.

Possible causes

I am totally at a loss as to why smd started destroying everything like it did. But a few things come to mind:

  1. I rewired my office on that day.

  2. This meant unplugging curie, the workstation.

  3. It has a bad CMOS battery (known problem), so it jumped around the time continuum a few times, sometimes by years.

  4. The smd services are ran from a systemd unit with OnCalendar=*:0/2. I have heard that it's possible that major time jumps "pile up" execution of jobs, and it seems this happened in this case.

  5. It's possible that locking in smd is not as great as it could be, and that it corrupted its internal data structures on curie, which led it to command a destruction of the remote mail spool.

It's also possible that there was a disk failure on the server, marcos. But since it's running on a (software) RAID-1 array, and no errors have been found (according to dmesg), I don't think that's a plausible hypothesis.

Lessons learned

  1. follow what smd says, even if it seems useless or strange.

  2. trust but verify: just backup everything before you do anything, especially the largest data set.

  3. daily backups are not great for email, unless you're ready to lose a day of email (which I'm not).

  4. hashdeep is great. I keep finding new use cases for it. Last time it was to audit my camera SD card to make sure I didn't forget anything, and now this. it's fast and powerful.

  5. borg is great too. the FUSE mount was especially useful, and it was pretty fast to explore the backup, even through that overhead: checksumming 15GB of mail took about 35 minutes, which gives a respectable 8MB/s, probably bottlenecked by the crap external USB drive I use for backups (!).

  6. I really need to finish my backup system so that I have automated offsite backups, although in this case that would actually have been much slower (certainly not 8MB/s!).

Workarounds and solutions

I setup fake-hwclock on curie, so that the next power failure will not upset my clock that badly.

I am thinking of switching to ZFS or BTRFS for most of my filesystems, so that I can use filesystem snapshots (including remotely!) as a backup strategy. This seems so much more powerful than crawling the filesystem for changes, and allows for truly offsite backups protected from an attacker (hopefully). But it's a long way there.

I'm also thinking of rebuilding my mail setup without smd. It's not the first time something like this happens with smd. It's the first time I am more confident it's the root cause of the problem, however, and it makes me really nervous for the future.

I have used offlineimap in the past and it seems it was finally ported to Python 3 so that could be an option again. isync/mbsync is another option, which I tried before but do not remember why I didn't switch. A complete redesign with something like getmail and/or nncp could also be an option. But alas, I lack the time to go crazy with those experiments.

Somehow, doing like everyone else and just going with Google still doesn't seem to be an option for me. Screw big tech. But I am afraid they will win, eventually.

In any case, I'm just happy I got mail again, strangely.

Update: this happened again, believe it or not! See this next post.

Comments on this page are closed.
Created . Edited .