So I had another major email crash with my syncmaildir setup. This time I was at least able to confirm the issue, and I still haven't lost mail thanks to backups and sheer luck (again).

The crash

It is not really worth going over the crash in details, it's fairly similar to the last one: something bad happened and smd started destroying everything. The hint is that it takes a long time to do what usually takes seconds. It helps that I now have a second monitor showing logs.

I still lost much more mail than the last time. I used to have "301 723 messages", according to notmuch. But then when I ran smd-pull by hand, it was telling me:

95K emails scanned

Oops. You can see notmuch happily noticing the destroyed files on the server:

jun 28 16:33:40 marcos notmuch[28532]: No new mail. Removed 65498 messages. Detected 1699 file renames.
jun 28 16:36:05 marcos notmuch[29746]: No new mail. Removed 68883 messages. Detected 2488 file renames.
jun 28 16:41:40 marcos notmuch[31972]: No new mail. Removed 118295 messages. Detected 3657 file renames.

The final count ended up being 81 042 messages, according to notmuch. A whopping 220 000 mails deleted.

The interesting bit, this time around, is that I caught smd in the act of running two processes in parallel:

jun 28 16:30:09 curie systemd[2845]: Finished pull emails with syncmaildir. 
jun 28 16:30:09 curie systemd[2845]: Starting push emails with syncmaildir... 
jun 28 16:30:09 curie systemd[2845]: Starting pull emails with syncmaildir... 

So clearly that is the source of the bug.

Recovery

Emergency stop on curie:

notmuch dump > notmuch.dump
systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer

On marcos (the server), guessed the number of messages delivered since the last backup to be 71, just looking at timestamps in the mail log. Made a list:

grep postfix/local /var/log/mail.log | tail -71 > lost-mail

Found postfix queue IDs:

sed 's/.*\]://;s/:.*//' lost-mail > qids

Turn those into message IDs, find those that are missing from the disk (had previously ran notmuch new just to be sure it's up to date):

while read qid ; do 
    grep "$qid: message-id" /var/log/mail.log
done < qids  | sed 's/.*message-id=<//;s/>//' | while read msgid; do
    sudo -u anarcat notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid
done

Copy this back on curie as missing-msgids and:

$ wc -l missing-msgids 
48 missing-msgids
$ while read msgid ; do notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid ; done < missing-msgids
mailman.189.1624881611.23397.nodes-reseaulibre.ca@reseaulibre.ca
AnwMy7rdSpK-N-vt4AiOag@ismtpd0148p1mdw1.sendgrid.net

only two mails missing! whoohoo!

Copy those back onto marcos as really-missing-msgids, and look at the full mail logs to see what they are:

~anarcat/src/koumbit-scripts/mail/postfix-trace --from-file really-missing-msgids2

I actually remembered deleting those, so no mail lost!

Rebuild the list of msgids that were lost, on marcos:

while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids  | sed 's/.*message-id=<//;s/>//'

Copy that on curie as lost-mail-msgids, then copy the files over in a test dir:

while read msgid ; do
    notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids | sed 's#/home/anarcat/Maildir/##' | rsync -v  --files-from=- /home/anarcat/Maildir/ shell.anarc.at:restore/Maildir-angela/

If that looks about right, on marcos:

find restore/Maildir-angela/ -type f | wc -l

... should match the number of missing mails, roughly.

Copy if in the real spool:

while read msgid ; do
    notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids | sed 's#/home/anarcat/Maildir/##' | rsync -v  --files-from=- /home/anarcat/Maildir/ shell.anarc.at:Maildir/

Then on the server, notmuch new should find the new emails, and we shouldn't have any lost mail anymore:

while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids  | sed 's/.*message-id=<//;s/>//' | while read msgid; do sudo -u anarcat notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid ; done

Then, crucial moment, try to pull the new mails from the backups on curie:

anarcat@curie:~(main)$ smd-pull  -n  --show-tags -v
Found lockfile of a dead instance. Ignored.
Phase 0: handshake
Phase 1: changes detection
    5K emails scanned
   10K emails scanned
   15K emails scanned
   20K emails scanned
   25K emails scanned
   30K emails scanned
   35K emails scanned
   40K emails scanned
   45K emails scanned
   50K emails scanned
Phase 2: synchronization
Phase 3: agreement
default: smd-client@localhost: TAGS: stats::new-mails(49687), del-mails(0), bytes-received(215752279), xdelta-received(3703852)
"smd-pull  -n  --show-tags -v" took 3 mins 39 secs

This brought me back to the state after the backup plus the mails delivered during the day, which means I had to catchup with all my holiday's read emails (1440 mails!) but thankfully I made a dump of the notmuch database on curie at the start of the procedure, so this actually restored a sane state:

pv notmuch.dump | notmuch restore

Phew!

Workaround

I have filed this as a bug in upstream issue 18. Considering I filed 11 issues and only 3 of those were closed, I'm not holding my breath. I nevertheless filed PR 19 in the hope that this will fix my particular issue, but I'm not even sure this is the right fix...

Fix

At this point, I'm really ready to give up on SMD. It's really, really nice to be able to sync mail over SSH because I don't need to store my IMAP password on disk. But surely there are more reliable syncing mechanisms. I do not remember ever losing that much mail before. At worst, offlineimap would duplicate emails like mad, but never destroy my entire mail spool that way.

As mentioned before, there are other programs that sync mail. I'm looking at:

offlineimap over ssh

I'm using offlineimap over ssh, with

preauthtunnel = ssh -q mymailserver /usr/lib/dovecot/imap

Thus my offlineimap doesn't need to know my IMAP password (which happens to match the Unix account login password, since I'm using dovecot with the default configuration).

Comment by Marius Gedminas
other alternative: interimap
  • doveadm-sync: requires dovecot on both ends, but supports using SSH to sync, will try this next

Went down that route as well some years ago, and IIRC that solution is more suitable for bidirectional synchronization of multiple IMAPd in a “ring” topology not a “star” topology. Also at the time I was not able to get incremental synchronization to work, thereby wasting much more bandwidth and CPU cycles than necessary (like OfflineIMAP).

For these reasons I ending up writing my own synchronization software shortly afterwards, which — shameless plug :-) — you might want to try too¹: interimap. Like doveadm sync it requires an IMAPd on each end, and can use SSH for transport. It takes advantage of the QRESYNC IMAP extension for incremental synchronization, yielding much better performance than OfflineIMAP.

¹ If you'd like to give a try be sure to check the known bugs and limitations section of the manual. It has served me well for almost 6 years now but it's neither as flexible (the use-case is much narrower, although if you've decided to install Dovecot on each end you should be covered) nor as mature as OfflineIMAP.

Comment by guilhem
commenting on 2021-06-29-another-mail-crash

These are maildirs after all: directories and files, why not using a generic file synchronisation tool? I think at Syncthing which will synchronize those in almost real time, a little footprint and the ability to do backups by itself… Unison should be able to do the job, or more "fun", something like glusterfs…

Comment by Landry
on tools

Answering a bunch of comments at once:

  • Marius: nice trick, noted, thanks!
  • guilhem: excellent story, great background, tool looks awesome, may try it out next, thanks!
  • Landry: i am not sure Syncthing would scale, and i wouldn't trust it with my mail spool. i'm almost 100% certain that Unison would not scale. syncing mail spools is hard: most backup software have a lot of trouble walking those large directory trees, for example, let alone "in real time"...

I've also added mail-sync to the list, recommended by helmut on IRC. Thanks!

Comment by anarcat
Created . Edited .