Another syncmaildir crash
So I had another major email crash with my syncmaildir setup. This time I was at least able to confirm the issue, and I still haven't lost mail thanks to backups and sheer luck (again).
The crash
It is not really worth going over the crash in details, it's fairly
similar to the last one: something bad happened and smd
started
destroying everything. The hint is that it takes a long time to do
what usually takes seconds. It helps that I now have a second monitor
showing logs.
I still lost much more mail than the last time. I used to have "301
723 messages", according to notmuch. But then when I ran smd-pull
by hand, it was telling me:
95K emails scanned
Oops. You can see notmuch
happily noticing the destroyed files on
the server:
jun 28 16:33:40 marcos notmuch[28532]: No new mail. Removed 65498 messages. Detected 1699 file renames.
jun 28 16:36:05 marcos notmuch[29746]: No new mail. Removed 68883 messages. Detected 2488 file renames.
jun 28 16:41:40 marcos notmuch[31972]: No new mail. Removed 118295 messages. Detected 3657 file renames.
The final count ended up being 81 042 messages, according to notmuch. A whopping 220 000 mails deleted.
The interesting bit, this time around, is that I caught smd in the act of running two processes in parallel:
jun 28 16:30:09 curie systemd[2845]: Finished pull emails with syncmaildir.
jun 28 16:30:09 curie systemd[2845]: Starting push emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: Starting pull emails with syncmaildir...
So clearly that is the source of the bug.
Recovery
Emergency stop on curie:
notmuch dump > notmuch.dump
systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
On marcos (the server), guessed the number of messages delivered since the last backup to be 71, just looking at timestamps in the mail log. Made a list:
grep postfix/local /var/log/mail.log | tail -71 > lost-mail
Found postfix queue IDs:
sed 's/.*\]://;s/:.*//' lost-mail > qids
Turn those into message IDs, find those that are missing from the disk
(had previously ran notmuch new
just to be sure it's up to date):
while read qid ; do
grep "$qid: message-id" /var/log/mail.log
done < qids | sed 's/.*message-id=<//;s/>//' | while read msgid; do
sudo -u anarcat notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid
done
Copy this back on curie as missing-msgids
and:
$ wc -l missing-msgids
48 missing-msgids
$ while read msgid ; do notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid ; done < missing-msgids
mailman.189.1624881611.23397.nodes-reseaulibre.ca@reseaulibre.ca
AnwMy7rdSpK-N-vt4AiOag@ismtpd0148p1mdw1.sendgrid.net
only two mails missing! whoohoo!
Copy those back onto marcos as really-missing-msgids
, and look at
the full mail logs to see what they are:
~anarcat/src/koumbit-scripts/mail/postfix-trace --from-file really-missing-msgids2
I actually remembered deleting those, so no mail lost!
Rebuild the list of msgids that were lost, on marcos:
while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids | sed 's/.*message-id=<//;s/>//'
Copy that on curie as lost-mail-msgids
, then copy the files over in
a test dir:
while read msgid ; do
notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids | sed 's#/home/anarcat/Maildir/##' | rsync -v --files-from=- /home/anarcat/Maildir/ shell.anarc.at:restore/Maildir-angela/
If that looks about right, on marcos:
find restore/Maildir-angela/ -type f | wc -l
... should match the number of missing mails, roughly.
Copy if in the real spool:
while read msgid ; do
notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids | sed 's#/home/anarcat/Maildir/##' | rsync -v --files-from=- /home/anarcat/Maildir/ shell.anarc.at:Maildir/
Then on the server, notmuch new
should find the new emails, and we
shouldn't have any lost mail anymore:
while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids | sed 's/.*message-id=<//;s/>//' | while read msgid; do sudo -u anarcat notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid ; done
Then, crucial moment, try to pull the new mails from the backups on curie:
anarcat@curie:~(main)$ smd-pull -n --show-tags -v
Found lockfile of a dead instance. Ignored.
Phase 0: handshake
Phase 1: changes detection
5K emails scanned
10K emails scanned
15K emails scanned
20K emails scanned
25K emails scanned
30K emails scanned
35K emails scanned
40K emails scanned
45K emails scanned
50K emails scanned
Phase 2: synchronization
Phase 3: agreement
default: smd-client@localhost: TAGS: stats::new-mails(49687), del-mails(0), bytes-received(215752279), xdelta-received(3703852)
"smd-pull -n --show-tags -v" took 3 mins 39 secs
This brought me back to the state after the backup plus the mails delivered during the day, which means I had to catchup with all my holiday's read emails (1440 mails!) but thankfully I made a dump of the notmuch database on curie at the start of the procedure, so this actually restored a sane state:
pv notmuch.dump | notmuch restore
Phew!
Workaround
I have filed this as a bug in upstream issue 18. Considering I filed 11 issues and only 3 of those were closed, I'm not holding my breath. I nevertheless filed PR 19 in the hope that this will fix my particular issue, but I'm not even sure this is the right fix...
Fix
At this point, I'm really ready to give up on SMD. It's really, really nice to be able to sync mail over SSH because I don't need to store my IMAP password on disk. But surely there are more reliable syncing mechanisms. I do not remember ever losing that much mail before. At worst, offlineimap would duplicate emails like mad, but never destroy my entire mail spool that way.
Update: I have migrated to another tool after one more failure from SMD, see the crash log and recovery.
As mentioned before, there are other programs that sync mail. See the mbsync vs OfflineIMAP review the list of alternatives.
I'm using offlineimap over ssh, with
Thus my offlineimap doesn't need to know my IMAP password (which happens to match the Unix account login password, since I'm using dovecot with the default configuration).
Went down that route as well some years ago, and IIRC that solution is more suitable for bidirectional synchronization of multiple IMAPd in a “ring” topology not a “star” topology. Also at the time I was not able to get incremental synchronization to work, thereby wasting much more bandwidth and CPU cycles than necessary (like OfflineIMAP).
For these reasons I ending up writing my own synchronization software shortly afterwards, which — shameless plug — you might want to try too¹: interimap. Like
doveadm sync
it requires an IMAPd on each end, and can use SSH for transport. It takes advantage of the QRESYNC IMAP extension for incremental synchronization, yielding much better performance than OfflineIMAP.¹ If you'd like to give a try be sure to check the known bugs and limitations section of the manual. It has served me well for almost 6 years now but it's neither as flexible (the use-case is much narrower, although if you've decided to install Dovecot on each end you should be covered) nor as mature as OfflineIMAP.
These are maildirs after all: directories and files, why not using a generic file synchronisation tool? I think at Syncthing which will synchronize those in almost real time, a little footprint and the ability to do backups by itself… Unison should be able to do the job, or more "fun", something like glusterfs…
Answering a bunch of comments at once:
I've also added
mail-sync
to the list, recommended by helmut on IRC. Thanks!