The last syncmaildir crash
My syncmaildir (SMD) setup failed me one too many times (previously, previously). In an attempt to migrate to an alternative mail synchronization tool, I looked into using my IMAP server again, and found out my mail spool was in a pretty bad shape. I'm comparing mbsync and offlineimap in the next post but this post talks about how I recovered the mail spool so that tools like those could correctly synchronise the mail spool again.
The latest crash
On Monday, SMD just started failing with this error:
nov 15 16:12:19 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:12:22 angela systemd[2305]: smd-pull.service: Succeeded.
nov 15 16:12:22 angela systemd[2305]: Finished pull emails with syncmaildir.
nov 15 16:14:08 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:14:11 angela systemd[2305]: Failed to start pull emails with syncmaildir.
nov 15 16:16:14 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Unable to get any data from the other endpoint.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: This problem may be transient, please retry.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Hint: did you correctly setup the SERVERNAME variable
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: on your client? Did you add an entry for it in your ssh
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: configuration file?
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error
nov 15 16:16:17 angela smd-pull[27188]: register: smd-client@localhost: TAGS: error::context(handshake) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:16:17 angela systemd[2305]: Failed to start pull emails with syncmaildir.
What is frustrating is that there's actually no network error here. Running the command by hand I did see a different message, but now I have lost it in my backlog. It had something to do with a filename being too long, and I gave up debugging after a while. This happened suddenly too, which added to the confusion.
In a fit of rage I started this blog post and experimenting with alternatives, which led me down a lot of rabbit holes.
Reviewing my previous mail crash
documentation, it seems most
solutions involve talking to an IMAP server, so I figured I would just
do that. Wanting to try something new, i gave isync (AKA
mbsync
) a try. Oh dear, I did not expect how much trouble just
talking to my IMAP server would be, which wasn't not isync's fault,
for what that's worth. It was the primary tool I used to debug things,
and served me well in that regard.
Mailbox corruption
The first thing I found out is that certain messages in the IMAP spool
were corrupted. mbsync
would stop on a FETCH
command and Dovecot
would give me those errors on the server side.
"wrong W value"
nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Maildir filename has wrong W value, renamed the file from /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S to /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495:2,S
nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Deleting corrupted cache record uid=1582: UID 1582: Broken virtual size in mailbox junk: read(/home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S): FETCH BODY[] got too little data: 2540 vs 2578
At least this first error was automatically healed by Dovecot (by
renaming the file without the W=
flag). The problem is that the
FETCH
command fails and mbsync
exits noisily. So you need to
constantly restart mbsync
with a silly command like:
while ! mbsync -a; do sleep 1; done
"cached message size larger than expected"
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=mail stream)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: Deleting corrupted cache record uid=19288: UID 19288: Broken physical size in mailbox Sent: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=)
nov 16 13:53:08 marcos dovecot[3520770]: imap-login: Panic: epoll_ctl(del, 7) failed: Bad file descriptor
This second problem is much harder to fix, because dovecot does not
recover automatically. This is Dovecot complaining that the cached
size (the S=
field, but also present in Dovecot's metadata files)
doesn't match the file size.
I wonder if at least some of those messages were corrupted in the OfflineIMAP to syncmaildir migration because part of that procedure is to run the strip_header script to remove content from the emails. That could easily have broken things since the files do not also get renamed.
Workaround
So I read a lot of the Dovecot documentation on the maildir
format, and wrote an extensive fix script for those two
errors. The script worked and mbsync
was able to sync the entire mail
spool.
And no, rebuilding the index files didn't work. Also tried
doveadm force-resync -u anarcat
which didn't do anything.
In the end I also had to do this, because the wrong cache values were also stored elsewhere.
service dovecot stop ; find -name 'dovecot*' -delete; service dovecot start
This would have totally broken any existing clients, but thankfully I'm starting from scratch (except maybe webmail, but I'm hoping it will self-heal as well, assuming it only has a cache and not a full replica of the mail spool).
Incoherence between Maildir and IMAP
Unfortunately, the first mbsync
was incomplete as it was missing about
15,000 mails:
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l
384836
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
369221
As it turns out, mbsync
was not at fault here either: this was yet
more mail spool corruption.
It's actually 26 folders (out of 205) with inconsistent sizes, which can be found with:
for folder in * .[^.]* ; do
printf "%s\t%d\n" $folder $(find "$folder" -type f -a \! -name '.*' | wc -l );
done
The special \! -name '.*'
bit is to ignore the mbsync
metadata,
which creates .uidvalidity
and .mbsyncstate
in every folder. That
ignores about 200 files but since they are spread around all folders,
which was making it impossible to review where the problem was.
Here is what the diff looks like:
--- Maildir-list 2021-11-17 20:42:36.504246752 -0500
+++ Maildir-mbsync-list 2021-11-17 20:18:07.731806601 -0500
@@ -6,16 +6,15 @@
[...]
.Archives 1
.Archives.2010 3553
-.Archives.2011 3583
-.Archives.2012 12593
+.Archives.2011 3582
+.Archives.2012 620
.Archives.2013 8576
.Archives.2014 11057
-.Archives.2015 8173
+.Archives.2015 8165
.Archives.2016 54
.band 34
.bitbuck 1
@@ -38,13 +37,12 @@
.couchsurfers 2
-cur 11285
+cur 11280
.current 130
.cv 2
.debbug 262
-.debian 37544
-drafts 1
-.Drafts 4
+.debian 37533
+.Drafts 2
.drone 241
.drupal 188
.drupal-devel 303
[...]
Misfiled messages
It's a bit all over the place, but we can already notice some huge
differences between mailboxes, for example in the Archives
folders. As it turns out, at least 12,000 of those missing mails were
actually misfiled: instead of being in the
Maildir/.Archives.2012/cur/
folder, they were directly in
Maildir/.Archives.2012/
. This is something that doesn't matter for
SMD (and possibly for notmuch? it does matter, notmuch suddenly
found 12,000 new mails) but that definitely matters to Dovecot and
therefore mbsync
...
After moving those files around, we still have 4,000 message missing:
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
381196
anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l
385053
The problem is that those 4,000 missing mails are harder to
track. Take, for example, .Archives.2011
, which has a single message
missing, out of 3,582. And the files are not identical: the checksums
don't match after going through the IMAP transport, so we can't use a
tool like hashdeep to compare the trees and find why any single
file is missing.
"register" folder
One big chunk of the 4,000, however, is a special folder called
register
in my spool, which I am syncing separately (see Securing
registration email for details on that setup). That actually
covers 3,700 of those messages, so I actually have a more modest 300
messages to figure out, after (easily!) configuring mbsync
to sync
that folder separately:
@@ -30,9 +33,29 @@ Slave :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
-Patterns *
+#Patterns *
+Patterns * !register !.register
# Automatically create missing mailboxes, both locally and on the server
#Create Both
Create slave
# Sync the movement of messages between folders and deletions, add after making sure the sync works
#Expunge Both
+
+IMAPAccount anarcat-register
+Host imap.anarc.at
+User register
+PassCmd "pass imap.anarc.at-register"
+SSLType IMAPS
+CertificateFile /etc/ssl/certs/ca-certificates.crt
+
+IMAPStore anarcat-register-remote
+Account anarcat-register
+
+MaildirStore anarcat-register-local
+SubFolders Maildir++
+Inbox ~/Maildir-mbsync/.register/
+
+Channel anarcat-register
+Master :anarcat-register-remote:
+Slave :anarcat-register-local:
+Create slave
"tmp" folders and empty messages
After syncing the "register" messages, I end up with the measly little 160 emails out of sync:
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
384900
anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l
385059
Argh. After more digging, I have found 131 mails in the tmp/
directories of the client's mail spool. Mysterious! On the server
side, it's even more files, and not the same ones. Possible that
those were mails that were left there during a failed delivery of some
sort, during a power failure or some sort of crash? Who knows. It
could be another race condition in SMD if it runs while mail is being
delivered in tmp/
...
The first thing to do with those is to cleanup a bunch of empty files (21 on angela):
find .[^.]*/tmp -type f -empty -delete
As it turns out, they are all duplicates, in the sense that notmuch can easily find a copy of files with the same message ID in its database. In other words, this hairy command returns nothing
find .[^.]*/tmp -type f | while read path; do
msgid=$(grep -m 1 -i ^message-id "$path" | sed 's/Message-ID: //i;s/[<>]//g');
if notmuch count --exclude=false "id:$msgid" | grep -q 0; then
echo "$path <$msgid> not in notmuch" ;
fi;
done
... which is good. Or, to put it another way, this is safe:
find .[^.]*/tmp -type f -delete
Poof! 314 mails cleaned on the server side. Interestingly, SMD doesn't
pick up on those changes at all and still sees files in tmp/
directories on the client side, so we need to operate the same twisted
logic there.
notmuch to the rescue again
After cleaning that on the client, we get:
anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l
384928
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
384901
Ha! 27 mails difference. Those are the really sticky, unclear ones. I was hoping a full sync might clear that up, but after deleting the entire directory and starting from scratch, I end up with:
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l
385034
anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l
384993
That is: even more messages missing (now 37). Sigh.
Thankfully, this is something notmuch can help with:
it can index all files by Message-ID
(which I learned is
case-insensitive, yay) and tell us which messages don't make it
through.
Considering the corruption I found in the mail spool, I wouldn't be the least surprised those messages are just skipped by the IMAP server. Unfortunately, there's nothing on the Dovecot server logs that would explain the discrepancy.
Here again, notmuch comes to the rescue. We can list all message IDs to figure out that discrepancy:
notmuch search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-msgids
notmuch --config=.notmuch-config-mbsync search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-mbsync-msgids
And then we can see how many messages notmuch thinks are missing:
$ wc -l *msgids
372723 Maildir-mbsync-msgids
372752 Maildir-msgids
That's 29 messages. Oddly, it doesn't exactly match the find
output:
anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l
385204
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l
385241
That is 10 more messages. Ugh. But actually, I know what those are:
more misfiled messages (in a .folder/draft/
directory, bizarrely, so
the totals actually match.
In the notmuch output, there's a lot of stuff like this:
id:notmuch-sha1-fb880d673e24f5dae71b6b4d825d4a0d5d01cde4
Those are messages without a valid Message-ID. Notmuch (presumably) constructs one based on the file's checksum. Because the files differ between the IMAP server and the local mail spool (which is unfortunate, but possibly inevitable), those do not match. There are exactly the same number of those on both sides, so I'll go ahead and assume those are all accounted for.
What remains is:
anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids | grep '^\-[^-]' | grep -v sha1 | wc -l
2
anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids | grep '^\+[^+]' | grep -v sha1 | wc -l
21
anarcat@angela:~(main)$
ie. 21 missing from mbsync
, and, surprisingly, 2 missing from the
original mail spool.
Further inspection also showed they were all messages with some sort of "corruption": no body and only headers. I am not sure that is a legal email format in the first place. Since they were mostly spam or administrative emails ("You have been unsubscribed from mailing list..."), it seems fairly harmless to ignore those.
Update: this actually seems to be a bug in mbsync, in the end. Indeed, RFC5322 clearly states that:
3.5. Overall Message Syntax
A message consists of header fields, optionally followed by a message body.
In other words, it's perfectly fine to have a message that is made of
only headers and no body. Mailman unsubscribe confirmations and mail
< foo
will produce such messages, so I have reported this on the
isync-devel mailing list and the maintainer agreed that had to be
fixed.
Conclusion
As we'll see in the next article, SMD has stellar performance. But that comes at a huge cost: it accesses the mail storage directly. This can (and has) created significant problems on the mail server. It's unclear exactly why those things happen, but Dovecot expects a particular storage format on its file, and it seems unwise to bypass that.
In the future, I'll try to remember to avoid that, especially since mechanisms like SMD require special server access (SSH) which, in the long term, I am not sure I want to maintain or expect.
In other words, just talking with an IMAP server opens up a lot more possibilities of hosting than setting up a custom synchronisation protocol over SSH. It's also safer and more reliable, as we have seen. Thankfully, I've been able to recover from all the errors I could find, but it could have gone differently and it would have been possible for SMD to permanently corrupt significant part of my mail archives.
In the end, however, the last drop was just another weird bug which, ironically, SMD mysteriously recovered from on its own while I was writing this documentation and migrating away from it.
In any case, I recommend SMD users start looking for alternatives. The project has been archived upstream, and the Debian package has been orphaned. I have seen significant mail box corruption, including entire mail spool destruction, mostly due to incorrect locking code. I have filed a release-critical bug in Debian to make sure it doesn't ship with Debian bookworm.
Alternatives like mbsync
provide fast and reliable transport,
including over SSH. See the next
article for further discussion of
the alternatives.