My syncmaildir (SMD) setup failed me one too many times (previously, previously). In an attempt to migrate to an alternative mail synchronization tool, I looked into using my IMAP server again, and found out my mail spool was in a pretty bad shape. I'm comparing mbsync and offlineimap in the next post but this post talks about how I recovered the mail spool so that tools like those could correctly synchronise the mail spool again.

  1. The latest crash
  2. Mailbox corruption
    1. "wrong W value"
    2. "cached message size larger than expected"
    3. Workaround
  3. Incoherence between Maildir and IMAP
    1. Misfiled messages
    2. "register" folder
    3. "tmp" folders and empty messages
    4. notmuch to the rescue again
  4. Conclusion

The latest crash

On Monday, SMD just started failing with this error:

nov 15 16:12:19 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:12:22 angela systemd[2305]: smd-pull.service: Succeeded.
nov 15 16:12:22 angela systemd[2305]: Finished pull emails with syncmaildir.
nov 15 16:14:08 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:14:11 angela systemd[2305]: Failed to start pull emails with syncmaildir.
nov 15 16:16:14 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Unable to get any data from the other endpoint.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: This problem may be transient, please retry.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Hint: did you correctly setup the SERVERNAME variable
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: on your client? Did you add an entry for it in your ssh
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: configuration file?
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error
nov 15 16:16:17 angela smd-pull[27188]: register: smd-client@localhost: TAGS: error::context(handshake) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:16:17 angela systemd[2305]: Failed to start pull emails with syncmaildir.

What is frustrating is that there's actually no network error here. Running the command by hand I did see a different message, but now I have lost it in my backlog. It had something to do with a filename being too long, and I gave up debugging after a while. This happened suddenly too, which added to the confusion.

In a fit of rage I started this blog post and experimenting with alternatives, which led me down a lot of rabbit holes.

Reviewing my previous mail crash documentation, it seems most solutions involve talking to an IMAP server, so I figured I would just do that. Wanting to try something new, i gave isync (AKA mbsync) a try. Oh dear, I did not expect how much trouble just talking to my IMAP server would be, which wasn't not isync's fault, for what that's worth. It was the primary tool I used to debug things, and served me well in that regard.

Mailbox corruption

The first thing I found out is that certain messages in the IMAP spool were corrupted. mbsync would stop on a FETCH command and Dovecot would give me those errors on the server side.

"wrong W value"

nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Maildir filename has wrong W value, renamed the file from /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S to /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495:2,S
nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Deleting corrupted cache record uid=1582: UID 1582: Broken virtual size in mailbox junk: read(/home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S): FETCH BODY[] got too little data: 2540 vs 2578

At least this first error was automatically healed by Dovecot (by renaming the file without the W= flag). The problem is that the FETCH command fails and mbsync exits noisily. So you need to constantly restart mbsync with a silly command like:

while ! mbsync -a; do sleep 1; done

"cached message size larger than expected"

nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=mail stream)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: Deleting corrupted cache record uid=19288: UID 19288: Broken physical size in mailbox Sent: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=)
nov 16 13:53:08 marcos dovecot[3520770]: imap-login: Panic: epoll_ctl(del, 7) failed: Bad file descriptor

This second problem is much harder to fix, because dovecot does not recover automatically. This is Dovecot complaining that the cached size (the S= field, but also present in Dovecot's metadata files) doesn't match the file size.

I wonder if at least some of those messages were corrupted in the OfflineIMAP to syncmaildir migration because part of that procedure is to run the strip_header script to remove content from the emails. That could easily have broken things since the files do not also get renamed.

Workaround

So I read a lot of the Dovecot documentation on the maildir format, and wrote an extensive fix script for those two errors. The script worked and mbsync was able to sync the entire mail spool.

And no, rebuilding the index files didn't work. Also tried doveadm force-resync -u anarcat which didn't do anything.

In the end I also had to do this, because the wrong cache values were also stored elsewhere.

service dovecot stop ; find -name 'dovecot*' -delete; service dovecot start

This would have totally broken any existing clients, but thankfully I'm starting from scratch (except maybe webmail, but I'm hoping it will self-heal as well, assuming it only has a cache and not a full replica of the mail spool).

Incoherence between Maildir and IMAP

Unfortunately, the first mbsync was incomplete as it was missing about 15,000 mails:

anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 
384836
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l 
369221

As it turns out, mbsync was not at fault here either: this was yet more mail spool corruption.

It's actually 26 folders (out of 205) with inconsistent sizes, which can be found with:

for folder in * .[^.]* ; do 
  printf "%s\t%d\n" $folder $(find "$folder" -type f -a \! -name '.*' | wc -l );
done

The special \! -name '.*' bit is to ignore the mbsync metadata, which creates .uidvalidity and .mbsyncstate in every folder. That ignores about 200 files but since they are spread around all folders, which was making it impossible to review where the problem was.

Here is what the diff looks like:

--- Maildir-list    2021-11-17 20:42:36.504246752 -0500
+++ Maildir-mbsync-list 2021-11-17 20:18:07.731806601 -0500
@@ -6,16 +6,15 @@
[...]
 .Archives  1
 .Archives.2010 3553
-.Archives.2011 3583
-.Archives.2012 12593
+.Archives.2011 3582
+.Archives.2012 620
 .Archives.2013 8576
 .Archives.2014 11057
-.Archives.2015 8173
+.Archives.2015 8165
 .Archives.2016 54
 .band  34
 .bitbuck   1
@@ -38,13 +37,12 @@
 .couchsurfers  2
-cur    11285
+cur    11280
 .current   130
 .cv    2
 .debbug    262
-.debian    37544
-drafts 1
-.Drafts    4
+.debian    37533
+.Drafts    2
 .drone 241
 .drupal    188
 .drupal-devel  303
[...]

Misfiled messages

It's a bit all over the place, but we can already notice some huge differences between mailboxes, for example in the Archives folders. As it turns out, at least 12,000 of those missing mails were actually misfiled: instead of being in the Maildir/.Archives.2012/cur/ folder, they were directly in Maildir/.Archives.2012/. This is something that doesn't matter for SMD (and possibly for notmuch? it does matter, notmuch suddenly found 12,000 new mails) but that definitely matters to Dovecot and therefore mbsync...

After moving those files around, we still have 4,000 message missing:

anarcat@angela:~(main)$ find Maildir-mbsync/  -type f -a \! -name '.*' | wc -l 
381196
anarcat@angela:~(main)$ find Maildir/  -type f -a \! -name '.*' | wc -l 
385053

The problem is that those 4,000 missing mails are harder to track. Take, for example, .Archives.2011, which has a single message missing, out of 3,582. And the files are not identical: the checksums don't match after going through the IMAP transport, so we can't use a tool like hashdeep to compare the trees and find why any single file is missing.

"register" folder

One big chunk of the 4,000, however, is a special folder called register in my spool, which I am syncing separately (see Securing registration email for details on that setup). That actually covers 3,700 of those messages, so I actually have a more modest 300 messages to figure out, after (easily!) configuring mbsync to sync that folder separately:

 @@ -30,9 +33,29 @@ Slave :anarcat-local:
  # Exclude everything under the internal [Gmail] folder, except the interesting folders
  #Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
  # Or include everything
 -Patterns *
 +#Patterns *
 +Patterns * !register  !.register
  # Automatically create missing mailboxes, both locally and on the server
  #Create Both
  Create slave
  # Sync the movement of messages between folders and deletions, add after making sure the sync works
  #Expunge Both
 +
 +IMAPAccount anarcat-register
 +Host imap.anarc.at
 +User register
 +PassCmd "pass imap.anarc.at-register"
 +SSLType IMAPS
 +CertificateFile /etc/ssl/certs/ca-certificates.crt
 +
 +IMAPStore anarcat-register-remote
 +Account anarcat-register
 +
 +MaildirStore anarcat-register-local
 +SubFolders Maildir++
 +Inbox ~/Maildir-mbsync/.register/
 +
 +Channel anarcat-register
 +Master :anarcat-register-remote:
 +Slave :anarcat-register-local:
 +Create slave

"tmp" folders and empty messages

After syncing the "register" messages, I end up with the measly little 160 emails out of sync:

anarcat@angela:~(main)$ find Maildir-mbsync/  -type f -a \! -name '.*' | wc -l 
384900
anarcat@angela:~(main)$ find Maildir/  -type f -a \! -name '.*' | wc -l 
385059

Argh. After more digging, I have found 131 mails in the tmp/ directories of the client's mail spool. Mysterious! On the server side, it's even more files, and not the same ones. Possible that those were mails that were left there during a failed delivery of some sort, during a power failure or some sort of crash? Who knows. It could be another race condition in SMD if it runs while mail is being delivered in tmp/...

The first thing to do with those is to cleanup a bunch of empty files (21 on angela):

find .[^.]*/tmp -type f -empty -delete

As it turns out, they are all duplicates, in the sense that notmuch can easily find a copy of files with the same message ID in its database. In other words, this hairy command returns nothing

find .[^.]*/tmp -type f | while read path; do
  msgid=$(grep -m 1  -i ^message-id "$path" | sed 's/Message-ID: //i;s/[<>]//g');
  if notmuch count --exclude=false  "id:$msgid" | grep -q 0; then
    echo "$path <$msgid> not in notmuch" ;
  fi;
done

... which is good. Or, to put it another way, this is safe:

find .[^.]*/tmp -type f -delete

Poof! 314 mails cleaned on the server side. Interestingly, SMD doesn't pick up on those changes at all and still sees files in tmp/ directories on the client side, so we need to operate the same twisted logic there.

notmuch to the rescue again

After cleaning that on the client, we get:

anarcat@angela:~(main)$ find Maildir/  -type f -a \! -name '.*' | wc -l 
384928
anarcat@angela:~(main)$ find Maildir-mbsync/  -type f -a \! -name '.*' | wc -l 
384901

Ha! 27 mails difference. Those are the really sticky, unclear ones. I was hoping a full sync might clear that up, but after deleting the entire directory and starting from scratch, I end up with:

anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 
385034
anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l 
384993

That is: even more messages missing (now 37). Sigh.

Thankfully, this is something notmuch can help with: it can index all files by Message-ID (which I learned is case-insensitive, yay) and tell us which messages don't make it through.

Considering the corruption I found in the mail spool, I wouldn't be the least surprised those messages are just skipped by the IMAP server. Unfortunately, there's nothing on the Dovecot server logs that would explain the discrepancy.

Here again, notmuch comes to the rescue. We can list all message IDs to figure out that discrepancy:

notmuch search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-msgids
notmuch --config=.notmuch-config-mbsync search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-mbsync-msgids

And then we can see how many messages notmuch thinks are missing:

$ wc -l *msgids
372723 Maildir-mbsync-msgids
372752 Maildir-msgids

That's 29 messages. Oddly, it doesn't exactly match the find output:

anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l 
385204
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 
385241

That is 10 more messages. Ugh. But actually, I know what those are: more misfiled messages (in a .folder/draft/ directory, bizarrely, so the totals actually match.

In the notmuch output, there's a lot of stuff like this:

id:notmuch-sha1-fb880d673e24f5dae71b6b4d825d4a0d5d01cde4

Those are messages without a valid Message-ID. Notmuch (presumably) constructs one based on the file's checksum. Because the files differ between the IMAP server and the local mail spool (which is unfortunate, but possibly inevitable), those do not match. There are exactly the same number of those on both sides, so I'll go ahead and assume those are all accounted for.

What remains is:

anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids  | grep '^\-[^-]' | grep -v sha1 | wc -l 
2
anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids  | grep '^\+[^+]' | grep -v sha1 | wc -l 
21
anarcat@angela:~(main)$ 

ie. 21 missing from mbsync, and, surprisingly, 2 missing from the original mail spool.

Further inspection also showed they were all messages with some sort of "corruption": no body and only headers. I am not sure that is a legal email format in the first place. Since they were mostly spam or administrative emails ("You have been unsubscribed from mailing list..."), it seems fairly harmless to ignore those.

Conclusion

As we'll see in the next article, SMD has stellar performance. But that comes at a huge cost: it accesses the mail storage directly. This can (and has) created significant problems on the mail server. It's unclear exactly why those things happen, but Dovecot expects a particular storage format on its file, and it seems unwise to bypass that.

In the future, I'll try to remember to avoid that, especially since mechanisms like SMD require special server access (SSH) which, in the long term, I am not sure I want to maintain or expect.

In other words, just talking with an IMAP server opens up a lot more possibilities of hosting than setting up a custom synchronisation protocol over SSH. It's also safer and more reliable, as we have seen. Thankfully, I've been able to recover from all the errors I could find, but it could have gone differently and it would have been possible for SMD to permanently corrupt significant part of my mail archives.

In the end, however, the last drop was just another weird bug which, ironically, SMD mysteriously recovered from on its own while I was writing this documentation and migrating away from it.

In any case, I recommend SMD users start looking for alternatives. The project has been archived upstream, and the Debian package has been orphaned. I have seen significant mail box corruption, including entire mail spool destruction, mostly due to incorrect locking code. I have filed a release-critical bug in Debian to make sure it doesn't ship with Debian bookworm.

Alternatives like mbsync provide fast and reliable transport, including over SSH. See the next article for further discussion of the alternatives.

Created . Edited .