mbsync vs OfflineIMAP
After recovering from my latest email
crash
(previously,
previously), I had to figure out
which tool I should be using. I had many options but I figured I would
start with a popular one (mbsync
).
But I also evaluated OfflineIMAP which was resurrected from the Python 2 apocalypse, and because I had used it before, for a long time.
Read on for the details.
Benchmark setup
All programs were tested against a Dovecot 1:2.3.13+dfsg1-2 server, running Debian bullseye.
The client is a Purism 13v4 laptop with a Samsung SSD 970 EVO 1TB NVMe drive.
The server is a custom build with a AMD Ryzen 5 2600 CPU, and a RAID-1 array made of two NVMe drives (Intel SSDPEKNW010T8 and WDC WDS100T2B0C).
The mail spool I am testing against has almost 400k messages and takes 13GB of disk space:
$ notmuch count --exclude=false
372758
$ du -sh --exclude xapian Maildir
13G Maildir
The baseline we are comparing against is SMD (syncmaildir) which performs the sync in about 7-8 seconds locally (3.5 seconds for each push/pull command) and about 10-12 seconds remotely.
Anything close to that or better is good enough. I do not have recent numbers for a SMD full sync baseline, but the setup documentation mentions 20 minutes for a full sync. That was a few years ago, and the spool has obviously grown since then, so that is not a reliable baseline.
A baseline for a full sync might be also set with rsync, which copies files at nearly 40MB/s, or 317Mb/s!
anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian shell.anarc.at:Maildir/ Maildir/
12,647,814,731 100% 37.85MB/s 0:05:18 (xfr#394981, to-chk=0/395815)
72.38user 106.10system 5:19.59elapsed 55%CPU (0avgtext+0avgdata 15988maxresident)k
8816inputs+26305112outputs (0major+50953minor)pagefaults 0swaps
That is 5 minutes to transfer the entire spool. Incremental syncs are obviously pretty fast too:
anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian shell.anarc.at:Maildir/ Maildir/
0 0% 0.00kB/s 0:00:00 (xfr#0, to-chk=0/395815)
1.42user 0.81system 0:03.31elapsed 67%CPU (0avgtext+0avgdata 14100maxresident)k
120inputs+0outputs (3major+12709minor)pagefaults 0swaps
As an extra curiosity, here's the performance with tar
, pretty
similar with rsync
, minus incremental which I cannot be bothered to
figure out right now:
anarcat@angela:tmp(main)$ time ssh shell.anarc.at tar --exclude xapian -cf - Maildir/ | pv -s 13G | tar xf -
56.68user 58.86system 5:17.08elapsed 36%CPU (0avgtext+0avgdata 8764maxresident)k
0inputs+0outputs (0major+7266minor)pagefaults 0swaps
12,1GiO 0:05:17 [39,0MiB/s] [===================================================================> ] 92%
Interesting that rsync
manages to almost beat a plain tar
on file
transfer, I'm actually surprised by how well it performs here,
considering there are many little files to transfer.
(But then again, this maybe is exactly where rsync
shines: while
tar
needs to glue all those little files together, rsync
can just
directly talk to the other side and tell it to do live
changes. Something to look at in another article maybe?)
Since both ends are NVMe drives, those should easily saturate a gigabit link. And in fact, a backup of the server mail spool achieves much faster transfer rate on disks:
anarcat@marcos:~$ tar fc - Maildir | pv -s 13G > Maildir.tar
15,0GiO 0:01:57 [ 131MiB/s] [===================================] 115%
That's 131Mibyyte per second, vastly faster than the gigabit link. The client has similar performance:
anarcat@angela:~(main)$ tar fc - Maildir | pv -s 17G > Maildir.tar
16,2GiO 0:02:22 [ 116MiB/s] [==================================] 95%
So those disks should be able to saturate a gigabit link, and they are not the bottleneck on fast links. Which begs the question of what is blocking performance of a similar transfer over the gigabit link, but that's another question altogether, because no sync program ever reaches the above performance anyways.
Finally, note that when I migrated to SMD, I wrote a small performance comparison that could be interesting here. It show SMD to be faster than OfflineIMAP, but not as much as we see here. In fact, it looks like OfflineIMAP slowed down significantly since then (May 2018), but this could be due to my larger mail spool as well.
mbsync
The isync (AKA mbsync
) project is written in C and supports
syncing Maildir and IMAP folders, with possibly multiple replicas. I
haven't tested this but I suspect it might be possible to sync between
two IMAP servers as well. It supports partial mirorrs, message flags,
full folder support, and "trash" functionality.
Complex configuration file
I started with this .mbsyncrc
configuration file:
SyncState *
Sync New ReNew Flags
IMAPAccount anarcat
Host imap.anarc.at
User anarcat
PassCmd "pass imap.anarc.at"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt
IMAPStore anarcat-remote
Account anarcat
MaildirStore anarcat-local
# Maildir/top/sub/sub
#SubFolders Verbatim
# Maildir/.top.sub.sub
SubFolders Maildir++
# Maildir/top/.sub/.sub
# SubFolders legacy
# The trailing "/" is important
#Path ~/Maildir-mbsync/
Inbox ~/Maildir-mbsync/
Channel anarcat
# AKA Far, convert when all clients are 1.4+
Master :anarcat-remote:
# AKA Near
Slave :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
Patterns *
# Automatically create missing mailboxes, both locally and on the server
#Create Both
Create slave
# Sync the movement of messages between folders and deletions, add after making sure the sync works
#Expunge Both
Long gone are the days where I would spend a long time reading a
manual page to figure out the meaning of every option. If that's your
thing, you might like this one. But I'm more of a "EXAMPLES
section" kind of person now, and I somehow couldn't find a sample file
on the website. I started from the Arch wiki one but it's
actually not great because it's made for Gmail (which is not a usual
Dovecot server). So a sample config file in the manpage would be a
great addition. Thankfully, the Debian packages ships one in
/usr/share/doc/isync/examples/mbsyncrc.sample
but I only found that
after I wrote my configuration. It was still useful and I recommend
people take a look if they want to understand the syntax.
Also, that syntax is a little overly complicated. For example, Far
needs colons, like:
Far :anarcat-remote:
Why? That seems just too complicated. I also found that sections are
not clearly identified: IMAPAccount
and Channel
mark section
beginnings, for example, which is not at all obvious until you learn
about mbsync
's internals. There are also weird ordering issues: the
SyncState
option needs to be before IMAPAccount
, presumably
because it's global.
Using a more standard format like .INI or TOML could improve that situation.
Stellar performance
A transfer of the entire mail spool takes 56 minutes and 6 seconds, which is impressive.
It's not quite "line rate": the resulting mail spool was 12GB (which
is a problem, see below), which turns out to be about 29Mbit/s and
therefore not maxing the gigabit link, and an order of magnitude
slower than rsync
.
The incremental runs are roughly 2 seconds, which is even more
impressive, as that's actually faster than rsync
:
===> multitime results
1: mbsync -a
Mean Std.Dev. Min Median Max
real 2.015 0.052 1.930 2.029 2.105
user 0.660 0.040 0.592 0.661 0.722
sys 0.338 0.033 0.268 0.341 0.387
Those tests were performed with isync 1.3.0-2.2 on Debian bullseye. Tests with a newer isync release originally failed because of a corrupted message that triggered bug 999804 (see below). Running 1.4.3 under valgrind works around the bug, but adds a 50% performance cost, the full sync running in 1h35m.
Once the upstream patch is applied, performance with 1.4.3 is fairly
similar, considering that the new sync included the register
folder
with 4000 messages:
120.74user 213.19system 59:47.69elapsed 9%CPU (0avgtext+0avgdata 105420maxresident)k
29128inputs+28284376outputs (0major+45711minor)pagefaults 0swaps
That is ~13GB in ~60 minutes, which gives us 28.3Mbps. Incrementals are also pretty similar to 1.3.x, again considering the double-connect cost:
===> multitime results
1: mbsync -a
Mean Std.Dev. Min Median Max
real 2.500 0.087 2.340 2.491 2.629
user 0.718 0.037 0.679 0.711 0.793
sys 0.322 0.024 0.284 0.320 0.365
Those tests were all done on a Gigabit link, but what happens on a
slower link? My server uplink is slow: 25 Mbps down, 6 Mbps up. There
mbsync
is worse than the SMD baseline:
===> multitime results
1: mbsync -a
Mean Std.Dev. Min Median Max
real 31.531 0.724 30.764 31.271 33.100
user 1.858 0.125 1.721 1.818 2.131
sys 0.610 0.063 0.506 0.600 0.695
That's 30 seconds for a sync, which is an order of magnitude slower than SMD.
Great user interface
Compared to OfflineIMAP and (ahem) SMD, the mbsync
UI is kind of neat:
anarcat@angela:~(main)$ mbsync -a
Notice: Master/Slave are deprecated; use Far/Near instead.
C: 1/2 B: 204/205 F: +0/0 *0/0 #0/0 N: +1/200 *0/0 #0/0
(Note that nice switch away from slavery-related terms too.)
The display is minimal, and yet informative. It's not obvious what does mean at first glance, but the manpage is useful at least for clarifying that:
This represents the cumulative progress over channels, boxes, and messages affected on the far and near side, respectively. The message counts represent added messages, messages with updated flags, and trashed messages, respectively. No attempt is made to calculate the totals in advance, so they grow over time as more information is gathered. (Emphasis mine).
In other words:
C 2/2
: channels done/total (2 done out of 2)B 204/205
: mailboxes done/total (204 out of 205)F
: changes on the far sideN: +10/200 *0/0 #0/0
: changes on the "near" side:+10/200
: 10 out of 200 messages downloaded*0/0
: no flag changed#0/0
: no message deleted
You get used to it, in a good way. It does not, unfortunately, show up when you run it in systemd, which is a bit annoying as I like to see a summary mail traffic in the logs.
Interoperability issue
In my notmuch setup, I have bound key
S
to "mark spam", which basically assigns the tag spam
to the
message and removes a bunch of others. Then I have a
notmuch-purge script which moves that message to the spam folder,
for training purposes. It basically does this:
notmuch search --output=files --format=text0 "$search_spam" \
| xargs -r -0 mv -t "$HOME/Maildir/${PREFIX}junk/cur/"
This method, which worked fine in SMD (and also OfflineIMAP) created this error on sync:
Maildir error: duplicate UID 37578.
And indeed, there are now two messages with that UID in the mailbox:
anarcat@angela:~(main)$ find Maildir/.junk/ -name '*U=37578*'
Maildir/.junk/cur/1637427889.134334_2.angela,U=37578:2,S
Maildir/.junk/cur/1637348602.2492889_221804.angela,U=37578:2,S
This is actually a known limitation or, as mbsync(1) calls it, a "RECOMMENDATION":
When using the more efficient default UID mapping scheme, it is important that the MUA renames files when moving them between Maildir fold ers. Mutt always does that, while mu4e needs to be configured to do it:
(setq mu4e-change-filenames-when-moving t)
So it seems I would need to fix my script. It's unclear how the
paths should be renamed, which is unfortunate, because I would need to
change my script to adapt to mbsync
, but I can't tell how just from
reading the above.
(A manual fix is actually to rename the file to remove the U=
field:
mbsync
will generate a new one and then sync correctly.)
Fortunately, someone else already fixed that issue: afew, a
notmuch tagging script (much puns, such hurt), has a move mode
that can rename files correctly, specifically designed to deal
with mbsync
. I had already been told about afew, but it's one more
reason to standardize my notmuch hooks on that project, it looks like.
Update: I have tried to use afew and found it has significant
performance issues. It also has a completely different paradigm
to what I am used to: it assumes all incoming mail has a new
and
lays its own tags on top of that (inbox
, sent
, etc). It can only
move files from one folder at a time (see this bug) which
breaks my spam training workflow. In general, I sync my tags into
folders (e.g. ham
, spam
, sent
) and message flags (e.g. inbox
is F
, unread
is "not S
", etc), and afew is not well suited for
this (although there are hacks that try to fix this). I have
worked hard to make my tagging scripts idempotent, and it's something
afew doesn't currently have. Still, it would be better to have
that code in Python than bash, so maybe I should consider my options
here. For now, I'm still using those pre-new and post-new
scripts which workaround that problem.
Stability issues
The newer release in Debian bookworm (currently at 1.4.3) has stability issues on full sync. I filed bug 999804 in Debian about this, which lead to a thread on the upstream mailing list. I have found at least three distinct crashes that could be double-free bugs "which might be exploitable in the worst case", not a reassuring prospect.
The thing is: mbsync
is really fast, but the downside of that is that
it's written in C, and with that comes a whole set of security
issues. The Debian security tracker has only three CVEs on
isync, but the above issues show there could be many more.
Reading the source code certainly did not make me very comfortable
with trusting it with untrusted data. I considered sandboxing it with
systemd (below) but having systemd run as a --user
process makes
that difficult. I also considered using an apparmor profile but
that is not trivial because we need to allow SSH and only some parts
of it...
Thankfully, upstream has been diligent at addressing the issues I have found. They provided a patch within a few days which did fix the sync issues.
Update: upstream actually took the issue very seriously. They not only got CVE-2021-44143 assigned for my bug report, they also audited the code and found several more issues collectively identified as CVE-2021-3657, which actually also affect 1.3 (ie. Debian 11/bullseye/stable). Somehow my corpus doesn't trigger that issue, but it was still considered serious enough to warrant a CVE. So one the one hand: excellent response from upstream; but on the other hand: how many more of those could there be in there?
Automation with systemd
The Arch wiki has instructions on how to setup mbsync
as a
systemd service. It suggests using the --verbose
(-V
) flag which
is a little intense here, as it outputs 1444 lines of messages.
I have used the following .service
file:
[Unit]
Description=Mailbox synchronization service
ConditionHost=!marcos
Wants=network-online.target
After=network-online.target
Before=notmuch-new.service
[Service]
Type=oneshot
ExecStart=/usr/bin/mbsync -a
Nice=10
IOSchedulingClass=idle
NoNewPrivileges=true
[Install]
WantedBy=default.target
And the following .timer
:
[Unit]
Description=Mailbox synchronization timer
ConditionHost=!marcos
[Timer]
OnBootSec=2m
OnUnitActiveSec=5m
Unit=mbsync.service
[Install]
WantedBy=timers.target
Note that we trigger notmuch
through systemd, with the Before
and
also by adding mbsync.service
to the notmuch-new.service
file:
[Unit]
Description=notmuch new
After=mbsync.service
[Service]
Type=oneshot
Nice=10
ExecStart=/usr/bin/notmuch new
[Install]
WantedBy=mbsync.service
An improvement over polling repeatedly with a .timer
would be to
wake up only on IMAP notify, but neither imapnotify nor
goimapnotify seem to be packaged in Debian. It would also not
cover for the "sent folder" use case, where we need to wake up on
local changes.
Password-less setup
The sample file suggests this should work:
IMAPStore remote
Tunnel "ssh -q host.remote.com /usr/sbin/imapd"
Add BatchMode
, restrict to IdentitiesOnly
, provide a password-less
key just for this, add compression (-C
), find the Dovecot imap
binary, and you get this:
IMAPAccount anarcat-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap"
And it actually seems to work:
$ mbsync -a
Notice: Master/Slave are deprecated; use Far/Near instead.
C: 0/2 B: 0/1 F: +0/0 *0/0 #0/0 N: +0/0 *0/0 #0/0imap(anarcat): Error: net_connect_unix(/run/dovecot/stats-writer) failed: Permission denied
C: 2/2 B: 205/205 F: +0/0 *0/0 #0/0 N: +1/1 *3/3 #0/0imap(anarcat)<1611280><90uUOuyElmEQlhgAFjQyWQ>: Info: Logged out in=10808 out=15396642 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=1 body_bytes=8087
It's a bit noisy, however. dovecot/imap
doesn't have a "usage" to
speak of, but even the source code doesn't hint at a way to disable
that Error
message, so that's unfortunate. That socket is owned by
root:dovecot
so presumably Dovecot runs the imap
process as
$user:dovecot
, which we can't do here. Oh well?
Interestingly, the SSH setup is not faster than IMAP.
With IMAP:
===> multitime results
1: mbsync -a
Mean Std.Dev. Min Median Max
real 2.367 0.065 2.220 2.376 2.458
user 0.793 0.047 0.731 0.776 0.871
sys 0.426 0.040 0.364 0.434 0.476
With SSH:
===> multitime results
1: mbsync -a
Mean Std.Dev. Min Median Max
real 2.515 0.088 2.274 2.532 2.594
user 0.753 0.043 0.645 0.766 0.804
sys 0.328 0.045 0.212 0.340 0.393
Basically: 200ms slower. Tolerable.
Migrating from SMD
The above was how I migrated to mbsync
on my first workstation. The
work on the second one was more streamlined, especially since the
corruption on mailboxes was fixed:
install isync, with the patch:
dpkg -i isync_1.4.3-1.1~_amd64.deb
copy all files over from previous workstation to speed up the transfer (optional):
rsync -a --info=progress2 angela:Maildir/ Maildir-mbsync/
rename all files to match new hostname (optional):
find Maildir-mbsync/ -type f -name '*.angela,*' -print0 | rename -0 's/\.angela,/\.curie,/'
trash the notmuch database (optional):
rm -rf Maildir-mbsync/.notmuch/xapian/
disable all smd and notmuch services:
systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
do one last sync with smd:
smd-pull --show-tags ; smd-push --show-tags ; notmuch new ; notmuch-sync-flagged -v
backup notmuch on the client and server:
notmuch dump | pv > notmuch.dump
backup the maildir on the client and server:
cp -al Maildir Maildir-bak
create the SSH key:
ssh-keygen -t ed25519 -f .ssh/id_ed25519_mbsync cat .ssh/id_ed25519_mbsync.pub
add to
.ssh/authorized_keys
on the server, like this:command="/usr/lib/dovecot/imap",restrict ssh-ed25519 AAAAC...
move old files aside, if present:
mv Maildir Maildir-smd
move new files in place (CRITICAL SECTION BEGINS!):
mv Maildir-mbsync Maildir
run a test sync, only pulling changes:
mbsync --create-near --remove-none --expunge-none --noop anarcat-register
if that works well, try with all mailboxes:
mbsync --create-near --remove-none --expunge-none --noop -a
if that works well, try again with a full sync:
mbsync register mbsync -a
reindex and restore the notmuch database, this should take ~25 minutes:
notmuch new pv notmuch.dump | notmuch restore
enable the systemd services and retire the
smd-*
services:systemctl --user enable mbsync.timer notmuch-new.service systemctl --user start mbsync.timer rm ~/.config/systemd/user/smd* systemctl daemon-reload
During the migration, notmuch helpfully told me the full list of those lost messages:
[...]
Warning: cannot apply tags to missing message: CAN6gO7_QgCaiDFvpG3AXHi6fW12qaN286+2a7ERQ2CQtzjSEPw@mail.gmail.com
Warning: cannot apply tags to missing message: CAPTU9Wmp0yAmaxO+qo8CegzRQZhCP853TWQ_Ne-YF94MDUZ+Dw@mail.gmail.com
Warning: cannot apply tags to missing message: F5086003-2917-4659-B7D2-66C62FCD4128@gmail.com
[...]
Warning: cannot apply tags to missing message: mailman.2.1316793601.53477.sage-members@mailman.sage.org
Warning: cannot apply tags to missing message: mailman.7.1317646801.26891.outages-discussion@outages.org
Warning: cannot apply tags to missing message: notmuch-sha1-000458df6e48d4857187a000d643ac971deeef47
Warning: cannot apply tags to missing message: notmuch-sha1-0079d8e0c3340e6f88c66f4c49fca758ea71d06d
Warning: cannot apply tags to missing message: notmuch-sha1-0194baa4cfb6d39bc9e4d8c049adaccaa777467d
Warning: cannot apply tags to missing message: notmuch-sha1-02aede494fc3f9e9f060cfd7c044d6d724ad287c
Warning: cannot apply tags to missing message: notmuch-sha1-06606c625d3b3445420e737afd9a245ae66e5562
Warning: cannot apply tags to missing message: notmuch-sha1-0747b020f7551415b9bf5059c58e0a637ba53b13
[...]
As detailed in the crash report, all of those were actually innocuous and could be ignored.
Also note that we completely trash the notmuch
database because it's
actually faster to reindex from scratch than let notmuch
slowly
figure out that all mails are new and all the old mails are
gone. The fresh indexing took:
nov 19 15:08:54 angela notmuch[2521117]: Processed 384679 total files in 23m 41s (270 files/sec.).
nov 19 15:08:54 angela notmuch[2521117]: Added 372610 new messages to the database.
While a reindexing on top of an existing database was going twice as slow, at about 120 files/sec.
Current config file
Putting it all together, I ended up with the following configuration file:
SyncState *
Sync All
# IMAP side, AKA "Far"
IMAPAccount anarcat-imap
Host imap.anarc.at
User anarcat
PassCmd "pass imap.anarc.at"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt
IMAPAccount anarcat-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap"
IMAPStore anarcat-remote
Account anarcat-tunnel
# Maildir side, AKA "Near"
MaildirStore anarcat-local
# Maildir/top/sub/sub
#SubFolders Verbatim
# Maildir/.top.sub.sub
SubFolders Maildir++
# Maildir/top/.sub/.sub
# SubFolders legacy
# The trailing "/" is important
#Path ~/Maildir-mbsync/
Inbox ~/Maildir/
# what binds Maildir and IMAP
Channel anarcat
Far :anarcat-remote:
Near :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
#Patterns *
Patterns * !register !.register
# Automatically create missing mailboxes, both locally and on the server
Create Both
#Create Near
# Sync the movement of messages between folders and deletions, add after making sure the sync works
Expunge Both
# Propagate mailbox deletion
Remove both
IMAPAccount anarcat-register-imap
Host imap.anarc.at
User register
PassCmd "pass imap.anarc.at-register"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt
IMAPAccount anarcat-register-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C register@imap.anarc.at /usr/lib/dovecot/imap"
IMAPStore anarcat-register-remote
Account anarcat-register-tunnel
MaildirStore anarcat-register-local
SubFolders Maildir++
Inbox ~/Maildir/.register/
Channel anarcat-register
Far :anarcat-register-remote:
Near :anarcat-register-local:
Create Both
Expunge Both
Remove both
Note that it may be out of sync with my live (and private) configuration file, as I do not publish my "dotfiles" repository publicly for security reasons.
New client setup
The above describes the migration from SMD, but a slightly simpler
procedure can be used to setup new clients. In this example, we
transfer files from the server (marcos
) to a new client (emma
):
disable systemd services to keep them from hijacking this procedure:
systemctl --user disable mbsync.timer notmuch-new.service notmuch-new.timer systemctl --user stop mbsync.service mbsync.timer notmuch-new.service notmuch-new.timer
install isync:
apt install isync
create the SSH key:
ssh-keygen -t ed25519 -f .ssh/id_ed25519_mbsync cat .ssh/id_ed25519_mbsync.pub
add to
.ssh/authorized_keys
on the server, like this:command="/usr/lib/dovecot/imap",restrict ssh-ed25519 AAAAC...
backup Maildir and notmuch database on the server:
cp -al Maildir Maildir-bak notmuch dump | pv > notmuch.dump
move new files in place (CRITICAL SECTION BEGINS!):
mv Maildir-mbsync Maildir
run a test sync, only pulling changes:
sed -i 's/Remove .*/Remove None/;s/Expunge .*/Expunge None/' .mbsyncrc mbsync --create-near --noop anarcat-register
if that works well, try with all mailboxes:
mbsync --create-near --noop -a
That will yield a LOT of warnings like:
Maildir notice: no UIDVALIDITY, creating new.
Specifically: one per mailbox. That is normal.
if that works well, try again with a full sync:
mbsync anarcat-register
Note that the above will actually sync the full register mailbox, as that mailbox was not covered by the first rsync. That is normal.
mbsync -a
This is without
Expunge
andRemove
. To remove those safeguards, you can revert with:sed -i 's/Remove .*/Remove Both/;s/Expunge .*/Expunge Both/' .mbsyncrc
And then rerun:
mbsync -a
... which should be a noop.
reindex and restore the notmuch database, this should take ~25 minutes:
notmuch new ssh -tt marcos.anarc.at pv notmuch.dump > notmuch.dump pv notmuch.dump | notmuch restore rm notmuch.dump
test systemd unattended run:
systemctl --user start mbsync.service
enable the systemd services:
systemctl --user enable mbsync.timer notmuch-new.service systemctl --user start mbsync.timer
You are done.
Note: a previous version of this procedure suggested using rsync
to
copy files and running mbsync to complete the sync. In testing that
procedure on emma
, it seemed that all messages were copied all over
again, so this part of the procedure was removed in favor of the
simpler (but slower) mbsync-only procedure. (Slower: it should take 60
minutes instead of 5 minutes with rsync. But since it's a one-time
thing, it's a tolerable delay.)
The actual last run of a full mbsync
, on emma
was about an hour:
98.58user 261.84system 1:02:56elapsed 9%CPU (0avgtext+0avgdata 109736maxresident)k
27400inputs+31871760outputs (59major+32660minor)pagefaults 0swaps
And the first notmuch new
was another half hour:
anarcat@emma:~$ time notmuch new
Found 437261 total files (that's not much mail).
Processed 437261 total files in 34m 52s (208 files/sec.).
Added 424360 new messages to the database.
1601.60user 171.89system 35:00.93elapsed 84%CPU (0avgtext+0avgdata 552004maxresident)k
31125848inputs+87570304outputs (305major+231166minor)pagefaults 0swaps
Restoring the tags is faster but not by much, about 20 minutes.
OfflineIMAP
I've used OfflineIMAP for a long time before switching to SMD. I
don't exactly remember why or when I started using it, but I do
remember it became painfully slow as I started using notmuch
, and
would sometimes crash mysteriously. It's been a while, so my memory is
hazy on that.
It also kind of died in a fire when Python 2 stop being
maintained. The main author moved on to a different project,
imapfw which could serve as a framework to build IMAP clients,
but never seemed to implement all of the OfflineIMAP features and
certainly not configuration file compatibility. Thankfully, a new team
of volunteers ported OfflineIMAP to Python 3 and we can now test that
new version to see if it is an improvement over mbsync
.
Crash on full sync
The first thing that happened on a full sync is this crash:
Copy message from RemoteAnarcat:junk:
ERROR: Copying message 30624 [acc: Anarcat]
decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Thread 'Copy message from RemoteAnarcat:junk' terminated with exception:
Traceback (most recent call last):
File "/usr/share/offlineimap3/offlineimap/imaputil.py", line 406, in utf7m_decode
for c in binary.decode():
AttributeError: 'memoryview' object has no attribute 'decode'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/share/offlineimap3/offlineimap/threadutil.py", line 146, in run
Thread.run(self)
File "/usr/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto
message = self.getmessage(uid)
File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage
data = self._fetch_from_imap(str(uid), self.retrycount)
File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap
ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1])
File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes
return self.parser.parsestr(text, headersonly)
File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
File "/usr/lib/python3.9/email/parser.py", line 56, in parse
feedparser.feed(data)
File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed
self._call_parse()
File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse
self._parse()
File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
for retval in self._parsegen():
File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen
for retval in self._parsegen():
File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
for retval in self._parsegen():
File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen
if self._cur.get_content_type() == 'message/delivery-status':
File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type
value = self.get('content-type', missing)
File "/usr/lib/python3.9/email/message.py", line 471, in get
return self.policy.header_fetch_parse(k, v)
File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse
return self.header_factory(name, value)
File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__
return self[name](name, value)
File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__
cls.parse(value, kwds)
File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse
kwds['parse_tree'] = parse_tree = cls.value_parser(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header
ctype.append(parse_mime_parameters(value[1:]))
File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters
token, value = get_parameter(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter
token, value = get_value(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value
token, value = get_quoted_string(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string
token, value = get_bare_quoted_string(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string
token, value = get_encoded_word(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word
text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode
string = bstring.decode(charset)
AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Last 1 debug messages logged for Copy message from RemoteAnarcat:junk prior to exception:
thread: Register new thread 'Copy message from RemoteAnarcat:junk' (account 'Anarcat')
ERROR: Exceptions occurred during the run!
ERROR: Copying message 30624 [acc: Anarcat]
decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Traceback:
File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto
message = self.getmessage(uid)
File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage
data = self._fetch_from_imap(str(uid), self.retrycount)
File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap
ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1])
File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes
return self.parser.parsestr(text, headersonly)
File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
File "/usr/lib/python3.9/email/parser.py", line 56, in parse
feedparser.feed(data)
File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed
self._call_parse()
File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse
self._parse()
File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
for retval in self._parsegen():
File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen
for retval in self._parsegen():
File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
for retval in self._parsegen():
File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen
if self._cur.get_content_type() == 'message/delivery-status':
File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type
value = self.get('content-type', missing)
File "/usr/lib/python3.9/email/message.py", line 471, in get
return self.policy.header_fetch_parse(k, v)
File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse
return self.header_factory(name, value)
File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__
return self[name](name, value)
File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__
cls.parse(value, kwds)
File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse
kwds['parse_tree'] = parse_tree = cls.value_parser(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header
ctype.append(parse_mime_parameters(value[1:]))
File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters
token, value = get_parameter(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter
token, value = get_value(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value
token, value = get_quoted_string(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string
token, value = get_bare_quoted_string(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string
token, value = get_encoded_word(value)
File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word
text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode
string = bstring.decode(charset)
Folder junk [acc: Anarcat]:
Copy message UID 30626 (29008/49310) RemoteAnarcat:junk -> LocalAnarcat:junk
Command exited with non-zero status 100
5252.91user 535.86system 3:21:00elapsed 47%CPU (0avgtext+0avgdata 846304maxresident)k
96344inputs+26563792outputs (1189major+2155815minor)pagefaults 0swaps
That only transferred about 8GB of mail, which gives us a transfer
rate of 5.3Mbit/s, more than 5 times slower than mbsync
. This bug is
possibly limited to the bullseye
version of offlineimap3
(the
lovely 0.0~git20210225.1e7ef9e+dfsg-4
), while the current sid
version (the equally gorgeous 0.0~git20211018.e64c254+dfsg-1
) seems
unaffected.
Tolerable performance
The new release still crashes, except it does so at the very end, which is an improvement, since the mails do get transferred:
*** Finished account 'Anarcat' in 511:12
ERROR: Exceptions occurred during the run!
ERROR: Exception parsing message with ID (<20190619152034.BFB8810E07A@marcos.anarc.at>) from imaplib (response type: bytes).
AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Traceback:
File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
message = self.getmessage(uid)
File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
data = self._fetch_from_imap(str(uid), self.retrycount)
File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap
raise OfflineImapError(
ERROR: Exception parsing message with ID (<40A270DB.9090609@alternatives.ca>) from imaplib (response type: bytes).
AttributeError: decoding with 'x-mac-roman' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Traceback:
File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
message = self.getmessage(uid)
File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
data = self._fetch_from_imap(str(uid), self.retrycount)
File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap
raise OfflineImapError(
ERROR: IMAP server 'RemoteAnarcat' does not have a message with UID '32686'
Traceback:
File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
message = self.getmessage(uid)
File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
data = self._fetch_from_imap(str(uid), self.retrycount)
File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 889, in _fetch_from_imap
raise OfflineImapError(reason, severity)
Command exited with non-zero status 1
8273.52user 983.80system 8:31:12elapsed 30%CPU (0avgtext+0avgdata 841936maxresident)k
56376inputs+43247608outputs (811major+4972914minor)pagefaults 0swaps
"offlineimap -o " took 8 hours 31 mins 15 secs
This is 8h31m for transferring 12G, which is around 3.1Mbit/s. That is
nine times slower than mbsync
, almost an order of magnitude!
Now that we have a full sync, we can test incremental synchronization. That is also much slower:
===> multitime results
1: sh -c "offlineimap -o || true"
Mean Std.Dev. Min Median Max
real 24.639 0.513 23.946 24.526 25.708
user 23.912 0.473 23.404 23.795 24.947
sys 1.743 0.105 1.607 1.729 2.002
That is also an order of magnitude slower than mbsync
, and
significantly slower than what you'd expect from a sync process. ~30
seconds is long enough to make me impatient and distracted; 3 seconds,
less so: I can wait and see the results almost immediately.
Integrity check
That said: this is still on a gigabit link. It's technically
possible that OfflineIMAP performs better than mbsync
over a slow
link, but I Haven't tested that theory.
The OfflineIMAP mail spool is missing quite a few messages as well:
anarcat@angela:~(main)$ find Maildir-offlineimap -type f -type f -a \! -name '.*' | wc -l
381463
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l
385247
... although that's probably all either new messages or the
register
folder, so OfflineIMAP might actually be in a better
position there. But digging in more, it seems like the actual
per-folder diff is fairly similar to mbsync
: a few messages missing
here and there. Considering OfflineIMAP's instability and poor
performance, I have not looked any deeper in those discrepancies.
Other projects to evaluate
Those are all the options I have considered, in alphabetical order
- doveadm-sync: requires dovecot on both ends, can tunnel over SSH, may have performance issues in incremental sync, written in C
- fdm: fetchmail replacement, IMAP/POP3/stdin/Maildir/mbox,NNTP support, SOCKS support (for Tor), complex rules for delivering to specific mailboxes, adding headers, piping to commands, etc. discarded because no (real) support for keeping mail on the server, and written in C
- getmail: fetchmail replacement, IMAP/POP3 support, supports incremental runs, classification rules, Python
- imapnotify (also goimapnotify, python-imapnotify): run a script when "IMAP IDLE" pings, not a puller itself
- imapsync: one-way only, has another list of alternatives
- interimap: syncs two IMAP servers, apparently faster than
doveadm
andofflineimap
, but requires running an IMAP server locally, Perl - isync/mbsync: TLS client certs and SSH tunnels, fast, incremental, IMAP/POP/Maildir support, multiple mailbox, trash and recursion support, and generally has good words from multiple Debian and notmuch people (Arch tutorial), supports push notifications through imapnotify (see above) written in C, review above
- mail-sync: notify support, happens over any piped transport
(e.g. ssh), diff/patch system, requires binary on both ends,
mentions UUCP in the manpage, mentions
rsmtp
which is a nice name forrsendmail
. not evaluated because it seems awfully complex to setup, Haskell - neverest: rust, IMAP/Maildir/Notmuch sync, filters, lacks client TLS support, see comparison, layout incompatible with mbsync, unclear if it supports IDLE/notify
- nncp: treat the local spool as another mail server, not really compatible with my "multiple clients" setup, Golang
- offlineimap3: requires IMAP, used the py2 version in the past, might just still work, first sync painful (IIRC), ways to tunnel over SSH, review above, Python
- runt: IMAP-to-maildir, rust, IDLE and QRESYNC support, can run as a daemon to monitor the filesystem (and server) for changes
Most projects were not evaluated due to lack of time.
Conclusion
I'm now using mbsync
to sync my mail. I'm a little disappointed by
the synchronisation times over the slow link, but I guess that's on
par for the course if we use IMAP. We are bound by the network speed
much more than with custom protocols. I'm also worried about the C
implementation and the crashes I have witnessed, but I am encouraged
by the fast upstream response.
Time will tell if I will stick with that setup. I'm certainly curious about the promises of interimap and mail-sync, but I have ran out of time on this project.
You can use your Mastodon account to reply to this post.