This morning, internet was down at home. The last time I had such an issue was in February 2023, when my provider was Oricom. Now I'm with a business service at Teksavvy Internet (TSI), in which I pay 100$ per month for a 250/50 mbps business package, with a static IP address, on which I run, well, everything: email services, this website, etc.

Mitigation

Email

The main problem when the service goes down like this for prolonged outages is email. Mail is pretty resilient to failures like this but after some delay (which varies according to the other end), mail starts to drop. I am actually not sure what the various settings are among different providers, but I would assume mail is typically kept for about 24h, so that's our mark.

Last time, I setup VMs at Linode and Digital Ocean to deal better with this. I have actually kept those VMs running as DNS servers until now, so that part is already done.

I had fantasized about Puppetizing the mail server configuration so that I could quickly spin up mail exchangers on those machines. But now I am realizing that my Puppet server is one of the service that's down, so this would not work, at least not unless the manifests can be applied without a Puppet server (say with puppet apply).

Thankfully, my colleague groente did amazing work to refactor our Postfix configuration in Puppet at Tor, and that gave me the motivation to reproduce the setup in the lab. So I have finally Puppetized part of my mail setup at home. That used to be hand-crafted experimental stuff documented in a couple of pages in this wiki, but is now being deployed by Puppet.

It's not complete yet: spam filtering (including DKIM checks and graylisting) are not implemented yet, but that's the next step, presumably to do during the next outage. The setup should be deployable with puppet apply, however, and I have refined that mechanism a little bit, with the run script.

Heck, it's not even deployed yet. But the hard part / grunt work is done.

Other

The outage was "short" enough (5 hours) that I didn't take time to deploy the other mitigations I had deployed in the previous incident.

But I'm starting to seriously consider deploying a web (and caching) reverse proxy so that I endure such problems more gracefully.

Side note on proper services

Well that was dumb. I wrote this clever piece on what's a properly ran service and originally shoved it deep inside this service note instead of making a blog article.

That is now fixed, see 2025-09-30-proper-services instead.

Resolution

In the end, I didn't need any mitigation and the problem fixed itself. I did do quite a bit of cleanup so that feels somewhat good, although I despaired quite a bit at the amount of technical debt I've accumulated in the lab.

Timeline

Times are in UTC-4.

6:52: IRC bouncer goes offline
9:20: called TSI support, waited on the line 15 minutes then was told I'd get a call back
9:54: outage apparently detected by TSI
11:00: no response, tried calling back support again
11:10: confirmed bonding router outage, no official ETA but "today", source of the 9:54 timestamp above
12:08: TPA monitoring notices service restored
12:34: call back from TSI; service restored, problem was with the "bonder" configuration on their end, which was "fighting between Montréal and Toronto"

You can use your Mastodon account to reply to this post.

Created 2025-03-22 00:25. Edited 2025-09-30 10:59.