Minor outage at Teksavvy business
This morning, internet was down at home. The last time I had such an issue was in February 2023, when my provider was Oricom. Now I'm with a business service at Teksavvy Internet (TSI), in which I pay 100$ per month for a 250/50 mbps business package, with a static IP address, on which I run, well, everything: email services, this website, etc.
Mitigation
The main problem when the service goes down like this for prolonged outages is email. Mail is pretty resilient to failures like this but after some delay (which varies according to the other end), mail starts to drop. I am actually not sure what the various settings are among different providers, but I would assume mail is typically kept for about 24h, so that's our mark.
Last time, I setup VMs at Linode and Digital Ocean to deal better with this. I have actually kept those VMs running as DNS servers until now, so that part is already done.
I had fantasized about Puppetizing the mail server configuration so
that I could quickly spin up mail exchangers on those machines. But
now I am realizing that my Puppet server is one of the service that's
down, so this would not work, at least not unless the manifests can be
applied without a Puppet server (say with puppet apply
).
Thankfully, my colleague groente did amazing work to refactor our Postfix configuration in Puppet at Tor, and that gave me the motivation to reproduce the setup in the lab. So I have finally Puppetized part of my mail setup at home. That used to be hand-crafted experimental stuff documented in a couple of pages in this wiki, but is now being deployed by Puppet.
It's not complete yet: spam filtering (including DKIM checks and
graylisting) are not implemented yet, but that's the next step,
presumably to do during the next outage. The setup should be
deployable with puppet apply
, however, and I have refined that
mechanism a little bit, with the run
script.
Heck, it's not even deployed yet. But the hard part / grunt work is done.
Other
The outage was "short" enough (5 hours) that I didn't take time to deploy the other mitigations I had deployed in the previous incident.
But I'm starting to seriously consider deploying a web (and caching) reverse proxy so that I endure such problems more gracefully.
Side note on proper services
Well that was dumb. I wrote this clever piece on what's a properly ran service and originally shoved it deep inside this service note instead of making a blog article.
That is now fixed, see 2025-09-30-proper-services instead.
Resolution
In the end, I didn't need any mitigation and the problem fixed itself. I did do quite a bit of cleanup so that feels somewhat good, although I despaired quite a bit at the amount of technical debt I've accumulated in the lab.
Timeline
Times are in UTC-4.
- 6:52: IRC bouncer goes offline
- 9:20: called TSI support, waited on the line 15 minutes then was told I'd get a call back
- 9:54: outage apparently detected by TSI
- 11:00: no response, tried calling back support again
- 11:10: confirmed bonding router outage, no official ETA but "today", source of the 9:54 timestamp above
- 12:08: TPA monitoring notices service restored
- 12:34: call back from TSI; service restored, problem was with the "bonder" configuration on their end, which was "fighting between Montréal and Toronto"
You can use your Mastodon account to reply to this post.