This morning, internet was down at home. The last time I had such an issue was in February 2023, when my provider was Oricom. Now I'm with a business service at Teksavvy Internet (TSI), in which I pay 100$ per month for a 250/50 mbps business package, with a static IP address, on which I run, well, everything: email services, this website, etc.

Mitigation

Email

The main problem when the service goes down like this for prolonged outages is email. Mail is pretty resilient to failures like this but after some delay (which varies according to the other end), mail starts to drop. I am actually not sure what the various settings are among different providers, but I would assume mail is typically kept for about 24h, so that's our mark.

Last time, I setup VMs at Linode and Digital Ocean to deal better with this. I have actually kept those VMs running as DNS servers until now, so that part is already done.

I had fantasized about Puppetizing the mail server configuration so that I could quickly spin up mail exchangers on those machines. But now I am realizing that my Puppet server is one of the service that's down, so this would not work, at least not unless the manifests can be applied without a Puppet server (say with puppet apply).

Thankfully, my colleague groente did amazing work to refactor our Postfix configuration in Puppet at Tor, and that gave me the motivation to reproduce the setup in the lab. So I have finally Puppetized part of my mail setup at home. That used to be hand-crafted experimental stuff documented in a couple of pages in this wiki, but is now being deployed by Puppet.

It's not complete yet: spam filtering (including DKIM checks and graylisting) are not implemented yet, but that's the next step, presumably to do during the next outage. The setup should be deployable with puppet apply, however, and I have refined that mechanism a little bit, with the run script.

Heck, it's not even deployed yet. But the hard part / grunt work is done.

Other

The outage was "short" enough (5 hours) that I didn't take time to deploy the other mitigations I had deployed in the previous incident.

But I'm starting to seriously consider deploying a web (and caching) reverse proxy so that I endure such problems more gracefully.

Side note on proper servics

Typically, I tend to think of a properly functioning service as having four things:

  1. backups
  2. documentation
  3. monitoring
  4. automation
  5. high availability

Yes, I miscounted. This is why you have high availability.

Backups

Duh. If data is maliciously or accidentally destroyed, you need a copy somewhere. Preferably in a way that malicious joe can't get to.

This is harder than you think.

Documentation

I have an entire template for this. Essentially, it boils down to using https://diataxis.fr/ and this "audit" guide. For me, the most important parts are:

You probably know this is hard, and this is why you're not doing it. Do it anyways, you'll think it sucks, but you'll be really grateful for whatever scraps you wrote when you're in trouble.

Monitoring

If you don't have monitoring, you'll know it fails too late, and you won't know it recovers. Consider high availability, work hard to reduce noise, and don't have machine wake people up, that's literally torture and is against the Geneva convention.

Consider predictive algorithm to prevent failures, like "add storage within 2 weeks before this disk fills up".

This is harder than you think.

Automation

Make it easy to redeploy the service elsewhere.

Yes, I know you have backups. That is not enough: that typically restores data and while it can also include configuration, you're going to need to change things when you restore, which is what automation (or call it "configuration management" if you will) will do for you anyways.

This also means you can do unit tests on your configuration, otherwise you're building legacy.

This is probably as hard as you think.

High availability

Make it not fail when one part goes down.

Eliminate single points of failures.

This is easier than you think, except for storage and DNS (which, I guess, means it's harder than you think too).

Assessment

In the above 5 items, I check two:

  1. backups
  2. documentation

And barely: I'm not happy about the offsite backups, and my documentation is much better at work than at home (and even there, I have a 15 year backlog to catchup on).

I barely have monitoring: Prometheus is scraping parts of the infra, but I don't have any sort of alerting -- by which I don't mean "electrocute myself when something goes wrong", I mean "there's a set of thresholds and conditions that define an outage and I can look at it".

Automation is wildly incomplete. My home server is a random collection of old experiments and technologies, ranging from Apache with Perl and CGI scripts to Docker containers running Golang applications. Most of it is not Puppetized (but the ratio is growing). Puppet itself introduces a huge attack vector with kind of catastrophic lateral movement if the Puppet server gets compromised.

And, fundamentally, I am not sure I can provide high availability in the lab. I'm just this one guy running my home network, and I'm growing older. I'm thinking more about winding things down than building things now, and that's just really sad, because I feel we're losing (well that escalated quickly).

Resolution

In the end, I didn't need any mitigation and the problem fixed itself. I did do quite a bit of cleanup so that feels somewhat good, although I despaired quite a bit at the amount of technical debt I've accumulated in the lab.

Timeline

Times are in UTC-4.

You can use your Mastodon account to reply to this post.

Created . Edited .