Proper services
During 2025-03-21-another-home-outage, I reflected upon what's a properly ran service and blurted out what turned out to be something important I want to outline more. So here it is, again, on its own for my own future reference.
Typically, I tend to think of a properly functioning service as having four things:
- backups
- documentation
- monitoring
- automation
- high availability (HA)
Yes, I miscounted. This is why you need high availability.
A service doesn't properly exist if it doesn't at least have the first 3 of those. It will be harder to maintain without automation, and inevitably suffer prolonged outages without HA.
The five components of a proper service
Backups
Duh. If data is maliciously or accidentally destroyed, you need a copy somewhere. Preferably in a way that malicious Joe can't get to.
This is harder than you think.
Documentation
I have an entire template for this. Essentially, it boils down to using https://diataxis.fr/ and this "audit" guide. For me, the most important parts are:
- disaster recovery (includes backups, probably)
- playbook
- install/upgrade procedures (see automation)
You probably know this is hard, and this is why you're not doing it. Do it anyways, you'll think it sucks, it will grow out of sync with reality, but you'll be really grateful for whatever scraps you wrote when you're in trouble.
Any docs, in other words, is better than no docs, but are no excuse for doing the work correctly.
Monitoring
If you don't have monitoring, you'll know it fails too late, and you won't know it recovers. Consider high availability, work hard to reduce noise, and don't have machine wake people up, that's literally torture and is against the Geneva convention.
Consider predictive algorithm to prevent failures, like "add storage within 2 weeks before this disk fills up".
This is also harder than you think.
Automation
Make it easy to redeploy the service elsewhere.
Yes, I know you have backups. That is not enough: that typically restores data and while it can also include configuration, you're going to need to change things when you restore, which is what automation (or call it "configuration management" if you will) will do for you anyways.
This also means you can do unit tests on your configuration, otherwise you're building legacy.
This is probably as hard as you think.
High availability
Make it not fail when one part goes down.
Eliminate single points of failures.
This is easier than you think, except for storage and DNS ("naming things" not "HA DNS", that is easy), which, I guess, means it's harder than you think too.
Assessment
In the above 5 items, I currently check two in my lab:
- backups
- documentation
And barely: I'm not happy about the offsite backups, and my documentation is much better at work than at home (and even there, I have a 15 year backlog to catchup on).
I barely have monitoring: Prometheus is scraping parts of the infra, but I don't have any sort of alerting -- by which I don't mean "electrocute myself when something goes wrong", I mean "there's a set of thresholds and conditions that define an outage and I can look at it".
Automation is wildly incomplete. My home server is a random collection of old experiments and technologies, ranging from Apache with Perl and CGI scripts to Docker containers running Golang applications. Most of it is not Puppetized (but the ratio is growing). Puppet itself introduces a huge attack vector with kind of catastrophic lateral movement if the Puppet server gets compromised.
And, fundamentally, I am not sure I can provide high availability in the lab. I'm just this one guy running my home network, and I'm growing older. I'm thinking more about winding things down than building things now, and that's just really sad, because I feel we're losing (well that escalated quickly).
Side note about Tor
The above applies to my personal home lab, not work!
At work, of course, it's another (much better) story:
- all services have backups
- lots of services are well documented, but not all
- most services have at least basic monitoring
- most services are Puppetized, but not crucial parts (DNS, LDAP, Puppet itself), and there are important chunks of legacy coupling between various services that make the whole system brittle
- most websites, DNS and large parts of email are highly available, but key services like the the Forum, GitLab and similar applications are not HA, although most services run under replicated VMs that can trivially survive a total, single-node hardware failure (through Ganeti and DRBD)
You can use your Mastodon account to reply to this post.