It's major upgrade time again! The Debian project just published the Debian 11 "bullseye" release, and it's pretty awesome! This makes me realize that I have never written here about my peculiar upgrade process, and figured it was worth bringing that up to a wider audience.

My upgrade process also has a notable changes section which includes major version changes (e.g. Inkscape 1.0!), new packages (e.g. podman!) and important behavior changes (e.g. driverless scanning and printing!).

I'm particularly interested to hear about any significant change I might have missed. If you know of a cool new package that shipped with bullseye and that I forgot, do let me know!

But that's for the cool new stuff. We need to talk about the problems with Debian major upgrades.

Background
Limitations of the official procedure
How to automate major upgrades
Trade-offs
The other option: never upgrade
What is your process?
My upgrades so far

Background

I have been maintaining detailed upgrade guides, on my wiki, starting with the jessie release, but I have actually written such guides for Koumbit.org as far back as Debian squeeze in 2011 (another worker wrote the older Debian lenny upgrade guide in 2009). Koumbit, since then, has kept maintaining those guides all the way to the latest bullseye upgrade, through 7 major releases!

Over the years, those guides evolved from a quick "cheat-sheet" format copied from the release notes into a more or less "scripted" form that I currently use.

Each guide has a procedure made of a few steps that can be basically copy-pasted to batch-upgrade a host (or multiple hosts in parallel) as quickly as possible. There is also the predict-os script which allows you to keep track of progress of the upgrades in a Puppet cluster.

Limitations of the official procedure

In comparison with my procedure, the official upgrade guide is mostly designed to upgrade a single machine, typically a workstation, with a rather slow and exhaustive process. The PDF version of the upgrade guide is 14 pages long! This, obviously, does not work when you have tens or hundreds of machines to upgrade.

Debian upgrades are notorious for being extremely reliable, but we have a lot of packages, and there are always corner cases where the upgrade will just fail because of a bug specific to your environment. Those will only be fixed after some back and forth in the community (and that's assuming users report those bugs, which is not always the case). There's no obvious way to deploy "hot fixes" in this context, at least not without fixing the package and publishing it on an unofficial Debian archive while the official ones catch up. This is slow and difficult.

Or some packages require manual labor. Examples of this are the PostgreSQL or Ganeti packages which require you to upgrade your clusters by hand, while the old and new packages live side by side. Debian packages bring you far in the upgrade process, but sometimes not all the way.

Which means every Debian install needs to be manually upgraded and inspected when a new release comes out. That's slow and error prone and we can do better.

How to automate major upgrades

I have a proposal to automate this. It's been mostly dormant in the Debian wiki, for 5 years now. Fundamentally, this is a hard problem: Debian gets installed in so many different environments, from workstations to physical servers to virtual machines, embedded systems and so on, that it's extremely hard to come up with a "one size fits all" system.

The (manual) procedure I'm using is mostly targeting servers, but I'm also using it on workstations. And I'll note that it's specific to my home setup: I have a different procedure at work, although it has a lot of common code.

To automate this, I would factor out that common code with hooks where you could easily inject special code like "you need to upgrade ferm first", "you need an extra reboot here", or "this is how you finish the PostgreSQL upgrade".

With Debian getting closer to a 2 year release cycle, with the previous release being supported basically only one year after the new stable comes out, I feel more and more strongly that this needs better automation.

So I'm thinking that I should write a prototype for this. Ubuntu has do-release-upgrade that is too Ubuntu-specific to be reused. An attempt at collaborating on this has been mostly met with silence from Ubuntu's side as well.

I'm thinking that using something like Fabric, Mitogen, or Transilience: anything that will allow me to write simple, portable Python code that can run transparently on a local machine (for single systems upgrades, possibly with a GUI frontend) to remote servers (for large clusters of servers, maybe with canaries and grouping using Cumin). I'll note that Koumbit started experimenting with Puppet Bolt in the bullseye upgrade process, but that feels too site-specific to be useful more broadly.

Trade-offs

I am not sure where this stands in the XKCD time trade-off evaluation, because the table doesn't actually cover the time frequency of Debian release (which is basically "biennial") and the amount of time the upgrade would take across a cluster (which varies a lot, but that I estimate to be between one to 6 hours per machine).

Assuming I have 80 machines to upgrade, that is 80 to 480 hours (between ~3 to 20 days) of work! It's unclear how much work such an automated system would shave off, however. Assuming things are an order of magnitude faster (say I upgrade 10 machines at a time), I would shave off between 3 and 18 days of work, which implies I might allow myself to spend a minimum of 5 days working on such a project.

The other option: never upgrade

Before people mention those: I am aware of containers, Kubernetes, and other deployment mechanisms. Indeed, those may be a long-term solution, we currently can't afford to migrate everything over to containers right now: that is a huge migration and a total paradigm shift. At that point, whatever is left might not even be Debian in the first place. And besides, if you run Kubernetes, you still need to run some OS underneath and upgrade that, so that problem never completely disappears.

Still, maybe that's the final answer: never upgrade.

For some stateless machines like DNS replicas or load balancers, that might make a lot of sense as there's no or little data to carry to the new host. But this implies a seamless and fast provisioning process, and we don't have that either: at my work, installing a machine takes about as long as upgrading it, and that's after a significant amount of work automating that process, partly writing my own Debian installer with Fabric (!).

What is your process?

I'm curious to hear what people think of those ideas. It strikes me as really odd that no one has really tackled that problem yet, considering how many clusters of Debian machines are out there. Surely people are upgrading those, and not following that slow step by step guide, right?

I suspect everyone is doing the same thing: we all have our little copy-paste script we batch onto multiple machines, sometimes in parallel. That is what the Debian.org sysadmins are doing as well.

There must be a better way. What is yours?

My upgrades so far

So far, I have upgraded 2 out of my 3 home machines running buster -- others have been installed directly in bullseye -- with only my main, old, messy server left. Upgrades have been pretty painless so far (see another report, for example), much better than the previous buster upgrade. Obviously, for me personal use, automating this is pointless.

Work-side, however, is another story: we have over 80 boxes to upgrade there and that will take a while. The last stretch to buster cycle took about two years to complete, so we might be done by the time the next release (12, "bookworm") is released, but that's actually a full year after "buster" becomes EOL, so it's actually too late...

At least I fixed the installers so that new the machines we create all ship with bullseye, so we stopped accumulating new buster hosts...

Thanks to lelutin and pabs for reviewing a draft of this post.

RSS

clusterssh

I group together virtual machines, servers and containers that are similar.

They tend to have configurations close enough to send the upgrade commands/ scripts in parallel, simultaneously.

So for example I have a cluster of 3 Proxmox servers, I will access them with:

clusterssh pve01 pve02 pve03

... and issue the commands there.

Then for all the IS/IT servers (VMs):

clusterssh support syspass guacamole ssh-bastion

And for web servers:

clusterssh web1 web2 web3 web4

etc.

It remains a semi-manual process but much faster than going into each and every system simultaneously.

I don't manage anything bigger than 15-20 servers at a time so it's not the same problem and probably wouldn't scale but it works for me.

Comment by Fabian Rodriguez — 2021-09-20 22:22

cssh and others

Thanks Fabian, that's a great point! clusterssh (AKA cssh) is how I did a lot of Debian upgrades at Koumbit back in the days. I don't know why exactly I haven't used it yet here, but there's something to be said about heterogeneous systems here.

If you have a lot of identical machines, then, yes, just batch-entering all those commands on all those machines at once works like a charm. I would actually use clustershell for this nowadays, since you can feed it a bunch of scripts more easily and you don't have to deal with tens of windows or Perl or TCL/TK like cssh (maybe I remember wrong).

But still: that doesn't deal with the situation where you have that special snowflake server (e.g. PostgreSQL) that needs to be upgraded just so. Sure, you can do the upgrade at once there as well: you enter the magic commands on all the servers for that as well. But then if you mess it up, you destroy all your (e.g. PostgreSQL!) servers all at once.

The idea of scripting this into something that can be batched by something like cumin or others is that you can start the upgrade on a part of the cluster and then stop if something goes wrong. With cssh and friends, it's kind of all-or-nothing.

Of course, you can absolutely select part of the cluster by hand, but that is still kind of slow and manual.

I want a silver bullet, and one that we can collaborate on too. If you or I can do cssh to deal with those things, that's great, but that doesn't help Roger over there who's just getting started with Debian sysadmin and has to now figure out how to upgrade PostgreSQL interactively...

Comment by anarcat — 2021-09-21 10:57

Comments on this page are closed.

Created 2021-09-05 14:40. Edited 2021-11-24 09:14.