Yesterday, my workstation (curie) was hung when I came in the office. After a "skinny elephant", the box rebooted, but it couldn't find the primary disk (in the BIOS). Instead, it booted on the secondary HDD drive, still running an old Fedora 27 install which somehow survived to this day, possibly because ?BTRFS is incomprehensible.

Somehow, I blindly accepted the Fedora prompt asking me to upgrade to Fedora 28, not realizing that:

  1. Fedora is now at release 36, not 28
  2. major upgrades take about an hour...
  3. ... and happen at boot time, blocking the entire machine (I'll remember this next time I laugh at Windows and Mac OS users stuck on updates on boot)
  4. you can't skip more than one major upgrade

Which means that upgrading to latest would take over 4 hours. Thankfully, it's mostly automated and seems to work pretty well (which is not exactly the case for Debian). It still seems like a lot of wasted time -- it would probably be better to just reinstall the machine at this point -- and not what I had planned to do that morning at all.

In any case, after waiting all that time, the machine booted (in Fedora) again, and now it could detect the SSD disk. The BIOS could find the disk too, so after I reinstalled grub (from Fedora) and fixed the boot order, it rebooted, but secureboot failed, so I turned that off (!?), and I was back in Debian.

I did an emergency backup with ddrescue, from the running system which probably doesn't really work as a backup (because the filesystem is likely to be corrupt) but it was fast enough (20 minutes) and gave me some peace of mind. My offsites backup have been down for a while and since I treat my workstations as "cattle" (not "pets"), I don't have a solid recovery scenario for those situations other than "just reinstall and run Puppet", which takes a while.

Now I'm wondering what the next step is: probably replace the disk anyways (the new one is bigger: 1TB instead of 500GB), or keep the new one as a hot backup somehow. Too bad I don't have a snapshotting filesystem on there... (Technically, I have LVM, but LVM snapshots are heavy and slow, and can't atomically cover the entire machine.)

It's kind of scary how this thing failed: totally dropped off the bus, just not in the BIOS at all. I prefer the way spinning rust fails: clickety sounds, tons of warnings beforehand, partial recovery possible. With this new flashy junk, you just lose everything all at once. Not fun.

BTRFS raid?
I have seen similar things happening with some ssds, too. Since I run btrfs-raid (over 7 disks or so) it happens now and then, and usually is fixed by unplugging, plugging in a new disk, and rebalancing. Seems worth to consider on your side, too, makes it a bit easier to deal with this (and with btrfs volumes you can have multiple dists booting)
Comment by Norbert
Dying SSD-s
I also have experienced few SATA and NVME drives just disappearing on reboot or even during normal usage. In my experience SSD-s just stop working without any warnings. Working, up to date backups are a must have.
Comment by Arti
comment 3

Seems worth to consider on your side, too, makes it a bit easier to deal with this (and with btrfs volumes you can have multiple dists booting)

I'm certainly considering some sort of RAID or snapshotting for my workstation now. Problem is it's a NUC so it really can't fit more disks.

Considering my ... unfruitful experience with BTRFS, I probably will stay the heck away from it though, but thanks for the advice.

Working, up to date backups are a must have.

That's the understatement of the day. :p

Thankfully, as I said, this machine is mostly throw-away. But because our installers are still kind of crap, it takes a while to recover it, so I am thinking RAID or offline snapshots could be useful to speed up recovery...

Comment by anarcat
Created . Edited .