NVMe/SSD disk failure
Yesterday, my workstation (curie) was hung when I came in the office. After a "skinny elephant", the box rebooted, but it couldn't find the primary disk (in the BIOS). Instead, it booted on the secondary HDD drive, still running an old Fedora 27 install which somehow survived to this day, possibly because BTRFS is incomprehensible.
Somehow, I blindly accepted the Fedora prompt asking me to upgrade to Fedora 28, not realizing that:
- Fedora is now at release 36, not 28
- major upgrades take about an hour...
- ... and happen at boot time, blocking the entire machine (I'll remember this next time I laugh at Windows and Mac OS users stuck on updates on boot)
- you can't skip more than one major upgrade
Which means that upgrading to latest would take over 4 hours. Thankfully, it's mostly automated and seems to work pretty well (which is not exactly the case for Debian). It still seems like a lot of wasted time -- it would probably be better to just reinstall the machine at this point -- and not what I had planned to do that morning at all.
In any case, after waiting all that time, the machine booted (in Fedora) again, and now it could detect the SSD disk. The BIOS could find the disk too, so after I reinstalled grub (from Fedora) and fixed the boot order, it rebooted, but secureboot failed, so I turned that off (!?), and I was back in Debian.
I did an emergency backup with ddrescue
, from the running system
which probably doesn't really work as a backup (because the filesystem
is likely to be corrupt) but it was fast enough (20 minutes) and gave
me some peace of mind. My offsites backup have been down for a while
and since I treat my workstations as "cattle" (not "pets"), I don't
have a solid recovery scenario for those situations other than "just
reinstall and run Puppet", which takes a while.
Now I'm wondering what the next step is: probably replace the disk anyways (the new one is bigger: 1TB instead of 500GB), or keep the new one as a hot backup somehow. Too bad I don't have a snapshotting filesystem on there... (Technically, I have LVM, but LVM snapshots are heavy and slow, and can't atomically cover the entire machine.)
It's kind of scary how this thing failed: totally dropped off the bus, just not in the BIOS at all. I prefer the way spinning rust fails: clickety sounds, tons of warnings beforehand, partial recovery possible. With this new flashy junk, you just lose everything all at once. Not fun.
I'm certainly considering some sort of RAID or snapshotting for my workstation now. Problem is it's a NUC so it really can't fit more disks.
Considering my ... unfruitful experience with BTRFS, I probably will stay the heck away from it though, but thanks for the advice.
That's the understatement of the day. :p
Thankfully, as I said, this machine is mostly throw-away. But because our installers are still kind of crap, it takes a while to recover it, so I am thinking RAID or offline snapshots could be useful to speed up recovery...