Today SuSE has a big update for aarch64, moving lots of stuff from /bin /lib to /usr/bin /usr/lib. This was done for x86_64 last week and went smoothly. But not on Holly. The result is that either the executor or the shared libraries have become missing, so if you do "ls filename" it says "/usr/bin/ls: no such file or directory" (even though it's there, per "echo /usr/bin/ls*". Specific error message:
Installation of filesystem-15.5-40.2.aarch64 failed:
Error: Subprocess failed. Error: RPM failed: Make a copy of `/bin'.
(It seems to have successfully made a copy of /bin.)
This upgrade was not exactly atomic: last week the firmware migrated from /lib/firmware to /usr/lib/firmware. But the drivers, specifically the out of kernel user compiled driver for (RTL) 88x1bu.ko, are looking for the firmware in /lib/firmware. A simple fix was to make a symlink from /lib/firmware to /usr/lib/firmware. Problem solved — yeah, sure. I suspect without real proof that the posttrans script for the filesystem package was not smart enough to recognize that it should just remove the symlink, not try to copy it to /usr/lib where ./firmware already exists. In any case, only part of the essential infrastructure formerly in /lib got copied over.
So I'm going to have to do something drastic to bring Holly back to life.
The intervention will obviously involve restoring things from backup…
and now is a very good opportunity to test if my backup system is actually
saving everything important, and if access to the backup copy is actually
feasible. (Other shops have had shortcomings
in both these areas.)
Fortunately all hosts including Holly were freshly backed up, according to
the standard procedure, just before the system update that went awry.
Here are some goals for the recovery campaign:
Of course the prime goal is to recover a working Holly identical
to the one trashed in the update.
Holly is a Raspberry Pi (RPi), which uses a SD card in the role of a solid state disc. I would prefer to have the restored system on the same SD card: saves a lot of hassle in unscrambling /etc/fstab, updating records, etc. (But I've got to get the correct labels onto the partitions.)
There is a small but significant amount of customization on Holly, which is freshly backed up. I want to restore it from backup, as a test of the backup procedure.
There is probably stuff that isn't backed up. I want to restore that too, from the original content of the card. And to improve the backup procedure accordingly.
This is on card 03, 32GiB. Copy everything to xena:/s1/holly using rsync (or the equivalent with tar).
I'm using OpenSuSE Tumbleweed. Locate the current XFCE image for aarch64 and copy it (raw) onto the card.
Restore minimal configuration files from backup, so I can use SSH remote execution privilege to get on the new Holly.
Some packages such as unbound (DNS client/server) have their own user and group, but the numeric UID/GID used on the installed files and directories will not be the same as what will be restored from backup. This will have to be straightened out later.
Boot the RPi. (Minus mt76x2u wireless NIC.)
Check if filesystem-15.5-40.2.aarch64 was installed on the image. Yes it was.
I have a script called post_jump for setting up a new machine. It installs all my standard configuration files, then installs normal and extra packages, removes unwanted packages, and does a dist-upgrade. It enables wanted services and disables unwanted ones, and does final detail work like extracting (or creating) the Kerberos host keys.
jacinth:~reports/public_html/update.d/holly has a list of packages which, before the failed update, were on Holly or were expected to be installed. Compare this list with what's actually on Holly. Update the package lists that post_jump reads: add packages that were missed, and remove cruft, particularly wanted but discontinued packages.
Compare Holly's backup with what's on Holly. Bypass items that were legitimately changed in today's update. Install wanted items that are missing or that have my customization. Identify cruft in the backup and remove from backup.pln.
Compare the saved content of card 03 with what's on Holly. Install wanted items and add them to backup.pln. Decide about unwanted items; why were they on Holly and why weren't they deleted?
Final checkout. I have a collection of test scripts for important (and testable) services.
Card slot issues: On Xena the card will make contact, but won't stay in the slot. On Jacinth's monitor, it doesn't seem to recognize that there's a card in the slot. Finally: it works in Iris' slot. Just stick in a full-size card (adapter), no locking. raw device is /dev/mmcblk0 , ROOT-03 is on /dev/mmcblk0p2 .
Various issues made trouble on the first attempt to restore the SD
card contents and I finally decided to abort and re-do it from the
beginning. Where there were differences they are described below together,
designated as try #1
or try #2
.
This is on card 03, 32Gib. Copy (filesystem level) to iris:/s1/holly . The partition tables on the SD card and on the image are not the same: when first booted, the image's initrd will create a swap partition and will expand the root to fill available space.
It's a MSDOS table (not GPT). 1: 32Mb fat16 (EFI),
2: 31.5Gb ext4 (root), 3: 0.51Gb swap.
Command lines:
parted /dev/mmcblk0 print
fsck -f -C /dev/disk/by-label/ROOT-03
mount -r /dev/disk/by-label/ROOT-03 /mnt
rsync -a /mnt/ /s1/holly/ROOT/
umount /mnt
168021 (1.68e5) files copied taking about 10min; 9.89Gb.
Locate the current xfce image for aarch64 and copy it (raw) onto the card. https://en.opensuse.org/HCL:Raspberry_Pi3 has a link to http://download.opensuse.org/ports/aarch64/tumbleweed/appliances/openSUSE-Tumbleweed-ARM-XFCE-raspberrypi.aarch64.raw.xz; if the link is broken, which isn't too rare, dig around for the latest version in the containing directory. Size 1.12e9 bytes (1.12 Gb) compressed, took 283sec, 4.2Mb/s.
To copy onto the card: xzcat $image.xz | dd bs=4M of=/dev/mmcblk0 iflag=fullblock oflag=direct status=progress ; sync
Rate ~17Mb/sec. Total (uncompressed): 5.91Gb, 373sec, 15.8Mb/sec.
Resulting partition table: 1: 67.1Mb fat16, 2: 524Mb swap, 3: 5.3Gb ext4,
5.9Gb total. At first boot, partition 3 will be expanded to fill the card.
The partitions are labeled EFI, SWAP, ROOT (hope that your own disc has
different labels, as mine do). To label these partitions:
fatlabel /dev/disk/by-label/EFI EFI-03 (from dosfstools)
tune2fs -L ROOT-03 /dev/disk/by-label/ROOT
mkswap -L SWAP-03 /dev/disk/by-label/SWAP (reinits swap space)
I should have, but didn't, copy critical items (/etc/ssh, /root/.ssh)
at this point. Later I ended up shutting down Holly and moving the card
back to Iris, several times. On try #2 I pre-copied the files listed in
the plan summary. Plus I checked /etc/fstab carefully, then copied from
backup. Simple way to do the copy: first make a file containing filenames
to be copied with the future paths on holly e.g. /etc/ssh. Then use this
command line, without trailing '/' on the source and destination:
rsync -a -r -K -O --files-from=./filelist /home/backup/holly /mnt
Boot the RPi. (Minus mt76x2u NIC.)
linux.
Logging in (try #1). root/linux got me in at the console, but the XFCE desktop didn't start (I waited about 1min). So I flipped to VT1; logged in successfully.
Logging in (try #2). A lot nicer with backed-up critical files preloaded.
yast2 lan.
SSH setup (try #1). It has some random host key which CouchNet hosts are not going to touch. I should have done the following steps when I was monkeying with the SD card. Shutting down Holly and doing that now. Copied /etc/ssh and /root/.ssh . OK, now it lets me on with RSA authentication. Try #2: these files were pre-copied and SSH worked upon boot.
Guess what, the RPi has no realtime clock. Set the time. Here's a simple way but not too accurate; do as quick as possible:
Do a dist-upgrade (try #1 only) and make sure that
filesystem-15.5-40.2.aarch64 gets installed.
The image is very recently updated; filesystem-15.5-40.2.aarch64 is
already installed. Doing the dist upgrade anyway.
zypper refresh
zypper dist-upgrade --auto-agree-with-licenses --no-recommends --download-as-needed
2 new pkgs to install 1 to remove:
Installing libopenssl3 openssl-3 ; remove openssl-1_1
For try #2 I'm letting post_jump do the dist-upgrade.
Copy from backup files that post_jump needs. Try #1 only; in try #2 I already did this when preparing the card. List of files: /m1/custom/extra.sel /m1/custom/scripts.extra /m1/custom/conffiles /etc/sysconfig/network
Try #1: Run post_jump, should install all the desired packages. Issues:
krb-maint testfailed. Re-done by hand, now
krb-maint testsucceeds. post_jump had a wrong option to krb-maint [fixed].
Fix the current date for Kerberos client initialization— it says
inactive??
post_jump (try #2): Last time around there was a botched digest verification during installation.
Issues while installing wanted packages:
It looks like there may have been a key rollover. I use Squid as a
proxy so all my hosts can share one download; however, if RPM or
metadata files get updated and Squid thinks they're still fresh
,
bad things can happen as I just experienced. I suppressed the proxy in
the repo definitions, refreshed the repo, and re-did the installation
phase (same command line as before). 5 packages to upgrade, 631 new, 5
to change vendor. Download: 519.4 MiB; decompressed 1.6 GiB.
Resuming post_jump -p 8 (delete unwanted packages).
Compare jacinth:~reports/public_html/update.d/holly with the list of actually installed packages. Install wanted ones (and update extra.sel). Identify cruft and don't install.
Intermediate reboot and checkout.
Compare Holly's backup with what's on Holly. Install wanted items.
Identify cruft in the backup and remove from backup.pln.
rsync -a -n -O -x --force --log-format="%o %f" /home/backup/holly/ holly:/
Command line interpretations:
-a Preserve all file attributes and also recurse into directories.
-n No-op, report what would have been transferred but don't
transfer anything.
-O For directories, if the only discrepancy is times, skip it.
-x One file system, send nothing outside the filesystem that the
toplevel argument is in. In case of mounted other filesystems,
unlikely in this case. /proc/$PIDs/mem is particularly bad.
--force OK to replace a destination directory with a source
non-directory. Specifically the /etc/unbound symlink.
--log-format Lists the files being transferred (normally errors only).
1186 files are discrepant. Likely many should not be restored.
Total size: 38Mb, 3011 files. Copy it onto Holly: $j/holly.bku
Compare the saved content of card 03 with what's on Holly. Install wanted items and add them to backup.pln. Decide about unwanted items; why were they on Holly and why weren't they deleted?
EFI partition: 253 files, 37 were different, most were deletions. The only old one that might be kept is README.jim, but it refers to problems from 2018, so toss. I'm not going to touch EFI.
Root partition (groan): Number of files: backup 126197 new 115087 How many are actually identical? 88346 identical, 37851 only in bku, 26741 only in new (referring to having a different sum OR present in only one file)
Edit, removing large blocks of files that should be for the distro only. Directories to consider for restoration:
2544 files were considered. 175 files missing. 55 different.
All the different
ones have already been dealt with in the
official backup and are not going to be restored from the
unofficial one.
These few missing files are going to be restored:
Conclusion: Everything that should be backed up, is backed up, except /var/lib/chrony/chrony.drift . Jimc's homedir on hosts other than Jacinth and Xena: I should review how it's being kept in sync.
Final checkout. restarter: didn't need to restart anything. checkout.sh:
This disaster recovery exercise has been surprisingly successful.
I successfully returned Holly to the status quo ante.
The SD card format of Holly's disc
made restoration more
convenient than if I had to deal with a rotating disc. There was no
problem reading the backup storage area.
I took the time to clean up cruft in my package selection list: 16 packages tossed or replaced.
Customized and app-specific files were restored from the official backup. Quite a lot of files were legitimately updated and should not be overwritten from the backup. Few or no files are being backed up that shouldn't be.
Due to an unofficial
backup of the whole SD card after it got
trashed, I was able to positively identify files that should have been
backed up and weren't: only two of them, now fixed in the backup plan.
In the update of 2022-03-22, all hosts got all the way through the update with no problems, except Claude was the last to finish and got chomped. The mirror that I was using (provo-mirror.opensuse.org) had a weird event in which all the content vanished: the root (htdocs) directory got served but it was empty. I aborted the update, but some cache file was left in a state where libzypp complained that the baseurl of something unspecified lacked a host part, so it could not refresh from a different mirror. Removing /var/cache/zypp did not help. After much struggle I ran out of ideas and declared Claude to be a total loss. I'll need to reinstall the OS on Claude, then restore configuration and web content from backups.
I followed basically the same procedure as for Holly. But Claude is a virtual machine with its disc storage on a file on the host system (Jacinth). Unless I want to mess with the install disc, I'm going to be doing most of the process on the host with a --root option directing packages to the mounted guest filesystem.
Preparing to reinstall the OS on Claude.
Make a backup copy of Claude's disc:
Need to identify Claude's partitions.
Mount it readonly on /mnt
Back up the whole partition in /s1/claude-hosed/backups/220323a (which is the effective root). Let's call this bkroot.
Now trying to reinstall packages on Claude.
Filesystem check:
Remount /mnt read-write. The first time I did this I was using
/dev/loop0 with an explicit offset, and apparently that precludes
mount -o remount,rw /mnt
. So I unmounted and repeated the mount
command without -r.
Make a list of all installed packages. It will be needed to know what to re-install.
Remove all files and symlinks but keep directories:
Restore /etc/zypp from backup. (Try to) refresh repos. On my net
a CNAME backup
represents the host with the backup storage.
ssync
is a script that adds to the rsync parameters
--log-format="%o %f"
to show a list of files
transferred; normally rsync prints nothing.
Install aaa_base and filesystem packages. If I had wiped the disc completely, clearly these would have to go on first. Good thing that I didn't.
From the list of all installed packages on Claude, extract the package basenames (minus version and architecture) into $j/pkglist . 4262 packages. I'm leaving the already installed packages in the list; let zypper noisily skip them.
Install all the packages. It's taking a long time to do Reading
installed packages
because there are so many packages.
Unmount /mnt, then make another backup in /s1/kvm/claude/claude-disc1.bk3.gz
Restoring from the official backup: mount the root partition r/w on /mnt (see prior section), then:
Everything. The only ones actually restored were /var/lib/chrony ntp unbound; the rest were not backed up.
Making yet another backup in /s1/kvm/claude/claude-disc1.bk4.gz
So how are we going to install Grub?
It created /boot/grub2/i386-pc/core.img OK. Aargh! I had a big problem the first time I tried this: ext2 doesn't support embedding, could only use blocklists, deprecated, will not proceed with blocklists. The issue is that /dev/loop0 (with an explicit offset) contains only partition 2, and you can't install grub in an individual partition (or that's a stupid way to do it); rather you should install in the whole disc, i.e. the MBR. I followed the procedure shown above with losetup --partscan, and mounted /dev/loop0p2 . Then grub was installed with no errors.
Booting Claude for the first time. The KVM host is Jacinth.
Likely relevant module loaded on Oso (another VM): virtio_blk. After checking some forum posts that didn't lead to a solution, and after reading the entire man page for Dracut, I came up with this command line:
Now it boots. But some non-restored features caused trouble.
Next day, Claude passes all tests in checkout.sh. It is now operational, except undoubtedly a few missed items remain to be discovered.