Holly Hosed, Tested Backup

James F. Carter <jimc@jfcarter.net>, 2021-06-08

Today SuSE has a big update for aarch64, moving lots of stuff from /bin /lib to /usr/bin /usr/lib. This was done for x86_64 last week and went smoothly. But not on Holly. The result is that either the executor or the shared libraries have become missing, so if you do "ls filename" it says "/usr/bin/ls: no such file or directory" (even though it's there, per "echo /usr/bin/ls*". Specific error message:

Installation of filesystem-15.5-40.2.aarch64 failed:
Error: Subprocess failed. Error: RPM failed: Make a copy of `/bin'.
(It seems to have successfully made a copy of /bin.)

This upgrade was not exactly atomic: last week the firmware migrated from /lib/firmware to /usr/lib/firmware. But the drivers, specifically the out of kernel user compiled driver for (RTL) 88x1bu.ko, are looking for the firmware in /lib/firmware. A simple fix was to make a symlink from /lib/firmware to /usr/lib/firmware. Problem solved — yeah, sure. I suspect without real proof that the posttrans script for the filesystem package was not smart enough to recognize that it should just remove the symlink, not try to copy it to /usr/lib where ./firmware already exists. In any case, only part of the essential infrastructure formerly in /lib got copied over.

So I'm going to have to do something drastic to bring Holly back to life. The intervention will obviously involve restoring things from backup… and now is a very good opportunity to test if my backup system is actually saving everything important, and if access to the backup copy is actually feasible. (Other shops have had shortcomings in both these areas.) Fortunately all hosts including Holly were freshly backed up, according to the standard procedure, just before the system update that went awry.

Goals

Here are some goals for the recovery campaign:

Of course the prime goal is to recover a working Holly identical to the one trashed in the update.
Holly is a Raspberry Pi (RPi), which uses a SD card in the role of a solid state disc. I would prefer to have the restored system on the same SD card: saves a lot of hassle in unscrambling /etc/fstab, updating records, etc. (But I've got to get the correct labels onto the partitions.)
There is a small but significant amount of customization on Holly, which is freshly backed up. I want to restore it from backup, as a test of the backup procedure.
There is probably stuff that isn't backed up. I want to restore that too, from the original content of the card. And to improve the backup procedure accordingly.

Outline of the Plan

This is on card 03, 32GiB. Copy everything to xena:/s1/holly using rsync (or the equivalent with tar).
I'm using OpenSuSE Tumbleweed. Locate the current XFCE image for aarch64 and copy it (raw) onto the card.
Restore minimal configuration files from backup, so I can use SSH remote execution privilege to get on the new Holly.
- /etc/hostname
- /etc/passwd
- /etc/shadow
- /etc/group
- /etc/sysconfig/network
- /etc/ssh
- /root/.ssh
- /m1/custom/extra.sel (special packages wanted)
- /m1/custom/scripts.extra (special services to start)
- /m1/custom/conffiles (local backups of a few configuration files)
Some packages such as unbound (DNS client/server) have their own user and group, but the numeric UID/GID used on the installed files and directories will not be the same as what will be restored from backup. This will have to be straightened out later.
Boot the RPi. (Minus mt76x2u wireless NIC.)
Check if filesystem-15.5-40.2.aarch64 was installed on the image. Yes it was.
I have a script called post_jump for setting up a new machine. It installs all my standard configuration files, then installs normal and extra packages, removes unwanted packages, and does a dist-upgrade. It enables wanted services and disables unwanted ones, and does final detail work like extracting (or creating) the Kerberos host keys.
jacinth:~reports/public_html/update.d/holly has a list of packages which, before the failed update, were on Holly or were expected to be installed. Compare this list with what's actually on Holly. Update the package lists that post_jump reads: add packages that were missed, and remove cruft, particularly wanted but discontinued packages.
Compare Holly's backup with what's on Holly. Bypass items that were legitimately changed in today's update. Install wanted items that are missing or that have my customization. Identify cruft in the backup and remove from backup.pln.
Compare the saved content of card 03 with what's on Holly. Install wanted items and add them to backup.pln. Decide about unwanted items; why were they on Holly and why weren't they deleted?
Final checkout. I have a collection of test scripts for important (and testable) services.

Executing the Plan

Stuffing the SD Card

Card slot issues: On Xena the card will make contact, but won't stay in the slot. On Jacinth's monitor, it doesn't seem to recognize that there's a card in the slot. Finally: it works in Iris' slot. Just stick in a full-size card (adapter), no locking. raw device is /dev/mmcblk0 , ROOT-03 is on /dev/mmcblk0p2 .
Various issues made trouble on the first attempt to restore the SD card contents and I finally decided to abort and re-do it from the beginning. Where there were differences they are described below together, designated as try #1 or try #2.
This is on card 03, 32Gib. Copy (filesystem level) to iris:/s1/holly . The partition tables on the SD card and on the image are not the same: when first booted, the image's initrd will create a swap partition and will expand the root to fill available space.
It's a MSDOS table (not GPT). 1: 32Mb fat16 (EFI), 2: 31.5Gb ext4 (root), 3: 0.51Gb swap. Command lines:
parted /dev/mmcblk0 print
fsck -f -C /dev/disk/by-label/ROOT-03
mount -r /dev/disk/by-label/ROOT-03 /mnt
rsync -a /mnt/ /s1/holly/ROOT/
umount /mnt
168021 (1.68e5) files copied taking about 10min; 9.89Gb.
Locate the current xfce image for aarch64 and copy it (raw) onto the card. https://en.opensuse.org/HCL:Raspberry_Pi3 has a link to http://download.opensuse.org/ports/aarch64/tumbleweed/appliances/openSUSE-Tumbleweed-ARM-XFCE-raspberrypi.aarch64.raw.xz; if the link is broken, which isn't too rare, dig around for the latest version in the containing directory. Size 1.12e9 bytes (1.12 Gb) compressed, took 283sec, 4.2Mb/s.
To copy onto the card: xzcat $image.xz | dd bs=4M of=/dev/mmcblk0 iflag=fullblock oflag=direct status=progress ; sync
Rate ~17Mb/sec. Total (uncompressed): 5.91Gb, 373sec, 15.8Mb/sec.
Resulting partition table: 1: 67.1Mb fat16, 2: 524Mb swap, 3: 5.3Gb ext4, 5.9Gb total. At first boot, partition 3 will be expanded to fill the card. The partitions are labeled EFI, SWAP, ROOT (hope that your own disc has different labels, as mine do). To label these partitions:
fatlabel /dev/disk/by-label/EFI EFI-03 (from dosfstools)
tune2fs -L ROOT-03 /dev/disk/by-label/ROOT
mkswap -L SWAP-03 /dev/disk/by-label/SWAP (reinits swap space)
I should have, but didn't, copy critical items (/etc/ssh, /root/.ssh) at this point. Later I ended up shutting down Holly and moving the card back to Iris, several times. On try #2 I pre-copied the files listed in the plan summary. Plus I checked /etc/fstab carefully, then copied from backup. Simple way to do the copy: first make a file containing filenames to be copied with the future paths on holly e.g. /etc/ssh. Then use this command line, without trailing '/' on the source and destination:
rsync -a -r -K -O --files-from=./filelist /home/backup/holly /mnt

Initial Boot and Login

Boot the RPi. (Minus mt76x2u NIC.)
- It booted, no problem. It created a swap partition and expanded the root partition to fill available space.
- Due to issues with the KVM switch, the keyboard was not recognized on the splash screen.
- LightDM greeter is up. Authenticate as root, password = linux.
Logging in (try #1). root/linux got me in at the console, but the XFCE desktop didn't start (I waited about 1min). So I flipped to VT1; logged in successfully.
- It's supposed to have a DHCP IPv4 address. It does, aleatory (192.9.200.240). IPv6 link local only. Managed by NetworkManager.
- Pre interventions:
  passwd root (to my normal, much more secure root password)
  echo "holly" > /etc/hostname
  hostname "holly"
  ip addr add 192.9.200.199/26 dev en0
  ip addr del 192.9.200.240/26 dev en0
  I did this over with "yast2 lan" and switched to Wicked.
  It's on the net.
Logging in (try #2). A lot nicer with backed-up critical files preloaded.
- lightdm greeter did not start up (likely a permission problem; yes confirmed). Ctrl-Alt-F1 worked; getty is listening.
- root/linux got me in on the console.
- It has an aleatory DHCP IPv4 address, and IPv6 link local only. I manually gave it 192.9.200.199 (Holly), same as in try #1. It's on the net, and other hosts can do "ssh holly …" and it's accepted since /root/.ssh/authorized_keys is preloaded. As is the host key in /etc/ssh.
- Switching over to Wicked using yast2 lan.
SSH setup (try #1). It has some random host key which CouchNet hosts are not going to touch. I should have done the following steps when I was monkeying with the SD card. Shutting down Holly and doing that now. Copied /etc/ssh and /root/.ssh . OK, now it lets me on with RSA authentication. Try #2: these files were pre-copied and SSH worked upon boot.
Guess what, the RPi has no realtime clock. Set the time. Here's a simple way but not too accurate; do as quick as possible:
- On Holly, type: date -s @ (don't press return)
- Elsewhere: date +%s (select the integer, i.e. double click)
- Paste the integer date on Holly after the @ and hit return.
- Actually post_jump syncs the time, automating this procedure, but it's better to get it right at the beginning.
Do a dist-upgrade (try #1 only) and make sure that filesystem-15.5-40.2.aarch64 gets installed. The image is very recently updated; filesystem-15.5-40.2.aarch64 is already installed. Doing the dist upgrade anyway.
zypper refresh
zypper dist-upgrade --auto-agree-with-licenses --no-recommends --download-as-needed
2 new pkgs to install 1 to remove:
Installing libopenssl3 openssl-3 ; remove openssl-1_1
For try #2 I'm letting post_jump do the dist-upgrade.
Copy from backup files that post_jump needs. Try #1 only; in try #2 I already did this when preparing the card. List of files: /m1/custom/extra.sel /m1/custom/scripts.extra /m1/custom/conffiles /etc/sysconfig/network

Running Post_jump

Try #1: Run post_jump, should install all the desired packages. Issues:
- diff seems to be missing??? Package diffutils, needed on Jacinth by rcs dasher rpm-build. But not on Holly. Added explicitly to couchnet.sel.
- /etc/krb5/krb5.keytab was(?) installed but krb-maint test failed. Re-done by hand, now krb-maint test succeeds. post_jump had a wrong option to krb-maint [fixed].
- audit-scripts is still complaining about missing /etc/init.d [Fixed post_jump].
- rsync: [generator] failed to set permissions on "/lib": Operation not supported (95)
  Because they're symlinks to /usr/lib{,64} and you can't set their permissions. Ignore it.
- Installed 675 wanted packages… Oh, crap! digest verification failed… Too late, got to fix tomorrow.
- Removed 590 unwanted packages. Later look for rpm{new,save,orig}
- module /lib64/security/pam_pwquality.so may be a new requirement.
- dist-upgrade of 51 packages (so why weren't they done earlier?) Because some of them are switching to PackMan.
- (Re)installation of these configuration files, check if still correct:
  send /home/post_jump/byhg/aarch64 lib
  send /home/post_jump/byhg/aarch64 lib64
  send /home/post_jump/byhg/aarch64 usr/lib64
- Fix the current date for Kerberos client initialization — it says inactive ??
- checksum.J 3800 files added to (empty) inventory.
- /var/spool/atjobs does not exist, prob was not installed. Better re-run post_jump tomorrow to get all the nitpicky details.
- diffie.J -- Some kind of weird error suggesting non-installed pkg. Better force rebuilding the moduli. Even better plan: tomorrow I should do the whole installation from the beginning.
post_jump (try #2): Last time around there was a botched digest verification during installation.
- post_jump phase 7 is installing wanted packages. I'll use the -P 6 option to stop at the end of phase 6.
- Run package installation with manual intervention:
  audit-pkgs -v -i -c -I
- Then resume post_jump -p 8 (starting with phase 8, delete unwanted packages).
Issues while installing wanted packages:
- No provider of 'usb-modeswitch' found. (Should be usb_modeswitch, fixed in couchnet.sel)
- Problem: the installed systemd-logger-246.13-2.1.aarch64 conflicts with 'namespace:otherproviders(syslog)' provided by the to be installed rsyslog-8.2104.0-1.2.aarch64 (install. rsyslog, toss systemd-logger)
- Problem: the installed u-boot-rpiarm64-2021.04-4.2.aarch64 conflicts with 'u-boot-loader' provided by the to be installed u-boot-rpi3-2021.04-4.2.aarch64 (keep u-boot-rpiarm64, don't install u-boot-rpi3)
- Problem: the to be installed gstreamer-plugins-libav-1.18.4-60.1.aarch64 requires 'libavcodec58_134(unrestricted)', but this requirement cannot be provided (allow vendor change to PackMan)
- Problem: the to be installed ffmpeg-4-4.4-4.3.aarch64 requires 'libavformat58_76 = 4.4-4.3', but this requirement cannot be provided (allow vendor change to PackMan)
- 5 packages to upgrade, 664 new, 1 to remove, 5 to change vendor. Download 560.6 MiB, decompressed: 1.7 GiB
- PackageKit-branding-openSUSE-42.1-2.16.noarch digest bad. Skip.
- apache2-manual-2.4.48-1.1.noarch wrong size. Skip.
- autoconf-2.69-17.20.noarch.rpm 403 not found. Skip.
- gcc11-info-11.1.1+git121-2.1.noarch.rpm wrong size. Skip.
- gccmakedep-1.0.3-4.1.noarch botched digest. This is getting excessive. Abort.
It looks like there may have been a key rollover. I use Squid as a proxy so all my hosts can share one download; however, if RPM or metadata files get updated and Squid thinks they're still fresh, bad things can happen as I just experienced. I suppressed the proxy in the repo definitions, refreshed the repo, and re-did the installation phase (same command line as before). 5 packages to upgrade, 631 new, 5 to change vendor. Download: 519.4 MiB; decompressed 1.6 GiB.
- Most or all of the packages botched previously were installed OK.
- No further errors in installation.
- These are the 16 packages wanted but not installable. Probably they are in couchnet.sel and have disappeared from the distro or been renamed (I see at least one of those).
  - ConsoleKit (installed on x86_64, unavail. on ARM)
  - ConsoleKit-x11 (ditto)
  - GeoIP (gone)
  - ffmpeg-4 (vendor change to PackMan; installed)
  - giggle (gone)
  - gstreamer-plugins-libav (vendor change to PackMan; installed)
  - hddtemp (gone)
  - ifplugd (gone)
  - orage (gone)
  - python-gstreamer (gone)
  - u-boot-rpi3 (conflicts with new version)
  - usb-modeswitch (correct spelling is usb_modeswitch)
  - xfce4-mixer (gone)
  - xfce4-mixer-plugin (gone)
  - yast2-trans-en (gone)
  - yast2-trans-en_US (gone)
  - Also remove quagga from couchnet.sel and /etc/
  - Also add u-boot-rpiarm64-2021.04-4.2 to couchnet.sel
- Resuming post_jump -p 8 (delete unwanted packages).
  - Tossed 516 unwanted packages.
  - Dist upgrade: 53 pkgs to update, 16 to change vendor (to PackMan)
  - audit-scripts probably didn't run due to nonexistent /etc/init.d [fixed script, re-ran it, and empty /etc/init.d was removed on all hosts.]
  - sync_jump /lib /lib64 -- can't set attributes (because it's a symlink?)
  - /etc/krb5/krb5.keytab didn't get installed. [done]
  - quagga is bogusly in /etc/passwd group shadow. Need to overwrite. It's also bogusly installed! Get it out of there. [done]
  - Remove: /home/post_jump/byhg/v99.8 /etc/pam.d/quagga /home/post_jump/byhg/v99.8 /etc/quagga [done]

Restore from Backup

Compare jacinth:~reports/public_html/update.d/holly with the list of actually installed packages. Install wanted ones (and update extra.sel). Identify cruft and don't install.
- The only packages I needed to add to extra.sel were hostapd + nmap.
- Other package changes were made globally, i.e. in couchnet.sel; see above for the list.
Intermediate reboot and checkout.
- pam_cracklib is not installed
- lightdm would not start (why?) X-server looks like started OK. Got everything started up, then 1.5sec later, shut everything down. Opened greeter session, but it immediately logged out and lightdm died. [Discovered and fixed the issue: /var/lib/lightdm and /var/lib/lightdm/.Xauthority were not owned by lightdm.)
- checkout.sh results: these failed:
  - bluetooth-fw.J.path is disabled. (Why? Because audit-scripts not run yet.)
  - apache2 is disabled (ditto)
  - display-manager is hosed, see above
  - hostapd.J is hosed, no config file (not yet restored from backup)
  - unbound is not running (and screwed up sshd)
  - vsftpd won't start, no host cert (not yet restored from backup)
Compare Holly's backup with what's on Holly. Install wanted items. Identify cruft in the backup and remove from backup.pln.
rsync -a -n -O -x --force --log-format="%o %f" /home/backup/holly/ holly:/ Command line interpretations: -a Preserve all file attributes and also recurse into directories. -n No-op, report what would have been transferred but don't transfer anything. -O For directories, if the only discrepancy is times, skip it. -x One file system, send nothing outside the filesystem that the toplevel argument is in. In case of mounted other filesystems, unlikely in this case. /proc/$PIDs/mem is particularly bad. --force OK to replace a destination directory with a source non-directory. Specifically the /etc/unbound symlink. --log-format Lists the files being transferred (normally errors only).
1186 files are discrepant. Likely many should not be restored. Total size: 38Mb, 3011 files. Copy it onto Holly: $j/holly.bku
- Of the files in the backup that differ from the extant instance, many are the result of legitimate updates or screwup fixes that were installed on the new version of Holly and should not be overwritten. Action: remove from the copy of the backup that is going to be restored. 53 differing files were thus removed.
- Backup files that ought to overwrite extant ones:
  - /root/.ssh/known_hosts (should be empty due to SSHFP)
  - /var/tmp/root.jimc/holly.bku/etc/systemd/resolved.conf (jimc hacks)
  - /etc/systemd/network/50-rad0.link (jimc hacks)
  - /etc/systemd/system/remote-time.service (bug fixed)
  - /etc/sysconfig/postfix (official CouchNet)
  - /etc/X11/xdm/xdm-config and numerous friends (major jimc hacks)
  - /etc/postfix/transport.lmdb and friends (binary) differ, restoring, probably makes no difference.
- 622 missing files, i.e. not on Holly but present in the backup. All missing files were restored, plus the differing files listed in the previous section that were supposed to overwrite the extant instances.
- Jimc's homedir is a special case: the laptop's homedir is frequently synced with jimc's homedir on Jacinth, and that gets backed up. Syncing restored the homedir from Jacinth to Holly.
- Conclusion: There's no major stuff that is being backed up and should not be, but on a rebuild like this, it's inevitable that a lot of updated files need to be kept, not overwritten from the backup.
Compare the saved content of card 03 with what's on Holly. Install wanted items and add them to backup.pln. Decide about unwanted items; why were they on Holly and why weren't they deleted?
- EFI partition: 253 files, 37 were different, most were deletions. The only old one that might be kept is README.jim, but it refers to problems from 2018, so toss. I'm not going to touch EFI.
- Root partition (groan): Number of files: backup 126197 new 115087 How many are actually identical? 88346 identical, 37851 only in bku, 26741 only in new (referring to having a different sum OR present in only one file)
- Edit, removing large blocks of files that should be for the distro only. Directories to consider for restoration:
  - /etc
  - /home
  - /m1
  - /root
  - /srv
  - /usr/etc
  - /usr/share -- Probably should not be considered. This is the biggest single dir, around half of the files.
  - /var/lib/chrony
  - /var/lib/unbound -- was this properly backed up?
  - /var/lib/xdm
  - /var/lib/wwwrun
  2544 files were considered. 175 files missing. 55 different. All the different ones have already been dealt with in the official backup and are not going to be restored from the unofficial one. These few missing files are going to be restored:
  - /m1/custom/conffiles/etc/systemd/network/50-rad0.link (/m1/custom/conffiles is backed up; was 50-rad0.link mistakenly excluded when restoring the backup?)
  - /var/lib/chrony/chrony.drift (added to /var/lib/backup.pln)
- Conclusion: Everything that should be backed up, is backed up, except /var/lib/chrony/chrony.drift . Jimc's homedir on hosts other than Jacinth and Xena: I should review how it's being kept in sync.
Final checkout. restarter: didn't need to restart anything. checkout.sh:
- router-sol: saw no RS's but did see a RA. Re-test: OK.
- hostapd: is not running, because I'm in the middle of setting it up on Holly.
- unbound: not listening on port 53 /var/lib/unbound/etc/unbound/root.key is in the wrong format. unbound-anchor updated /var/lib/unbound/root.key but not the other. Got to figure this out. Running now.
- firewall.J: Lacks port 53 and 8953 from Unbound. Re-test; firewall passes the test now that Unbound is alive. The firewall tester knows that particular ports should be accessible from some hosts and not from others (particularly the wild side). If a port is unexpectedly unreachable, is the firewall being overactive? Not in this case.
- net-geom.J: skipped functional test; investigate. Net::Ping.pm is horked, a known problem. Bug fix was re-applied.
- check-net.S: /usr/lib/perl5/5.32.1/Net/Ping.pm horkage, re-apply bug fix. Now check-net.S passes.

Overall Conclusion

This disaster recovery exercise has been surprisingly successful.

I successfully returned Holly to the status quo ante.
The SD card format of Holly's disc made restoration more convenient than if I had to deal with a rotating disc. There was no problem reading the backup storage area.
I took the time to clean up cruft in my package selection list: 16 packages tossed or replaced.
Customized and app-specific files were restored from the official backup. Quite a lot of files were legitimately updated and should not be overwritten from the backup. Few or no files are being backed up that shouldn't be.
Due to an unofficial backup of the whole SD card after it got trashed, I was able to positively identify files that should have been backed up and weren't: only two of them, now fixed in the backup plan.

Appendix: Same Issue on Claude

In the update of 2022-03-22, all hosts got all the way through the update with no problems, except Claude was the last to finish and got chomped. The mirror that I was using (provo-mirror.opensuse.org) had a weird event in which all the content vanished: the root (htdocs) directory got served but it was empty. I aborted the update, but some cache file was left in a state where libzypp complained that the baseurl of something unspecified lacked a host part, so it could not refresh from a different mirror. Removing /var/cache/zypp did not help. After much struggle I ran out of ideas and declared Claude to be a total loss. I'll need to reinstall the OS on Claude, then restore configuration and web content from backups.

I followed basically the same procedure as for Holly. But Claude is a virtual machine with its disc storage on a file on the host system (Jacinth). Unless I want to mess with the install disc, I'm going to be doing most of the process on the host with a --root option directing packages to the mounted guest filesystem.

Preparing to reinstall the OS on Claude.

Make a backup copy of Claude's disc:
- nice -19 pigz -c claude-disc1.raw > claude-disc1.bk2.gz
- (or gzip but pigz is multi threaded for compression (only))
Need to identify Claude's partitions.
- parted claude-disc1.raw
- print (shows partition boundaries in Mb)
- The root is on partition 2, starts at 535MB, size 9951MB. Partitions are not labelled on this disc. Probably they should be.
- unit B (for bytes)
- print
- Starts at 534773760 , size: 9950986240 (bytes; remove the ending B)
- quit
- It turns out that I was able to get the kernel to make per partition block devices, so I didn't have to specify offsets explicitly.
Mount it readonly on /mnt
- losetup -f (find an unused loop device; prints /dev/loop0)
- losetup --partscan /dev/loop0 claude-disc1.raw
- mount -r /dev/loop0p2 /mnt (this is the root partition; p1 is swap)
- After unmounting, losetup -d /dev/loop0 to disconnect it.
Back up the whole partition in /s1/claude-hosed/backups/220323a (which is the effective root). Let's call this bkroot.
- rsync -a /mnt/ $bkroot/

Now trying to reinstall packages on Claude.

Filesystem check:
- grep /mnt /proc/mounts (and note the loop device that was used)
- e2fsck -f -n -C 0 /dev/loop0p2
- (-n = don't attempt to write on this readonly mounted filesystem)
Remount /mnt read-write. The first time I did this I was using /dev/loop0 with an explicit offset, and apparently that precludes mount -o remount,rw /mnt. So I unmounted and repeated the mount command without -r.
Make a list of all installed packages. It will be needed to know what to re-install.
- $j is my all-purpose temporary directory: j=/var/tmp/root.jimc
- rpm --root /mnt -q -a | sort -o $j/rpm.all
  (But I had a copy saved from before.)
Remove all files and symlinks but keep directories:
- find /mnt ! -type d -delete
Restore /etc/zypp from backup. (Try to) refresh repos. On my net a CNAME backup represents the host with the backup storage. ssync is a script that adds to the rsync parameters --log-format="%o %f" to show a list of files transferred; normally rsync prints nothing.
- ssync -a -n backup:/home/backup/claude/etc/zypp/ /mnt/etc/zypp/
- zypper --root /mnt refresh
  /etc/products.d/baseproduct is missing (duh, I just deleted everything), can't refresh.
- Restore /etc/products.d from backup.
  ssync -a -n backup:/home/backup/claude/etc/products.d/ /mnt/etc/products.d/
- Also restore passwd group shadow hosts (to get UIDs right)
  ssync -a "backup:/home/backup/claude/etc/{passwd,shadow,group,hosts}" /mnt/etc/
- Now the repos could be refreshed.
Install aaa_base and filesystem packages. If I had wiped the disc completely, clearly these would have to go on first. Good thing that I didn't.
- Cryptic dependency of mailx-12.5-33.1.x86_64.rpm, %prein script failed because /bin/sh did not exist, also /usr/bin/bash. Got to install these in a separaate initial step.
- zypper --root /mnt install --no-recommends --download-as-needed bash bash-sh
- zypper --root /mnt install --no-recommends --download-as-needed aaa_base filesystem
From the list of all installed packages on Claude, extract the package basenames (minus version and architecture) into $j/pkglist . 4262 packages. I'm leaving the already installed packages in the list; let zypper noisily skip them.
- sed -e 's/-[^-]*-[^-]*$//' /mnt$j/rpm.all > $j/pkglist
Install all the packages. It's taking a long time to do Reading installed packages because there are so many packages.
- zypper --root /mnt install --no-recommends --download-as-needed --auto-agree-with-licenses $(< $j/pkglist)
- The following issues had to be dealt with by hand:
- gpg-pubkey not found: keyID got chopped off due to the way the package list was created, should have been excluded entirely. Do not install.
- libGLEW1_13 not found (for mesa-demos which is useless on a VM). Probably a moldy version. Omit mesa-demos.
- X11_ABI_VIDEODRV not found for xf86-video-intel, useless on a VM. Omit xf86-video-intel.
- libz-ng-compat1 conflicts with libz1. Toss libz-ng-compat1.
- product:MicroOS-20220321-0.x86_64 conflicts with product:openSUSE-20220321-0.x86_64. Toss MicroOS.
- busybox-sed conflicts with sed and a gazillion miscellaneous packages (1608 pkgs). Toss busybox-sed.
- busybox-xz conflicts with xz. Toss busybox-xz.
- product:openSUSE-20220321-0.x86_64 requires product(openSUSE) = 20220321-0. Toss MicroOS (again).
- libopenh264-5 not found, probably back version. Skipped.
- libx264-(6 versions) not found, probably back version. Skipped.
- 4207 new packages to install, 4 to remove. It took almost exactly 3 hours to install them (faster than expected). Likely almost all were in my big Squid cache because of the previous botched install plus other maintenance activities.
Unmount /mnt, then make another backup in /s1/kvm/claude/claude-disc1.bk3.gz
Restoring from the official backup: mount the root partition r/w on /mnt (see prior section), then:
- ssync -a -n -O -x --force backup:/home/backup/claude/ /mnt/ >& $j/bku2claude.ls
- Option interpretation:
  - -n = List but don't transfer the files. When you see that they're satisfactory, repeat without -n.
  - -a = Preserve almost all file attributes. Also recurse into directories.
  - -O = Omit setting directory times if that's the only discrepancy.
  - -x = One filesystem, do not recurse into mount points, which is very nasty if you recurse into /proc.
  - --force = OK to replace a destination directory with a non-directory, e.g. a symlink, which is specifically needed for /etc/unbound -> /var/lib/unbound.
- Notable directories to be transferred:
  - /etc: Everything. Need to delete /etc/*.rpmsave in the backup (312 files on various machines) [done]. /etc/letsencrypt/keys/0000_key-certbot.pem has 32 moldy keys; delete various cruft in /etc/letsencrypt .
  - /home: Everything
  - /home/diklo has to come from elsewhere [done]
  - /root: Everything
  - /var: Everything. The only ones actually restored were /var/lib/chrony ntp unbound; the rest were not backed up.
- Comparison: diff -r /net/backup/home/backup/claude /mnt
  Unlike with Holly, I didn't spot any files that shouldn't be restored from backup. Go back and re-do the rsync without -n.
Making yet another backup in /s1/kvm/claude/claude-disc1.bk4.gz
So how are we going to install Grub?
- losetup -f (find an unused loop device; prints /dev/loop0)
- losetup --partscan /dev/loop0 claude-disc1.raw
- mount /dev/loop0p2 /mnt (this is the root partition; p1 is swap)
- This image does not have EFI, so don't try to mount it.
- Mount -o bind /proc /sys /dev on their mount points in /mnt/
- chroot /mnt
- grub2-install -v /dev/loop0 (with the default --target)
  Takes at least a minute, 1093 lines of oinkage printed. It copied a bunch of files into /boot/grub2, created /boot/grub2/i386-pc/core.img, wrote 108 sectors somewhere (probably the unallocated space after the MBR). No error reported.
- mkinitrd (has to be done before grub2-mkconfig). This was cryptically botched; see below for the right intervention.
- grub2-mkconfig -o /boot/grub2/grub.cfg
- Exit from chroot shell; umount -R -v /mnt; losetup -d /dev/loop0
It created /boot/grub2/i386-pc/core.img OK. Aargh! I had a big problem the first time I tried this: ext2 doesn't support embedding, could only use blocklists, deprecated, will not proceed with blocklists. The issue is that /dev/loop0 (with an explicit offset) contains only partition 2, and you can't install grub in an individual partition (or that's a stupid way to do it); rather you should install in the whole disc, i.e. the MBR. I followed the procedure shown above with losetup --partscan, and mounted /dev/loop0p2 . Then grub was installed with no errors.
Booting Claude for the first time. The KVM host is Jacinth.
- virt-viewer -c qemu+ssh://root@jacinth/system -w -r claude &
- virsh start claude (on the host, jacinth)
- Grub menu was shown.
- Dracut stuff in the initrd executed.
- Timeout waiting for disc UUID ending in 5b89c. This is the root.
- It dropped into Dracut Emergency Shell (give root password). /run/initramfs/rdsosreport.txt was useful this time but I couldn't exfiltrate it because of the current issues.
- /dev/disk does not exist.
- Modules loaded: uhci_hcd ehci_hcd usbcore serio_raw sg, nothing related to discs beyond sg (useless).
- systemctl poweroff
Likely relevant module loaded on Oso (another VM): virtio_blk. After checking some forum posts that didn't lead to a solution, and after reading the entire man page for Dracut, I came up with this command line:
- dracut --force-drivers virtio_blk --force --regenerate-all
- Repeat the grub installation procedure (2 sections back). You don't have to actually reinstall grub though. Replace mkinitrd with the above command line, then re-do grub2-mkconfig (might actually not be needed).
Now it boots. But some non-restored features caused trouble.
- Failed to start firewall. Because Hostgroup.pm is missing. It's on the disc. But /usr/lib/perl5/site_perl/ is the native dir, not the correct symlink. I removed the dir and installed the symlink from Xena. Now firewall can start. I'm sure a bunch of other stuff is hosed too. I'll need to reboot.
- lightdm starts up on the console.
- SSH lets root on.
- Rebooting to get the perl stuff.
- Running the steps from my normal procedure doing a dist-upgrade.
- /usr/diklo/lib/daily/permissions.J -- several perms were wrong explaining daemons that refused to start.
- /home/post_jump/sync_jump -p -C -a claude -- Major discrepancies! 170 files. All the hacked stuff in /usr{etc,lib,local} did not get restored. Reinstalled it now.
- rpmcclean -N -- 2 rpmnew files to delete.
- audit-scripts -v -c -k -- 15 unwanted services got disabled.
- Rebooting again to give effect to all of these.
- restarter ; checkout.sh -- discrepancies:
  - The usual crap with avahi-daemon
  - apache2 cert mismatch (?? self healed ??)
  - /home/httpd/htdocs/wp-login.php missing, not on Xena+Iris either. I've forgotten what this is and can't tell why it was logged in /var/log/apache2/error_log.
  - But Claude's apache2 is serving jimc's homedir on Claude.
  - It turned out that Jacinth's wild side IPv4 address (DHCP) was changed; it's not a fixed IP; and the tester was using the old IP. Once caches timed out, this problem self-healed.
Next day, Claude passes all tests in checkout.sh. It is now operational, except undoubtedly a few missed items remain to be discovered.