Selection | Testing | Setup | Top |
Check the box contents and record serial numbers. Try to get the Ethernet MAC address so it can be registered with the firewall. Recent (2017) NUCs have the MAC address and the serial number on a sticker on the bottom of the machine and also on the box.
What's in the box:
Intel Core i5.
Make a special backup of Diamond, in case of mistakes. [Done]
The new machine initially will be called Orion. Add it to hostdata.db and install relevant files everywhere (/etc/hosts, /etc/ethers, hostgroup.db, trusted-adr.fw). Reload the firewall, to accept the new MAC address. [Done.]
The machine is pre-assembled: no assembly steps. But it needs a name sticker. [Done]
Its temporary home will be atop Jacinth's cabinet. Steal Jacinth's monitor, keyboard and mouse. Connect Ethernet to the spare hub port. [Done]
Unplug the Ethernet (for the next several steps) and let Orion boot into Windows. Once it tries to start setup, shut it down. If it were dead on arrival, it would fail in this step. [It successfully did limited Windows Setup. It did a Wi-Fi scan and found the CouchNet SSID. Not dead on arrival.]
Update the BIOS using the flasher in the BIOS. [Done] Details here.
Set up options in the BIOS. [Done] Details here.
Download a recent Tumbleweed rescue disc, onto USB flash memory. Boot up Orion on this disc. [Done]
Execute /etc/kea/mkstatic.pl which replaces Orion's MAC address, so it will get the correct address when I plug in Ethernet. [Done, and Orion got its correct address.]
Partition the disc. [Done] Details here.
Copy partition contents from the old Diamond to Orion. [Done] Details here.
Boot up Orion on its own disc. Check out all services (checkout.sh). Fix any problems found. Details here.
Subsystem checkout and testing. I went through subsystem.shtml and did the listed tests; q.v. for the results. Discrepancies and fixups are listed here.
Do the power and speed measurements on the new machine. [Done] Details here.
I need to make Diamond become Orion, and Orion become Diamond,
without making errors, I created /s1/etc-orion and /s1/etc-diamond
on both hosts, with a directory structure actually starting at the
root, containing host-specific files. I can swap one or the other
into place by:
rsync -n -a -O /s1/etc-diamond/ /
(-n for testing, and it will help to squeeze in
--log-format="%o %f" to make it report what it's copying.)
The files involved are:
For each directory, once it's populated, install it on the other host. Test that everything in the host's own dir is right (doesn't get selected for non-copying with rsync -n) and that all files would get non-copied (with rsync -n) from the other host's dir.
Burn bridges: interchange Diamond's and Orion's name and host keys; see the previous paragraph for which files those are and how to make the exchange. Details:
audit-scripts, then
restarter, and
checkout.sh. Check that LDAP, Kerberos, SSH and Apache (TLS) function with the new name. Repeat with the old machine (now Orion); formerly working LDAP and Kerberos KDC should be out of action.
hostdata.db: Put Orion in the down
hostgroup. Interchange
the MAC address for orionen0 and diamonden0. Don't swap the MACs on
br0, which by local policy are derived from the IPv4 address. Rebuild
/etc/ethers and /usr/diklo/lib/hostgroup.db and install on all hosts.
[Done]
Shut down the old machine (Orion). Print up a note about the old machine's status. Put the old machine in its box in the storage area. [Done]
Move the new machine to Alice's desktop. Re-pair the Bluetooth mouse. Check that Cups can print. [Done]
As delivered, the machine has BIOS version 0040 dated 2021-04-14. Not too old, but I should check for recent updates. How to obtain the latest BIOS version:
OS Independent. Download type = BIOS.
BIOS Flash Update. This page is also accessible from the regular setup on F2. It shows the current BIOS version; we have version 0040 dated 2021-04-14.
Unknown Device. Scroll down to this line and hit Enter.
Intel NUC), hold down F2 to get into BIOS setup. Verify that the new BIOS version got installed.
This should be done after the BIOS update because that resets at least some options to factory defaults.
Press F2 during booting to get into BIOS setup. It shows
Intel NUC
for about one second, and likely you have to hit F2
while that screen is showing. Alternatively, just hold down F2 when
it starts the boot process.
Main page: System Information.
Advanced (Devices): I changed only one of these.
Cooling: By default the Fan Control Mode
is
Balanced
but I changed it to Quiet
. It has temp sensors
on the CPU, the PCH (whatever that is), the memory, and the
motherboard. But its policy is to leave the fan off until the CPU
temperature reaches 80 C, then turn the fan on full blast, and it
leaves a bit set so that at the next boot it shows a message that
a thermal emergency was responded to; check for blocked airflow;
press any key to continue. I reverted to Balanced
.
Performance: Processor: I turned off hyperthread.
Security: Totally unlocked. I left these alone. In a public
computer lab you will want to set a BIOS password. These Security
Features
are enabled: virtualization; virtualization with direct
I/O; Platform Trust Technology.
Power:
Boot: Both Legacy Boot and UEFI were enabled, so it says. Under Secure Boot, it's enabled, and this is supposed to disable Legacy Boot. But my kernel had an invalid signature so I disabled Secure Boot. (With a new kernel, is the signature once again valid?) Under Boot Priority, Windows Boot Manager came first. I switched to USB flash drive first, and I disabled PXE over IPv4+6. The USB drive persists even if not plugged in, unlike in some older BIOS versions.
To save and reboot, hit F10. It booted Windows anyway (shut down). To boot the rescue system, when booting hold down F10 and it will give you a menu on which you can choose Windows or the USB drive. That worked (whew!)
At this point there are three ways to proceed:
Install OpenSuSE Tumbleweed, then restore
the special
backup onto Orion. This is the most work but has fewest dicey
steps. See
Holly Hosed, Tested Backup for details of the procedure.
Partition Orion's disc by hand including creating filesystems and labels. Mount each partition in the rescue system and use rsync, or the equivalent with tar and ssh, to copy Diamond's content onto it. This is the procedure I will try first.
Partition Orion's disc by hand (no filesystems). For each
partition, copy Diamond's raw device onto Orion's (across the net).
This method may not work because the partitions probably won't be
exactly the same size, even though close, the filesystems will
appear to be corrupt, and repairs probably
will work but
this isn't assured.
Should I take out Diamond's disc, connect it to Orion with my SATA to USB adapter, and not have to deal with and wait for the network? It's probably better to use the network connection even if slower — the USB adapter isn't that fast anyway.
Partition table on old Diamond:
Nbr | Size | Role | Fsys | Label |
---|---|---|---|---|
1 | 1049kB | BIOS boot | -- | |
2 | 38.8MB | EFI | fat16 | |
3 | 23.1GB | Root | ext4 | |
4 | 23.1GB | Home | ext4 | |
5 | 8587MB | Swap | linux-swap(v1) | |
6 | 52.0GB | Old VM #1 | -- | |
7 | 55.4GB | Old VM #2 | -- | |
8 | 338GB | Extra | ext4 |
Orion's disc: /dev/nvme0n1 Silicon Motion SM2263EN/SM2263XT SSD Controller, 238.47GiB, 250.0Gb, can't tell the actual vendor's name but it's in Guangdong, CN.
Partition table on Orion:
Nbr | Size | Role | Fsys | Label |
---|---|---|---|---|
1 | 1 MiB | BIOS boot | None | None |
2 | 40 MiB | EFI | FAT16 | EFI-11 |
3 | 20 GiB | Root | Ext4 | ROOT-11 |
4 | 32 GiB | Home | Ext4 | HOME-11 |
5 | 16 GiB | Swap | linux-swap(v1) | SWAP-11 |
6 | 170 GiB | Extra | Ext4 | S1-11 |
The Yast2 partitioner populates /etc/fstab (on the rescue system) with lines for the mountable partitions created. It persists across reboots; there's a copy-on-write overlay area. Mount /dev/nvme0n1p3 (use /dev/disk/by-label/ROOT-11) on /mnt. Copy /etc/fstab to /mnt/etc/fstab-orion . Unmount the partition.
The goal at this point is to copy stuff from Diamond to Orion's disc. A few preliminary steps were needed.
It got into a weird state. For reasons unknown, the live CD
user linux
(no password) could not log in; logs suggest that it
did log in but exited immediately. I'm continuing as root on tty1.
It has a correct /etc/resolv.conf, generated by netconfig, probably
from DHCP options that it was given. ssh jimc@diamond uname -a
works (password required).
For easier copying I will give root@orion a key agent with jimc's secret
key. It will get this key by:
rsync -a jimc@diamond:~/.ssh/ ~/ .ssh/ (which already exists).
eval $(ssh_agent)
ssh-add /root/.ssh/id_rsa (give password)
ssh diamond uname -a (works, no password needed)
Command lines to suck a filesystem: /boot/EFI, home and s1 are mounted
on their proper mount points, but not root, of course, so put it on
/mnt.
But I'll first do EFI because it's very small
Guess what, Diamond, Jacinth, Iris don't have EFI booting,
only Xena does. Get it from Xena.
Another gotcha:
rsync will set the owner and group to the numeric values in the
rescue disc's /etc/passwd and /etc/group for their names,
which differ from what CouchNet is enforcing, leading to things not
working. Head this off with the --numeric-ids option.
mount /dev/disk/by-label/ROOT-11 /mnt
rsync -a --one-file-system --numeric-ids --exclude lost+found xena:/boot/efi/ /boot/efi/
rsync -a --one-file-system --numeric-ids --exclude lost+found diamond:/ /mnt/
rsync -a --one-file-system --numeric-ids --exclude lost+found diamond:/home/ /home/
Seems to have arrived in good order, cross fingers. Root had close to 10.0Gb. If it saturates MOCA (100Mbit/sec) it would take 600sec (10min). Actual 19min, no error messages. /home has similar size and speed. I'm not copying /s1; it's all ancient special backups and ancient virtual machine images. But remember to re-create /s1/scr [done].
After copying content I need to change some items to stop being Diamond and start being Orion. This is still on the rescue disc.
grub2-install /dev/disk/by-label/ROOT-11. The safest way is to chroot into /mnt (root). You need to mount /proc, /sys and /dev on the new root's mount points. It seems to be OK to have them mounted in multiple places. Also
mount /dev/disk/by-label/EFI-11 /boot/efibut not if it's in /proc/mounts, check first. [Done, no errors reported.]
Initial problems with booting:
Secure Boot Violation, invalid signature detected, check secure
boot policy in setup.
I didn't try to debug this; I just turned
off Secure Boot. In Setup-Boot-Secure Boot-Secure Boot (change
to Disabled). [Now Grub shows its menu.]
No such device: UUID ending in e973 . Checking on Diamond, this
is Diamond's root partition (which is on Diamond, not Orion).
Back to the rescue system and chroot to Orion's root.
View /etc/default/grub and run
grub2-mkconfig -o /boot/grub2/grub.cfg
.
grub2-mkconfig populated grub.cfg with 3dd0 which is /dev/nvme0n1p3
(Orion's root). Rebooting again.
It boots. And hangs after enumerating USB
devices. After about 60 secs it announces dracut-initqueue:
starting timeout scripts
, still waiting for…
(it gives lines from
/run/systemd/generator/systemd-cryptsetup@*.service,
grep for After=remote-fs-pre.target)
After reporting this once per second for 30-60 secs, it recommends
saving /run/initramfs/rdsosreport.txt on a USB flash drive.
It drops into emergency mode. Give root password for maintenance.
The report looks useful but only usbcore is loaded and I can't
mount USB memory. Back to chroot jail. My hypothesis is that there
are host-dependent items in the initrd which are botched on the other
host. Mkinitrd should fix that [done]. Rebooting.
It boots much more normally. It's on the net, so it says. The only failed service seen is slapd (LDAP), shouldn't have been attempted on Orion. Looking for lightdm greeter, but it displays nothing. Isn't that cute, it has Diamond's IP. Which was hardwired in /etc/sysconfig/network/ifcfg-br0 . Fixed, now both Orion and Diamond are happy (until exchanged).
Orion is using Diamond's SSH host keys and clients complain. I made separate dirs /etc/ssh-orion and /etc/ssh-diamond, with a symlink to the first one, and I added Orion's SSHFP records to DNS. Now it works. I did the same thing on Diamond. Later I reverted this set of dirs, and created equivalent directories with a lot more host-specific files.
checkout.sh discrepancies:
Wicked: If you do ping -4 -w 3 jacinth (192.9.200.193) it answers, but $ft/wicked reports failure on this test. [Self-healed, restarter may or may not have helped.]
display-manager is still hosed. Failed to create IPv4 VNC
socket on [::]:5900 : not an IPv4 address
(paraphrased).
Diamond has the same message but still starts. /var/log/lightdm
contains logfiles from 2017, needs a cleaner.
seat0-greeter.log: /var/lib/lightdm/.Xauthority permission denied.
Owned by the wrong luser, why wasn't it copied over right?
(For the answer, and a fix, search for --numeric-ids .)
[Re-owned, now it starts.]
Wonder of wonders, Avahi passed its test. Normally, if you sneeze it will fail.
cups is not wanted on Orion, but it's running and passes its test. Apparently the printer doesn't have to be connected for it to pass. Similar for kpropd krb5kdc postgresql unbound{2,3}. [Fixed by running audit-scripts which disabled all of them.]
krb-client.J failed because it has no host key for Orion. [Installed it, now test passes.]
bluetooth-hci.J is not wanted (should be wanted), is dead, but
even so the HCI is powered up (good). Reason: Orion was not in
hostgroup blue
. [Added, problem solved.]
ldap is not wanted, enabled, failed. [Disabled it on Orion.]
apache2 botched TLS to orion.cft.ca.us because it had the wrong host key. [Installed correct host key, now it works.]
alsa-restore: need to regenerate /var/lib/alsa/asound.state for the new sound card. [rm /var/lib/alsa/asound.state and retest.]
firewall.J: botched vnc-server, http, https because the daemons are hosed (skipped). Remaining ports passed.
check-net.S: /etc/ethers needs a pro-forma host orionen0 with the hardware address, and the correct fake MAC on orion (br0). [Done, now passes test.]
Daemons that are enabled but should not be on Orion:
Rebooted, re-ran the test.
The daily report had a bunch of permission and ownership issues.
This was already seen for /var/lib/lightdm/.Xauthority (preventing the
display-manager from starting, see above). I'm beginning to suspect
this scenario: rsync was used to import Diamond's files to Orion. For
owners and groups it sends the alphabetic names from Diamond (and
numeric), and Orion looks up the name in /etc/passwd or /etc/group,
and uses the resulting numbers in the file's inode. This is
/etc/passwd or /etc/group on the rescue disc, not the CouchNet
values, producing these error reports. So I'm going to re-do the
transfers.
rsync -a -O --one-file-system --numeric-ids --exclude lost+found diamond:/home/ /home/
And similarly for / (root) and /boot/efi . But I'm going to have
to be cautious, to avoid overwriting files whose content is supposed
to be different on Orion.
These are discrepancies encountered when doing the tests listed in subsystem.shtml.
I tried to boot the kernel and initrd copied from Diamond. They needed these steps before I could boot them:
Secure Boot Violation, invalid signature detected.I turned off Secure Boot in BIOS. Now that I've gotten it booting, I wonder if I can turn it back on?
Speed test output in Kb/sec: SHA512 1 core 345663 4 cores 1382652 reading 87859 overall 346384 Comparison to the old Diamond: CPU about 2.5x, disc reading 9x.
Wake on LAN does not wake Orion from S5. It should be enabled in BIOS Power-Secondary. See BIOS Setup power section. [Done, now it wakes.]
After hibernation, Orion does a full reboot, not resuming.
This happened on Xena too, but other hosts resume with no special
configuration. To fix, I created
/etc/dracut.conf.d/99-resume.J.conf containing
add_dracutmodules+=" resume "
(a space is required before and after each module name).
#Comments OK. Then execute mkinitrd --or-- dracut -vf .
[Fixed, now it resumes from hibernation.]
To see if an initrd has the resume module, do
lsinitrd /boot/initrd-… | less
and look for
resume
in the modules section.
Now when it goes into S4, 2 secs later it wakes up, i.e. starts the boot process and diverts into restoring the saved image. There was no net traffic (pinger or ssh) to Orion but broadcast packets (ARP, IPv6 neighbor discovery, etc) are likely. Per /proc/acpi/wakeup these devices could wake the machine from S4: (and many more were disabled)
There are forum discussions of Bluetooth causing this problem, supposedly fixed in kernel 5.3.x. I stopped bluetooth and hibernated: didn't help, it woke up again.
Someone mentioned that if you
echo PXSX > /proc/acpi/wakeup
it will toggle the enabled status of that device. I tried
disabling all of them, then hibernating: it disabled all but PXSX
(Ethernet NIC). Success: it stayed off for 60 sec. Wake on LAN
woke it up.
Turning them on one at a time, most suspicious first: XHCI: wakes immediately. I put everything on except XHCI. Stayed asleep for 60sec. In setup, USB S4/S5 power was off; suppose I turn it on (and wake on USB from S5). After reboot, XHCI wakeup is enabled. Wonder of wonders, it stayed off for 60 sec. And it woke on USB. I'll want to keep an eye on power consumption, and maybe suppress waking on USB.
Intel docs claim the i5-1135G7 chipset includes a watchdog
timer, but the iTCO_wdt.ko driver was not loaded. It would appear
that this happens in the initrd, if Dracut configuration includes
the watchdog module. Most machines have this by default, but on
Orion you have to configure it. Create
/etc/dracut.conf.d/99-watchdog.J.conf containing
add_dracutmodules+=" watchdog "
(spaces required before and after the module name; #comments
OK), and run mkinitrd or dracut -vf. [Done]
Speed test on Jimc's benchmark. Columns in the output:
The test was run 3 times and the last one is reported. Actually the test is designed to be reasonably immune to buffer cache effects and scores vary only about 3% between repetitions. Numbers are in kbytes/sec.
SHA512 | SHA512*cores | Disc read | Composite | Machine |
---|---|---|---|---|
120848 | 241696 | 12003 | 81531 | NUC5i5RYH |
200479 | 400958 | 4964 | 127781 | NUC7i5BNH |
346151 | 1384604 | 84795 | 345279 | NUC11PAHi5 |
Selection | Testing | Setup | Top |