Planning | Testing | Setup | Collapse | Gentoo | Top |
2019-03-20: We had a power glitch, and disovered that the batteries in the two UPSes are close to end of life. But worse, something nasty happened and Iris won't boot; it gets to starting Wicked (the network manager), then gets three kernel OOPSes in succession, and freezes. They scroll off the screen way too fast to capture useful information. The last one usually happens in one of the netfilter (firewall) modules, frequently in ip6_do_table (I use IPv6 and IPv4, dual stack), and the specific complaint is an unrecoverable page fault while handling an interrupt.
This kind of crash has been going on for months, with different details; sometimes it happens at boot (with less than 100% probability), and sometimes it happens when the machine is trying to shut down or reboot. Also from one kernel update to the next, the display may or may not show images. This kind of behavior is very discouraging.
I extracted Iris' SD card 06, ran fsck, and mounted it (on /mnt).
find . -xdev -type f -exec md5sum {} + | sort -k 2,2 -o $j/iris.md5
Same on Orion, starting at /. I also checked the EFI partition
separately. Comparing the checksums with this command:
comm -3 orion.md5.b iris.md5.b > diff.b
Results:
My next goal is to find files present on both machines but with different
checksums. Sort by file. Exchange fields, $1 = file, $2 = MD5sum. Command:
awk '$1 == prev && $2 != md5 {print $1, $2, md5} {prev = $1 ; md5 = $2}' diff.b
Results:
What now? Let's try rebooting again. Still fails.
I'm not sure if card 06 is permanently destroyed, or had a "recoverable" error which overwriting would fix, or if there's a bad block in there which would submerge in the free pool if overwritten, to surface later at the most inopportune moment.
I have card 04 which is supposed to be from 2018-10-24 for Iris. Partition names are without suffixes, matching card 06. Let's try to boot it and see what it really is. Last login: 2018-12-21; Kernel 4.19.7-1-default; at least it boots. Display is even more hosed than on card 06. Power-off succeeded (blinked 10 times).
Plan: Run post_jump on it. Then copy (rsync) /home /root and other dirs. Making backup copies of the SD cards: On Diamond. /scr/iris-1903/card04/ROOT/(contents) and similarly for EFI and card06.
Run parted and check that card 04 partition table is reasonable. (Units: Gb, power of 10)
Size 32.0Gb, MBR 1 16.8Mb fat16 EFI LBA, type=0c 2 31.5Gb ext4 ROOT type=83 3 510Mb linux.swap SWAP type=83, should be 82
Change the labels on card 04 to have suffixes.
Also edit the lock file /etc/zypp/locks to unlock RPi firmware updates. (No such file on card-04)
Copy /m1/custom/ (particularly conffiles) from 06 to 04. Only 4 conffiles. Same for /home .
Boot card 04 -- Yes it booted, but the display is not improved, is still more hosed than card 06. Run post_jump $TARGET . Problems…
After doing phase 7 (install missing packages) which did not involve upgrading openssh, it would not accept ssh from diamond to iris, auth worked but exit code was 254. All attempts to diagnose revealed nothing. Downgrade to openssh-7.2p2-13.1.aarch64 did not help. Syslog is not running, no log messages. Starting nslcd didn't help.
Running dist-upgrade by hand. Planning to reboot after that. 1319 pkgs to upgrade, 1 to downgrade, 25 new, 13 to remove, 2 chg arch Pre-install these packages due to file size screwup: jansi jansi-native jline1 fusesource-pom hawtjni-runtime The only one that (currently) has a size error is fusesource-pom (noarch) http://download.opensuse.org/tumbleweed/repo/oss/noarch/fusesource-pom-1.11-1.1.noarch.rpm These will be tossed due to not being keystone or dependent pkgs.
DNS is also hosed, can't resolve off-site hostnames. Because systemd-resolved is not running. And won't start. Edited resolv.conf by hand. That fixed it.
Now that everything is updated, you can do ssh iris again!
post_jump -p 8 iris (start by removing unwanted packages.)
We've spawned another user: srvGeoClue 494:472 . Need to update globally and reinstall /etc/group because as usual it leaves postfix out of group hostcert.
$da/sorthosts.J needs to deal better with missing /etc/ldap.conf and /etc/openldap/ldap.conf .
Other than these issues, post_jump seems to have been successful.
Rebooting Iris, cross fingers. Not only did Iris boot, but the X-Server and display manager are running!
checkout.sh discreps:
apcupsd is not running. Serial port not connected. Wrong conf file, this device is USB! Copied config from Jacinth. Now it works.
icecast is not running. Using default conf file. Restored from backup. Now it works.
alsa-restore failed because no /etc/asound.state. Copied from /var/lib/alsa/asound.state, now it works.
Restore /home (selectively) from saved content of card 06. Content is *IDENTICAL*. Because I already copied /home from card 06.
Try out sound.
Go back and do loose ends: /etc/fstab, srvGeoIP user, etc.
Very soon I want to compare /boot/efi with card 06 and try to identify why Iris' graphics works and Orion is hosed. And revive Orion.
Physically interchanged memory cards: 03 in Iris chassis, 04 in Orion chassis. The one with Iris' IP has card 04, Orion's IP has card 03. Not using DHCP. And simlarly, the Orion chassis display is OK, Iris chassis display hosed.
Behaviors on aplay $sound:
I switched them back, card 04 in Iris chassis. Growl, doesn't play: aplay -D hw:0,0 $sound ; also 0,1 Fixed the swap device in /etc/fstab, rebooted, now it plays! Weird.
Replacement batteries were delivered. I swapped the battery in Iris' and Jacinth's UPS. Piece of cake :-)
Upon booting the RPI, it got a kernel oops involved with one of the Wicked processes, fatal exception (kernel paging req) while doing an interrupt. In one of the IPv6 modules. I power cycled 4 times, 10 sec wait after turning off power. 5th time, I waited 20 sec, and it booted.
On Iris, sound plays. Display manager shows image. Performing streaming media from off-site.
!@#$%^ After ~10 mins, sound stops. meow/gstreamer is still running (killed). aplay doesn't play. This is possibly coincident with DPMS sleep. Yes it was. DISPLAY=:0 XAUTHORITY=/run/lightdm/root/:0 xset dpms force on And the sound plays over the TV speakers, but if it goes into DPMS off, of course the sound goes off too.
That's not a bug, that's a feature! The RPi helpfully diverts onboard
audio to HDMI if available. Now find out (again) how to turn that off.
https://www.raspberrypi.org/documentation/configuration/audio-config.md
In /var/lib/alsa/asound.state look for state.ALSA (the first group),
control.3, name PCM Playback Route
. Default value is 0 meaning to
prefer HDMI if available; alternatives are 1 = always analog, 2 = always
HDMI. amixer cset numid=3 1 #Set control #3 to analog.
Growl, amixer has to connect to PulseAudio, connection refused. Why?? This is what strace shows:
Investigating the failure to connect.
To install from Iris: /usr/diklo/lib/functest/rpi-sound-route.J /usr/diklo/sbin/rpi-sound-route /etc/systemd/system/rpi-sound-route.J.service
Edit /var/lib/alsa/asound.state , change control 3 value to 1 by hand.
alsactl -f /var/lib/alsa/asound.state restore
Growl, the route was obeyed, but output is silent. Yes the amp is
turned on and the physical volume is turned up. This is really
bad. Something has gotten so tangled up and it's not going to work.
This whole mess is unacceptable and is a time sink. Unless I can get Iris working rather promptly, I'm going to junk it and revert to an Intel box such as a recent NUC. Further, about the kernel OOPSes, updating uaserspace config files, none of which appear threatening, absolutely should not make kernel code corrupt. My working hypothesis is that some driver -- the vc4 display driver heads the list of suspects -- is storing through a pointer that has been trashed, and results are variable but show patterns like often involving ip6_do_table. When a non-used instruction is overwritten the machine boots without a problem, but when a used instruction is hit, it crashes. I have neither the skills nor the inclination to deal with driver development, and I do not want to continually deal with weevils in the raspberries.
It's proven that the sound behavior follows the SD card, not the chassis; in other words, the power glitch (probably) did not damage the RPi hardware.
These symptoms are the current nemeses, in order of being addressed:
The onboard PCM no longer produces any sound even when the sound route (ALSA control #3) is set to 1 (always analog). This prevents music playback, the machine's primary role at present.
The HDMI diversion produces gravelly sound, nearly unlistenable, when the route is set to 0 or 2. This would prevent video performance. But this proves that the userspace audio performance software is not defective.
When Iris boots (with a previous 20 sec pause powered off), with around 80% probability it gets a sequence of kernel errors ending in an OOPS, specifically an unrecoverable page fault while handling an interrupt, which is always in a Netfilter module, often with ipv6 in its name, but the details are different each time. This happens just when Wicked brings up its managed network interface (eth0), and it should have sent a Router Solicitation and configured its IPv6 and IPv4 addresses. Orion has similar crashes on boot but at a lower probability, probably about 50%.
Obviously if it does boot, the memory corruption has occurred elsewhere and is going to have a baleful effect that is less visible.
dtoverlay=vc4-fkms-v3d (or real kms) is commented out (no GPU acceleration), preventing video performance. If it were active, the machine would freeze up due to a leak of CMS memory. Iris currently can show the lightdm greeter, whereas Orion can run the X-Server but the screen shows two lines of the text framebuffer.
Nonfunctional and radically unstable machines are not providing value and entertainment, and extreme measures are going to be used to either bring them back to life, or to replace them.
My current working hypothesis is that there's nothing wrong with the hardware, and I'll give software interventions a small number of additional chances.
Where the various files and directories are:
Initial steps:
System View.
Finishand supposedly it happened.
For once, it executes. It has HDMI (vc4-hdmi) but analog output is missing.
This is with the default audio route, i.e. 0.
aplay -D hw:0,0 $sound
produces audio open error: No such
device
, similar to behavior on the failing Iris. Evidently the HDMI
channel has no ALSA driver. Next step will be to reboot with onboard
audio enabled: dtparam=audio=on in /boot/efi/extraconfig.txt .
DISPLAY=:0 XAUTHORITY=/run/lightdm/root/:0 ico
produces a very active bouncing icosahedron. We have 307Mb or CMA memory,
273Mb free. The vc4 driver is loaded, plus infrastructure including:
vchiq snd_soc snd_pcm drm_kms_helper drm. I'm going to evaluate sound
first, and then turn on acceleration and see if it blows up.
/var/log/Xorg.0.log reveals: modesetting (KMS) driver, using /dev/dri/card0, but Option AccelMethod none , from /etc/X11/xorg.conf.d/20-kms.conf . It's using 1920x1080px on output HDMI-1. Using swrast.
Rebooting 10 times. It takes about 3 minutes each. No crashes, no anomalies.
This is the ALSA device (bcm2835).
Per alsamixer, it is unmuted and the volume is at 40%. With the sound
route set to 0, aplay -D hw:ALSA,0 $sound
is played on HDMI (TV
speakers) with good quality. Setting to 1, it is played on the onboard
audio, for the first time in 5 days.
De-disabling acceleration and restarting the X-Server. It's trying glamoregl but giving up on it. AIGLX: Screen 0 is not DRI2 capable (still); it's falling back to swrast. This line of investigation will have to be finished later.
We have a configuration with working sound and working non-accelerated video. The next job will be to identify where it goes bad in the transition from the pristine image to the failing Iris.
I'm going to change the configuration step by step, and I'll try to identify what step makes it stop working. Key points to monitor are:
Saved in xena:/s1/kvm/iris/inven:
zypper install rsyncgets a segfault while retrieving the repo metadata. Orion's base URL is http://download.opensuse.org/ports/aarch64/tumbleweed/repo/oss/ Their base URL is identical (there's also a debug repo which is not enabled). I saved a copy of the repo file, and removed the repo (zypper removerepo openSUSE-Ports-Tumbleweed-repo-oss). Then restored the copy, and again did zypper refresh. This time the refresh worked, and so did installing rsync, and retrieving /boot/efi.
Installing CouchNet administration scripts, particularly the selftest scripts.
Running checkout.sh. A lot of packages are not installed; a lot of wanted daemons are not enabled; a lot of unwanted daemons are enabled and running. The only critical item is that the firewall is not installed.
These are packages to particularly watch out for in the next section where they are updated or deleted. I'll lock them all in advance, then unlock after updating.
Let's go through post_jump section by section. At each step, I'll reboot, check the display, and check sound. The sections are:
Phase 6 and earlier -- Check versions and consistency, install modified config files, install /m1/local. Most of this has already been done. It reboots (no crash or weird behavior), gets an IPv4+6 address, junks the IPv6 address, display is black. Lightdm could not get a list of users from dbus. Don't worry. Sound plays.
Phase 7 -- Install missing packages but with interactive mode.
audit-pkgs -v -i -c -I
276 keystone packages, Key gpg-pubkey-bbac6b14-5c755908 is obsolete,
remove it if not needed. 16 packages could not be found, 260 remain.
3 JTD items: gvim-data, pulseaudio-module-gsettings are needed but
not installable; systemd-logger conflicts with rsyslog. Solutions:
Retain older vim-data. Omit pulseaudio-module-gconf.
Toss systemd-logger. 10 packages to upgrade, 602 new, 1 to toss.
It reboots successfully, no crash or weird behavior. It obtains and
keeps its IPv6 address (and IPv4). The display
manager starts up, with CouchNet branding
. Sound plays over
HDMI. alsamixer is unable to connect to pulseaudio. I used the trick
of editing /var/lib/alsa/asound.state and restoring it with alsactl.
Now the sound plays from the headphone jack (to the amp to the
speakers).
Phase 8 -- Remove unwanted packages.
audit-pkgs -v -e -c -I ;
Removing 461 packages. Rebooting: /boot/grub2/themes/openSUSE/(various
fonts) are missing, aborted, press any key to exit. The USB keyboard
is not recognized for press any key
. (grub2-branding-openSUSE
was one of the packages that was removed.) I edited
/boot/grub2/grub.cfg removing the theme, and it booted normally.
Display is normal. Sound plays from HDMI. alsa-restore.service
did not get run because /etc/asound.state is missing. So let's provide
it, with the sound path correctly set (cp -p /var/lib/alsa/asound.state
/etc/asound.state). Sound now plays from the speakers.
Phase 9 -- Update everything to the latest version (dist-upgrade).
audit-pkgs -v -U -c -I
Locks: it's honoring locks on unbound openssh and u-boot-rpi3.
Other packages are also locked but are not mentioned.
We're getting kernel-default-4.20.13-1.1 kernel-firmware
raspberrypi-firmware raspberrypi-firmware-config . All of these are
supposed to be locked. I'm aborting and checking the lock situation.
The lock file got overwritten with the conf file update. Re-doing it.
Now the desired packages are locked. 117 packages to upgrade, 4 new,
3 to remove. Upgrade finished OK. Rebooting: Came up, no weird
behavior (except USB keyboard was ignored in the grub menu, as always).
Display is functional. Sound plays correctly the first time.
Phase 10 -- Miscellaneous setup and cleanup. I have several packages locked, where it's very likely that the problem is hiding. I'm going to do the miscellaneous setup first, then upgrade the locked packages one at a time. On Diamond, post_jump -p 10 iris . Miscellaneous steps ran normally. Rebooting: It rebooted. Display is normal. Sound plays from speakers. checkout.sh: passed all tests (after restoring some conf files from backup).
To restore Iris' specific configuration
and user files, first I did:
cd /home/backup/iris ; ssync -a -n -O --exclude inventory.dat . iris:/
(Capturing in a file the list of files that it proposes to transfer.)
Then I went through the list of 819 files in detail and decided which ones
I wanted, and which should not be transferred, to be overwritten or deleted
in the next backup. 498 files will be transferred. Then:
ssync -a -n -O -R --files-from iris-rstr . iris:/
Rebooting after this step and re-running checkout.sh:
It died with the 3 successive OOPSes ending in ip6_do_table.
I put the XFCE image on card 06, installed keys.iris, with 2 new files: /etc/asound.state and /etc/zypp/locks (should be non-threatening). I ran post_jump (133 mins). Rebooted. It died, where at this step previously it was running OK.
Repeating the same procedure but omitting the dist-upgrade (-p 8). Rebooting, it died. This machine is sooooo hosed, and its chances for redemption have run out. Continued in Replacing Iris.
Planning | Testing | Setup | Collapse | Gentoo | Top |