Raspberry Pi® Logo
Valid HTML 4.01 Transitional

Raspberry Pi 3B
Collapse of the Dream

Jim Carter, 2018-02-06

2019-03-20: We had a power glitch, and disovered that the batteries in the two UPSes are close to end of life. But worse, something nasty happened and Iris won't boot; it gets to starting Wicked (the network manager), then gets three kernel OOPSes in succession, and freezes. They scroll off the screen way too fast to capture useful information. The last one usually happens in one of the netfilter (firewall) modules, frequently in ip6_do_table (I use IPv6 and IPv4, dual stack), and the specific complaint is an unrecoverable page fault while handling an interrupt.

This kind of crash has been going on for months, with different details; sometimes it happens at boot (with less than 100% probability), and sometimes it happens when the machine is trying to shut down or reboot. Also from one kernel update to the next, the display may or may not show images. This kind of behavior is very discouraging.

I extracted Iris' SD card 06, ran fsck, and mounted it (on /mnt).
find . -xdev -type f -exec md5sum {} + | sort -k 2,2 -o $j/iris.md5
Same on Orion, starting at /. I also checked the EFI partition separately. Comparing the checksums with this command:
comm -3 orion.md5.b iris.md5.b > diff.b
Results:

My next goal is to find files present on both machines but with different checksums. Sort by file. Exchange fields, $1 = file, $2 = MD5sum. Command:
awk '$1 == prev && $2 != md5 {print $1, $2, md5} {prev = $1 ; md5 = $2}' diff.b
Results:

What now? Let's try rebooting again. Still fails.

I'm not sure if card 06 is permanently destroyed, or had a "recoverable" error which overwriting would fix, or if there's a bad block in there which would submerge in the free pool if overwritten, to surface later at the most inopportune moment.

I have card 04 which is supposed to be from 2018-10-24 for Iris. Partition names are without suffixes, matching card 06. Let's try to boot it and see what it really is. Last login: 2018-12-21; Kernel 4.19.7-1-default; at least it boots. Display is even more hosed than on card 06. Power-off succeeded (blinked 10 times).

Plan: Run post_jump on it. Then copy (rsync) /home /root and other dirs. Making backup copies of the SD cards: On Diamond. /scr/iris-1903/card04/ROOT/(contents) and similarly for EFI and card06.

Run parted and check that card 04 partition table is reasonable. (Units: Gb, power of 10)

    Size 32.0Gb, MBR
    1	16.8Mb	fat16		EFI	LBA, type=0c
    2	31.5Gb	ext4		ROOT	type=83
    3	510Mb	linux.swap	SWAP	type=83, should be 82

Change the labels on card 04 to have suffixes.

Also edit the lock file /etc/zypp/locks to unlock RPi firmware updates. (No such file on card-04)

Copy /m1/custom/ (particularly conffiles) from 06 to 04. Only 4 conffiles. Same for /home .

Boot card 04 -- Yes it booted, but the display is not improved, is still more hosed than card 06. Run post_jump $TARGET . Problems…

Rebooting Iris, cross fingers. Not only did Iris boot, but the X-Server and display manager are running!

checkout.sh discreps:

Restore /home (selectively) from saved content of card 06. Content is *IDENTICAL*. Because I already copied /home from card 06.

Try out sound.

Go back and do loose ends: /etc/fstab, srvGeoIP user, etc.

Very soon I want to compare /boot/efi with card 06 and try to identify why Iris' graphics works and Orion is hosed. And revive Orion.

Physically interchanged memory cards: 03 in Iris chassis, 04 in Orion chassis. The one with Iris' IP has card 04, Orion's IP has card 03. Not using DHCP. And simlarly, the Orion chassis display is OK, Iris chassis display hosed.

Behaviors on aplay $sound:

I switched them back, card 04 in Iris chassis. Growl, doesn't play: aplay -D hw:0,0 $sound ; also 0,1 Fixed the swap device in /etc/fstab, rebooted, now it plays! Weird.

Replacement batteries were delivered. I swapped the battery in Iris' and Jacinth's UPS. Piece of cake :-)

Upon booting the RPI, it got a kernel oops involved with one of the Wicked processes, fatal exception (kernel paging req) while doing an interrupt. In one of the IPv6 modules. I power cycled 4 times, 10 sec wait after turning off power. 5th time, I waited 20 sec, and it booted.

On Iris, sound plays. Display manager shows image. Performing streaming media from off-site.

!@#$%^ After ~10 mins, sound stops. meow/gstreamer is still running (killed). aplay doesn't play. This is possibly coincident with DPMS sleep. Yes it was. DISPLAY=:0 XAUTHORITY=/run/lightdm/root/:0 xset dpms force on And the sound plays over the TV speakers, but if it goes into DPMS off, of course the sound goes off too.

That's not a bug, that's a feature! The RPi helpfully diverts onboard audio to HDMI if available. Now find out (again) how to turn that off. https://www.raspberrypi.org/documentation/configuration/audio-config.md In /var/lib/alsa/asound.state look for state.ALSA (the first group), control.3, name PCM Playback Route. Default value is 0 meaning to prefer HDMI if available; alternatives are 1 = always analog, 2 = always HDMI. amixer cset numid=3 1 #Set control #3 to analog.

Growl, amixer has to connect to PulseAudio, connection refused. Why?? This is what strace shows:

Investigating the failure to connect.

To install from Iris: /usr/diklo/lib/functest/rpi-sound-route.J /usr/diklo/sbin/rpi-sound-route /etc/systemd/system/rpi-sound-route.J.service

Edit /var/lib/alsa/asound.state , change control 3 value to 1 by hand.
alsactl -f /var/lib/alsa/asound.state restore
Growl, the route was obeyed, but output is silent. Yes the amp is turned on and the physical volume is turned up. This is really bad. Something has gotten so tangled up and it's not going to work.

Plan B: Scorched Earth Strategy

This whole mess is unacceptable and is a time sink. Unless I can get Iris working rather promptly, I'm going to junk it and revert to an Intel box such as a recent NUC. Further, about the kernel OOPSes, updating uaserspace config files, none of which appear threatening, absolutely should not make kernel code corrupt. My working hypothesis is that some driver -- the vc4 display driver heads the list of suspects -- is storing through a pointer that has been trashed, and results are variable but show patterns like often involving ip6_do_table. When a non-used instruction is overwritten the machine boots without a problem, but when a used instruction is hit, it crashes. I have neither the skills nor the inclination to deal with driver development, and I do not want to continually deal with weevils in the raspberries.

It's proven that the sound behavior follows the SD card, not the chassis; in other words, the power glitch (probably) did not damage the RPi hardware.

These symptoms are the current nemeses, in order of being addressed:

Installing an Image (Again)

Where the various files and directories are:

Initial steps:

  • Install keys.iris on card 06.
  • A. I've decided that this isn't the best procedure: Boot the card. After the first boot, shut down and bring back to Xena. It booted and resized the root partition. Partition sizes: EFI 16.8Mb, root 31.5Gb, swap 510Mb. It's using its own background for the greeter. sshd is working, with its correct host key.
  • B. On Xena, resize the root partition and create a swap partition.
  • Change the partition labels and edit /etc/fstab accordingly.
  • Execute: fsck -f -C /dev/disk/by-label/ROOT-06
    It has 7675392 blocks, 4096 bytes each, 31.44Gb, which is 29.28GiB, the targeted size of the partition. I was wondering if I would have to run resize2fs, but YaST Partitioner already did it.
  • (Leaving off at this point.)
  • Boot this card and evaluate it.

    Evaluation of the Pristine Image

    ALSA Mixer

    For once, it executes. It has HDMI (vc4-hdmi) but analog output is missing.

    HDMI Audio

    This is with the default audio route, i.e. 0. aplay -D hw:0,0 $sound produces audio open error: No such device, similar to behavior on the failing Iris. Evidently the HDMI channel has no ALSA driver. Next step will be to reboot with onboard audio enabled: dtparam=audio=on in /boot/efi/extraconfig.txt .

    Video

    DISPLAY=:0 XAUTHORITY=/run/lightdm/root/:0 ico produces a very active bouncing icosahedron. We have 307Mb or CMA memory, 273Mb free. The vc4 driver is loaded, plus infrastructure including: vchiq snd_soc snd_pcm drm_kms_helper drm. I'm going to evaluate sound first, and then turn on acceleration and see if it blows up.

    /var/log/Xorg.0.log reveals: modesetting (KMS) driver, using /dev/dri/card0, but Option AccelMethod none , from /etc/X11/xorg.conf.d/20-kms.conf . It's using 1920x1080px on output HDMI-1. Using swrast.

    Rebooting 10 times. It takes about 3 minutes each. No crashes, no anomalies.

    Onboard Audio

    This is the ALSA device (bcm2835). Per alsamixer, it is unmuted and the volume is at 40%. With the sound route set to 0, aplay -D hw:ALSA,0 $sound is played on HDMI (TV speakers) with good quality. Setting to 1, it is played on the onboard audio, for the first time in 5 days.

    Accelerated Video

    De-disabling acceleration and restarting the X-Server. It's trying glamoregl but giving up on it. AIGLX: Screen 0 is not DRI2 capable (still); it's falling back to swrast. This line of investigation will have to be finished later.

    Interim Conclusion

    We have a configuration with working sound and working non-accelerated video. The next job will be to identify where it goes bad in the transition from the pristine image to the failing Iris.

    Configuring Iris for CouchNet

    I'm going to change the configuration step by step, and I'll try to identify what step makes it stop working. Key points to monitor are:

    Inventory

    Saved in xena:/s1/kvm/iris/inven:

    • List of installed packages, rpm -qa
    • Exact copy of /boot/efi, specifically the DTD. Isn't that cute, rsync isn't on the image. zypper install rsync gets a segfault while retrieving the repo metadata. Orion's base URL is http://download.opensuse.org/ports/aarch64/tumbleweed/repo/oss/ Their base URL is identical (there's also a debug repo which is not enabled). I saved a copy of the repo file, and removed the repo (zypper removerepo openSUSE-Ports-Tumbleweed-repo-oss). Then restored the copy, and again did zypper refresh. This time the refresh worked, and so did installing rsync, and retrieving /boot/efi.
    • For reference, Iris has kernel 4.20.12-1-default while Orion has 4.20.13-1-default, so Iris' kernel will be upgraded when I eventually do dist-upgrade.

    Admin Scripts

    Installing CouchNet administration scripts, particularly the selftest scripts.

    Checkout

    Running checkout.sh. A lot of packages are not installed; a lot of wanted daemons are not enabled; a lot of unwanted daemons are enabled and running. The only critical item is that the firewall is not installed.

    Lock Packages

    These are packages to particularly watch out for in the next section where they are updated or deleted. I'll lock them all in advance, then unlock after updating.

    • alsa alsa-oss alsa-plugins alsa-utils
    • kernel-default-4.20.12-1.1.aarch64
    • kernel-firmware-20190212-1.1.noarch
    • raspberrypi-firmware raspberrypi-firmware-config raspberrypi-firmware-dt u-boot-rpi3

    post_jump

    Let's go through post_jump section by section. At each step, I'll reboot, check the display, and check sound. The sections are:

    • Phase 6 and earlier -- Check versions and consistency, install modified config files, install /m1/local. Most of this has already been done. It reboots (no crash or weird behavior), gets an IPv4+6 address, junks the IPv6 address, display is black. Lightdm could not get a list of users from dbus. Don't worry. Sound plays.

    • Phase 7 -- Install missing packages but with interactive mode.
      audit-pkgs -v -i -c -I
      276 keystone packages, Key gpg-pubkey-bbac6b14-5c755908 is obsolete, remove it if not needed. 16 packages could not be found, 260 remain. 3 JTD items: gvim-data, pulseaudio-module-gsettings are needed but not installable; systemd-logger conflicts with rsyslog. Solutions: Retain older vim-data. Omit pulseaudio-module-gconf. Toss systemd-logger. 10 packages to upgrade, 602 new, 1 to toss. It reboots successfully, no crash or weird behavior. It obtains and keeps its IPv6 address (and IPv4). The display manager starts up, with CouchNet branding. Sound plays over HDMI. alsamixer is unable to connect to pulseaudio. I used the trick of editing /var/lib/alsa/asound.state and restoring it with alsactl. Now the sound plays from the headphone jack (to the amp to the speakers).

    • Phase 8 -- Remove unwanted packages.
      audit-pkgs -v -e -c -I ;
      Removing 461 packages. Rebooting: /boot/grub2/themes/openSUSE/(various fonts) are missing, aborted, press any key to exit. The USB keyboard is not recognized for press any key. (grub2-branding-openSUSE was one of the packages that was removed.) I edited /boot/grub2/grub.cfg removing the theme, and it booted normally. Display is normal. Sound plays from HDMI. alsa-restore.service did not get run because /etc/asound.state is missing. So let's provide it, with the sound path correctly set (cp -p /var/lib/alsa/asound.state /etc/asound.state). Sound now plays from the speakers.

    • Phase 9 -- Update everything to the latest version (dist-upgrade).
      audit-pkgs -v -U -c -I
      Locks: it's honoring locks on unbound openssh and u-boot-rpi3. Other packages are also locked but are not mentioned. We're getting kernel-default-4.20.13-1.1 kernel-firmware raspberrypi-firmware raspberrypi-firmware-config . All of these are supposed to be locked. I'm aborting and checking the lock situation. The lock file got overwritten with the conf file update. Re-doing it. Now the desired packages are locked. 117 packages to upgrade, 4 new, 3 to remove. Upgrade finished OK. Rebooting: Came up, no weird behavior (except USB keyboard was ignored in the grub menu, as always). Display is functional. Sound plays correctly the first time.

    • Phase 10 -- Miscellaneous setup and cleanup. I have several packages locked, where it's very likely that the problem is hiding. I'm going to do the miscellaneous setup first, then upgrade the locked packages one at a time. On Diamond, post_jump -p 10 iris . Miscellaneous steps ran normally. Rebooting: It rebooted. Display is normal. Sound plays from speakers. checkout.sh: passed all tests (after restoring some conf files from backup).

    Restore from Backup

    To restore Iris' specific configuration and user files, first I did:
    cd /home/backup/iris ; ssync -a -n -O --exclude inventory.dat . iris:/
    (Capturing in a file the list of files that it proposes to transfer.) Then I went through the list of 819 files in detail and decided which ones I wanted, and which should not be transferred, to be overwritten or deleted in the next backup. 498 files will be transferred. Then:
    ssync -a -n -O -R --files-from iris-rstr . iris:/
    Rebooting after this step and re-running checkout.sh: It died with the 3 successive OOPSes ending in ip6_do_table.

    Re-Image Again #2

    I put the XFCE image on card 06, installed keys.iris, with 2 new files: /etc/asound.state and /etc/zypp/locks (should be non-threatening). I ran post_jump (133 mins). Rebooted. It died, where at this step previously it was running OK.

    Re-Image Again #3

    Repeating the same procedure but omitting the dist-upgrade (-p 8). Rebooting, it died. This machine is sooooo hosed, and its chances for redemption have run out. Continued in Replacing Iris.

    Raspberry Pi® Logo
    Photo and Image Credit