In the last weekly update, I went from kernel 5.0.11 to 5.0.13. Networking became hosed, fortunately only on Jacinth, the main router. Its eth1 connects to the ISP (Frontier FIOS, formerly Verizon, IPv4 only), eth0 connects to the local LAN, wlan1 is controlled by hostapd for Wi-Fi service (902.11n), vnet0 connects to Jacinth's virtual machine Claude, and several tunnels are used by OpenVPN instances with various jobs. All but eth1 and the tunnels are in a bridge, which has Jacinth's own local IP addresses (IPv4 and IPv6).
The symptom was, Jacinth sent DHCP4 and ICMP6 solicitations, ARP, etc. on eth1, which were never answered, as seen by tcpdump. (The ICMP6 router solicitations have not been answered since 2006, but Jacinth sends them anyway.) ICMP4+6 pings going out on eth0 were also not answered. Other hosts on the LAN could ping each other but got no answers from Jacinth. It is likely, but not assured, that Jacinth never received any unicast packets on eth0. Jacinth could ping hosts on vnet0 and wlan1.
The kernel command line includes net.ifnames=0, turning off
predictable device names and using whatever the kernel picks. There are
udev rules cueing on the MAC address that rename eth-type NICs to the
appropriate eth0 and eth1. With kernel 5.0.11 the NICs were reliably
renamed (or by luck consistently get the right
names). With
kernel 5.0.13 there was an error message in syslog saying: eth1:
Failed to rename network interface 3 from 'eth1' to 'eth0': File
exists
and similar for eth0 to eth1. The rule for eth0 says: (eth1
is similar) (wrapping is not in the original)
SUBSYSTEM=="net", DRIVERS=="?*", ATTR{address}=="f4:4d:30:69:e2:1c", \ KERNEL=="eth*", NAME="eth0"
What would be the consequence of failing to swap eth0 with eth1? Jacinth's wickedd-dhcp4 and 6 would send their DHCP solicitations to the local LAN on eth1 (should be eth0), where the ISP's DHCP server isn't. Local LAN hosts trying to connect to Jacinth could not do so because Jacinth eth1 has no IP address. Jacinth would send ARP and ICMP6 neighbor discovery on the wild side eth0 (should be eth1), where the local LAN hosts aren't. Nothing would be answered, as actually seen.
So what can I do about this?
In the transition to kernel 5.0.13 there were several seemingly
irrelevant changes to IPv4 and IPv6 networking. But I definitely don't
want to get tangled up fixing a difficult issue which the developers of
the kernel, systemd and udev have already declared to be deprecated.
Here's the official
explanation of the device name issue.
Here's a brief summary of the issue.
In the good old days (1985), you would assign by hand the major and minor device numbers for each device, edit a table (devtab), and compile the monolithic kernel. Then you would use the makedevs script to create inodes with these device numbers. When the number of inodes in /dev got up over 10,000, this procedure was improved.
With devfs, which you could mount on /dev, the drivers would signal it with the major and minor device number of available devices. Naturally for a first attempt, improvements were needed, and in particular, heavyweight logic for conditionally naming the devices needed to be evicted from the kernel.
Now, udev in userspace receives the notices of created devices,
and it is free to be as baroque as necessary in creating inodes for
them. The kernel may provide a hint for the device name, e.g.
eth
for wired Ethernet, sd
for a SCSI disc, or
ttyUSB
for a USB serial line, but udev is not required to
use the hint. (In reality, there is no inode in /dev for a NIC.)
The big problem is that device drivers are initialized in an
arbitrary order, even on different cores and therefore
simultaneously
, so the major device numbers are not
predictable, and minor devices can also be randomized. In my
case, one NIC is on PCI while the other is USB. It's a fact of life
that one kernel's assignments may be reproducible, while at the most
inopportune moment a seemingly irrelevant change will have a baleful
effect on the numbering of the devices. I once had to maintain a
super workstation with four discs whose BIOS enumerated them in a
different order on every boot.
Formerly (kernel 5.0.11), if you renamed a NIC to an already occupied name, e.g. eth1 to eth0 when eth0 already existed, the target would be renamed to something unoccupied. Now (kernel 5.0.13) this is not happening. This may be an alternative fact: I can no longer find documentation for this behavior, and it is possible in my case that the PCI NIC formerly ended up by good luck as eth0 consistently, and the USB NIC was eth1, so no renaming was needed. But now the order is switched and renaming doesn't happen.
What NICs are on each host? Hardware devices only, excluding lo, bridges, VM host interfaces, and tunnels.
To summarize, every host has eth0 (to local LAN), some have wlan0 (wireless), and only Jacinth has more than this.
So what action should I take?
I'll remove net.ifnames=0 from the kernel command line. Should this be just for Jacinth? Or all hosts? I'm going to try to be as similar as possible on all hosts.
The kernel creates the devices with the traditional names ethN, wlanN, etc. Then udev can rename them. Renaming eth0 to eth0 is fine, but if I have to swap eth0 with eth1 it's going to fail, as I found out the hard way. Therefore I need to rename to a namespace other than ethN. wlanN has a similar issue.
Wired Ethernet, kernel eth*, will be renamed to enN. udev schemes
that generate enN natively will not be enabled. I was born after
horseless carriages and wireless telegraphy had gone the way of button
shoes, so I'm going to call the Wi-Fi NICs radN for the radio
.
I will scrap the existing udev rules and rely on systemd link files. Overriding 99-default.link I will provide every host with a file that matches eth0 and renames it to en0, and wlan0 to rad0. The files will be called 50-en0.link and 50-rad0.link (in /etc/systemd/network).
Jacinth will get specific link files for all NICs. Match contingencies will go like this:
Off-topic tidbit: I'm having trouble with getting Intel NUCs
to wake on LAN. I'm going to try putting WakeOnLan=magic
in the eth0 link file. The default is off
and I wonder if
the BIOS is enabling it and then systemd-udevd is turning it off.
[Update: power to the NIC is shut off in S3-S4-S5 despite being turned
on in BIOS and despite this keyword.]
The firewall rules and /etc/sysconfig/network config files will have to be edited and/or renamed to use the new names.
Special backup: Put it in /tmp/netbak and Xena:/s1/netbak/$HOST/ with subdirs for network (for /etc/sysconfig/network ) and rules.d (for /etc/udev/rules.d ). Firewall rules also get backed up on Xena, same on all hosts, carefully curated.
Jacinth has the most configuration files and administrative scripts. I searched for anything that mentions ethN or wlanN, and edited it switching to enN or radN. Also I took the opportunity to map wlan1 to rad0. 31 files needed to be edited; about 50 more only had the NIC names in comments or were otherwise ignorable. I then selectively propagated the files to other hosts. Naturally, backup copies were saved in a separate directory. When possible I changed the files so they worked with either eth0 or en0, but this was not feasible for many of the files.
The following steps will be done first on Petra, the development VM on my laptop, then on Jacinth, and finally on all other hosts in parallel.
In the firewall, the interface names of eth0 and wlan0 only occur in two files, nat-bit400-B5.wild and nat-bit2-B5.all . Related names like eth1 never appear. I will give parallel rules for en0 and rad0, and will remove eth0 and wlan0 after the project is confirmed working.
Edit /etc/default/grub.m4 to remove net.ifnames=0. Run /usr/diklo/lib/daily/grubdflt.J (which does grub2-mkconfig), or similarly edit /boot/grub2/grub.cfg (chicken!).
Copy /etc/sysconfig/network/ifcfg-eth0 to ifcfg-en0, and similarly for whatever other NICs the host has. Remember that Jacinth's wlan0 and wlan1 are being swapped into rad1 and rad0. The machine can be booted whether or not NICs are getting renamed (I hope).
Remove /etc/udev/rules.d/70-persistent-net.rules .
Install the link file(s) /etc/systemd/network/50-en0.link and /etc/systemd/network/50-rad0.link . Jacinth needs files for en1 and rad1, and all need to be handcrafted.
Host-specific files to be edited. Copy the special versions to /m1/custom/conffiles .
Reboot the machine and see what happens.
Per machine outcomes:
Once I'm confident that it won't have to be reverted, remove /etc/sysconfig/network/ifcfg-eth0 and friends. Also edit the two firewall files to remove eth0 and wlan0. [Done.]
Files to be copied into the post_jump storage area (so possibly unintended changes can be reported, or these files can be installed on a new machine):
Run conffiles.J and see if anything is in /m1/custom/conffiles that was monkeyed with. Copy the new version there. [Done.]