The Paddy's Day bug

Last Sunday, I got a message from a coworker and a good friend, Dhaval, that his Fedora laptop was stuck during the boot process. His work laptop, also running Fedora, also failed to boot. I checked my work laptop and personal laptop, and both of them rebooted just fine, so we then started going through the normal troubleshooting process on his work laptop. There were no error messages on the screen, just a hang after Update mlocate database every day. We booted a live environment, mounted the laptop’s filesystem there, and checked the journal, and, again, nothing to see there. Google didn’t turn anything up either. There wasn’t much installed on his personal laptop, so Dhaval suggested re-installing Fedora on that laptop. Booting into the live USB worked perfectly and the reinstall happened without a hitch. We then rebooted into the newly installed system and… it hung. Again.

What? This was a completely clean install! Was it possible that Dhaval’s laptops both had some kind of boot sector virus? But his work laptop had secure boot on, which is supposed to protect against that kind of attack. I decided to compare the startup systemd services in my laptop compared to his. After going through multiple services, I noticed that raid-check.timer was set to start on Dhaval’s laptop, but wasn’t setup on mine. I started the service on my laptop… and the system immediately became unresponsive! Using the live environment, I then disabled the service on Dhaval’s laptop. One reboot later… and his system booted perfectly!

Neither of our laptops have RAID and manually starting the raid-check.service didn’t kill the system, so the problem seemed to be in systemd itself rather than the RAID service. What really concerned me, though, was that the problem occurred on a fresh install of Fedora. It turns out that, on F33, raid-check.timer is enabled by default. My laptops had both been upgraded from previous Fedora releases where this wasn’t the case, but Dhaval had performed fresh installs on his systems. Further testing confirmed that the bug only affected systems running raid-check.timer after 1:00PM on Mar 21st.

As far as I could see, this was going to affect everyone when they booted Fedora, and this had me worried. I figured the best place to start getting the word out was to open a bug report, and I then left messages in both #fedora-devel on IRC and the Fedora development mailing list. tomhughes was the first to respond on IRC with a handy kernel command line option I’d never seen before to temporarily mask the timer in systemd (systemd.mask=raid-check.timer). cmurf and nirik then started trying to work out where the bug was coming from. nirik was the first one to realize that daylight savings might be the problem, since after 1:00AM on the 21st, the next trigger would be after the clocks change here in Ireland. But it was chrisawi who gave us the first ray of hope when they pointed out that the bug was only triggered if you were in the “Europe/Dublin” time zone.

This was the first indication I had that the bug wasn’t affecting everyone, and that was a huge relief. I had feared that everyone who had installed Fedora 33 was going to have to work around this bug, but this dropped the number of affected people down to the Fedora users in Ireland. tomhughes pointed out that Ireland is unique in the world in that summer time is “normal” and going back during the winter is the “savings” time. Apparently systemd was having problems with the negative time offset, though it’s unclear to me why this hadn’t been triggered in previous years.

Once it was clear that the bug was related to DST in Ireland, zbyszek was able to figure out that the problem involved an infinite loop and he created a fix. After an initial test update that caused major networking problems, we now have systemd-246.13 pushed to stable that fully resolves the problem.

One of the things I realized when I was doing the initial troubleshooting is how long it’s been since I’ve seen this kind of bug in Fedora. I think the last time I saw a bug of this severity was somewhere around ten years ago, though before that these kinds of bugs seemed to show up annually. It’s a tribute to the QA team and to the processes that have been established around creating and pushing updates that these kind of show-stopping bugs are so rare. This bug is a reminder, though, that no matter how good your testing is, there will always be some that fall through the cracks.

I would like to say a huge thank you again to zbyszek for fixing the bug and adamw for stopping the broken systemd update before it made it to stable.

If you’re in Ireland and doing a clean install of Fedora 33, you’ll need to work around the bug as follows:

  • In grub, type e to edit the current boot entry
  • Move the cursor to the line that starts with linux or linuxefi and type systemd.mask=raid-check.timer at the end of the line
  • Press Ctrl+X to boot
  • Once the system has booted, update systemd to 246.13 (or later)

Escape from the dock(er)

I’ve been using Docker (packaged as moby-engine in Fedora) on my home server for quite a while now to run Nextcloud, Home Assistant, and a few other services.

When I upgraded to Fedora 31, I ran into quite the surprise when I realized that, because of its inability to handle cgroupsv2, Docker is no longer supported out of the box. The fix is easy enough, but I took this as the kick in the pants I needed to switch over to podman.

The process was fairly straightforward, but there were a couple of gotchas that I wanted to document and a couple of podman features that I wanted to take advantage of.

No more daemon

This is both a feature and a gotcha when switching from Docker to podman. The docker daemon, which has traditionally run as root, is an obvious attack vector, and its removal can be seen as nothing other than a pretty compelling feature, but without a daemon, containers will no longer automatically start on boot.

The (maybe not-so) obvious workaround is to treat each container as a service and use an obscure tool called systemd to manage the container lifecycle. Podman will even go to the trouble of generating a systemd service for you, if that’s what you want.

Unfortunately, there were a couple of things I was looking for that podman’s auto-generated services just didn’t cover. The first was container creation; I wanted a service that would create the container if it didn’t exist. The second was auto-updates. I wanted my containers to automatically update to the latest version on boot.

Just yesterday, Valentin Rothberg published a post on how to do the first, but, unfortunately, his post didn’t exist when I was trying to do this a month ago, so I had to wing it. I have shamelessly stolen a few of his ideas, though, to simplify my services.

Rootless

The one other major feature I wanted was rootless containers, specifically containers that would be started by root, but would immediately drop privileges so root in the container is not the same as root on the host.

The systemd file I came up with looks something like this:

[Unit]
Description=Nextcloud
Wants=mariadb.service network-online.target
After=mariadb.service network-online.target

[Service]
Restart=on-failure
ExecStartPre=-/usr/bin/podman pull docker.io/library/nextcloud:stable
ExecStartPre=-/usr/bin/podman rm -f nextcloud
ExecStart=/usr/bin/podman run \
    --name nextcloud \
    --uidmap 0:110000:4999 \
    --gidmap 0:110000:4999 \
    --uidmap 65534:114999:1 \
    --gidmap 65534:114999:1 \
    --add-host mariadb:10.88.1.2 \
    --hostname nextcloud \
    --conmon-pidfile=/run/nextcloud.pid \
    --tty \
    -p 127.0.0.1:8888:80 \
    -v /var/lib/nextcloud/data:/var/www/html:Z \
    docker.io/library/nextcloud:stable
ExecStop=/usr/bin/podman rm -f nextcloud
KillMode=none
PIDFile=/run/nextcloud.pid

[Install]
WantedBy=multi-user.target

Most of this is pretty similar to what Valentin posted, but I want to highlight a few changes that are specific to my goals:

  • I’m pulling the image before starting the service. The - at the beginning of the ExecStartPre lines means that, if the pull fails for whatever reason, we will still start the service.

  • If there’s a container called nextcloud running before the service starts, we stop and remove it. There can be only one.

  • When we actually run podman run, we don’t use the -d (detached) flag and this is a simple service rather than forking. The reason for this is that I want my container logs to be in the journal, tied to their service, and I haven’t worked out how to do that with a forking service.

  • The --uidmap and --gidmap flags are used to map the uids and gids from 0-4998 in the container to 110000-114998 on the host. Because a number of containers have nobody mapped to uid/gid 65534, I then specially map that uid/gid to 114999 on the host. Using these flags allows my containers to think they’re running as root when they’re not, and should hopefully help protect my system in the off chance that an attacker were able to break out of the container.

  • The tty flag is used because we get read/write problems with /dev/stdin, /dev/stdout, and /dev/stderr when using --uidmap 0 without this flag.

Runtime path bug

After running the above setup a few weeks, I noticed that I kept losing the container state. I found a related bug report, investigated further, and realized that the container state for a system service should be in /run/crun rather than /run/user/0/crun, and that the latter directory was getting wiped when I’d log out after logging into my server as root (because root is my own account).

I submitted a fix and it’s been merged upstream, so now we just need to wait for the fix to make it back downstream. In the meantime, I’ve made a scratch build with the fix applied.

Conclusion

With the podman fix described in the last section, my containers are now working to my satisfaction.

The real joy is when I run the following:

# podman exec nextcloud ps ax -o user,pid,stat,start,time,command
USER         PID STAT  STARTED     TIME COMMAND
root           1 Ss+  11:57:10 00:00:00 apache2 -DFOREGROUND
www-data      23 S+   11:57:11 00:00:06 apache2 -DFOREGROUND
www-data      24 S+   11:57:11 00:00:04 apache2 -DFOREGROUND
www-data      25 S+   11:57:11 00:00:03 apache2 -DFOREGROUND
www-data      26 S+   11:57:11 00:00:04 apache2 -DFOREGROUND
www-data      27 S+   11:57:11 00:00:02 apache2 -DFOREGROUND
www-data      28 S+   11:57:16 00:00:03 apache2 -DFOREGROUND
www-data      29 S+   11:58:18 00:00:02 apache2 -DFOREGROUND
www-data      34 S+   12:01:07 00:00:05 apache2 -DFOREGROUND
# ps ax -o user,pid,stat,start,time,command
USER         PID STAT  STARTED     TIME COMMAND
...
110000     64235 Ss+  11:57:10 00:00:00 apache2 -DFOREGROUND
110033     64336 S+   11:57:11 00:00:06 apache2 -DFOREGROUND
110033     64337 S+   11:57:11 00:00:04 apache2 -DFOREGROUND
110033     64338 S+   11:57:11 00:00:03 apache2 -DFOREGROUND
110033     64339 S+   11:57:11 00:00:04 apache2 -DFOREGROUND
110033     64340 S+   11:57:11 00:00:02 apache2 -DFOREGROUND
110033     64343 S+   11:57:16 00:00:03 apache2 -DFOREGROUND
110033     64359 S+   11:58:18 00:00:02 apache2 -DFOREGROUND
110033     64402 S+   12:01:07 00:00:05 apache2 -DFOREGROUND
...

Seeing both the root and www-data uids mapped to something with more restricted access makes me a very happy sysadmin.