Escape from the dock(er)

I’ve been using Docker (packaged as moby-engine in Fedora) on my home server for quite a while now to run Nextcloud, Home Assistant, and a few other services.

When I upgraded to Fedora 31, I ran into quite the surprise when I realized that, because of its inability to handle cgroupsv2, Docker is no longer supported out of the box. The fix is easy enough, but I took this as the kick in the pants I needed to switch over to podman.

The process was fairly straightforward, but there were a couple of gotchas that I wanted to document and a couple of podman features that I wanted to take advantage of.

No more daemon

This is both a feature and a gotcha when switching from Docker to podman. The docker daemon, which has traditionally run as root, is an obvious attack vector, and its removal can be seen as nothing other than a pretty compelling feature, but without a daemon, containers will no longer automatically start on boot.

The (maybe not-so) obvious workaround is to treat each container as a service and use an obscure tool called systemd to manage the container lifecycle. Podman will even go to the trouble of generating a systemd service for you, if that’s what you want.

Unfortunately, there were a couple of things I was looking for that podman’s auto-generated services just didn’t cover. The first was container creation; I wanted a service that would create the container if it didn’t exist. The second was auto-updates. I wanted my containers to automatically update to the latest version on boot.

Just yesterday, Valentin Rothberg published a post on how to do the first, but, unfortunately, his post didn’t exist when I was trying to do this a month ago, so I had to wing it. I have shamelessly stolen a few of his ideas, though, to simplify my services.

Rootless

The one other major feature I wanted was rootless containers, specifically containers that would be started by root, but would immediately drop privileges so root in the container is not the same as root on the host.

The systemd file I came up with looks something like this:

[Unit]
Description=Nextcloud
Wants=mariadb.service network-online.target
After=mariadb.service network-online.target

[Service]
Restart=on-failure
ExecStartPre=-/usr/bin/podman pull docker.io/library/nextcloud:stable
ExecStartPre=-/usr/bin/podman rm -f nextcloud
ExecStart=/usr/bin/podman run \
    --name nextcloud \
    --uidmap 0:110000:4999 \
    --gidmap 0:110000:4999 \
    --uidmap 65534:114999:1 \
    --gidmap 65534:114999:1 \
    --add-host mariadb:10.88.1.2 \
    --hostname nextcloud \
    --conmon-pidfile=/run/nextcloud.pid \
    --tty \
    -p 127.0.0.1:8888:80 \
    -v /var/lib/nextcloud/data:/var/www/html:Z \
    docker.io/library/nextcloud:stable
ExecStop=/usr/bin/podman rm -f nextcloud
KillMode=none
PIDFile=/run/nextcloud.pid

[Install]
WantedBy=multi-user.target

Most of this is pretty similar to what Valentin posted, but I want to highlight a few changes that are specific to my goals:

  • I’m pulling the image before starting the service. The - at the beginning of the ExecStartPre lines means that, if the pull fails for whatever reason, we will still start the service.

  • If there’s a container called nextcloud running before the service starts, we stop and remove it. There can be only one.

  • When we actually run podman run, we don’t use the -d (detached) flag and this is a simple service rather than forking. The reason for this is that I want my container logs to be in the journal, tied to their service, and I haven’t worked out how to do that with a forking service.

  • The --uidmap and --gidmap flags are used to map the uids and gids from 0-4998 in the container to 110000-114998 on the host. Because a number of containers have nobody mapped to uid/gid 65534, I then specially map that uid/gid to 114999 on the host. Using these flags allows my containers to think they’re running as root when they’re not, and should hopefully help protect my system in the off chance that an attacker were able to break out of the container.

  • The tty flag is used because we get read/write problems with /dev/stdin, /dev/stdout, and /dev/stderr when using --uidmap 0 without this flag.

Runtime path bug

After running the above setup a few weeks, I noticed that I kept losing the container state. I found a related bug report, investigated further, and realized that the container state for a system service should be in /run/crun rather than /run/user/0/crun, and that the latter directory was getting wiped when I’d log out after logging into my server as root (because root is my own account).

I submitted a fix and it’s been merged upstream, so now we just need to wait for the fix to make it back downstream. In the meantime, I’ve made a scratch build with the fix applied.

Conclusion

With the podman fix described in the last section, my containers are now working to my satisfaction.

The real joy is when I run the following:

# podman exec nextcloud ps ax -o user,pid,stat,start,time,command
USER         PID STAT  STARTED     TIME COMMAND
root           1 Ss+  11:57:10 00:00:00 apache2 -DFOREGROUND
www-data      23 S+   11:57:11 00:00:06 apache2 -DFOREGROUND
www-data      24 S+   11:57:11 00:00:04 apache2 -DFOREGROUND
www-data      25 S+   11:57:11 00:00:03 apache2 -DFOREGROUND
www-data      26 S+   11:57:11 00:00:04 apache2 -DFOREGROUND
www-data      27 S+   11:57:11 00:00:02 apache2 -DFOREGROUND
www-data      28 S+   11:57:16 00:00:03 apache2 -DFOREGROUND
www-data      29 S+   11:58:18 00:00:02 apache2 -DFOREGROUND
www-data      34 S+   12:01:07 00:00:05 apache2 -DFOREGROUND
# ps ax -o user,pid,stat,start,time,command
USER         PID STAT  STARTED     TIME COMMAND
...
110000     64235 Ss+  11:57:10 00:00:00 apache2 -DFOREGROUND
110033     64336 S+   11:57:11 00:00:06 apache2 -DFOREGROUND
110033     64337 S+   11:57:11 00:00:04 apache2 -DFOREGROUND
110033     64338 S+   11:57:11 00:00:03 apache2 -DFOREGROUND
110033     64339 S+   11:57:11 00:00:04 apache2 -DFOREGROUND
110033     64340 S+   11:57:11 00:00:02 apache2 -DFOREGROUND
110033     64343 S+   11:57:16 00:00:03 apache2 -DFOREGROUND
110033     64359 S+   11:58:18 00:00:02 apache2 -DFOREGROUND
110033     64402 S+   12:01:07 00:00:05 apache2 -DFOREGROUND
...

Seeing both the root and www-data uids mapped to something with more restricted access makes me a very happy sysadmin.

Anatomy of a bug

DNF not downloading zchunk metadata in Fedora

I’m not sure how many have noticed, but DNF hasn’t been downloading zchunk metadata in Fedora since the beginning of this month due to a bug in… well, it’s complicated.

Let’s start with the good news. The fact that almost nobody has noticed means that the fallback to non-zchunk metadata is working perfectly. Fedora is still generating zchunk metadata. When we get the fix built for Fedora, DNF will automatically go back to downloading zchunk metadata. And, when that happens, almost nobody will notice (but their metadata downloads will be greatly reduced in size again)

So, what happened?

Back in July, a bug was found in RHEL 8 that caused a crash if zchunk-enabled repos were found. The bug turned out to be in librepo, where LRO_SUPPORTS_CACHEDIR was defined in all recent versions. This define is used by libdnf to detect whether to ask libsolv and librepo to attempt to use zchunk (libdnf doesn’t actually use zchunk itself). The problem is that RHEL 8’s librepo had this defined, but didn’t actually have zchunk enabled, which caused Bad Things To Happen™.

The DNF developers quickly settled on an easy fix, wrapping the define in an #ifdef WITH_ZCHUNK so it would only be defined if zchunk was enabled in librepo. At least, that’s how it was supposed to work.

This fix made it into Fedora and had no noticeable effect until libdnf was rebuilt early this month. It turns out that the header file with the define is included as is in libdnf. libdnf wasn’t being build with WITH_ZCHUNK enabled, so LRO_SUPPORTS_CACHEDIR wasn’t defined, so libdnf never passed librepo the base cache directory, so librepo can’t find the old zchunk files, so librepo downloads the non-zchunk metadata. Oops.

How do we fix it?

There are a couple of ways to fix this. The method we’ve gone with is to define WITH_ZCHUNK in libdnf so LRO_SUPPORTS_CACHEDIR is defined once more. This fix is currently in review, and once it’s pushed upstream, we’ll try to get libdnf builds done in Fedora with it included.

Once a libdnf build is pushed out with LRO_SUPPORTS_CACHEDIR defined, DNF will once more start downloading zchunk metadata.