The Paddy's Day bug

Last Sunday, I got a message from a coworker and a good friend, Dhaval, that his Fedora laptop was stuck during the boot process. His work laptop, also running Fedora, also failed to boot. I checked my work laptop and personal laptop, and both of them rebooted just fine, so we then started going through the normal troubleshooting process on his work laptop. There were no error messages on the screen, just a hang after Update mlocate database every day. We booted a live environment, mounted the laptop’s filesystem there, and checked the journal, and, again, nothing to see there. Google didn’t turn anything up either. There wasn’t much installed on his personal laptop, so Dhaval suggested re-installing Fedora on that laptop. Booting into the live USB worked perfectly and the reinstall happened without a hitch. We then rebooted into the newly installed system and… it hung. Again.

What? This was a completely clean install! Was it possible that Dhaval’s laptops both had some kind of boot sector virus? But his work laptop had secure boot on, which is supposed to protect against that kind of attack. I decided to compare the startup systemd services in my laptop compared to his. After going through multiple services, I noticed that raid-check.timer was set to start on Dhaval’s laptop, but wasn’t setup on mine. I started the service on my laptop… and the system immediately became unresponsive! Using the live environment, I then disabled the service on Dhaval’s laptop. One reboot later… and his system booted perfectly!

Neither of our laptops have RAID and manually starting the raid-check.service didn’t kill the system, so the problem seemed to be in systemd itself rather than the RAID service. What really concerned me, though, was that the problem occurred on a fresh install of Fedora. It turns out that, on F33, raid-check.timer is enabled by default. My laptops had both been upgraded from previous Fedora releases where this wasn’t the case, but Dhaval had performed fresh installs on his systems. Further testing confirmed that the bug only affected systems running raid-check.timer after 1:00PM on Mar 21st.

As far as I could see, this was going to affect everyone when they booted Fedora, and this had me worried. I figured the best place to start getting the word out was to open a bug report, and I then left messages in both #fedora-devel on IRC and the Fedora development mailing list. tomhughes was the first to respond on IRC with a handy kernel command line option I’d never seen before to temporarily mask the timer in systemd (systemd.mask=raid-check.timer). cmurf and nirik then started trying to work out where the bug was coming from. nirik was the first one to realize that daylight savings might be the problem, since after 1:00AM on the 21st, the next trigger would be after the clocks change here in Ireland. But it was chrisawi who gave us the first ray of hope when they pointed out that the bug was only triggered if you were in the “Europe/Dublin” time zone.

This was the first indication I had that the bug wasn’t affecting everyone, and that was a huge relief. I had feared that everyone who had installed Fedora 33 was going to have to work around this bug, but this dropped the number of affected people down to the Fedora users in Ireland. tomhughes pointed out that Ireland is unique in the world in that summer time is “normal” and going back during the winter is the “savings” time. Apparently systemd was having problems with the negative time offset, though it’s unclear to me why this hadn’t been triggered in previous years.

Once it was clear that the bug was related to DST in Ireland, zbyszek was able to figure out that the problem involved an infinite loop and he created a fix. After an initial test update that caused major networking problems, we now have systemd-246.13 pushed to stable that fully resolves the problem.

One of the things I realized when I was doing the initial troubleshooting is how long it’s been since I’ve seen this kind of bug in Fedora. I think the last time I saw a bug of this severity was somewhere around ten years ago, though before that these kinds of bugs seemed to show up annually. It’s a tribute to the QA team and to the processes that have been established around creating and pushing updates that these kind of show-stopping bugs are so rare. This bug is a reminder, though, that no matter how good your testing is, there will always be some that fall through the cracks.

I would like to say a huge thank you again to zbyszek for fixing the bug and adamw for stopping the broken systemd update before it made it to stable.

If you’re in Ireland and doing a clean install of Fedora 33, you’ll need to work around the bug as follows:

  • In grub, type e to edit the current boot entry
  • Move the cursor to the line that starts with linux or linuxefi and type systemd.mask=raid-check.timer at the end of the line
  • Press Ctrl+X to boot
  • Once the system has booted, update systemd to 246.13 (or later)

Multiseat in Fedora 19

This year in our main computer room, we switched from single-seat systems to multiseat systems. Our old single-seat systems cost us roughly $300 a system, and we would generally buy 20 a year. The goal with our multiseat systems was to see if we could do better than $300/seat. I also had a number of requirements, some of which would raise the cost, while others couldn’t be met the last time I looked into multiseat systems.

My first requirement was 3D acceleration on all seats. I know someone’s been working on separating OpenGL processing from the display server, which would theoretically allow us to use Plugable devices, but until that’s done, we need a separate video card for each seat. We also need motherboards that can support more than one PCIE video card (as well as preferably supporting the built-in GPU). This is the main extra expense for our multiseat systems.

My second requirement was plug-and-play USB. The last time I looked into multiseat, that wasn’t supported under Linux; USB devices would only be detected if they were plugged in when the X server started. But, thanks to some relatively new code in systemd which is now controlling logins using logind, USB ports can be directed to specific seats, with the devices plugged into them appearing in the correct seat when they’re plugged in.

In June, we bought a test system that came to just under $600. To our normal order we added a gaming motherboard, three of the cheapest PCIE AMD Radeon 5xxx/6xxx series cards we could find, extra RAM, and four USB hubs. The idea with the USB hubs was to place one next to each monitor and create our own wannabe-Plugable devices. I then wrote a small program that would deterministically assign each USB hub to a different monitor on bootup. An extra bonus to this program is that we can daisy chain the USB hubs. Once the program was working, I let the students play with the test system… and it worked!

So, during the summer, we bought ten more systems and put them in our main computer room. At four seats per system, we are saving 50%, so we were able to replace all forty computers in the main room in one year (and add four more seats as a bonus).

The main annoyance we’re still dealing with is that the USB hubs we got aren’t that great, and we’ve had a few fail on us. But they’re easy (and cheap) to replace. I also had to make some changes to X, like re-enabling Ctrl+Alt+Backspace as a solution for a stuck seat, which is better than rebooting the whole computer. And we do have the occasional hang where all four seats stop working, which I think is tied to the number of open files, but I haven’t tracked it down yet.

I’ve been very happy with our multiseat systems and would like to extend a huge thank you to the systemd developers for their work on logind.

Edit: More details are available in this post.