Fedora 18 – A Sysadmin’s view

The road less traveled

At our school we have around 100 desktops, a vast majority of which run Fedora, and somewhere around 900 users. We switched from Windows to Fedora shortly after Fedora 8 was released and we’ve hit 8, 10, 13, 16, and 17 (deploying a local koji instance has made it easier to upgrade).

As I finished putting together our new Fedora 18 image, there were a few things I wanted to mention.

The Good

  1. Offline updates: Traditionally, our systems automatically updated on shutdown. In the 16-17 releases, that became very fragile as any systemctl scriptlets in the updates would block because systemd was in the process of shutting down. Now, with systemd’s support for offline updates, we can download the updates on shutdown, reboot the computer, and install the updates in a minimal system environment. I’ve packaged my offline updater here.
  2. btrfs snapshots: This isn’t new in Fedora 18, but, with the availability of offline updates, we’ve finally been able to take proper advantage of it. One problem we have is that we have impatient students who think the reset button is the best way to get access to a computer that’s in the middle of a large update. Now, if some genius reboots the computer while it’s updating, it reverts to its pre-update state, and then attempts the update again. If, on the other hand, the update fails due to a software fault, the computer reverts to its pre-update state and boots normally. Either way, the system won’t be the half-updated zombie that so many of my Fedora 17 desktops are.
  3. dconf mandatory settings: Over the years we’ve moved from gconf to dconf, and I love the easy way that dconf allows us to set mandatory settings for Gnome. This continued working with only a small modification from Fedora 17 to Fedora 18, available here and here.
  4. Javascript config for polkit: I love how flexible this is. We push out the same Fedora image to our school laptops, but the primary difference compared to the desktop is that we allow our laptop users to suspend, hibernate and shutdown their laptops, while our desktop users can’t do any of the above. What I would really like to do is have the JS config check for the existence of a file (say /etc/sysconfig/laptop), and do different things based on that, but I haven’t managed to work out how to do that yet. My first attempt is here.
  5. systemd: This isn’t a new feature in 18, but systemd deserves a shout-out anyway. It does a great job of making my workstations boot quickly and has greatly simplified my initscripts. It’s so nice to be able to easily prevent the display manager from starting before we have mounted our network directories.
  6. Gnome Shell: We actually started experimenting with Gnome Shell when it was first included in Fedora, and I switched to it as the default desktop in Fedora 13. As we’ve moved from 13 to 16, then 17, and now 18, it’s been a nice clean evolution for our users. When I first enabled Gnome Shell in our Fedora 13 test environment, the feedback from our students was very positive. “It doesn’t look like Windows 98 any more!” was the most common comment. As we’ve upgraded, our users have only become more happy with it.

The Bad

The bad in Fedora 18 mainly comes down to the one area where Linux in general, and Fedora specifically, is weak – being backwards-compatible. This was noticeable in two very specific places:

  1. Javascript config for polkit: While I was impressed with the new javascript config’s flexibility, I was most definitely not impressed that my old pkla files were completely ignored. As a system administrator, I find it frustrating when I have to completely rewrite my configuration files because “now we have a better way”. I’ve read the blog post explaining the reasoning behind the switch to the JS config, but how hard would it have been to either keep the old pkla interpreter, or, if it was really desired, rewrite the pkla interpreter in javascript? The ironic part of this is that the “old” pkla configuration was itself a non-backwards-compatible change from the even older PolicyKit configuration a little less than four years ago.
  2. dconf mandatory settings: With the version of dconf in Fedora 18, we now have the ability to have multiple user dconf databases. This is a great feature, but it requires a change in the format of the database profile files, which meant my database profile files from Fedora 17 no longer worked correctly. In fact, they caused gnome-settings-daemon to crash, which crashed Gnome and left users unable to log in. Oops. To be fair, this was a far less annoying change because I only had to change a couple of lines, but I’m still not impressed that dconf couldn’t just read my old db profile files.

As a developer, I totally understand the “I have a better way” mindset, but I think backwards compatibility is still vital. That’s why I love rsync and systemd, but have very little time for unison (three different versions in the Fedora repositories because newer versions don’t speak the same language as older versions).

I know some people will say, “If you want stability, just use RHEL.” That’s fine, but I’m not necessarily looking for stability. I like the rate of change in Fedora. What I dislike is when things break because someone wanted to do something different.

All in all, I’ve been really happy with Fedora as our school’s primary OS, and each new release’s features only make me happier. Now I need to go fix a regression in yum-presto that popped up because of some changes we made because we wanted to do something different.

Under the hood

Two years ago, as mentioned in btrfs on the server, we set up btrfs as our primary filesystem on our data servers. After we started running into high load as our network expanded (and a brief experiment with GlusterFS as mentioned in GlusterFS Madness), in March we switched over to ext4 with the journal on an SSD.

So, as of March, we had three data servers. datastore01 was the primary server for usershare, our shared data. datastore03 was the primary server for users, which, surprisingly, held our users’ home directories. datastore02 was secondary for both usershare and users which were synced using DRBD.

One of the things I had originally envisioned when I set up our system was a self-correcting system. I played around with both the Red Hat Cluster suite and heartbeat and found that they were a bit much for what we were trying to achieve, but I wanted a system where, if a single data server went down, the only notice I would have would be a Nagios alert, and not a line of people outside my office asking me what the problem is.

While I never achieved that level of self correction, I could switch usershare from datastore01 to datastore02 with a less than 30-second delay, and the same applied with switching users from datastore03 to datastore02. NFS clients would connect to an aliased IP that switched when the filesystem switched, so they would only freeze for about 30 seconds, and then come back.

This made updating the systems pretty painless. I would update datastore02 first, reboot into the new kernel and verify that everything was working correctly. Then, I would migrate usershare over to datastore02 and update datastore01. After datastore01 came back up, I would migrate usershare back, and then repeat the process with users and datastore03.

We also had nightly rsync backups to backup01 which was running btrfs and which would create a snapshot after the backup finished. We implemented nightly backups after a ham-fisted idiot of a system administrator (who happens to sleep next to my wife every night) managed to corrupt our filesystem (and, coincidentally, come within a hair’s breadth of losing all of our data) back when we were still using btrfs. The problem with DRBD is that it writes stuff to the secondary drives immediately, which is great when you want network RAID, but bad when the corruption that you just did on the primary is immediately sent to the secondary. Oops. Anyhow, after we managed to recover from that disaster (with lots of prayer and a very timely patch from Josef Bacik), we decided that a nightly backup to a totally separate filesystem wouldn’t be a bad idea.

We also had two virtual hosts, virtserver01 and virtserver02. Our virtual machines’ hard drives were synced between the two using DRBD. We could stop a virtual machine on one host and start it on the other, but live migration didn’t work and backups were a nightly rsync to backup01.

So, after the switchover, our network looked something like this (click for full size):

I was pretty happy with our setup, but our load problem popped up again. While it was better than it was before the switch, it would still sometimes peak during breaks and immediately after school.

As I was asking myself what other system administrators do, it hit me that one of my problems was my obsession with self-correcting systems. More specifically, my obsession with automatic correction of a misbehaving server, rather than the more common issue of automatically “correcting” misbehaving hard drives. Because of that, I had been ignoring NAS’s as none of them seemed to have something that worked along the same lines as DRBD.

I started looking at FOSS NAS solutions, and found NAS4Free, a FreeBSD appliance that comes with the latest open-source version of ZFS. The beauty of ZFS when it comes to speed is that, unlike btrfs, it allows you to set up a SSD as a read cache or as the data log.

After running some tests over the summer, I found that ZFS with a SSD cache partition and SSD log partition was quite a bit faster than our ext4 partitions with SSD log, especially with multiple systems hitting the server at the same time with multiple small writes.

So we switched our data servers over to NAS4Free, reduced them to two, and added another backup server. The data servers are configured with RAIDZ1 plus SSD caches and logs. The backups are configured with RAIDZ1, no cache, no SSD log.

A nice feature of ZFS (which I believe btrfs also recently got) is the ability to send a diff between two snapshots from one server to another. Using this feature (which isn’t exposed in the NAS4Free web interface, but accessible using a bash script that runs at 2:00 every morning), I’m able to send my backups to the backup servers in far less time than it used to take to run rsync.

One other nice feature of NAS4Free is the ability of ZFS to create a “volume” which is basically a disk device as part of the data pool, and then export it using iSCSI. I switched our virtual machines’ hard drives from DRBD to iSCSI, which now allows us to live migrate from one virtual host to the other. We also get the bonus of automatic backups of the ZFS volumes as part of the snapshot diffs.

Now our network looks something like this (click for full size):

There is one major annoyance and one major regression in our system, though. First the annoyance. ZFS has no way of removing a drive. You can swap out a drive in a RAIDZ or mirror set, but once you’ve added a RAIDZ set, a mirror or even a single drive, you cannot remove them without destroying the pool. Apparently enterprise users never want to shrink their storage. More on this in my next post.

The major regression is that if either of our data servers goes down, the whole network goes down until I get the server back up. I can switch us over to the backups, but we’ll be using yesterday’s data if I do, so that’s very much a last resort. This basically means that I need to be ready to swap the drives into a new system if one of our data servers does go down. And there will be downtime if (when?) that happens. Joy.

So now we have a system that gives us the speed we need, but not the redundancy I’d like. What I’d really like would be a filesystem that is fully distributed, has no single point of failure and allows you to store volumes on it. GlusterFS fits the bill (mostly), but I’m gunshy at the moment. Ceph looks like it may fit the bill even better with RBD as well as CephFS, but the filesystem part isn’t considered production-ready yet.

So where does that leave us? As we begin the 2012-2013 school year, file access and writing is faster than ever. We’d need simultaneous failure of four hard drives before we start losing data, and, once I deploy our third backup server for high-priority data, it will take even more to lose the data. We do have a higher risk of downtime in the event of a server failure, but we’re not at the point where that downtime would keep us from our primary job, teaching.