How do you rank a sysadmin?

Sysadmin at work

When I heard about the 4th Linux Showdown, sponsored by TrueAbility, I was pretty excited. I’m a pretty competitive guy, so the idea of competing in a sysadmin’s challenge sounded like fun.

In the Linux Showdown, you get 30 minutes to complete a certain number of sysadmin tasks. Some of the tasks are pretty simple, while some of the others become more difficult. I entered the first day and managed to get 9th place with a score of 100% and a time of just under 17 minutes.

The second day I ran into trouble. One of the tasks was to reset the mysql root password, and, though I followed the directions here, twice, I was never able to log into mysql as root. The commands seemed to be running correctly, but I was locked out.

In my day-job as the system administrator for a school, I would keep bashing away at the problem until I figured out what I was doing wrong. In the competition, I ran out of time after fifteen minutes of debugging and ended up with a lousy 40%. Ouch!

I was frustrated, but figured the third day’s competition should fit a bit better. The hint said that it was a scripting competition, and my python foo is pretty decent. Sure enough, day three involved finding files with modification times between two dates, adding them to a database, and then tarring them up.

I came up with a python script that found the necessary files and added them to the database. Except my clever ‘INSERT’ statement didn’t actually work. If I manually copied and pasted it into mysql, it worked perfectly, but it didn’t run from the script. Grrr. I spent ten minutes debugging… and my time was up!

Well, that sucked. This time I got an impressive 20%. Double ouch!

After finishing the test, I went to bed and spent fifteen minutes ranting to my poor wife. The next day, after cooling off, I decided I was done. The hint for the last competition said that it had something to do with security, and I wouldn’t call myself an expert on that. If I’m getting 20% in the areas that I’m relatively good at, then what should I expect in areas that I’m less comfortable with.

Then it hit me. If I’m not comfortable with it, why not just do it for fun? If I know I’m probably going to get a zero, who cares? I checked the leaderboard, and the highest score at the time was 67%, so my zero wouldn’t be so bad. I went ahead and started the last competition.

Step one, secure the mail server. We don’t run our own mail servers here at the school and I know nothing about postfix, so I spent ten minutes or so Googling for some kind of solution, typed in what I thought was a partial fix, and then decided to give up.

Step two, secure a page on the webserver. This is something I have to do quite often, so I was able to get it done in five minutes or so.

Finally, step three, secure an FTP server. Who still uses FTP? We don’t! I wasn’t even sure what the ftp daemon’s name was, so I ran a ‘ps aux | grep ftp’. This was the only reason that I noticed that the ftp daemon wasn’t using the config file in /etc, but rather some config file in someone’s home directory. I did what I thought would secure the ftp server in both config files, and saw that I had a little over two minutes left.

Ok, I could have spent some more time on postfix, but I knew nothing about it, so I decided that I was finished. Worst case, I’d get 33% for the webserver (which was the only fix I’d actually tested). Best case, 67% for the ftp server, which I was pretty sure I’d fixed. If so, I might actually get in the top twenty. So, I logged in to the leaderboard, checked my ranking… First!??!? With 100%? What?

Apparently the random lines from Google that I put into my postfix config had secured it. Pure luck. As I followed the leaderboard for the rest of the day, it became obvious that many people with a lot of experience with apache, postfix and ftp were whipping right through the contest, missing the ftp config file in the home directory, and getting 67%, while I kept sitting on top with the lone 100%. I felt like such a fraud.

Finally, in the last hour before the contest ending, someone else found the solution five minutes faster than I did and got first place. Praise God! I still felt like a fraud, but at least first place was going to someone who knew what they were doing.

So, in four days of competitions, I got the highest score in the areas I was weakest in and the lowest score in the areas I was strongest in. That seems to indicate either that I don’t know what my strengths and weaknesses are, or that the competition needs some tweaking. Well, I think I’m at least reasonably aware of my strengths and weaknesses, and I’m very aware of how much of a role chance played in all four days of competition. So how can this competition be tweaked?

The strengths of the competition are pretty obvious. The whole point of TrueAbility is to winnow out people who talk the talk, but can’t walk the walk. When you get a résumé, you don’t know whether the applicant can actually do all the things they claim to be able to do, so, with TrueAbility, you give someone a VM and a list of tasks, and see whether or not they can do them. TrueAbility doesn’t care how they do the tasks, they just check that the tasks are completed. Brilliant!

The biggest weakness in the competition is the time limit. A vast majority of the problems we face as sysadmins need to be fixed quickly, but rarely does a complex problem need to be solved within 30 minutes. This time limit in the competition introduces a bias against those who work methodically. While hiring fast workers is always nice, basing hiring decisions based on how fast someone can code rather than how well they code is not wise.

In addition, the marking (especially for the last few days) was extremely coarse, so ranking was heavily dependent on how quickly you finished. This was especially noticeable in the first day, where the only difference between 1st place and 28th place was whether you took 10 minutes to finish the job or 30 minutes. As was obvious in the last day’s competition, this emphasis on time caused people to rush so much that they made mistakes. Time makes a lousy basis for ranking.

So what’s the solution? I see two complementary things that could be done to improve the competition. The first is to break down the grading even more, and assign different values to the different tasks. I’d even add in some standard tasks (with a total score of a maximum of 20%) along the lines of “Make sure that you close any ports not needed for your task”, “Disallow password logins over ssh and set up the server to trust your ssh key”, and “Replace your Ubuntu install with the real sysadmin’s OS: Fedora”. Ok, I’m half joking on that last one, but you get the idea. The key thing is that it should be almost impossible to get 100%, but a mediocre sysadmin should be able to hit 70% with only minor difficulty, and a talented sysadmin shouldn’t have much trouble reaching 90%.

The other thing that would help would be a removal of the hard deadline. Instead, allow candidates to continue working beyond the time limit, with a deduction of 1-2% for every minute. This introduces a cost to breaking the deadline without causing the candidate to completely fail because they needed ten more minutes.

With these two adjustments, time should become secondary to doing the job right. If I spend 10 minutes getting 90%, I’ll still get a lower score than someone who takes their time to do it right in 30 minutes. And, if I spend 40 minutes reaching 90%, I’ll only lose 20% for going over and end with a score of 70%, rather than sitting at zero because I just couldn’t finish my script within the deadline.

TrueAbility, thank you for the time and effort you’ve put into developing the problems for this competition, and thank you for the creative idea of a sysadmin’s competition in the first place.

And I really want to congratulate those who were able to consistently get high scores under the tough time limits.

Now I’m off to get some sleep before our first day of school.

Messy wires credit – Cisco Spaghetti by CHRISTOPHER MACSURAK. Used under the CC-BY 2.0 license.

Fedora 18 – A Sysadmin’s view

The road less traveled

At our school we have around 100 desktops, a vast majority of which run Fedora, and somewhere around 900 users. We switched from Windows to Fedora shortly after Fedora 8 was released and we’ve hit 8, 10, 13, 16, and 17 (deploying a local koji instance has made it easier to upgrade).

As I finished putting together our new Fedora 18 image, there were a few things I wanted to mention.

The Good

  1. Offline updates: Traditionally, our systems automatically updated on shutdown. In the 16-17 releases, that became very fragile as any systemctl scriptlets in the updates would block because systemd was in the process of shutting down. Now, with systemd’s support for offline updates, we can download the updates on shutdown, reboot the computer, and install the updates in a minimal system environment. I’ve packaged my offline updater here.
  2. btrfs snapshots: This isn’t new in Fedora 18, but, with the availability of offline updates, we’ve finally been able to take proper advantage of it. One problem we have is that we have impatient students who think the reset button is the best way to get access to a computer that’s in the middle of a large update. Now, if some genius reboots the computer while it’s updating, it reverts to its pre-update state, and then attempts the update again. If, on the other hand, the update fails due to a software fault, the computer reverts to its pre-update state and boots normally. Either way, the system won’t be the half-updated zombie that so many of my Fedora 17 desktops are.
  3. dconf mandatory settings: Over the years we’ve moved from gconf to dconf, and I love the easy way that dconf allows us to set mandatory settings for Gnome. This continued working with only a small modification from Fedora 17 to Fedora 18, available here and here.
  4. Javascript config for polkit: I love how flexible this is. We push out the same Fedora image to our school laptops, but the primary difference compared to the desktop is that we allow our laptop users to suspend, hibernate and shutdown their laptops, while our desktop users can’t do any of the above. What I would really like to do is have the JS config check for the existence of a file (say /etc/sysconfig/laptop), and do different things based on that, but I haven’t managed to work out how to do that yet. My first attempt is here.
  5. systemd: This isn’t a new feature in 18, but systemd deserves a shout-out anyway. It does a great job of making my workstations boot quickly and has greatly simplified my initscripts. It’s so nice to be able to easily prevent the display manager from starting before we have mounted our network directories.
  6. Gnome Shell: We actually started experimenting with Gnome Shell when it was first included in Fedora, and I switched to it as the default desktop in Fedora 13. As we’ve moved from 13 to 16, then 17, and now 18, it’s been a nice clean evolution for our users. When I first enabled Gnome Shell in our Fedora 13 test environment, the feedback from our students was very positive. “It doesn’t look like Windows 98 any more!” was the most common comment. As we’ve upgraded, our users have only become more happy with it.

The Bad

The bad in Fedora 18 mainly comes down to the one area where Linux in general, and Fedora specifically, is weak – being backwards-compatible. This was noticeable in two very specific places:

  1. Javascript config for polkit: While I was impressed with the new javascript config’s flexibility, I was most definitely not impressed that my old pkla files were completely ignored. As a system administrator, I find it frustrating when I have to completely rewrite my configuration files because “now we have a better way”. I’ve read the blog post explaining the reasoning behind the switch to the JS config, but how hard would it have been to either keep the old pkla interpreter, or, if it was really desired, rewrite the pkla interpreter in javascript? The ironic part of this is that the “old” pkla configuration was itself a non-backwards-compatible change from the even older PolicyKit configuration a little less than four years ago.
  2. dconf mandatory settings: With the version of dconf in Fedora 18, we now have the ability to have multiple user dconf databases. This is a great feature, but it requires a change in the format of the database profile files, which meant my database profile files from Fedora 17 no longer worked correctly. In fact, they caused gnome-settings-daemon to crash, which crashed Gnome and left users unable to log in. Oops. To be fair, this was a far less annoying change because I only had to change a couple of lines, but I’m still not impressed that dconf couldn’t just read my old db profile files.

As a developer, I totally understand the “I have a better way” mindset, but I think backwards compatibility is still vital. That’s why I love rsync and systemd, but have very little time for unison (three different versions in the Fedora repositories because newer versions don’t speak the same language as older versions).

I know some people will say, “If you want stability, just use RHEL.” That’s fine, but I’m not necessarily looking for stability. I like the rate of change in Fedora. What I dislike is when things break because someone wanted to do something different.

All in all, I’ve been really happy with Fedora as our school’s primary OS, and each new release’s features only make me happier. Now I need to go fix a regression in yum-presto that popped up because of some changes we made because we wanted to do something different.