Under the hood

Two years ago, as mentioned in btrfs on the server, we set up btrfs as our primary filesystem on our data servers. After we started running into high load as our network expanded (and a brief experiment with GlusterFS as mentioned in GlusterFS Madness), in March we switched over to ext4 with the journal on an SSD.

So, as of March, we had three data servers. datastore01 was the primary server for usershare, our shared data. datastore03 was the primary server for users, which, surprisingly, held our users’ home directories. datastore02 was secondary for both usershare and users which were synced using DRBD.

One of the things I had originally envisioned when I set up our system was a self-correcting system. I played around with both the Red Hat Cluster suite and heartbeat and found that they were a bit much for what we were trying to achieve, but I wanted a system where, if a single data server went down, the only notice I would have would be a Nagios alert, and not a line of people outside my office asking me what the problem is.

While I never achieved that level of self correction, I could switch usershare from datastore01 to datastore02 with a less than 30-second delay, and the same applied with switching users from datastore03 to datastore02. NFS clients would connect to an aliased IP that switched when the filesystem switched, so they would only freeze for about 30 seconds, and then come back.

This made updating the systems pretty painless. I would update datastore02 first, reboot into the new kernel and verify that everything was working correctly. Then, I would migrate usershare over to datastore02 and update datastore01. After datastore01 came back up, I would migrate usershare back, and then repeat the process with users and datastore03.

We also had nightly rsync backups to backup01 which was running btrfs and which would create a snapshot after the backup finished. We implemented nightly backups after a ham-fisted idiot of a system administrator (who happens to sleep next to my wife every night) managed to corrupt our filesystem (and, coincidentally, come within a hair’s breadth of losing all of our data) back when we were still using btrfs. The problem with DRBD is that it writes stuff to the secondary drives immediately, which is great when you want network RAID, but bad when the corruption that you just did on the primary is immediately sent to the secondary. Oops. Anyhow, after we managed to recover from that disaster (with lots of prayer and a very timely patch from Josef Bacik), we decided that a nightly backup to a totally separate filesystem wouldn’t be a bad idea.

We also had two virtual hosts, virtserver01 and virtserver02. Our virtual machines’ hard drives were synced between the two using DRBD. We could stop a virtual machine on one host and start it on the other, but live migration didn’t work and backups were a nightly rsync to backup01.

So, after the switchover, our network looked something like this (click for full size):

I was pretty happy with our setup, but our load problem popped up again. While it was better than it was before the switch, it would still sometimes peak during breaks and immediately after school.

As I was asking myself what other system administrators do, it hit me that one of my problems was my obsession with self-correcting systems. More specifically, my obsession with automatic correction of a misbehaving server, rather than the more common issue of automatically “correcting” misbehaving hard drives. Because of that, I had been ignoring NAS’s as none of them seemed to have something that worked along the same lines as DRBD.

I started looking at FOSS NAS solutions, and found NAS4Free, a FreeBSD appliance that comes with the latest open-source version of ZFS. The beauty of ZFS when it comes to speed is that, unlike btrfs, it allows you to set up a SSD as a read cache or as the data log.

After running some tests over the summer, I found that ZFS with a SSD cache partition and SSD log partition was quite a bit faster than our ext4 partitions with SSD log, especially with multiple systems hitting the server at the same time with multiple small writes.

So we switched our data servers over to NAS4Free, reduced them to two, and added another backup server. The data servers are configured with RAIDZ1 plus SSD caches and logs. The backups are configured with RAIDZ1, no cache, no SSD log.

A nice feature of ZFS (which I believe btrfs also recently got) is the ability to send a diff between two snapshots from one server to another. Using this feature (which isn’t exposed in the NAS4Free web interface, but accessible using a bash script that runs at 2:00 every morning), I’m able to send my backups to the backup servers in far less time than it used to take to run rsync.

One other nice feature of NAS4Free is the ability of ZFS to create a “volume” which is basically a disk device as part of the data pool, and then export it using iSCSI. I switched our virtual machines’ hard drives from DRBD to iSCSI, which now allows us to live migrate from one virtual host to the other. We also get the bonus of automatic backups of the ZFS volumes as part of the snapshot diffs.

Now our network looks something like this (click for full size):

There is one major annoyance and one major regression in our system, though. First the annoyance. ZFS has no way of removing a drive. You can swap out a drive in a RAIDZ or mirror set, but once you’ve added a RAIDZ set, a mirror or even a single drive, you cannot remove them without destroying the pool. Apparently enterprise users never want to shrink their storage. More on this in my next post.

The major regression is that if either of our data servers goes down, the whole network goes down until I get the server back up. I can switch us over to the backups, but we’ll be using yesterday’s data if I do, so that’s very much a last resort. This basically means that I need to be ready to swap the drives into a new system if one of our data servers does go down. And there will be downtime if (when?) that happens. Joy.

So now we have a system that gives us the speed we need, but not the redundancy I’d like. What I’d really like would be a filesystem that is fully distributed, has no single point of failure and allows you to store volumes on it. GlusterFS fits the bill (mostly), but I’m gunshy at the moment. Ceph looks like it may fit the bill even better with RBD as well as CephFS, but the filesystem part isn’t considered production-ready yet.

So where does that leave us? As we begin the 2012-2013 school year, file access and writing is faster than ever. We’d need simultaneous failure of four hard drives before we start losing data, and, once I deploy our third backup server for high-priority data, it will take even more to lose the data. We do have a higher risk of downtime in the event of a server failure, but we’re not at the point where that downtime would keep us from our primary job, teaching.

GlusterFS Madness

Background

As mentioned in Btrfs on the server, we have been using btrfs as our primary filesystem for our servers for the last year and a half or so, and, for the most part, it’s been great. There have only been a few times that we’ve needed the snapshots that btrfs gives us for free, but when we did, we really needed them.

At the end of the last school year, we had a bit of a problem with the servers and came close to losing most of our shared data, despite using DRBD as a network mirror. In response to that, we set up a backup server which has the sole job of rsyncing the data from our primary servers nightly. The backup server is also using btrfs and doing nightly snapshots, so one of the major use-cases behind putting btrfs on our file servers has become redundant.

The one major problem we’ve had with our file servers is that, as the number of systems on the network has increased, our user data server can’t handle the load. The configuration caching filesystem (CCFS) I wrote has helped, but even with CCFS, our server was regularly hitting a load of 10 during breaks and occasionally getting as high as 20.

Switching to GlusterFS

With all this in mind, I decided to do some experimenting with GlusterFS. While we may have had high load on user data server, our local mirror and shared data servers both had consistently low loads, and I was hoping that GlusterFS would help me spread the load between the three servers.

The initial testing was very promising. When using GlusterFS over ext4 partitions using SSD journaling on just one server, the speed was just a bit below NFS over btrfs over DRBD. Given the distributed nature of GlusterFS, adding more servers should increase the speed linearly.

So I went ahead and broke the DRBD mirroring for our eight 2TB drives and used the four secondary DRBD drives to set up a production GlusterFS volume. Our data was migrated over, and we used GlusterFS for a week without any problems. Last Friday, we declared the transition to GlusterFS a success, wiped the four remaining DRBD drives, and added them to the GlusterFS volume.

I started the rebalance process for our GlusterFS volume Friday after school, and it continued to rebalance over the weekend and through Monday. On Monday night, one of the servers crashed. I went over to the school to power cycle the server, and, when it came back up, continued the rebalance.

Disaster!

Tuesday morning, when I checked on the server, I realized that, as a result of the crash, the rebalance wasn’t working the way it should. Files were being removed from the original drives but not being moved to the new drives, so we were losing files all over the place.

After an emergency meeting with the principal (who used to be the school’s sysadmin before becoming principal), we decided do ditch GlusterFS and go back to NFS over ext4 over DRBD. We copied over the files from the GlusterFS partitions, and then filled in the gaps from our backup server. Twenty-four sleepless hours later, the user data was back up and the shared data was up twenty-four sleepless hours after that.

Lessons learned

  1. Keep good backups. Our backups allowed us to restore almost all of the files that the GlusterFS rebalance had deleted. The only files lost were the ones created on Monday.
  2. Be conservative about what you put into production. I’m really not good at this. I like to try new things and to experiment with new ideas. The problem is that I can sometimes put things into production without enough testing, and this is one result.
  3. Have a fallback plan. In this case, our fallback was to wipe the server and restore all the data from the backup. It didn’t quite come to that as we were able to recover most of the data off of GlusterFS, but we did have a plan if it did.
  4. Avoid GlusterFS. Okay, maybe this isn’t what I should have learned, but I’ve already had one bad experience with GlusterFS a couple of years ago where its performance just wasn’t up to scratch. For software that’s supposedly at a 3.x.x release, it still seems very beta-quality.

The irony of this whole experience is that by switching the server filesystems from btrfs to ext4 with SSD journals, the load on our user data server has dropped to below 1.0. If I’d just made that switch, I could have avoided two days of downtime and a few sleepless nights.

Nuclear explosion credit – Licorne by Pierre J. Used under the CC-BY-NC 2.0 license.