I hate NFS

On our network we have about 100 client computers, most of which are running Fedora 11.  We have two real servers running CentOS 5.4, using DRBD to keep the virtual machine data on the two real machines in sync and Red Hat’s cluster tools for starting and stopping the virtual machines.

We have five virtual machines running on the two real machines, only one of which is important to this post, our fileserver.

Under our old configuration, /networld was mounted on one of the real servers, and then shared to our clients using NFS. Our virtual machine, fileserver, then mounted /networld over NFS and shared it using Samba for our few remaining Windows machines (obviously, a non-optimal solution).

Old configuration (click on image for full size)

There were a couple of drawbacks to this configuration:

  1. I had to turn on and off a number of services as the storage clustered service moved from storage-server01 to storage-server02
  2. Samba refused to share a nfs4-mounted /networld, and, when mounted using nfs3, the locking daemon would crash at random intervals (I suspect a race condition as it mainly happened when storage-server0x was under high load).

My solution was to pass the DRBD disks containing /networld to fileserver, and allow fileserver to share /networld using both NFS and Samba, which seemed a far less hacky solution.

Current configuration (click on image for full size)

I knew there would be a slight hit in performance, though I’m using virtio to pass the hard drives to the virtual machine, so I would expect a maximum of 10-15% degradation.

Or not. I don’t have any hard numbers, but once we have a full class logging in, the system slows to a crawl. My guess would be that our Linux clients are running at 1/2 to 1/3 of the speed of our old configuration.

The load values on fileserver sit at about 1 during idle times and get pumped all the way up to 20-40 during breaks and computer lessons.

So now I’m stuck. I really don’t want to go back to the old configuration, but I can’t leave the system as slow as it is. I’ve done some NFS tuning based on miscellaneous sites found via Google, and tomorrow will be the big test, but, to be honest, I’m not real hopeful.

(To top it off, I spent three hours Friday after school tracking down this bug after updating fileserver to CentOS 5.4 from 5.3. I’m almost ready to switch fileserver over to Fedora.)

Tyre Computers

I’ve just spent the last three days setting up LESSON (our web based marking system) at our sister school in Tyre.  In the process, I’ve learned several things about web applications:

  1. Make sure to get your database right at the beginning.  I’ve been working on LESSON on and off over the last five years, and there are some things that have been added onto it.  Most of the time, I’ve extended the database in a nice, clean way, but there were a few times where I didn’t.  I decided to fix these problems before doing a new deployment, and there have been several things I’ve had to fix as a result.
  2. Make sure your frontend is modular.  I seem to have a real problem with writing modular code.  I don’t know why it is, but my normal tendency is to write One Big Page.  In writing LESSON’s frontend, I do include things like the header and footer, but I have liberally copied and pasted similar pages rather than abstracting out common functions.  When I changed the database, it took 30 minutes to make the changes to the necessary tables and convert the data.  It took 5 hours to get the frontend back into a semi-working state.  I spent another few hours over the weekend on it.  And I will need to spend even more time over the coming weeks, mainly doing search and replace (with enough issues that much of it is manual).
  3. Don’t take shortcuts.  So, yeah, at some point our school needed a way of checking past days’ attendance.  I was obviously in a hurry to implement it and somehow decided that the fastest way was to create a “calendar” table.  This table contained every day from January 1, 2005 (shortly before LESSON was deployed in our school) to January 1, 2065.  When I saw this over the weekend, I almost threw myself over the balcony (five stories up).  I ended up fixing the one(!) page that used this table and running a DROP TABLE.

The good news is that everything seems to have come together with few glitches, and, along with their new Fedora 11 desktop roll-out, Tyre has a great new system.