One of the problems we’ve had to deal with on our servers is high load on the fileserver that holds the user directories. I haven’t worked out if it’s because we’re using standard workstation hardware for our servers, or if it’s a btrfs problem.
The strange thing is that the load will shoot up at random times when the network shouldn’t be that taxed, and then be fine when every computer in the school has someone logged into it.
Anyhow, we hit a point where the load on the server hit something like 60 and the workstations would lock for sixty seconds (or more) while waiting the the NFS server to respond again. This seemed to happen most often when all of the students in the computer room opened Firefox at the same time.
In a fit of desperation, I threw together a python fuse filesystem that I have cunningly called the Config Caching Filesystem (or ccfs for short). The concept is simple. A user’s home directory at /netshare/users/[username] is essentially bind-mounted to /home/[username] using ccfs.
The thing that separates ccfs from a simple fuse bind-mount is that every time a configuration file (one that starts with a “.”) is opened for writing, it is copied to a per-user cache directory in /tmp and opened for writing there. When the user logs out, /home/[username] is unmounted, and all of the files in the cache are copied back to /netshare/users/[username] using rsync. Any normal files are written directly to /netshare/users/[username], bypassing the cache.
Now the only time the server is being written to is when someone actually saves a file or when they log out. The load on the server rarely goes above five, and even then it’s only when everyone is logging out simultaneously, and the server recovers quickly.
A few bugs have cropped up, but I think I’ve got the main ones. The biggest bug was that some students were resetting their desktops when the system didn’t log out quickly enough and were getting corrupted configuration directories, mainly for Firefox. I fixed that by using –delay-updates with rsync so you either get the fully updated configuration files or you’re left with the configuration files were there when you logged in.
I do think this solution is a bit hacky, but it’s had a great effect on the responsiveness of our workstations, so I’ll just have to live with it.
Ccfs is available here for those interested, but if it breaks, you get to keep both pieces.
One of my fears when I set up the network in Tyre last year was that I would be called out for emergency repair trips. It’s an hour and quarter each way on a good day, double that if you hit the traffic wrong. And, for those who don’t know Lebanese traffic, hitting it wrong often involves an unhealthy rise in blood pressure.
Anyhow, I had mentally prepared for, at worst, one callout a month. Twelve months later, not one single callout. No emergencies. No “we need you here now” phone calls. The few times there were problems, I’d talk Dave (their resident computer expert) through them over the phone or get him to set up a reverse ssh tunnel so I could fix them from here.
Last week, that twelve month streak was finally broken. It started off with a phone call.
“Jonathan, none of our computers can get on the web. I can ssh with no problems, IMAP and POP3 work fine, but web pages only load sporadically, if at all.”
I talked Dave through checking the school’s squid proxy and then checked what happened when they bypassed their proxy. Still nothing.
“Ok, Dave, it’s obviously a problem with your ISP. Call them up and get them to fix it.”
The next day, Dave calls me again.
“The guy from the ISP was just here. He had no problems at all until he put his laptop behind the proxy. So he says it’s the proxy.”
Ok, that’s reasonable enough. Just to test, I have Dave bypass the proxy with his laptop (running Ubuntu), and, sure enough, the web works fine. For a couple of minutes. And then, again, nothing.
“Dave, if we’re bypassing the proxy, and you’re still not getting any web pages, it must be the ISP. Here’s what we’re going to do. We’re going to completely shut the proxy down and bypass it for everyone. That’s not going to fix the problem, but at least they can’t blame the proxy.”
The next day, I get a call again. “Jonathan, the technician came, and it’s definitely not them. He connected his laptop straight to the ISP using PPPoE, bypassing the router, and everything worked. He then went through the router, and, again, everything worked. He browsed for 15 minutes, with no problems at all. And here’s the crazy thing. All of the Macs and Windows machines are working fine. It’s only the Linux machines that aren’t working.”
Well, that sucks. The school runs Fedora on all of its desktops, the servers run CentOS, and Dave runs Ubuntu on his computer. And none of them can access the web.
At this point, I’m out of ideas, so I get in my car and head on down to Tyre. Of course, Dave has a meeting up here in Beirut, but he clears everything with the school secretary, and I’m given access to the router.
The first thing I do is plug my laptop into the network and start browsing the web. Five minutes later, when Google has still failed to load, I finally accept that, yes, there is actually a problem browsing the web.
My next step is to try swapping in another router. Even after setting the username, password, and MAC address, the new router just won’t connect. I remember what Dave said about the technician plugging straight into Internet ethernet cable and making the connection using PPPoE. So I plug my laptop straight into the cable, setup PPPoE in NetworkManager (which is insanely easy), and, boom, I’m in, bypassing the router.
I check my emails (using Evolution, connecting over IMAP). Looks great. I open Google. Not so great. I then test a Windows computer that’s sitting on the desk. Instant web access.
At this point, a bulb finally lights in my brain. Most of the ISPs in this country using transparent caching proxies, as bandwidth is expensive for them too. Could this have to do with their ISP’s proxy?
I set up my computer to use our server in the States as a proxy. All of a sudden, my web access is working perfectly. It’s the ISP’s proxy. There’s obviously something wrong with how it’s parsing any requests that come from Linux computers.
I then realize that the Mac and Windows computers started working after we shut down the school’s proxy… which was running under Linux. Ouch.
When Dave returns from Beirut, we sit down and talk through the problem. The first step is for me to turn the school proxy back on, and set it to use the US server as a parent proxy. Now, all web traffic is getting routed through the US server, which may not be efficient, but at least works. The next step is for the school to switch ISPs, and we’re still waiting on that process to finish.
As for me, I’m still a bit shell shocked. We live in 2010 and an ISP is using a transparent proxy solution that doesn’t work with Linux? My best guess is that we’re looking at some weirdness in how it’s parsing TCP packets… but how?
If anyone ever works out what the explanation is, I’d sure love to hear it.
Update (10/02/2010): A big thank you to all who offered suggestions in the comments. We went down to Tyre for a visit today, and while we were down there, I switched the school’s proxy back to a direct connection to the web so I could test some of the suggestions. Of course, the web started working correctly immediately. Obviously the ISP fixed whatever it was that they broke (which is good), but they haven’t explained what went wrong to the school (which isn’t so good).
Anyhow, if I come up against this again, I’ll at least have some things to try. Thanks again.