Solving the mystery of the disappearing bluetooth device

This is a true[1] story

One of the features my laptop comes with is Bluetooth, which I’ve found to be quite handy considering all the highly important uses I have for Bluetooth (using Bluetooth tethering on my phone when traveling, controlling my presentations with my phone, using a Wii-mote for playing SuperTuxKart portable Bluetooth controller with built-in accelerometer to analyze the consistency of the matrices used when rendering three-dimensional objects onto a two-dimensional field).

About three months ago, I started to run into problems. Not the easy kind of problem where “BUG: unable to handle kernel paging request at 0000ffffd15ea5e” brings the laptop to an abrupt stop, but instead the kind of problem that causes real trouble.

My Bluetooth module starts to randomly reset itself. I’ll be working merrily, trying to connect my phone or the… portable Bluetooth controller… and, halfway through the process, it will hang. Kernel logs show that the Bluetooth module has been unplugged from the USB bus and then reconnected. Which, when you think about it, makes a whole lot of sense, given that the Bluetooth module is built into the WiFi card which is screwed onto the motherboard.

When faced with kernel logs that boggle the mind, the most logical thing to do is downgrade the kernel. I know that I was able to successfully… analyze the matrices used for, oh, whatever it was… back at the beginning of June, which means I had working Bluetooth on June 1. Let’s see what kernel was latest then, download and install it, boot from it, and…

kernel: usb 8-4: USB disconnect, device number 3
kernel: usb 8-4: new full-speed USB device number 4 using ohci-pci

#$@&%*!

Ok, the hardware must be dying.  Stupid Atheros card.  No idea why it’s just the Bluetooth and not the WiFi as well, but we’re in Ireland and I’m on eBay, so I’ll just order another one.  Made by a different company.  A week later, a slightly used Ralink combo card shows up. I plug it in, fire her up, and…

kernel: usb 8-4: USB disconnect, device number 3
kernel: ohci-pci 0000:00:13.0: HC died; cleaning up
kernel: ohci-pci 0000:00:13.0: frame counter not updating; disabled

Double #$@&%*! Now the Bluetooth module is completely gone and the only way to get it back is to reboot. Grrrrr.

At this point I’ve got a hammer in my hand, my laptop in front of me, and the only thing keeping me from submitting a video for a new OnePlus One is my wife warning me that we’re not going to be buying me a new laptop any time this decade.

So I take a deep breath, calmly return the hammer to the toolbox (no, dear, I have no idea how that dent got on the toolbox), and decide to instead go down the road less traveled. I open up Fedora’s bugzilla and start preparing my bug report, taking special care to only use words that I’d be willing to say in front of my children. “…so the Bluetooth module keeps getting disconnected. It’s almost like the USB bus is cutting its power for some stupid…”

Wait a minute! Just before we traveled to Ireland, I remember experimenting with PowerTOP. And PowerTOP has this cool feature that allows you to automatically enable all power saving options on boot. And I might have enabled it. So I check, and, yes I have turned on autosuspend for my Bluetooth module. I turn it off, try to connect my… portable Bluetooth controller… and it works, first time. I do some… matrix analysis… with it and everything continues to work perfectly.

So I am an idiot. I close the page with the half-finished bug report and go to admit to my wife that I just wasted €20 on a WiFi card that I didn’t really need.  And, uh, if any Atheros or Ralink people read this, well, I’m sorry for any negative thoughts I may have had about your WiFi cards.

[1] Well, mostly true, anyway. Some of the details might be mildly exaggerated.

Managing the unmanageable

Uphill battle

When I first started working at LES, sometime in the last century, the computers were networked together using some high-tech gizmos called “hubs“. These hubs would reach a maximum speed of 10Mbps on a good day, if there were only two devices connected and the solar flares were at a minimum.

Time marched on and we upgraded to 10/100Mbps hubs, then 10/100Mbps switches, and then, finally, in the last few years to unmanaged gigabit switches. One of the biggest problems with using unmanaged switches is that the network can be brought to a standstill using a simple patch cable, plugged into two network sockets. I’ve become pretty adept at recognizing the signs of a network switching loop (the lights on the switches are flickering like the last few seconds on the timer in Mission Impossible, the servers are inaccessible, the teachers are waiting outside my office with baseball bats). One of our network loop disasters hiccups even managed to anonymously make it to a site dedicated to technology-related problems.

Over the last month, though, I had lots of small problems that never quite reached the level of crashing the network. Our Fedora systems, connected to the server via NFS, would occasionally freeze for a few seconds, and then start working again. Our accountants, who are running Windows, complained that their connection to the server was being broken a couple of times each day, causing their accounting software to crash. And pinging any server would result in a loss of ten-fifteen packets every ten minutes or so.

I checked our switches for the flicker of death and came up dry. I tried dumping packets from a server on one side of the school to a server on the other side of the school and consistently reached 1Gbps. In desperation, I retipped the Cat6 cable connecting the switches that form the backbone of our network. All to no avail. I decided to wait until evening and then unplug the switches one at a time until I found the problem. The problem disappeared.

The next morning it was back. I had two options. Disconnect the switches one port at a time in the middle of the school day, while teachers, students and accountants are all trying to use the system. Or put in a request for some managed switches and see if they could help us figure out what the heck was going on. Hundreds of irritated users outside my door… or new kit. It was a hard call, but I went for the new kit.

We started with an eight-port MikroTik switch/router, and, after I tested it for a day, we quickly grabbed a couple more 24-port MikroTik switches (most of our backbone locations have nine or ten ports that need to be connected and MikroTik either does 8 or 24 ports).

After we got the three core locations outfitted with switches, I quickly got messages on the switches pointing to a potential network loop on a link to one of our unmanaged leaf switches in the computer room, which was connected to another unmanaged five-port switch that had apparently had a bad day and decided it would start forwarding packets back through itself.

I replaced the five port switch with a TP-Link five-port router running OpenWRT and, just like that, everything was back to normal.

I am never going back to unmanaged switches again. Having managed switches as our network’s backbone reduced the time to find the problem by a factor of 10 to 20, and, if we’d had managed switches all the way through the network from the beginning, we could have zeroed in directly on the bad switch rather than spending weeks trying to work out what the problem was.

So now we’re back to a nice quiet network where packet storms are but a distant nightmare. Knock on wood.

First Work: Myth of Sisyphus detail #1 by AbominableDante, used under a CC BY-NC-ND license