Duct tape in the datacenter

When I was growing up, on Thursday nights we would watch the Red Green Show, a comedy show set in very rural Canada. One of the regular segments on the show was the Handyman Corner, where Red Green would build something impressive out of the spare parts and garbage he had sitting around the shop, along with copious amounts of “the Handyman’s secret weapon,” duct tape, to hold everything together. To this day, whenever I see duct tape I think about the Red Green Show.

Lately I’ve had cause to think about duct tape when looking into IT infrastructure issues, some that I’ve had to handle at work, and some that we’ve all gotten to see from the outside. I don’t think I’ve ever actually used physical duct tape on the job, but there’s more than one kind of duct tape.

The thing about duct tape is that it is an incredibly useful tool for holding things together that perhaps were never intended to be connected (the Apollo 13 duct tape and cardboard scene comes to mind here). The problem is when duct tape is used as a core part of the design (as per most of Red Green’s builds).

When working with infrastructure, we may not normally be using physical duct tape, but there are plenty of things we do that correspond with duct tape. At my day job, our infrastructure is built using a GitOps model where server deployments are managed using Ansible playbooks and, more recently, Terraform configuration, all stored in git repositories. We also have development environments where we can test infrastructure and code changes before pushing to production.

The primary advantages of using the GitOps model are documentation and repeatability. Documentation in git comes pretty easily because when changes are made, it’s easy to see what was changed, and (assuming we’ve been disciplined when writing commit messages) why it was changed. Repeatability is there because, If we deploy one server with a specific configuration and need to re-deploy it at a later point, the second deployment will be identical to the first. We don’t need to worry about missing steps because it’s all automatically done when deploying the playbooks.

So what happens when we bypass our processes? When we manually deploy a server? When we manually make a change? Well, that’s our duct tape. It’s sometimes very necessary. If an important service has crashed, the first priority is to get it back up and running, not have a committee discussion. The problem is when duct tape becomes permanent.

We’ve seen far too much of this in the news recently, the main case in point being Twitter. It’s unclear whether the latest outage is due to bypassed processes or whether the infrastructure was brittle to begin with (or, most likely, some combination of both), but, either way, the systems there are breaking down, and it just seems to be getting worse.

So how do you deal with duct tape in the datacenter? Here are some thoughts:

  1. Design your infrastructure to be extensible and your services to be highly available. One of the biggest causes of duct tape maintenance is brittle infrastructure. When there’s an outage, repairs are of the highest urgency, and fixing it now is more important than fixing it correctly. Unfortunately, emergency fixes are like layers of duct tape applied on top of each other, weakening the overall structure.

    Far better to have an infrastructure design where failures have no negative effect beyond letting your team know that something needs to be fixed. Without the urgency, fixes can be planned and implemented correctly. Upper management is happy because customers aren’t affected. You’re happy because the problem is being fixed properly and permanently.

  2. Be disciplined about your processes. When there are urgent tasks or problems, there’s the temptation to skip steps in the process and just apply a bit of duct tape to fix the problem. Sometimes you won’t have a choice, but, if you can avoid it, don’t take the shortcut.

  3. When you have to come up with quick and dirty hacks, ensure that they are done with an eye towards the long-term solution. While the ideal is to always design things properly from the beginning, in the real world, we sometimes have to get something done right now, without worrying about how correct it is.

    The common mistake is to build a quick solution without thinking about how you’ll replace it later. Once the urgency is over, you quickly realize that a proper fix will involve completely changing how your solution works.

    Far better to take a few minutes (or more if possible) to think through what you would want the long term solution to look like, and then assess how best to build the quick solution in a way that the duct tape can be easily replaced later.

  4. Document your duct tape. If you absolutely must make manual changes or break processes, document it. You may not think it’s necessary, but documentation ensures that the duct tape doesn’t get forgotten. There’s nothing worse than trying to fix a massive disaster and realizing that there’s some duct tape hanging around and you don’t know what it’s supposed to be attached to.

    I have been in a couple of situations recently where processes were bypassed (or never created in the first place) when creating a service, and the person responsible for the service has no idea how to manage it. We’ve ended up needing to spend tens of hours reverse engineering the service configuration, compared to a couple of minutes of documentation.

Like duct tape, the ability to make manual infrastructure changes is a useful, even vital, tool. Don’t throw it away just because it can be abused. The key is to ensure that it’s only used when absolutely necessary and that you know where your duct tape is.

Photo Polyken Duct Tape by Markbritton used under the CC BY-SA 3.0 license

PinePhone call quality

I bought a PinePhone a couple of months ago, mainly for the opportunity to play with it and see what it could do. Since I work for a company that does voice call quality testing, the first thing I wanted to do was figure out how to play back our audio quality test files and then do some call audio quality tests.

When doing call audio quality testing, it’s very important that the audio you’re playing back not be transcoded or even resampled, so using PulseAudio or PipeWire was completely out of the question. Figuring this process out turned out to be far harder than I expected, so I thought I should document how I made it work. My background is in IT infrastructure, and I am not an audio engineer or a telephony expert, so I wouldn’t be surprised if there are better ways of doing things. Suggestions are very welcome on Twitter or via email.

How to play audio over a call

  1. Insert your SIM into the phone. Then, setup Fedora on a microSD card and boot it
  2. Connect to the network either using wifi or a dock
  3. Make sure to enable sshd.service if it’s not already enabled. Then SSH into the PinePhone (default username is pine and default password is 123456), and become root (sudo su)
  4. Install asterisk dnf -y install asterisk
  5. Build and install the Quectel asterisk module. To build the channel module, you’ll need to download the source RPM for the asterisk package installed in step 4, build it (at least to the point where it’s generated the include files), and then pass the build directory to ./configure --with-asterisk=...
  6. Copy qeuctel.conf to /etc/asterisk. The defaults should suffice
  7. Stop phosh.service and ModemManager.service
  8. Setup an extension in /etc/asterisk/extensions.conf with the name [incoming-mobile]. This is what any incoming calls will be directed to. For testing, I suggest renaming the [demo] section to [incoming-mobile]
  9. Add the asterisk user to the audio and dialout groups
  10. Set the audio mode to Voice call in ALSA by running alsaucm -c PinePhone set _verb 'Voice Call'. (Rather embarrassingly, this step took me days to figure out!)
  11. In the playback tab of alsamixer, make sure that Line Out is unmuted (even if the volume is at 0%), along with AIF1 Slot 0 Digital DAC and AIF2 Digital DAC
  12. In the recording tab of alsamixer, disable recording for AIF1 Data Digital ADC and AIF2 ADC Mixer ADC. Then, enable recording for AIF2 ADC Mixer AIF1 DA0
  13. Start asterisk.service
  14. Run asterisk -vvvcgr
  15. Call your PinePhone’s number from another phone. On the other phone, you should hear the Asterisk demo (or whatever other audio you configured to play), and on the PinePhone, you should see something like this:

How the PinePhone modem routes audio

(short answer: Magic!)

Figuring out steps 10-12 above was a great example of 10% of the work taking 90% of the time. This section explains what I think is going on here, but a lot of it is guesswork. Unless you’re interested in the inner workings of the PinePhone and its modem, feel free to skip this section.

It didn’t take long for me to figure out how to build asterisk and the Quectel channel module, and, when I placed my first call after starting up asterisk, I was plenty excited when the call went through, Asterisk picked up… and then silence! Roughly 30 seconds of silence (oddly enough, the length of the file I was playing), and then asterisk hung up.

It became clear to me that asterisk thought it was playing audio, even if it wasn’t actually making it to my other phone. Thus started days of experimentation, trying to figure out the problem.

The PinePhone has a Quectel EG25-G LTE modem in it that, by default, provides a number of USB serial devices, one of which (supposedly) is where we’re supposed to send and receive audio.

After loads of scattered documentation around the internet (fun fact, the Quectel EC25 modem has some very different features, so you can’t just assume its manual applies to the EG25-G), it became more and more clear that, for voice calls, the PinePhone’s modem is tied into the PinePhone’s DSP, which is exposed in Linux through ALSA. AIF2 is the modem’s voice channel while AIF1 is the PinePhone’s normal audio DAC.

It took lots of experimentation to work out that, to get audio from the USB serial device to actually get sent through the modem, you need to route AIF1’s output audio into AIF2 (which is what we’re doing in step 12). You also need to enable the AIF1 and AIF2 DACS (step 11), and make sure that Line Out is unmuted.

The strange thing is that, even though we’re using ALSA to enable a number of different switches, none of the volume levels have any effect on the volume sent through the modem. I suspect that this is because Asterisk is already sending the exact audio data without any processing.

One limitation our current system has is that it only transmits audio in 8KHz, which means we haven’t yet been able to test VoLTE. From what I can see, the EG25-G supports VoLTE, but I’ll need to do more research to figure out how to turn it on.

Call quality

So, now that we can play audio through Asterisk, what does the actual call quality look like?

At Spearline, we commonly use PESQ to test a call’s audio quality, and we’ve got a great summary of what PESQ is and what the scores actually mean here. We ran two sets of tests on the PinePhone, one when the phone was in Spearline’s headquarters, and the other when it was in my home (a couple of kilometers away from the office).

This first chart shows the PESQ scores when the phone was at Spearline’s headquarters.

The scores vary from 3.51 to 4.01 with an average of 3.78, which puts them at the top end of the second-highest band of PESQ scores: 3.30-3.79 Attention necessary: no appreciable effort required. In other words, a call placed to the PinePhone while in Spearline’s headquarters would sound good, but you would have to put in a very small amount of effort to pay attention.

This second chart shows the PESQ score when the phone was moved to my home.

Here the scores were higher and more consistent, ranging from 3.88 to 4.02 with an average of 3.96, which puts them in the highest band of PESQ scores: 3.80-4.50 Complete relaxation possible; no effort required. In other words, a call to the PinePhone at my home would result in a call where the conversation could be completely relaxed, with no strain to understand what the other person is saying.

This shows that, as expected, call quality will depend on the tower you’re connected to and the distance to the tower. However, both sets of scores are high enough to make it clear that the PinePhone is fully capable of making high quality audio calls.

Photo Pinephone betaedition by ICCCC used under the CC0 license