System Administrators Might Save Your Life

There’s a lot of responsibility on the shoulders of system administrators. Unless you’re one, it’s probably hard to grasp the full weight of it. This week has been full of reminders for me.

This afternoon, shortly after lunch, I got word that people were having trouble with phones. A few minutes of testing showed that calls within our city were working fine, but it was completely impossible to place or receive long-distance calls. A little while later, a local newspaper’s website indicated that numerous cities over a multi-county region, all served by the CenturyLink local telephone company, were out of service in this manner. I figure that represents hundreds of square miles, or a patch of rural Kansas roughly the size of Los Angeles.

Adding to the problem was the fact that emergency 911 services are accessed via these long-distance lines that were down. For roughly three hours, 911 was completely down across this entire area. What’s more, many cellphone towers and Internet access options were also taken down, since they feed from these same lines.

This adds up to a situation that could very easily cost lives due to delayed response of emergency medical, fire, or police services.

In the end, the problem was traced to “a bad controller card on a Titan 5500 owned by AT&T.”

Now, here’s the system administration angle: If you worked for the phone company and had to troubleshoot a problem that you knew had taken down emergency services for thousands of people, what kind of pressure would you be feeling? Would you be able to keep your cool? I’m glad that I don’t have that kind of job.

I also wouldn’t like to be the engineer (or, more likely, accountant) that decided that they didn’t need any more redundancy to provide good service to the area. Especially considering this is the second time in the last year or two this has happened.

But, ironically, yesterday I signed a purchase order for a new Asterisk PBX (corporate phone) server. When selecting a machine for that task, I am always completely conscious of the responsibility on my shoulders: several hundred employees rely on the machine that is ultimately my responsibility to select. Our own access to 911 would be cut if the machine were to go down. I never forget that the correct operation of the systems that our team sets up and deploys could help save someone’s life, and that a malfunction could cost the company dearly in terms of revenue, productivity, image — or worse.

Nearly four years ago, we switched from an analog PBX, with outsourced support, to a digital VOIP system running Asterisk. Note that we use VOIP in-house, but do not use it externally. Anyhow, I can not say that the Asterisk PBX has been 100% perfect; I doubt that this could be honestly said of any PBX of any complexity.

I can say, though, that it saved us well over $100,000 AND has proven far more reliable than the system it replaced. Outages are exceptionally rare and brief now. Plus we have internal expertise to fix it, rather than having to wait for a technician to be dispatched from a city 2 hours away to fix anything. I know I don’t have the resources to build a perfect PBX that will never go down (if such a thing is even possible), but I take my responsibility regarding a reliable PBX extremely seriously.

We used to have a frequent problem: someone would call 911, then hang up. We suspected this was often on accident — maybe people hit 9 for an outside line, then misdialed their number. In any case, 911 dispatch would then call our main office, saying they got a hangup. A person or team would then go across our entire campus making sure nobody was in distress — that nobody had managed to dial, then passed out, for instance.

With Asterisk, I was able to help this situation. Whenever somebody calls 911 now, two emails are generated: the first contains details about the call, such as the extension number that called and the duration of the call. This goes to all people that are likely to receive a callback from 911. It may not always pinpoint the source (as with somebody using a wireless phone), but almost every time will give us a very good idea where the call came from. The second email is a recording of the call, and serves as an additional clue, but goes to fewer people.

I am aware that email isn’t a perfect medium, but: it let us make a dramatic (albeit imperfect) solution to a problem that very few institutions our size are able to address nearly so well.

There’s a lot of weight on our shoulders: keeping the accounting system up, the Internet links up, the web store or the sales phone lines, the shipping systems or the document archives up. These things going down can spell deep trouble in many ways.

And sometimes the systems we maintain might save a life. Such as this morning, when someone was feeling symptoms of a heart attack, used our phones to call a colleague for help, that person called 911, and an ambulance was dispatched. I know the system worked exactly as it should, because I had two familiar emails from Asterisk in my mailbox this morning when I got to work.

7 thoughts on “System Administrators Might Save Your Life

  1. So if it is a AT&T Titan5500, why do people blame CenturyLink? As a Adminstrator and knowing 5500’s, somebody at AT&T needs to be looking at their issues and the service they provide to the LEC, CenturyLink. Sounds like AT&T blocked “ALL” calls. Bad Controller? I think it is more than just a bad controller. Somebody is massaging this FCC reportable outage. Nice article.

    1. Interesting to hear from someone that works on a Titan 5500.

      I agree that, if it was truly AT&T’s fault, AT&T deserves blame here. I know only what CenturyLink told me. But CenturyLink deserves more than I’ve heaped upon them as well.

      This isn’t the first time it happened. Last time it was because they (CenturyLink) had cut through their sole cable carrying traffic into and out of the town. They have a design problem in that there are so many single points of failure for such a large area, and they seem to be failing on a somewhat regular basis.

      These towns have been mismanaged for years and years under Sprint, then Embarq, and who knows if CenturyLink will change things as a result of the CenturyTel merger.

      I’m sure that CenturyLink and AT&T could figure out how to provision redundant service to the area if they wished.

  2. Nice to read that you and your company are enjoying the benefits of Asterisk. You’re taking brilliant use of its possibilities by sending out an e-mail on calls to 911. Keep up the good work!

  3. Aren’t you supposed to register the location of various phones or gateways for VoIP? The place I use Asterisk is small enough that the location is all the same (small office), but if you are talking about a campus, I thought you were supposed to register certain locations, like this set of extensions is from here, this set is from over here, etc.

    1. We do VOIP inside our facility, but do not communicate with the outside world using VOIP. We also do not present individual extensions in the caller ID data transmitted on our PRI; this is fairly standard practice. (Very few places do that because many older PBXs can’t, smaller companies don’t have DIDs for all their extensions, and even if the technical capability exists, they may prefer people returning calls to do so via their central IVR or operator system)

      It would, as a practical matter, be not terribly helpful in any case, as the physical location of an extension can change without IT ever knowing it, even several times in a day if somebody carries their phone with them or uses a wireless one. (That is one difference over an analog PBX)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.