Daily Archives: January 13, 2010

System Administrators Might Save Your Life

There’s a lot of responsibility on the shoulders of system administrators. Unless you’re one, it’s probably hard to grasp the full weight of it. This week has been full of reminders for me.

This afternoon, shortly after lunch, I got word that people were having trouble with phones. A few minutes of testing showed that calls within our city were working fine, but it was completely impossible to place or receive long-distance calls. A little while later, a local newspaper’s website indicated that numerous cities over a multi-county region, all served by the CenturyLink local telephone company, were out of service in this manner. I figure that represents hundreds of square miles, or a patch of rural Kansas roughly the size of Los Angeles.

Adding to the problem was the fact that emergency 911 services are accessed via these long-distance lines that were down. For roughly three hours, 911 was completely down across this entire area. What’s more, many cellphone towers and Internet access options were also taken down, since they feed from these same lines.

This adds up to a situation that could very easily cost lives due to delayed response of emergency medical, fire, or police services.

In the end, the problem was traced to “a bad controller card on a Titan 5500 owned by AT&T.”

Now, here’s the system administration angle: If you worked for the phone company and had to troubleshoot a problem that you knew had taken down emergency services for thousands of people, what kind of pressure would you be feeling? Would you be able to keep your cool? I’m glad that I don’t have that kind of job.

I also wouldn’t like to be the engineer (or, more likely, accountant) that decided that they didn’t need any more redundancy to provide good service to the area. Especially considering this is the second time in the last year or two this has happened.

But, ironically, yesterday I signed a purchase order for a new Asterisk PBX (corporate phone) server. When selecting a machine for that task, I am always completely conscious of the responsibility on my shoulders: several hundred employees rely on the machine that is ultimately my responsibility to select. Our own access to 911 would be cut if the machine were to go down. I never forget that the correct operation of the systems that our team sets up and deploys could help save someone’s life, and that a malfunction could cost the company dearly in terms of revenue, productivity, image — or worse.

Nearly four years ago, we switched from an analog PBX, with outsourced support, to a digital VOIP system running Asterisk. Note that we use VOIP in-house, but do not use it externally. Anyhow, I can not say that the Asterisk PBX has been 100% perfect; I doubt that this could be honestly said of any PBX of any complexity.

I can say, though, that it saved us well over $100,000 AND has proven far more reliable than the system it replaced. Outages are exceptionally rare and brief now. Plus we have internal expertise to fix it, rather than having to wait for a technician to be dispatched from a city 2 hours away to fix anything. I know I don’t have the resources to build a perfect PBX that will never go down (if such a thing is even possible), but I take my responsibility regarding a reliable PBX extremely seriously.

We used to have a frequent problem: someone would call 911, then hang up. We suspected this was often on accident — maybe people hit 9 for an outside line, then misdialed their number. In any case, 911 dispatch would then call our main office, saying they got a hangup. A person or team would then go across our entire campus making sure nobody was in distress — that nobody had managed to dial, then passed out, for instance.

With Asterisk, I was able to help this situation. Whenever somebody calls 911 now, two emails are generated: the first contains details about the call, such as the extension number that called and the duration of the call. This goes to all people that are likely to receive a callback from 911. It may not always pinpoint the source (as with somebody using a wireless phone), but almost every time will give us a very good idea where the call came from. The second email is a recording of the call, and serves as an additional clue, but goes to fewer people.

I am aware that email isn’t a perfect medium, but: it let us make a dramatic (albeit imperfect) solution to a problem that very few institutions our size are able to address nearly so well.

There’s a lot of weight on our shoulders: keeping the accounting system up, the Internet links up, the web store or the sales phone lines, the shipping systems or the document archives up. These things going down can spell deep trouble in many ways.

And sometimes the systems we maintain might save a life. Such as this morning, when someone was feeling symptoms of a heart attack, used our phones to call a colleague for help, that person called 911, and an ambulance was dispatched. I know the system worked exactly as it should, because I had two familiar emails from Asterisk in my mailbox this morning when I got to work.