Disasters happen

It’s been one of those days.

It started with me forgetting my laptop bag at home. Then I had to work with an attorney on some things. He’s a nice guy, and does a good job, but you know, it’s not what I’d really like to spend my day doing.

Then in the middle of the afternoon, I hung up the phone and checked my email. About 50 messages from Samba telling me that processes with various PIDs had crashed unexpectedly. Uh-oh. I think we still have some 80ish people using that.

About a minute later, a coworker says, “John, I’ve been bad.” “Is that why I just got 50 emails from Samba?” “No. Well, yes. Well, maybe. I don’t know.”

It turned out he was working on a restore from tape that, out of necessity, grabbed more data than he needed. He meant to type rm -r ./var but typed rm -r /var instead. Oops. He hit Ctrl-C halfway through, so /var was still there enough to send email but not enough for Samba (or, apparently, NFS) to work.

As he dashed off to pull yesterday’s tapes from offsite storage, I prepared the restore and made a plan. We hadn’t installed any software since yesterday, so I restored var to a temporary location, took the server down into single-user mode, overwrote the /var that still existed, and rebooted the Xen instance in question. Everything back to normal. Except, that is, for the potentially dozens of users that will require assistance running SCANPST.EXE because their Outlook PST, being the fragile heap of garbage that it is, will have somehow been corrupted by this little incident.

So, what did we learn from this?

  • Deleting /var was probably the least annoying outage I’ve had to deal with yet. Certainly less nerve-wracking than the time I was working on a live, powered-up server and my wedding ring shorted out something on a circuit board. I didn’t know if that thing would just reboot or if we’d be down for hours waiting for parts…
  • It was really nice knowing what was going on, rather than trying to find that bit out
  • One coworker commented, “if he had to delete part of an operating system, at least it wasn’t Windows. We wouldn’t have recovered in 15 minutes if it was.” True.
  • Bacula is great.
  • Backups are great, even if you don’t use Bacula to make them.
  • I dislike programs that take server load from 0.3 to 9.5 just telling you that there’s something wrong with the server.

One thought on “Disasters happen

Leave a Reply

Your email address will not be published. Required fields are marked *