March 6th, 2009
It’s been quite the week.
Last Friday, my stomach was just starting to feel a little odd. I didn’t think much off it — a little food that didn’t go over well or stress, I thought.
Saturday I got out of bed and almost immediately felt like throwing up. Ugh. I probably caught some sort of stomach flu. I was nauseous all day and had some terrible diarrhea to boot. I spent parts of Saturday, Saturday night, Sunday, and Sunday night “supervising some emergency downloads” as the BOFH would say. By Sunday afternoon, I thought I was doing good enough to attend a practice of the Kansas Mennonite Men’s Choir. I made it through but it wasn’t quite as up to it as I thought.
Monday morning I woke up and thought the worst was behind me, so I went to work. By evening, the worst clearly was not behind me. I was extremely cold, and then got very hot a few hours later. Tuesday I left work a little early because of not feeling well.
Wednesday a colleague called me at home before I left to say that the ERP database had a major hiccup. That’s never good. The database is this creaky old dinosaur thing that has a habit of inventing novel ways to fail (favorite pastime: exceeding some arbitrary limit to the size of files that no OS has cared about for 5 years, then hanging without telling anybody why). My coworkers had been working on it since 5.
I went into the office and did what I could to help out, though they had mostly taken care of it. Then we went to reboot the server. It didn’t come back. I/O error on sda just after init started, and it hung. Puzzled, as it just used that disk to boot from. Try rebooting again.
This time, I/O error as the fibre channel controller driver loads. Again, puzzled as it just used that controller to load grub. Power cycle this time.
And now the server doesn’t see the fibre channel link at all. Eep. Check our fiber optic cables, and power cycle again.
And THIS time, the server doesn’t power back up. Fans whir for about a second, then an ominous red light I never knew was there shows up. Eeep!
So I call HP. They want me to remove one CPU. Yes, remove one CPU. I tried, and long story short, they dispatch a local guy with a replacement motherboard. “Can you send along a FC controller, in case it’s dead too?” “Nope, not until we diagnose a problem with it.”
Local guy comes out. He’s a sharp guy and I really like him. But the motherboard wasn’t in stock at the local HP warehouse, so he had to have it driven in from Oklahoma City. He gets here with it by about 4:30. At this point the single most important server to the company’s business has been down almost 12 hours.
He replaces the motherboard. The server now powers up — yay! And it POSTs, and it…. doesn’t see the disks. !#$!#$
He orders the FC controller, which is so very much not in stock that they can’t get it to us until 8:30AM the next morning (keep in mind this thing is on a 4-hour 24/7 contract).
Next morning rolls around. Outage now more than 24 hours. He pops the FC controller in, we tweak the SAN settings appropriately, we power up the machine, and….
still doesn’t see any disks, and the SAN switch still doesn’t see any link. EEP!
Even the BIOS firmware tool built into the controller doesn’t see a link, so we KNOW it’s not a software issue. We try plugging and unplugging cables, trying different ports, everything. Nothing makes a difference.
At this point, while he ponders what else he can replace while we start migrating the server to a different blade. We get ERP back up on its temporary home an hour later, and he basically orders us every part he can think of while we’ve bought him some room.
Several additional trips later, he’s replaced just about everything at least once, some things 2 or 3 times, and still no FC link. Meanwhile, I’ve asked my colleague to submit a new ticket to HP’s SAN team so we can try checking of the switch has an issue. They take their sweet time answering until he informs them this morning that it’s been *48 HOURS* since we first reported the outage. All of a sudden half a dozen people at HP take a keen interest in our case. As if they could smell this blog post coming…
So they advise us to upgrade the firmware in the SAN switch, but they also say “we really should send this to the blade group; the problem can’t be with the SAN” — and of course the blade people are saying “the problem’s GOT to be with the SAN”. We try to plan the firmware upgrade. In theory, we can lose a switch and nobody ever notices due to multipathing redundancy. In practice, we haven’t tested that in 2 years. None of this equipment had even been rebooted in 390 days.
While investigating this, we discovered that one of the blade servers could only see one path to its disks, not two. Strange. Fortunately, THAT blade wasn’t mission-critical on a Friday, so I power cycled it.
And it powered back up. And it promptly lost connection to its disks entirely, causing the SAN switches to display the same mysterious error they did with the first blade — the one that nobody at HP had heard of, could find in their documentation, or even on Google. Yes, that’s right. Apparently power cycling a server means it loses access to its disks.
Faced with the prospect of our network coming to a halt if anything else rebooted (or worse, if the problem started happening without a reboot), we decided we’d power cycle one switch now and see what would happen. If it worked out, our problems would be fixed. If not, at least things would go down in our and HP’s presence.
And that… worked? What? Yes. Power cycling the switch fixed every problem over the course of about 2 minutes, without us having to do anything.
Meanwhile, HP calls back to say, “Uhm, that firmware upgrade we told you to do? DON’T DO IT!” We power cycle the other switch, and have a normal SAN life again.
I let out a “WOOHOO!” My colleague, however, had the opposite reaction. “Now we’ll never be able to reproduce this problem to get it fixed!” Fair point, I suppose.
Then began the fairly quick job of migrating ERP back to its rightful home — it’s all on Xen already, designed to be nimble for just these circumstances. Full speed restored 4:55PM today.
So, to cap it all off, within the space of four hours, we had fail:
- One ERP database
- ERP server’s motherboard
- Two fiber optic switches — but only regarding their ability to talk to machines recently rebooted
- And possibly one FC controller
Murphy, I hate you.
The one fun moment out of this was this conversation:
Me to HP guy: “So yeah, that machine you’ve got open wasn’t rebooted in 392 days until today.”
HP guy: “WOW! That’s INCRED — oh wait, are you running Linux on it?”
HP: “Figures. No WAY you’d get that kind of uptime from Windows.”
And here he was going to be all impressed.