Please be aware that I haven’t had a chance to double-check the technical details of this post before the deadline so if you see an error please feel free to comment about it. The events happened years ago but it’s funny how history has a way of repeating itself.
Because I have too few hours in each day and I still suck at this blogging thing I am combining my MacGyver Moment and TSQL2sDay posts into this one so here we go.
The Beginning: Once upon a time in a conference room on a very high floor sat over a dozen technical people trying to quickly fix a server that had begun crashing the day before. This company was pretty big. Their data, it was massive. We only worked with summary rows because of this and yet it was still in numbers that make you stop and shake your head every once in a while. For instance, one time the daily import process failed just a few percent in and it already had over 400,000 rows cleansed and loaded.
To set the scene a little: We had developers, data folks, server experts, network experts, a SAN expert, the SAN expert’s boss, application SMEs, business liaisons, directors, and the company’s TAM from Microsoft. Oh and another SAN expert who also doubled as a Microsoft SQL Server MVP who just happened to specialize in High Availability (specifically clustering). This conference room had been turned into a War Room and was going to stay that way for as long as it took until the problem was fixed.
The Problem: A problem child of an application that already had a reputation for crashing every once in a while had now started crashing every couple of hours. I don’t know the exact specs of the hardware anymore but it was pretty close to the top of the line. It had something like 16 LPs and 16 Gigs of RAM backed by nearly 2 dozen disks on a SAN that held over 200 disks. This server had a twin and based on the error logs that was part of the problem. It was in a single node failover cluster and all of a sudden it had started logging 15,000 millisecond timeout errors while trying to check on it’s disk resources. This timeout in turn was causing whichever node was active to freak out and think that it had lost contact with the SAN altogether.
The Situation: While the overall SAN supporting this server is large by a lot of companies’ standards it was actually just a run of the mill small SAN at this place. In fact we later ordered up another one that looked just like it (only 2 years faster in every aspect) but at the time we kept on being told how solid this particular SAN was. There were a lot of moving parts and we were a fresh set of eyes being brought in. One of the moving parts that we had taken note of was that another server on different disks on the same SAN had recently been rebuilt as a 64 bit server. We took a look and said I’m sorry but your SAN just can’t support the needs of this SQL Server. You would have thought I had just told the SAN guys that I had caught them cross-dressing or something like that. They assured me that “No [little] Windows box” could ever take down their mighty SAN. Regardless of that we were stuck with that particular SAN for at least the next couple of weeks and needed to come up with something to help.
The Other Problem: We asked for all the IO stats they could give us but unfortunately there was something about a new version that needed to be installed and needing to wait for an outage window and all that. Bottom line we were stuck with some total IO numbers over a wide enough time frame that they weren’t all that useful however; being a glorified mathematician I took all the numbers I had and tried to figure out the size of the ‘unknown’ portion by subtracting out the three servers I could get stats for myself. As it turned out in one particular time interval that was just enough information for me to notice a trend. It wasn’t much but it was more solid than anything else being offered up (or at least I felt so) and the situation was to say the least a little tense.
The Lame-Brained Idea: One of the sever experts mentioned that he could “borrow” some RAM from a server that hadn’t gone into production yet and we could stick that in the servers (plural because they are a FC) that were crashing to see if that would fix it. Given the error message mentioned above I just didn’t think this would work. I had come up with something that just didn’t sound right even saying it but par for the course I said it anyways. I asked them to humor me and stick the “borrowed RAM” into a box that wasn’t crashing; that other box in the situation that had recently been rebuilt as 64 bit. They laughed at my idea and I don’t blame them. Unless you stepped through the situation in the the exact same order as I did this was like saying let’s just reboot and see if that fixes it (oh wait it does reboot every time it crashes 🙂 ). It’s easy to decide to stick the only spare RAM you have into a server that’s crashing; it’s a little bit tougher to make the call to bring down a completely different system and have them stick the RAM in that one. If you’re wrong you’ve just caused a second outage for no reason.
Stupid Windows Guys: That’s what they must have thought when they agreed to give my idea a try. We had all noticed the crashes had been happening almost like clockwork. Almost but no one could tie them back to anything else happening on that machine. As we all know, almost only counts in horse shoes and hand grenades. And occasionally, wild ass guesses. Funny thing happened though. We were able to guesstimate the next crash within 5 to 10 minutes at this point only it didn’t happen this time.
The Hidden Pattern: Now I’ve always enjoyed learning about the storage engine internals but I sure ain’t no Paul Randal. I am a quick study and love a good disaster though. What I had found was that there was cube that was being built every couple of hours on the newly minted 64 bit server and while the system did a good job of still letting you query the old cube while the new cube was being built there would inevitably come a point where something was happening that was causing the SSAS engine to flush ooh-gobs of data back down to the disk. More data than the total size of the cubes even. The SAN ended up getting saturated with gigs and gigs of non-stop writes (and I hear with RAID 5 that’s like a bad thing but maybe it’s just a rumor).
The Takeaway: Make sure you know what tools your SAN admin has available and more specifically find out what they can collect on a continuous basis for weeks on end, you might be surprised by the answer. The disks your server relies on may be fine but the controller cards accessing it might be getting saturated from something else. You may need to purchase some additional software or something. Look for solutions in places they might not make sense unless you hold your head just right. The error log is your friend. And above all, just because someone says it’s not possible, don’t completely believe them until they prove it to you.