Ah Melbourne, you’re quite the town. After spending a weekend visiting you for the weekend and soaking myself deep in your culture I’ve come to miss your delicious cuisine and exquisite coffee now that I’m back at my Canberran cubicle, but the memories of the trip still burn vividly in my mind. From the various pubs I frequented with my closest friends to perusing the wares of the Queen Victoria markets I just can’t get enough of your charm and, university admissions willing, I’ll be making you my home sometime next year. The trip was not without its dramas however and none was more far reaching than that of my attempt to depart the city of Melbourne via my airline of choice: Virgin Blue.
Whilst indulging in a few good pizzas and countless pints of Bimbo Blonde we discovered that Virgin Blue was having problems checking people in, resulting in them having to resort to manual check-ins. At the time I didn’t think it was such a big deal since initial reports hadn’t yet mentioned any flights actually being cancelled and my flight wasn’t scheduled to leave until 9:30PM that night. So we continued to indulge ourselves in the Melbourne life as was our want, cheerfully throwing our cares to the wind and ordering another round.
Things started to go all pear shaped when I thought I’d better check up on the situation and put a call into customer care hotline to see what the deal was. My first attempted was stonewalled by an automatic response stating that they weren’t taking any calls due to a large volume of people trying to get through. I managed to get into a queue about 30 minutes later and even then I was still on the phone for almost an hour before getting through. My attempts to get solid information out of them were met with the same response: “You have to go to the airport and then work it out from there”. Luckily for me and my travelling compatriots it was a public holiday on Monday so a delay, whilst annoying, wouldn’t be too devastating. We decided to proceed to the airport and what I saw there was chaos on a new level.
The Virgin check-in terminals were swamped with hundreds of passengers, all of them in varying levels of disarray and anger. Attempts to get information out of the staff wandering around were usually met with reassurance and directions to keep checking the information board whilst listening for announcements. On the way over I’d managed to work out that our flight wasn’t on the cancelled list so we were in with a chance, but seeing the sea of people hovering around the terminal didn’t give us much hope. After grabbing some quick dinner and sitting around for a while our flight number was called for manual check-ins and we lined up to get ourselves on the flight. You could see why so many flights had to be cancelled as boarding that one flight manually took them well over an hour, and that wasn’t even a full flight of passengers. 4 hours after arriving at the airport we were safe and sound in Canberra, which I unfortunately can’t say for the majority of people who chose Virgin as their carrier that day.
Throughout the whole experience all the blame was being squarely aimed at a failure in the IT system that took our their client facing check-in and online booking systems. Knowing a bit about mission critical infrastructure I remarked at how a single failure could take out a system like this, one that when it goes down costs them millions in lost business and compensation. Going through it logically I came to the conclusion that it had to be some kind of human failure that managed to wipe some critical shared infrastructure, probably a SAN that was live replicating to its disaster recovery site. I mean anything that has the potential to cause that much drama must have a recovery time less than a couple hours or so and it had been almost 12 hours since we first heard the reports of it being down.
As it turns out I was pretty far off the mark. Virgin just recently released an initial report of what happened and although it’s scant on the details what we’ve got to go on is quite interesting:
At 0800 (AEST) yesterday the solid state disk server infrastructure used to host Virgin Blue failed resulting in the outage of our guest facing service technology systems.
We are advised by Navitaire that while they were able to isolate the point of failure to the device in question relatively quickly, an initial decision to seek to repair the device proved less than fruitful and also contributed to the delay in initiating a cutover to a contingency hardware platform.
The service agreement Virgin Blue has with Navitaire requires any mission critical system outages to be remedied within a short period of time. This did not happen in this instance. We did get our check-in and online booking systems operational again by just after 0500 (AEST) today.
Navitaire are a subsidiary of Accenture, one of the largest suppliers of IT outsourcing in the world with over 177,000 employees worldwide and almost $22 billion in revenue. Having worked for one of their competitors (Unisys) for a while I know no large contract like this goes through without some kind of Service Level Agreement (SLA) in place which dictates certain metrics and their penalties should they not be met. Virgin has said that they will be seeking compensation for the blunder but to their credit they were more focused on getting their passengers sorted first before playing the blame game with Navitaire.
Still as a veteran IT administrator I can’t help but look at this disaster and wonder how it could have been avoided. A disk failure in a server is common enough that your servers are usually built around the idea of at least one of them failing. Additionally if this was based on shared storage there would have been several spare disks ready to take over in the event that one or more failed. Taking this all into consideration it appears that Navitaire had a single point of failure in the client facing parts of the system they had for Virgin and a disaster recovery process that hadn’t been tested prior to this event. All of these coalesced into an outage that lasted 21 hours when most mission critical systems like that wouldn’t tolerate anything more than 4.
Originally I had thought that Virgin had all their IT systems internal and this kind of outage seemed like pure incompetence. However upon learning about their outsourced arrangement I know exactly why this happened: profit. In an outsourced arrangement you’re always pressured to deliver exactly to the client’s SLAs whilst keeping your costs to a minimum, thereby maximising profit. Navitaire is no different and their cost saving measures meant that a failure in one place and a lack of verification testing in another lead to a massive outage to one of their big clients. Their other clients weren’t affected because they likely have independent systems for each client but I’d hazard a guess that all of them are at least partially vulnerable to the same outage that affected Virgin on the weekend.
In the end Virgin did handle the situation well all things considered, opting first to take care of their customers rather than pointing fingers right from the start. To their credit all the airport staff and plane crew stayed calm and collected throughout the ordeal and apart from the delayed check-in there was little difference between my flight down and the one back up. Hopefully this will trigger a review of their disaster recovery processes and end up with a more robust system for not only Virgin but all of Navitaire’s customers. It won’t mean much to us as customers as if that does happen we won’t notice anything, but it does mean that in the future such outages shouldn’t have such a big impact as the one of the weekend that just went by.