Posts Tagged‘outsourcing’

Why IT Outsourcing Sucks for Australia’s Government.

Back in 1996 one of the incoming Howard government’s core promises was to reduce their expenditure dramatically, specifically with regards to their IT. The resulting policy was dubbed the IT Initiative and promised to find some $1 billion dollars in savings in the following years primarily through outsourcing many functions to the private sector. It was thought that the private sector, which was well versed in projects of the government’s scale and beyond, would be able to perform the same function at a far reduced cost to that of permanent public servants. The next decade saw many companies rush in to acquire these lucrative IT outsourcing arrangements but the results, both in terms of services delivered and apparent savings, never matched that which was promised.

Contract Law Word CloudFor many the reasons behind the apparent failure were a mystery. Many of the organisations involved in providing IT services to the government weren’t fly by night operations, indeed many of them were large multi-national companies with proven track records, but they just didn’t achieve the same outcomes when it came to the government contracts. After nearly a decade of attempting to make outsourcing work many departments began insourcing their IT departments again and relied on a large contractor workforce to bring in the skills required to keep their projects functioning. Of course costs were still above what many had expected them to be, result in the Gershon Report that recommended heavy cuts to said contractor workforce.

This all stems from the one glaring failure that the government has still yet to realise: it can’t negotiate contracts.

I used to work for a large outsourcer in the Canberra region, swept up while I was still fresh out of university into a job that paid me a salary many took years to attain. The outsourcer had won this contract away from the incumbent to provide desktop and infrastructure services whilst the numerous other outsourcers involved in the contract retained ownership of their respective systems. After spending about 6 months as a system admin my boss approached me about moving into the project management space, something I had mentioned that I was keen on pursuing. It was in this position that I found out just how horrible the Australian government was at contract negotiation and how these service providers were the only winners in their arrangements.

My section was dedicated to “new business”, essentially work that we’d be responsible for implementing that wasn’t in scope as part of the broader outsourcing contract. Typically these would be small engagements, most not requiring tender level documentation, and in all honesty would have been considered by any reasonable individual to fall under the original contract. ¬†Of course many of the users who I came back to with a bill detailing how much it would cost to do the work they needed often responded with much surprise and often would simply drop the request than try to seek approval for the cost.

The issue still exists today primarily because many of the positions that handle contract negotiations don’t require specific skills or training. This means whilst the regulations in place stop most government agencies from entering into catastrophically bad arrangements the more subtle ones often slip through the cracks and it’s only after everything is said and done that oversights are found. All of the large outsourcers in Canberra know this and it’s why there’s been no force working to correct the problem for the better part of 2 decades. It’s why Canberra exists as a strange microcosm of IT expertise, with salaries that you won’t see anywhere else in Australia.

The solution is to simply start hiring contract negotiators away from the private sector and get them working for the Australian government. Get contract law experts to review large IT outsourcing arrangements and start putting the screws to those outsourcers to deliver more for the same amount of money. It’s not an easy road to tread and it won’t likely win the government any friends but unless they start doing something outsourcing is always going to be seen as a boondoggle, only for those with too much cash and not enough sense.

 

From The Outside: An Analysis of the Virgin Blue IT Disaster.

Ah Melbourne, you’re quite the town. After spending a weekend visiting you for the weekend and soaking myself deep in your culture I’ve come to miss your delicious cuisine and exquisite coffee now that I’m back at my Canberran cubicle, but the memories of the trip still burn vividly in my mind. From the various pubs I frequented with my closest friends to perusing the wares of the Queen Victoria markets I just can’t get enough of your charm and, university admissions willing, I’ll be making you my home sometime next year. The trip was not without its dramas however and none was more far reaching than that of my attempt to depart the city of Melbourne via my airline of choice: Virgin Blue.

Whilst indulging in a few good pizzas and countless pints of Bimbo Blonde we discovered that Virgin Blue was having problems checking people in, resulting in them having to resort to manual check-ins. At the time I didn’t think it was such a big deal since initial reports hadn’t yet mentioned any flights actually being cancelled and my flight wasn’t scheduled to leave until 9:30PM that night. So we continued to indulge ourselves in the Melbourne life as was our want, cheerfully throwing our cares to the wind and ordering another round.

Things started to go all pear shaped when I thought I’d better check up on the situation and put a call into customer care hotline to see what the deal was. My first attempted was stonewalled by an automatic response stating that they weren’t taking any calls due to a large volume of people trying to get through. I managed to get into a queue about 30 minutes later and even then I was still on the phone for almost an hour before getting through. My attempts to get solid information out of them were met with the same response: “You have to go to the airport and then work it out from there”. Luckily for me and my travelling compatriots it was a public holiday on Monday so a delay, whilst annoying, wouldn’t be too devastating. We decided to proceed to the airport and what I saw there was chaos on a new level.

The Virgin check-in terminals were swamped with hundreds of passengers, all of them in varying levels of disarray and anger. Attempts to get information out of the staff wandering around were usually met with reassurance and directions to keep checking the information board whilst listening for announcements. On the way over I’d managed to work out that our flight wasn’t on the cancelled list so we were in with a chance, but seeing the sea of people hovering around the terminal didn’t give us much hope. After grabbing some quick dinner and sitting around for a while our flight number was called for manual check-ins and we lined up to get ourselves on the flight. You could see why so many flights had to be cancelled as boarding that one flight manually took them well over an hour, and that wasn’t even a full flight of passengers. 4 hours after arriving at the airport we were safe and sound in Canberra, which I unfortunately can’t say for the majority of people who chose Virgin as their carrier that day.

Throughout the whole experience all the blame was being squarely aimed at a failure in the IT system that took our their client facing check-in and online booking systems. Knowing a bit about mission critical infrastructure I remarked at how a single failure could take out a system like this, one that when it goes down costs them millions in lost business and compensation. Going through it logically I came to the conclusion that it had to be some kind of human failure that managed to wipe some critical shared infrastructure, probably a SAN that was live replicating to its disaster recovery site. I mean anything that has the potential to cause that much drama must have a recovery time less than a couple hours or so and it had been almost 12 hours since we first heard the reports of it being down.

As it turns out I was pretty far off the mark. Virgin just recently released an initial report of what happened and although it’s scant on the details what we’ve got to go on is quite interesting:

At 0800 (AEST) yesterday the solid state disk server infrastructure used to host Virgin Blue failed resulting in the outage of our guest facing service technology systems.

We are advised by Navitaire that while they were able to isolate the point of failure to the device in question relatively quickly, an initial decision to seek to repair the device proved less than fruitful and also contributed to the delay in initiating a cutover to a contingency hardware platform.

The service agreement Virgin Blue has with Navitaire requires any mission critical system outages to be remedied within a short period of time. This did not happen in this instance. We did get our check-in and online booking systems operational again by just after 0500 (AEST) today.

Navitaire are a subsidiary of Accenture, one of the largest suppliers of IT outsourcing in the world with over 177,000 employees worldwide and almost $22 billion in revenue. Having worked for one of their competitors (Unisys) for a while I know no large contract like this goes through without some kind of Service Level Agreement (SLA) in place which dictates certain metrics and their penalties should they not be met. Virgin has said that they will be seeking compensation for the blunder but to their credit they were more focused on getting their passengers sorted first before playing the blame game with Navitaire.

Still as a veteran IT administrator I can’t help but look at this disaster and wonder how it could have been avoided. A disk failure in a server is common enough that your servers are usually built around the idea of at least one of them failing. Additionally if this was based on shared storage there would have been several spare disks ready to take over in the event that one or more failed. Taking this all into consideration it appears that Navitaire had a single point of failure in the client facing parts of the system they had for Virgin and a disaster recovery process that hadn’t been tested prior to this event. All of these coalesced into an outage that lasted 21 hours when most mission critical systems like that wouldn’t tolerate anything more than 4.

Originally I had thought that Virgin had all their IT systems internal and this kind of outage seemed like pure incompetence. However upon learning about their outsourced arrangement I know exactly why this happened: profit. In an outsourced arrangement you’re always pressured to deliver exactly to the client’s SLAs whilst keeping your costs to a minimum, thereby maximising profit. Navitaire is no different and their cost saving measures meant that a failure in one place and a lack of verification testing in another lead to a massive outage to one of their big clients. Their other clients weren’t affected because they likely have independent systems for each client but I’d hazard a guess that all of them are at least partially vulnerable to the same outage that affected Virgin on the weekend.

In the end Virgin did handle the situation well all things considered, opting first to take care of their customers rather than pointing fingers right from the start. To their credit all the airport staff and plane crew stayed calm and collected throughout the ordeal and apart from the delayed check-in there was little difference between my flight down and the one back up. Hopefully this will trigger a review of their disaster recovery processes and end up with a more robust system for not only Virgin but all of Navitaire’s customers. It won’t mean much to us as customers as if that does happen we won’t notice anything, but it does mean that in the future such outages shouldn’t have such a big impact as the one of the weekend that just went by.