I heap a lot of praise on Windows Azure here, enough for me to start thinking about how that’s making me sound like a Microsoft shill, but honestly I think it’s well deserved. As someone who’s spent the better part of a decade setting up infrastructure for applications to run on and then began developing said applications in its spare time I really do appreciate not having to maintain another set of infrastructure. Couple that with the fact that I’m a full Microsoft stack kind of guy it’s really hard to beat the tight integration between all of the products in the cloud stack, from the development tools to the back end infrastructure. So like many of my weekends recently I spent the previous coding away on the Azure platform and it was filled with some interesting highs and rather devastating lows.
For the uninitiated Azure Web Sites are essentially a cut down version of the Azure Web Role allowing you to run pretty much full scale web apps for a fraction of the cost. Of course this comes with limitations and unless you’re running on at the Reserved tier you’re essentially sharing a server with a bunch of people (I.E. a common multi-tenant scenario). For this site, which isn’t going to receive a lot of traffic, it’s perfect and I wanted to deploy the first run app onto this platform. Like any good admin I simply dove in head first without reading any documentation on the process and to my surprise I was up and running in a matter of minutes. It was pretty much create web site, download publish profile, click Publish in Visual Studio, import profile and wait for the upload to finish.
Deploying a web site on my own infrastructure would be a lot more complicated as I can’t tell you how many times I’ve had to chase down dependency issues or missing libraries that I have installed on my PC but not on the end server. The publishing profile coupled with the smarts in Visual Studio was able to resolve everything (the deployment console shows the whole process, it was actually quite cool to watch) and have it up and running at my chosen URL in about 10 minutes total. It’s very impressive considering this is still considered preview level technology, although I’m more inclined to classify it as a release candidate.
Other Azure users can probably guess what I’m going to write about next. Yep, the horrific storage problems that Azure had for about 24 hours.
I noticed some issues on Friday afternoon when my current migration (yes that one, it’s still going as I write this) started behaving…weird. The migration is in its last throws and I expected the CPU usage to start ramping down as the multitude of threads finished their work and this lined up with what I was seeing. However I noticed the number of records migrated wasn’t climbing up at the rate it was previously (usually indicative of some error happening that I suppressed in order for the migration to run faster) but the logs showed that it was still going, just at a snail’s pace. Figuring it was just the instance dying I reimaged it and then the errors started flooding in.
Essentially I was disconnected from my NOSQL storage so whilst I could browse my migrated database I couldn’t keep pulling records out. This also had the horrible side effect of not allowing me to deploy anything as it would come back with SSL/TLS connection issues. Googling this led to all sorts of random posts as the error is also shared by the libraries that power the WebClient in .NET so it wasn’t until I stumbled across the ZDNet article that I knew I wasn’t in the wrong. Unfortunately you were really up the proverbial creek without a paddle if your Azure application was based on this as the temporary fixes for this issue, either disabling SSL for storage connections or usurping the certificate handler, left your application rather vulnerable to all sorts of nasty attacks. I’m one of the lucky few who could simply do without until it was fixed but it certainly highlighted the issues that can occur with PAAS architectures.
Honestly though that’s the only issue (that’s not been directly my fault) I’ve had with Azure since I started using it at the end of last year and comparing it to other cloud services it doesn’t fair too badly. It has made me think about what contingency strategy I’ll need to implement should any parts of the Azure infrastructure go away for a extended period of time though. For the moment I don’t think I’ll worry too much as I’m not going to be earning any income from the things I build on it but it will definitely be a consideration as I begin to unleash my products onto the world.
Working with Microsoft’s cloud over the past couple months has been a real eye opener. Whilst I used to scoff at all these people eschewing the norms that have (and continue to) serve us well in favor of the latest technology du’jour I’m starting to see the benefits of their ways, especially with the wealth of resources that Microsoft has on the subjects. Indeed the cloud aspects of my latest side project, whilst consuming a good chunk of time at the start, have required almost no tweaking whatsoever even after I change my data model or fundamental part of how the service works. There is one architectural issue that continues to bug me however and recent events have highlighted why it troubles me so.
The events I’m referring to are the recent outage to Amazon’s Elastic Block storage service that affected a great number of web services. In essence part of the cloud services that Amazon provides, in this case a cloud disk service that for all intents and purposes is the same as a hard drive in your computer, suffered a major outage in one of their availability zones. This meant for most users in that particular zone that relied on this service to store data their services would begin to fail just as your computer would if I ripped its hard drive out whilst you were using it. The cause of the events can be traced back to human error but it was significantly compounded by the high level of automation in the system, which would be needed for any kind of cloud service at this scale.
For a lot of the bigger users of Amazon’s cloud this wasn’t so much of an issue since they usually have replicas of their service on the geographically independent mirror of the cloud that Amazon runs. For a lot of users however their entire service is hosted in the single location, usually because they can’t afford the additional investment to geographically disperse their services. Additionally you have absolutely no control over any of the infrastructure so you leave yourself at the mercy of technicians of the cloud. Granted they’ve done a pretty good so far but you’re still outsourcing risk that you can’t mitigate, or at least not affordably.
My preferred way of doing the cloud, which I’ve liked ever since I started talking to VMware about their cloud offerings back in 2008, was to combine the ideas of both self hosted services with the extensibility of the cloud. For many services they don’t need all the power (nor the cost) of running multiple cloud instances and could function quite happily on a few co-hosted servers. Of course there are times when they’d require the extra power to service peak requests and that’s where the on-demand nature of cloud services would really shine. However apart from vCloud Express (which is barely getting trialled by the looks of things) none of the cloud operators give you the ability to host a small private cloud yourself and then offload to their big cloud as you see fit. Which is a shame since I think it could be one way cloud providers could wriggle their way into some of the big enterprise markets that have shunned them thus far.
Of course there are other ways of getting around the problems that cloud providers might suffer, most of which involve not using them for certain parts of your application. You could also build your app on multiple cloud platforms (there are some that even have compatible APIs now!) but that would add an inordinate amount of complexity to your solution, not to mention doubling the costs of developing it. The hybrid cloud solution feels like the best of both worlds however it’s highly unlikely that it’ll become a mainstream solution anytime soon. I’ve heard rumors of Microsoft providing something along those lines and their new VM Role offering certainly shows the beginnings of that becoming a reality but I’m not holding my breath. Instead I’ll code a working solution first and worry about scale problems when I get to scale, otherwise I’m just engaging in fancy procrastination.
Ah Melbourne, you’re quite the town. After spending a weekend visiting you for the weekend and soaking myself deep in your culture I’ve come to miss your delicious cuisine and exquisite coffee now that I’m back at my Canberran cubicle, but the memories of the trip still burn vividly in my mind. From the various pubs I frequented with my closest friends to perusing the wares of the Queen Victoria markets I just can’t get enough of your charm and, university admissions willing, I’ll be making you my home sometime next year. The trip was not without its dramas however and none was more far reaching than that of my attempt to depart the city of Melbourne via my airline of choice: Virgin Blue.
Whilst indulging in a few good pizzas and countless pints of Bimbo Blonde we discovered that Virgin Blue was having problems checking people in, resulting in them having to resort to manual check-ins. At the time I didn’t think it was such a big deal since initial reports hadn’t yet mentioned any flights actually being cancelled and my flight wasn’t scheduled to leave until 9:30PM that night. So we continued to indulge ourselves in the Melbourne life as was our want, cheerfully throwing our cares to the wind and ordering another round.
Things started to go all pear shaped when I thought I’d better check up on the situation and put a call into customer care hotline to see what the deal was. My first attempted was stonewalled by an automatic response stating that they weren’t taking any calls due to a large volume of people trying to get through. I managed to get into a queue about 30 minutes later and even then I was still on the phone for almost an hour before getting through. My attempts to get solid information out of them were met with the same response: “You have to go to the airport and then work it out from there”. Luckily for me and my travelling compatriots it was a public holiday on Monday so a delay, whilst annoying, wouldn’t be too devastating. We decided to proceed to the airport and what I saw there was chaos on a new level.
The Virgin check-in terminals were swamped with hundreds of passengers, all of them in varying levels of disarray and anger. Attempts to get information out of the staff wandering around were usually met with reassurance and directions to keep checking the information board whilst listening for announcements. On the way over I’d managed to work out that our flight wasn’t on the cancelled list so we were in with a chance, but seeing the sea of people hovering around the terminal didn’t give us much hope. After grabbing some quick dinner and sitting around for a while our flight number was called for manual check-ins and we lined up to get ourselves on the flight. You could see why so many flights had to be cancelled as boarding that one flight manually took them well over an hour, and that wasn’t even a full flight of passengers. 4 hours after arriving at the airport we were safe and sound in Canberra, which I unfortunately can’t say for the majority of people who chose Virgin as their carrier that day.
Throughout the whole experience all the blame was being squarely aimed at a failure in the IT system that took our their client facing check-in and online booking systems. Knowing a bit about mission critical infrastructure I remarked at how a single failure could take out a system like this, one that when it goes down costs them millions in lost business and compensation. Going through it logically I came to the conclusion that it had to be some kind of human failure that managed to wipe some critical shared infrastructure, probably a SAN that was live replicating to its disaster recovery site. I mean anything that has the potential to cause that much drama must have a recovery time less than a couple hours or so and it had been almost 12 hours since we first heard the reports of it being down.
As it turns out I was pretty far off the mark. Virgin just recently released an initial report of what happened and although it’s scant on the details what we’ve got to go on is quite interesting:
At 0800 (AEST) yesterday the solid state disk server infrastructure used to host Virgin Blue failed resulting in the outage of our guest facing service technology systems.
We are advised by Navitaire that while they were able to isolate the point of failure to the device in question relatively quickly, an initial decision to seek to repair the device proved less than fruitful and also contributed to the delay in initiating a cutover to a contingency hardware platform.
The service agreement Virgin Blue has with Navitaire requires any mission critical system outages to be remedied within a short period of time. This did not happen in this instance. We did get our check-in and online booking systems operational again by just after 0500 (AEST) today.
Navitaire are a subsidiary of Accenture, one of the largest suppliers of IT outsourcing in the world with over 177,000 employees worldwide and almost $22 billion in revenue. Having worked for one of their competitors (Unisys) for a while I know no large contract like this goes through without some kind of Service Level Agreement (SLA) in place which dictates certain metrics and their penalties should they not be met. Virgin has said that they will be seeking compensation for the blunder but to their credit they were more focused on getting their passengers sorted first before playing the blame game with Navitaire.
Still as a veteran IT administrator I can’t help but look at this disaster and wonder how it could have been avoided. A disk failure in a server is common enough that your servers are usually built around the idea of at least one of them failing. Additionally if this was based on shared storage there would have been several spare disks ready to take over in the event that one or more failed. Taking this all into consideration it appears that Navitaire had a single point of failure in the client facing parts of the system they had for Virgin and a disaster recovery process that hadn’t been tested prior to this event. All of these coalesced into an outage that lasted 21 hours when most mission critical systems like that wouldn’t tolerate anything more than 4.
Originally I had thought that Virgin had all their IT systems internal and this kind of outage seemed like pure incompetence. However upon learning about their outsourced arrangement I know exactly why this happened: profit. In an outsourced arrangement you’re always pressured to deliver exactly to the client’s SLAs whilst keeping your costs to a minimum, thereby maximising profit. Navitaire is no different and their cost saving measures meant that a failure in one place and a lack of verification testing in another lead to a massive outage to one of their big clients. Their other clients weren’t affected because they likely have independent systems for each client but I’d hazard a guess that all of them are at least partially vulnerable to the same outage that affected Virgin on the weekend.
In the end Virgin did handle the situation well all things considered, opting first to take care of their customers rather than pointing fingers right from the start. To their credit all the airport staff and plane crew stayed calm and collected throughout the ordeal and apart from the delayed check-in there was little difference between my flight down and the one back up. Hopefully this will trigger a review of their disaster recovery processes and end up with a more robust system for not only Virgin but all of Navitaire’s customers. It won’t mean much to us as customers as if that does happen we won’t notice anything, but it does mean that in the future such outages shouldn’t have such a big impact as the one of the weekend that just went by.