I heap a lot of praise on Windows Azure here, enough for me to start thinking about how that’s making me sound like a Microsoft shill, but honestly I think it’s well deserved. As someone who’s spent the better part of a decade setting up infrastructure for applications to run on and then began developing said applications in its spare time I really do appreciate not having to maintain another set of infrastructure. Couple that with the fact that I’m a full Microsoft stack kind of guy it’s really hard to beat the tight integration between all of the products in the cloud stack, from the development tools to the back end infrastructure. So like many of my weekends recently I spent the previous coding away on the Azure platform and it was filled with some interesting highs and rather devastating lows.
For the uninitiated Azure Web Sites are essentially a cut down version of the Azure Web Role allowing you to run pretty much full scale web apps for a fraction of the cost. Of course this comes with limitations and unless you’re running on at the Reserved tier you’re essentially sharing a server with a bunch of people (I.E. a common multi-tenant scenario). For this site, which isn’t going to receive a lot of traffic, it’s perfect and I wanted to deploy the first run app onto this platform. Like any good admin I simply dove in head first without reading any documentation on the process and to my surprise I was up and running in a matter of minutes. It was pretty much create web site, download publish profile, click Publish in Visual Studio, import profile and wait for the upload to finish.
Deploying a web site on my own infrastructure would be a lot more complicated as I can’t tell you how many times I’ve had to chase down dependency issues or missing libraries that I have installed on my PC but not on the end server. The publishing profile coupled with the smarts in Visual Studio was able to resolve everything (the deployment console shows the whole process, it was actually quite cool to watch) and have it up and running at my chosen URL in about 10 minutes total. It’s very impressive considering this is still considered preview level technology, although I’m more inclined to classify it as a release candidate.
Other Azure users can probably guess what I’m going to write about next. Yep, the horrific storage problems that Azure had for about 24 hours.
I noticed some issues on Friday afternoon when my current migration (yes that one, it’s still going as I write this) started behaving…weird. The migration is in its last throws and I expected the CPU usage to start ramping down as the multitude of threads finished their work and this lined up with what I was seeing. However I noticed the number of records migrated wasn’t climbing up at the rate it was previously (usually indicative of some error happening that I suppressed in order for the migration to run faster) but the logs showed that it was still going, just at a snail’s pace. Figuring it was just the instance dying I reimaged it and then the errors started flooding in.
Essentially I was disconnected from my NOSQL storage so whilst I could browse my migrated database I couldn’t keep pulling records out. This also had the horrible side effect of not allowing me to deploy anything as it would come back with SSL/TLS connection issues. Googling this led to all sorts of random posts as the error is also shared by the libraries that power the WebClient in .NET so it wasn’t until I stumbled across the ZDNet article that I knew I wasn’t in the wrong. Unfortunately you were really up the proverbial creek without a paddle if your Azure application was based on this as the temporary fixes for this issue, either disabling SSL for storage connections or usurping the certificate handler, left your application rather vulnerable to all sorts of nasty attacks. I’m one of the lucky few who could simply do without until it was fixed but it certainly highlighted the issues that can occur with PAAS architectures.
Honestly though that’s the only issue (that’s not been directly my fault) I’ve had with Azure since I started using it at the end of last year and comparing it to other cloud services it doesn’t fair too badly. It has made me think about what contingency strategy I’ll need to implement should any parts of the Azure infrastructure go away for a extended period of time though. For the moment I don’t think I’ll worry too much as I’m not going to be earning any income from the things I build on it but it will definitely be a consideration as I begin to unleash my products onto the world.
Windows Azure Tables are one of those newfangled NoSQL type databases that excels in storing giant swaths of structured data. For what they are they’re quite good as you can store very large amounts of data in there without having to pay through the nose like you would for a traditional SQL server or an Azure instance of SQL. However that advantage comes at a cost: querying the data on anything but the partition key (think of it as a partition of the data within a table) and the row key (the unique identifier within that partition) results in queries that take quite a while to run, especially when compared to its SQL counter parts. There are ways to get around this however no matter how well you structure your data eventually you’ll run up against this limitation and that’s where things start to get interesting.
By default whenever you do a large query against an Azure Table you’ll only get back 1000 records, even if the query will return more. However if your query did have more results than that you’ll be able to access them via a continuation token that you can add to your original query, telling Azure that you want the records past that point. For those of us coding on the native .NET platform we get the lovely benefit of having all of this handled for us directly by simply adding .AsTableServiceQuery() to the end of our LINQ statements (if that’s what you’re using) which will handle the continuation tokens for us. For most applications this is great as it means you don’t have to fiddle around with the rather annoying way of extracting those tokens out of the response headers.
Of course that leads you down the somewhat lazy path of not thinking about the kinds of queries you’re running against your Tables and this can lead to problems down the line. Since Azure is a shared service there are upper limits on how long queries can run and how much data they can return to you. These limits aren’t exactly set in stone and depending on how busy the particular server you’re querying is or the current network utilization at the time your query could either take an incredibly long time to return or could simply end up getting closed off. Anyone who’s developed for Azure in the past will know that this is pretty common, even for the more robust things like Azure SQL, but there’s one thing that I’ve noticed over the past couple weeks that I haven’t seen mentioned anywhere else.
As the above paragraphs might indicate I have a lot of queries that try and grab big chunks of data from Azure Tables and have, of course, coded in RetryPolicies so they’ll keep at it if they should fail. There’s one thing that all the policies in the world won’t protect you from however and that’s connections that are forcibly closed. I’ve had quite a few of these recently and I noticed that they appear to come in waves, rippling through all my threads causing unhandled exceptions and forcing them to restart themselves. I’ve done my best to optimize the queries since then and the errors have mostly subsided but it appears that should one long running query trigger Azure to force the connection closed all connections from that instance to the same Table storage will also be closed.
Depending on how your application is coded this might not be an issue however for mine, where the worker role has about 8 concurrent threads running at any one time all attempting to access the same Table Storage account, it means one long running query that gets terminated triggers a cascade of failures across the rest of threads. For the most part this was avoided by querying directly on row and partition keys however the larger queries had to be broken up using the continuation tokens and then the results concatenated in memory. This introduces another limit on particular queries (as storing large lists in memory isn’t particularly great) which you’ll have to architect your code around. It’s by no means an unsolvable problem however it was one that has forced me to rethink certain parts of my application which will probably need to be on Azure SQL rather than Azure Tables.
Like any cloud platform Azure is a great service which requires you to understand what its various services are good for and what they’re not. I initially set out to use Azure Tables for everything and have since found that it’s simply not appropriate for that, especially if you need to query on parameters that aren’t the row or partition keys. If you have connections being closed on you inexplicably be sure to check for any potentially long running queries on the same role as this post can attest they could very well be the source of what ales you.
Today I was trying to do some fancy spreadsheet work using Excel. Now I’m usually pretty decent at this and I’ll be able to figure out if something can be done within a couple minutes or not. If I can do it great, off I go, if not I’ll usually have to code something myself. What really irks me is when something is supposed to work that doesn’t as was the case today. I was trying to do a simple VLOOKUP and for the life of me I couldn’t figure out why the data wasn’t coming out as I expected. That was until I looked carefully at the first couple lines….
There was a leading space on each line, which in Excel’s eyes make it completely different.
The engineers in the room will recognise this as a classic example (well extrapolation) of an off-by-one error. In essence everything about my system was perfect but a simple mistake of not sanitising the data completely ruined all my hard work, and left me chasing my tail for about an hour. This wasn’t the first time I’ve encountered these problems either, and even the most adept programmer/engineer can fall to these problems. In fact I once even stumped my university professors with one.
Rewind back about 7 years to when I was a fresh faced teenager blundering his way through the first year of university. Our first ever programming assignment was to develop, test and implement a simple cipher program. From memory I think it was a Caesar Cipher which is a trivial yet excellent way to show people the ropes of programming. All of the testing was done automatically so you had to be spot on, there was no wiggle room here.
So I coded and I coded and eventually I came up with what appeared to be a working cipher. I hand encoded some text to make sure I had some data I could compare to, and the initial results were good. Everything seemed to be working ok until I tried some longer and more unusual words. The really strange thing was that part of the text would cipher properly and other parts wouldn’t. After showing my code to the university tutor and lecturer I was met with complete disbelief, and they told me to try recoding it from scratch. Since I had a bit of time between classes (4 hour breaks in the middle of the day, yay! :() I attempted it once more, only to hit the same road blocks.
My code was absolutely bullet proof by the end. All bounds were checked and I thought everything was in order, but still I couldn’t shake the nagging half working cipher. It all became clear on the next assignment in which we had to build on our previous work. After not looking at the code for 2 weeks my eyes were fresh and I happened upon one line of code that looked something like this:
string chars = ‘ABCDEFGHIJKLMONPQRSTUVXYZ’;
Go on, figure out why that one line caused the cipher to break. It’s staring right at you!
Alright give up? The W is missing. This is why it would work on some words (anything without a W or any letter after it in the alphabet in it was fine) and not on others. The reason why no one picked it up is due to the way the UV looks very similar to W, well at least on the consoles we were programming on. Adding that one letter in fixed everything, and my next assignment worked flawlessly.
It’s a testament to the fine line we walk sometimes in the engineering field. One small mistake in the right spot can cause the whole system to fall in on itself, and that’s why testing and retesting something is so important. That day taught me that sometimes it’s best to leave a problem for a day in order to look upon it again with fresh eyes, something that I hadn’t really thought of before. It also reminds me of an old quote from the father of the modern computer, Charles Babbage:
On two occasions I have been asked,—”Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
Computers are always right, even when they’re wrong