If you’re a developer like me you’ve likely got a set of expectations about the way you handle data. Most likely they all have their roots in the object-oriented/relational paradigm meaning that you’d expect to be able to get some insight into your data by simply running a few queries against it or simply looking at the table, possibly sorting it to find something out. The day you decide to try out something like Azure Table storage however you’ll find that these tools simply aren’t available to you any more due to the nature of the service. It’s at this point where, if you’re like me, you’ll get a little nervous as your data can end up feeling like something of a black box.
A while back I posted about how I was over-thinking the scalability of my Azure application and how I was about to make the move to Azure SQL. That’s been my task for the past 3 weeks or so and what started out as a relatively simple task of simply moving data from one storage mechanism to another has turned into this herculean task that has seen me dive deeper into both Azure Tables and SQL than I have ever done previously. Along the way I’ve found out a few things that, whilst not changing my mind about the migration away from Azure tables, certainly would have made my life a whole bunch easier had I known about them.
1. If you need to query all the records in an Azure table, do it partition by partition.
The not-so-fun thing about Azure Tables is that unless you’re keeping track of your data in your application there’s no real metrics you can dredge up in order to give you some idea of what you’ve actually got. For me this meant that I had one table that I knew the count of (due to some background processing I do using that table) however there are 2 others which I have absolutely 0 idea about how much data is actually contained in there. Estimates using my development database led me to believe there was an order of magnitude more data in there than I thought there was which in turn led me to the conclusion that using .AsTableServiceQuery() to return the whole table was doomed from the start.
However Azure Tables isn’t too bad at returning an entire partition’s worth of data, even if the records number in the 10s or 100s of thousands. Sure the query time goes up linearly depending on how many records you’ve got (as Azure Tables will only return a max of 1000 records at a time) but if they’re all within the same partition you avoid the troublesome table scan which dramatically affects the performance of the query, sometimes to the point of it getting cancelled which isn’t handled by the default RetryPolicy framework. If you need all the data in the entire table you can then do queries on each partition and then dump them all in a list inside your application and then continue to do your query.
2. Optimize your context for querying or updating/inserting records.
Unbeknownst to me the TableServiceContext class has quite a few configuration options available that will allow you to change the way the context behaves. The vast majority of errors I was experiencing came from my background processor which primarily dealt with reading data without making any modifications to the records. If you have applications where this is the case then it’s best to set the Context.MergeOption to MergeOption.NoTracking as this means the context won’t attempt to track the entities.
If you have multiple threads running or queries that return large amounts of records this can lead to a rather large improvement in performance as the context doesn’t have to track any changes to them and the garbage collector can free up these objects even if you use the context for another query. Of course this means that if you do need to make any changes you’ll have to change the context and then attach to the entity in question but you’re probably doing that already. Or at least you should be.
3. Modify your web.config or app.config file to dramatically improve performance and reliability.
For some unknown reason the default number of HTTP connections that a Windows Azure application can make (although I get the feeling this affects all applications making use of the .NET frameworks) is set to 2. Yes just 2. This then manifests itself as all sorts of crazy errors that don’t make a whole bunch of sense like “the underlying connection was closed” when you try to make more than 2 requests at any one time (which includes queries to Azure Tables). The max number of connections you can specify depends on the size of the instance you’re using but Microsoft has a helpful guide on how to set this and other settings in order to make the most out of it.
Additionally some of the guys at Microsoft have collected a bunch of tips for improving the performance of Azure Tables in various circumstances. I’ve cherry picked out the best ones which I’ve confirmed that have worked wonders for me however there’s a fair few more in there that might be of use to you, especially if you’re looking to get every performance edge you can. Many of them are circumstantial and some require you to plan out or storage architecture in advance (so something that can’t be easily retrofitted into an existing app) but since the others have worked I hazard a guess they would to.
I might not be making use of some of these tips now that my application is going to be SQL and TOPAZ but if I can save anyone the trouble I went through trying to sort through all those esoteric errors I can at least say it was worth it. Some of these tips are just good to know regardless of the platform you’re on (like the default HTTP connection limit) and should be incorporated into your application as soon as its feasible. I’ve yet to get all my data into production yet as its still migrating but I get the feeling I might go on another path of discovery with Azure SQL in the not too distant future and I’ll be sure to share my tips for it then.
I’m a really big fan of Microsoft’s development tools. No other IDE that I’ve used to date can hold a candle to the mighty Visual Studio, especially when you couple it with things like ReSharper and the massive online communities dedicated to overcoming any of the shortcomings that you might encounter along the way. The same communities are also responsible for developing many additional frameworks in order to extend the Microsoft platforms even further, with many of them making their way into official SDKs. There have only been a few times when I’ve found myself treading new ground with Microsoft tools which no one has before, but every time I have I’ve discovered so much more than I initially set out to.
I’ve come to call these encounters “black magic moments”.
You see with the ease of developing with a large range of solutions already laid out for you it becomes quite tempting to slip into the habit of seeking out a completed solution, rather than building one of your own. Indeed there were a few design decisions in my previous applications that were driven by this, mostly because I didn’t want to dive under the hood of those solutions to develop the fix for my particular problem. It’s quite surprising how far you can get into developing something by doing this but eventually the decisions you make will corner you into a place where you have to make a choice between doing some real development or scraping a ton of work. Microsoft’s development ideals seem to encourage the latter (in favor of using one of their tried and true solutions) but stubborn engineers like me hate having to do rework.
This of course means diving beneath the surface of Microsoft’s black boxes and poking around to get an idea of what the hell is going on. My first real attempt at this was back in the early days of the Lobaco code base when I had decided that everything should be done via JSON. Everything was working out quite well until I started trying to POST a JSON object to my webservice, where upon it would throw out all sorts of errors about not being able to de-serialize the object. I spent the better part of 2 days trying to figure that problem out and got precisely no where, eventually posting my frustrations to the Silverlight forums. Whilst I didn’t get the actual answer from there they did eventually lead me down a path that got me there, but the solution is not documented anywhere nor does it seem that anyone else has attempted such a feat before (or after for that matter).
I hit another Microsoft black magic moment when I was working on my latest project that I had decided would be entirely cloud based. After learning my way around the ins and outs of the Windows Azure platform I took it upon myself to migrate the default authentication system built into ASP.NET MVC 3 onto Microsoft’s cloud. Thanks to a couple handy tutorials the process of doing so seemed fairly easy so I set about my task, converting everything into the cloud. However upon attempting to use the thing I just created I was greeted with all sorts of random errors and no amount of massaging the code would set it straight. After the longest time I found that it came down to a nuance of the Azure Tables storage part of Windows Azure, namely the way it structures data.
In essence Azure Tables is one of them new fangled NOSQL type databases and as such it relies on a couple properties in your object class to uniquely identify a row and provide scalability. These two properties are called PartitionKey and RowKey and whilst you can leave them alone and your app will still work it won’t be able to leverage any of the cloud goodness. So in my implementation I had overridden these variables in order to get the scalability that I wanted but had neglected to include any setters for them. This didn’t seem to be a problem when storing objects in Azure Tables but when querying them it seems that Azure requires the setters to be there, even if they do nothing at all. Adding one in fixed nearly every problem I was encountering and brought me back to another problem I had faced in the past (more on that when I finally fix it!).
Like any mature framework that does a lot of the heavy lifting for you Microsoft’s solutions suffer when you start to tread unknown territory. Realistically though this is should be expected and I’ve found I spend the vast majority of my time on less than 20% of the code that ends up making the final solution. The upshot is of course that once these barriers are down progress accelerates at an extremely rapid pace, as I saw with both the Silverlight and iPhone clients for Lobaco. My cloud authentication services are nearly ready for prime time and since I struggled so much with this I’ll be open sourcing my solution so that others can benefit from the numerous hours I spent on this problem. It will be my first ever attempt at open sourcing something that I created and the prospect both thrills and scares me, but I’m looking forward to giving back a little to the communities that have given me so much.