Microsoft’s message last year was pretty clear: we’re betting big that you’ll be using Azure as part of your environment and we’ve got a bunch of tools to make that happen. For someone who has cloudy aspirations this was incredibly exciting even though I was pretty sure that my main client , the Australian government, would likely abstain from using any of them for a long time. This year’s TechEd seemed like it was a little more subdued than last year (the lack of a bond style entrance with its accompanying Aston Martin was the first indicator of that) with the heavy focus on cloud remaining, albeit with a bent towards the mobile world.
Probably the biggest new feature to come to Azure is ExpressRoute, a service which allows you to connect directly to the Azure cloud without having to go over the Internet. For companies that have regulations around their data and the networks it can traverse this gives them the opportunity to use cloud services whilst still maintaining their obligations. For someone like me who primarily works with government this is a godsend and once the Azure instance comes online in Australia I’ll finally be able to sell it as a viable solution for many of their services. It will still take them some time to warm to the idea but with a heavy focus on finding savings, something Azure can definitely provide, I’m sure the adoption rate will be a lot faster than it has been with previous innovations of this nature.
The benefits of Azure Files on the other hand are less clear as whilst I can understand the marketing proposition it’s not that hard to set up a file server within Azure. This is made somewhat more pertinent by the fact that it uses SMB 2.1 rather than Server 2012’s SMB 3.0 so whilst you get some good features in the form of a REST API and all the backing behind Azure’s other forms of storage it lacks many of the new base capabilities that a traditional file server has. Still Microsoft isn’t one to develop a feature unless they know there’s a market for it, so I’d have to guess that this is a feature that many customers have been begging for.
In a similar vein the improvements to Microsoft’s BYOD offerings appear to be incremental more than anything with InTune receiving some updates and the introduction of Azure RemoteApps. Of the two Azure RemoteApps would be the most interesting as it allows you to deliver apps from the Azure cloud to your end points, wherever they may be. For large, disparate organisations this will be great as you can leverage Azure to deploy to any of your officers, negating the need for heavy infrastructure in order to provide a good user experience. There’s also the opportunity for Microsoft to offer pre-packaged applications (which they’re currently doing with Office 2013) although that’s somewhat at odds with their latest push for Office365.
Notably absent from any of the announcements was Windows 8.2 or Server 2012 R3, something which I think many of us had expected to hear rumblings about. There’s still the chance it will get announced at TechEd Australia this year especially considering the leaked builds that have been doing the rounds. If they don’t it’d be a slight departure from the tempo they set last year, something which I’m not entirely sure is a good or bad move from them.
Overall this feels like incremental improvements to Microsoft strategy they were championing last year more than revolutionary change. That’s not a bad thing really as the enterprise market is still catching up with Microsoft’s new found rapid pace and likely won’t be on par with them for a few years yet. Still it begs the question as to whether or not Microsoft is really committed to the rapid refresh program they kicked off not too long ago. TechEd Australia has played host to some big launches in the past so seeing Windows 8.2 for the first time there isn’t out of the question. As for us IT folk the message seems to remain the same: get on the cloud soon, and make sure it’s Azure.
The public cloud is a great solution to a wide selection of problems however there are times when its use is simply not appropriate. This is typical of organisations who have specific requirements around how their data is handled, usually due to data sovereignty or regulatory compliance. However whilst the public cloud is a great way to bolster your infrastructure on the cheap (although that’s debatable when you start ramping up your VM size) it doesn’t take advantage of the current investments in infrastructure that you’ve already made. For large, established organisations this is not insignificant and is why many of them were reluctant to transition fully to public cloud based services. This is why I believe the future of the cloud will be paved with hybrid solutions, something I’ve been saying for years now.
Microsoft has finally shown that they’ve understood this with the release of Windows Azure Pack for Server 2012R2. Sure there was beginnings of it with SCVMM 2012 allowing you to add in your Azure account and move VMs up there but that kind of thing has been available for ages through hosting partners. The Azure Pack on the other hand brings features that were hidden behind the public cloud wall down to the private level, allowing you to make full use of it without having to rely on Azure. If I’m honest I thought that Microsoft would probably be the only ones to try this given their presence in both the cloud and enterprise space but it seems other companies have begun to notice the hybrid trend.
Google has been working with the engineers at Red Hat to produce the Test Compatibility Kit for Google App Engine. Essentially this kit provides the framework for verifying the API level functionality of a private Google App Engine implementation, something which is achievable through an application called CapeDwarf. The vast majority of the App Engine functionality is contained within that application, enough so that current developers on the platform could conceivably use their code using on premises infrastructure if they so wished. There doesn’t appear to be a bridge between the two currently, like there is with Azure, as CapeDwarf utilizes its own administrative console.
They’ve done the right thing by partnering with RedHat as otherwise they’d lack the penetration in the enterprise market to make this a worthwhile endeavour. I don’t know how much presence JBoss/OpenShift has though so it might be less of using current infrastructure and more about getting Google’s platform into more places than it currently is. I can’t seem to find any solid¹ market share figures to see how Google currently rates compared to the other primary providers but I’d hazard a guess they’re similar to Azure, I.E. far behind Rackspace and Amazon. The argument could be made that such software would hurt their public cloud product but I feel these kinds of solutions are the foot in the door needed to get organisations thinking about using these services.
Whilst my preferred cloud is still Azure I’m still a firm believer that the more options we have to realise the hybrid dream the better. We’re still a long way from having truly portable applications that can move between freely between private and public platforms but the roots are starting to take hold. Given the rapid pace of IT innovation I’m confident that the next couple years will see the hybrid dream fully realised and then I’ll finally be able to stop pining for it.
¹This article suggests that Microsoft has 20% of the market which, since Microsoft has raked in $1 billion, would peg the total market at some $5 billion total which is way out of line with what Gartner says. If you know of some cloud platform figures I’d like to see them as apart from AWS being number 1 I can’t find much else.
After spending a week deep in the bowels of Microsoft’s premier tech conference and writing about them breathlessly for Lifehacker Australia you’d be forgiven for thinking I’m something of a Microsoft shill. It’s true that I think the direction they’re going in for their infrastructure products is pretty spectacular and the excitement for those developments is genuine. However if you’ve been here for a while you’ll know that I’m also among their harshest critics, especially when they do something that drastically out of line with my expectations as one of their consumers. However I believe in giving credit where its due and a recent PA Report article has brought Microsoft’s credentials in one area into question when they honestly shouldn’t be.
The article I’m referring to is this one:
I’m worried that there are going to be a few million consoles trying to dial into the home servers on Christmas morning, about the time when a mass of people begin to download new games through Microsoft’s servers. Remember, every game will be available digitally day and date of the retail version, so you’re going to see a spike in the number of people who buy their Xbox One games online.
I’m worried about what happens when that new Halo or Call of Duty is released and the system is stressed well above normal operating conditions. If their system falls, no matter how good our Internet connections, we won’t be able to play games.
Taken at face value this appears to be a fair comment. We can all remember times when the Xbox Live service came down in a screaming heap, usually around christmas time or even when a large release happened. Indeed even doing a quick Google search reveals there’s been a couple of outages in recent memory although digging deeper into them reveals that it was usually part of routine maintenance and only affected small groups of people at a time. With all the other criticism that’s being levelled at Microsoft of late (most of which I believe is completely valid) it’s not unreasonable to question their ability to keep a service of this scale running.
However as the title of this post alludes to I don’t think that’s going to be an issue.
The picture shown above is from the Windows Azure Internals session by Mark Russinovich which I attended last week at TechEd North America. It details the current infrastructure that underpins the Windows Azure platform which powers all of Microsoft’s sites including the Xbox Live service. If you have a look at the rest of the slides from the presentation you’ll see how far that architecture has come since they first introduced it 5 years ago when the over-subscription rates were much, much higher for the entire Azure stack. What this meant was that when something big happened the network simply couldn’t handle it and caved under the pressure. With this current generation of the Azure infrastructure however it’s far less oversubscribed and has several orders of magnitude more servers behind it. With that in mind it’s far less likely that Microsoft will struggle to service large spikes like they have done in the past as the capacity they have on tap is just phenomenal.
Of course this doesn’t alleviate the issues with the always/often on DRM or the myriad of other issues that people are criticizing the XboxOne for but it should show you that worrying about Microsoft’s ability to run a reliable service shouldn’t be one of them. Of course I’m just approaching this from an infrastructure point of view and it’s entirely possible for the Xbox Live system to have some systemic issue that will cause it to fail no matter how much hardware they throw at it. I’m not too concerned about that however as Microsoft isn’t your run of the mill startup who’s just learning how to scale.
I guess we’ll just have to wait and see how right or wrong I am.
Since my side projects (including this blog) don’t really have any kind of revenue generation potential I tend to shy away from spending a lot on them, if I can avoid it. This blog is probably the most extravagant of the lot getting its own dedicated server which, I’ll admit, is overkill but I’d had such bad experiences which shared providers before that I’m willing to bear the cost. Cloud hosting on the other hand can get nightmarishly expensive if you don’t keep an eye on it and that was the exact reason I shied away from it for any of my side projects. That was until I got accepted into the Microsoft BizSpark program which came with a decent amount of free usage, enough for me to consider it for my next application.
The Azure benefits for BizSpark are quite decent with a smattering of all their offerings chucked in which would easily be enough to power a nascent start up’s site through the initial idea verification stage. That’s exactly what I’ve been using it for and, as longtime readers will tell you, my experiences have been fairly positive with most of the issues arising from my misappropriation of different technologies. The limits, as I found out recently, are hard and running up against them causes all sorts of undesirable behaviour, especially if you run up against your compute or storage limit. I managed to run up against the former due to a misunderstanding of how a preview technology was billed but I hadn’t hit the latter until last week.
So the BizSpark benefits are pretty generous for SQL storage, giving you access to a couple 5GB databases (or a larger number of smaller 1GB ones) gratis. That sounds like a lot, and indeed it should be sufficient for pretty much any burgeoning application, however mine is based around gathering data from another site and then performing some analytics on it so the amount of data I have is actually quite large. In the beginning this wasn’t much of a problem as I had a lot of headroom however after I made a lot of performance improvements I started gathering data at a much faster rate and the 5GB limit loomed over me. In the space of a couple weeks I managed to fill it completely and had to shut it down lest my inbox get filled with “Database has reached its quota” errors.
Looking over the database in the Azure management studio (strangely one of the few parts of the Azure that still uses Silverlight) showed that one particular table was consuming the majority of the database. Taking a quick look at the rows it was pretty obvious as to why this was the case, I had a couple columns that had lengthy URLs in them and over the 6 million or so records I had this amounted to a huge amount of space being used. No worries I thought, SQL has to have some kind of built in compression to deal with this and so off I went looking for an easy solution.
As it turns out SQL Server does and its implementation would’ve provided the benefits I was looking for without much work on my end. However Azure SQL doesn’t support it and the current solution to this is to implement row based compression inside your application. If you’re straight up dumping large XML files or giant wads of text into SQL rows then this might be of use to you however if you’re trying to compress data at a page level then you’re out of luck, unless you want to code an extravagant solution (like creating a compression dictionary table in the same database, but that’s borderline psycotic if you ask me).
The solution for me was to move said problem table into its own database and, during the migration, trim out all the fat contained within the data. There were multiple columns I never ended up using, the URL fields were all very similar and the largest column, the one most likely causing me to chew through so much space, was no longer needed now that I was able to query that data properly rather than having to work around Azure Table Storage’s limitations. Page compression would’ve been an easy quick fix but it would’ve only been a matter of time before I found myself in the same situation, struggling to find space where I could get it.
For me this experience aptly demonstrated why its good to work within strict constraints as left unchecked these issues would’ve hit me much harder later on. Sure it can feel like I’m spinning my wheels when hitting issues like this is a monthly occurrence but I’m still in the learning stage of this whole thing and lessons learned now are far better than ones I learn when I finally move this thing into production.
As longtime readers will know I’m quite keen on Microsoft’s Azure platform and whilst I haven’t released anything on it I have got a couple projects running on it right now. For the most part it’s been great as previously I’d have to spend a lot of time getting my development environment right and then translate that onto another server in order to make sure everything worked as expected. Whilst this wasn’t beyond my capability it was more time burnt in activities that weren’t pushing the project forward and was often the cause behind me not wanting to bother with them anymore.
Of course as I continue down the Azure path I’ve run into the many different limitations, gotchas and ideology clashes that have caused me several headaches over the past couple years. I think most of them can be traced back to my decision to use Azure Table Storage as my first post on Azure development is how I ran up against some of the limitations I wasn’t completely aware of and this continued with several more posts dedicated to overcoming the shortcomings of Microsoft’s NOSQL storage backend. Since then I’ve delved into other aspects of the Azure platform but today I’m not going to talk about any of the technology per se, no today I’m going to tell you about what happens when you hit your subscription/spending limit, something which can happen with only a couple mouse clicks.
I’m currently on a program called Microsoft BizSpark a kind of partner program whereby Microsoft and several other companies provide resources to people looking to build their own start ups. Among the many awesome benefits I get from this (including a MSDN subscription that gives me access to most of the Microsoft catalogue of software, all for free) Microsoft also provides me with an Azure subscription that gives me access to a certain amount of resources. Probably the best part of this offer is the 1500 hours of free compute time which allows me to run 2 small instances 24/7. Additionally I’ve also got access to the upcoming Azure Websites functionality which I used for a website I developed for a friend’s wedding. However just before the wedding was about to go ahead the website suddenly became unavailable and I went to investigate why.
As it turned out I had somehow hit my compute hours limit for that month which results in all your services being suspended until the rollover period. It appears this was due to me switching the website from the free tier to the shared tier which then counts as consuming compute hours whenever someone hits the site. Removing the no-spend block on it did not immediately resolve the issue however a support query to Microsoft saw the website back online within an hour. However my other project, the one that would be chewing up the lion’s share of those compute hours, seemed to have up and disappeared even though the environment was still largely in tact.
This is in fact expected behaviour for when you hit either your subscription or spending limit for a particular month. Suspended VMs on Windows Azure don’t count as being inactive and will thus continue to cost you money even whilst they’re not in use. To get around this should you hit your spending limits those VMs will be deleted, saving you money but also causing some potential data loss. Now this might not be an issue for most people, for me all it entailed was republishing them from Visual Studio, but should you be storing anything critical on the local storage of an Azure role it will be gone forever. Whilst the nature of the cloud should make you wary of storing anything on non-permanent storage (like Azure Tables, SQL, blob storage) it’s still a gotcha that you probably wouldn’t be aware of until you ran into a situation similar to mine.
Like any platform there are certain aspects of Windows Azure that you have to plan for and chief among them is your spending limits. It’s pretty easy to simply put in your credit card details and then go crazy by provisioning as many VMs as you want but sooner or later you’ll be looking to put limits on it and it’s then that you have the potential to run into these kinds of issues.
If you’ve ever worked in a multi-tenant environment with shared resources you’ll know of the many pains that can come along with it. Resource sharing always ends up leading to contention and some of the time this will mean that you won’t be able to get access to the resources you want. For cloud services this is par for the course as since you’re always accessing shared services and so any application you build on these kinds of platforms has to take this into consideration lets your application spend an eternity crashing from random connection drop outs. Thankfully Microsoft has provided a few frameworks which will handle these situations for you, especially in the case of Azure SQL.
The Transient Fault Handling Application Block (or Topaz, which is a lot better in my view) gives you access to a number of classes which take out a lot of the pain when dealing with the transient errors you get when using Azure services. Of those the most useful one I’ve found is the RetryPolicy which when instantiated as SqlAzureTransientErrorDetectionStrategy allows you to simply wrap your database transactions with a little bit of code in order to make them resistant to the pitfalls of Microsoft’s cloud SQL service. For the most part it works well as prior to using it I’d get literally hundreds of unhandled exception messages per day. It doesn’t catch everything however so you will still need to handle some connection errors but it does a good job of eliminating the majority of them.
Currently however there’s no native support for it in Entity Framework (Microsoft’s data persistence framework) and this means you have to do a little wrangling in order to get it to work. This StackOverflow question outlines the problem and there’s a couple solutions on there which all work however I went for the simple route of instantiating a RetryPolicy and then just wrapping all my queries with ExecuteAction. As far as I could tell this all works fine and is the supported way of using EF with Topaz at least until 1.6 comes out which will have in built support for connection resiliency.
However when using Topaz in this way it seems that it mucks with entity tracking, causing returned objects to not be tracked in the normal way. I discovered this after I noticed many records not getting updated even though manually working through the data showed that they should be showing different values. As far as I can tell if you wrap an EF query with a RetryPolicy the entity ends up not being tracked and you will need to .Attach() to it prior to making any changes. If you’ve used EF before then you’ll see why this is strange as you usually don’t have to do that unless you’ve deliberately detached the entity or recreated the context. So as far as I can see there must be something in Topaz that causes it to become detached requiring you to reattach it if you want to persist your changes using Context.SaveChanges().
I haven’t tested any of the other methods of using Topaz with EF so it’s entirely possible there’s a way to get the entity tracked properly without having to attach to it after performing the query. Whether they work or not will be an exercise left for the reader as I’m not particularly interested in testing it, at least not just after I got it all working again. By the looks of it though a RC version of EF 6 might not be too far away, so this issue probably won’t remain one for long.
I heap a lot of praise on Windows Azure here, enough for me to start thinking about how that’s making me sound like a Microsoft shill, but honestly I think it’s well deserved. As someone who’s spent the better part of a decade setting up infrastructure for applications to run on and then began developing said applications in its spare time I really do appreciate not having to maintain another set of infrastructure. Couple that with the fact that I’m a full Microsoft stack kind of guy it’s really hard to beat the tight integration between all of the products in the cloud stack, from the development tools to the back end infrastructure. So like many of my weekends recently I spent the previous coding away on the Azure platform and it was filled with some interesting highs and rather devastating lows.
For the uninitiated Azure Web Sites are essentially a cut down version of the Azure Web Role allowing you to run pretty much full scale web apps for a fraction of the cost. Of course this comes with limitations and unless you’re running on at the Reserved tier you’re essentially sharing a server with a bunch of people (I.E. a common multi-tenant scenario). For this site, which isn’t going to receive a lot of traffic, it’s perfect and I wanted to deploy the first run app onto this platform. Like any good admin I simply dove in head first without reading any documentation on the process and to my surprise I was up and running in a matter of minutes. It was pretty much create web site, download publish profile, click Publish in Visual Studio, import profile and wait for the upload to finish.
Deploying a web site on my own infrastructure would be a lot more complicated as I can’t tell you how many times I’ve had to chase down dependency issues or missing libraries that I have installed on my PC but not on the end server. The publishing profile coupled with the smarts in Visual Studio was able to resolve everything (the deployment console shows the whole process, it was actually quite cool to watch) and have it up and running at my chosen URL in about 10 minutes total. It’s very impressive considering this is still considered preview level technology, although I’m more inclined to classify it as a release candidate.
Other Azure users can probably guess what I’m going to write about next. Yep, the horrific storage problems that Azure had for about 24 hours.
I noticed some issues on Friday afternoon when my current migration (yes that one, it’s still going as I write this) started behaving…weird. The migration is in its last throws and I expected the CPU usage to start ramping down as the multitude of threads finished their work and this lined up with what I was seeing. However I noticed the number of records migrated wasn’t climbing up at the rate it was previously (usually indicative of some error happening that I suppressed in order for the migration to run faster) but the logs showed that it was still going, just at a snail’s pace. Figuring it was just the instance dying I reimaged it and then the errors started flooding in.
Essentially I was disconnected from my NOSQL storage so whilst I could browse my migrated database I couldn’t keep pulling records out. This also had the horrible side effect of not allowing me to deploy anything as it would come back with SSL/TLS connection issues. Googling this led to all sorts of random posts as the error is also shared by the libraries that power the WebClient in .NET so it wasn’t until I stumbled across the ZDNet article that I knew I wasn’t in the wrong. Unfortunately you were really up the proverbial creek without a paddle if your Azure application was based on this as the temporary fixes for this issue, either disabling SSL for storage connections or usurping the certificate handler, left your application rather vulnerable to all sorts of nasty attacks. I’m one of the lucky few who could simply do without until it was fixed but it certainly highlighted the issues that can occur with PAAS architectures.
Honestly though that’s the only issue (that’s not been directly my fault) I’ve had with Azure since I started using it at the end of last year and comparing it to other cloud services it doesn’t fair too badly. It has made me think about what contingency strategy I’ll need to implement should any parts of the Azure infrastructure go away for a extended period of time though. For the moment I don’t think I’ll worry too much as I’m not going to be earning any income from the things I build on it but it will definitely be a consideration as I begin to unleash my products onto the world.
If you’re a developer like me you’ve likely got a set of expectations about the way you handle data. Most likely they all have their roots in the object-oriented/relational paradigm meaning that you’d expect to be able to get some insight into your data by simply running a few queries against it or simply looking at the table, possibly sorting it to find something out. The day you decide to try out something like Azure Table storage however you’ll find that these tools simply aren’t available to you any more due to the nature of the service. It’s at this point where, if you’re like me, you’ll get a little nervous as your data can end up feeling like something of a black box.
A while back I posted about how I was over-thinking the scalability of my Azure application and how I was about to make the move to Azure SQL. That’s been my task for the past 3 weeks or so and what started out as a relatively simple task of simply moving data from one storage mechanism to another has turned into this herculean task that has seen me dive deeper into both Azure Tables and SQL than I have ever done previously. Along the way I’ve found out a few things that, whilst not changing my mind about the migration away from Azure tables, certainly would have made my life a whole bunch easier had I known about them.
1. If you need to query all the records in an Azure table, do it partition by partition.
The not-so-fun thing about Azure Tables is that unless you’re keeping track of your data in your application there’s no real metrics you can dredge up in order to give you some idea of what you’ve actually got. For me this meant that I had one table that I knew the count of (due to some background processing I do using that table) however there are 2 others which I have absolutely 0 idea about how much data is actually contained in there. Estimates using my development database led me to believe there was an order of magnitude more data in there than I thought there was which in turn led me to the conclusion that using .AsTableServiceQuery() to return the whole table was doomed from the start.
However Azure Tables isn’t too bad at returning an entire partition’s worth of data, even if the records number in the 10s or 100s of thousands. Sure the query time goes up linearly depending on how many records you’ve got (as Azure Tables will only return a max of 1000 records at a time) but if they’re all within the same partition you avoid the troublesome table scan which dramatically affects the performance of the query, sometimes to the point of it getting cancelled which isn’t handled by the default RetryPolicy framework. If you need all the data in the entire table you can then do queries on each partition and then dump them all in a list inside your application and then continue to do your query.
2. Optimize your context for querying or updating/inserting records.
Unbeknownst to me the TableServiceContext class has quite a few configuration options available that will allow you to change the way the context behaves. The vast majority of errors I was experiencing came from my background processor which primarily dealt with reading data without making any modifications to the records. If you have applications where this is the case then it’s best to set the Context.MergeOption to MergeOption.NoTracking as this means the context won’t attempt to track the entities.
If you have multiple threads running or queries that return large amounts of records this can lead to a rather large improvement in performance as the context doesn’t have to track any changes to them and the garbage collector can free up these objects even if you use the context for another query. Of course this means that if you do need to make any changes you’ll have to change the context and then attach to the entity in question but you’re probably doing that already. Or at least you should be.
3. Modify your web.config or app.config file to dramatically improve performance and reliability.
For some unknown reason the default number of HTTP connections that a Windows Azure application can make (although I get the feeling this affects all applications making use of the .NET frameworks) is set to 2. Yes just 2. This then manifests itself as all sorts of crazy errors that don’t make a whole bunch of sense like “the underlying connection was closed” when you try to make more than 2 requests at any one time (which includes queries to Azure Tables). The max number of connections you can specify depends on the size of the instance you’re using but Microsoft has a helpful guide on how to set this and other settings in order to make the most out of it.
Additionally some of the guys at Microsoft have collected a bunch of tips for improving the performance of Azure Tables in various circumstances. I’ve cherry picked out the best ones which I’ve confirmed that have worked wonders for me however there’s a fair few more in there that might be of use to you, especially if you’re looking to get every performance edge you can. Many of them are circumstantial and some require you to plan out or storage architecture in advance (so something that can’t be easily retrofitted into an existing app) but since the others have worked I hazard a guess they would to.
I might not be making use of some of these tips now that my application is going to be SQL and TOPAZ but if I can save anyone the trouble I went through trying to sort through all those esoteric errors I can at least say it was worth it. Some of these tips are just good to know regardless of the platform you’re on (like the default HTTP connection limit) and should be incorporated into your application as soon as its feasible. I’ve yet to get all my data into production yet as its still migrating but I get the feeling I might go on another path of discovery with Azure SQL in the not too distant future and I’ll be sure to share my tips for it then.
As always I’m not-so-secretly working on a side project of mine (although I’ve kept it’s true nature a secret from most) which utilizes Windows Azure as the underlying platform. I’ve been working on it for the past 3 months or so and whilst it isn’t my first Azure application it is the first one that I’ve actually put into production. That means I’ve had to deal with all the issues associated with doing that, from building an error reporting framework to making code changes that have no effect in development but fix critical issues when the application is deployed. I’ve also come to the realisation that some the architectural decisions I made, ones done with an eye cast towards future scalability, aren’t as sound as I first thought they were.
I’ve touched on some of the issues and considerations that Azure Tables has previously but what I haven’t dug into is the reasons you would choose to use. On the surface it looks like a stripped down version of a relational database, missing some features but making up for it by being an extremely cheap way of storing a whole lot of data. Figuring that my application was going to be huge some day (as all us developers do) I made the decision to use Azure Tables for everything. Sure querying the data was a little cumbersome but there were ways to code around that, and code around I did. The end solution does work as intended when deployed into production but there are some quirks which don’t sit well with me.
For starters querying data from Azure Tables on anything but the partition key and row key will force a table scan. Those familiar with NOSQL style databases will tell me that that’s the point, storage services like these are optimized for this situation and outside of that you’re better off using an old fashioned SQL database. I realised this when I was developing it however the situations I had in mind fit in well with with the partition/row key paradigm as often I’d need to get a whole partition, single record or (and this is the killer) the entire table itself. Whilst Azure Tables might be great at the first 2 things it’s absolutely rubbish at the latter and this causes me no end of issues.
In the beginning I, like most developers, simply developed something that worked. This included a couple calls along the lines of “get all the records in this table then do something with each of them”. This worked well up until I started getting hundreds of thousands of rows needing to be returned which often ended with the query being killed long before it could complete. Frustrated I implemented a solution that attempted to iterate over all records in the table by requesting all of the records and then following the continuation tokens as they were given to me. This kind of worked although anyone who’s worked with Azure and LINQ will tell you that I reinvented the wheel by forgoing the .AsTableServiceQuery() method which does that all for you. Indeed the end result was essentially the same and the only way around it was to put in some manual retry logic (in addition to the regular RetryPolicy). This works but retrieving/iterating over 800,000 records takes some 5 hours to complete, unacceptable when I can do the same thing on my home PC in a minute or two.
It’s not a limitation of the instances I’m using either as I’m using Azure SQL for one part of it which uses a subset of the data, but still the same number of records, is able to return in a fraction of the time. Indeed the issue seems to come from the fact that Azure Tables lacks the ability to iterate and re-runs the giant query every time I request a the next 1000 records. This often runs into the execution time limit which terminates all connections from my instance to the storage, causing a flurry of errors to occur. The solution seems clear though, I need to move off Azure Tables and onto Azure SQL.
Realistically I should’ve realised this a lot sooner as there are numerous queries I make on things other than the partition and row keys which are critical to the way my application functions. This comes with its own challenges as scaling out the application becomes a lot harder but honestly I’m kidding myself by thinking I’ll need that level of scalability any time soon, especially when I can simply move database tables around on Azure instances to get the required performance and once that’s not enough I’ll finally try to understand SQL Federations properly and that will sort it for good.
Windows Azure Tables are one of those newfangled NoSQL type databases that excels in storing giant swaths of structured data. For what they are they’re quite good as you can store very large amounts of data in there without having to pay through the nose like you would for a traditional SQL server or an Azure instance of SQL. However that advantage comes at a cost: querying the data on anything but the partition key (think of it as a partition of the data within a table) and the row key (the unique identifier within that partition) results in queries that take quite a while to run, especially when compared to its SQL counter parts. There are ways to get around this however no matter how well you structure your data eventually you’ll run up against this limitation and that’s where things start to get interesting.
By default whenever you do a large query against an Azure Table you’ll only get back 1000 records, even if the query will return more. However if your query did have more results than that you’ll be able to access them via a continuation token that you can add to your original query, telling Azure that you want the records past that point. For those of us coding on the native .NET platform we get the lovely benefit of having all of this handled for us directly by simply adding .AsTableServiceQuery() to the end of our LINQ statements (if that’s what you’re using) which will handle the continuation tokens for us. For most applications this is great as it means you don’t have to fiddle around with the rather annoying way of extracting those tokens out of the response headers.
Of course that leads you down the somewhat lazy path of not thinking about the kinds of queries you’re running against your Tables and this can lead to problems down the line. Since Azure is a shared service there are upper limits on how long queries can run and how much data they can return to you. These limits aren’t exactly set in stone and depending on how busy the particular server you’re querying is or the current network utilization at the time your query could either take an incredibly long time to return or could simply end up getting closed off. Anyone who’s developed for Azure in the past will know that this is pretty common, even for the more robust things like Azure SQL, but there’s one thing that I’ve noticed over the past couple weeks that I haven’t seen mentioned anywhere else.
As the above paragraphs might indicate I have a lot of queries that try and grab big chunks of data from Azure Tables and have, of course, coded in RetryPolicies so they’ll keep at it if they should fail. There’s one thing that all the policies in the world won’t protect you from however and that’s connections that are forcibly closed. I’ve had quite a few of these recently and I noticed that they appear to come in waves, rippling through all my threads causing unhandled exceptions and forcing them to restart themselves. I’ve done my best to optimize the queries since then and the errors have mostly subsided but it appears that should one long running query trigger Azure to force the connection closed all connections from that instance to the same Table storage will also be closed.
Depending on how your application is coded this might not be an issue however for mine, where the worker role has about 8 concurrent threads running at any one time all attempting to access the same Table Storage account, it means one long running query that gets terminated triggers a cascade of failures across the rest of threads. For the most part this was avoided by querying directly on row and partition keys however the larger queries had to be broken up using the continuation tokens and then the results concatenated in memory. This introduces another limit on particular queries (as storing large lists in memory isn’t particularly great) which you’ll have to architect your code around. It’s by no means an unsolvable problem however it was one that has forced me to rethink certain parts of my application which will probably need to be on Azure SQL rather than Azure Tables.
Like any cloud platform Azure is a great service which requires you to understand what its various services are good for and what they’re not. I initially set out to use Azure Tables for everything and have since found that it’s simply not appropriate for that, especially if you need to query on parameters that aren’t the row or partition keys. If you have connections being closed on you inexplicably be sure to check for any potentially long running queries on the same role as this post can attest they could very well be the source of what ales you.