Posts Tagged‘performance’

Capturing a Before and After Performance Report for Windows Servers.

The current project I’m on has a requirement for being able to determine a server’s overall performance before and after a migration, mostly to make sure that it still functions the same or better once its on the new platform. Whilst it’s easy enough to get raw statistics from perfmon getting an at-a-glance view  of how a server is performing before and after a migration is a far more nuanced concept, one that’s not easily accomplished with some Excel wizardry. With that in mind I thought I’d share with you my idea for creating such a view as well as outlining the challenges I’ve hit when attempting to collate the data.

Perfmon Data

At a high level I’ve focused on the 4 core resources that all operating systems consume: CPU, RAM, disk and network. For the most part these metrics are easily captured by the counters that perfmon has however I wanted to go a bit further to make sure that the final comparisons represented a more “true” picture of before and after performance. To do this I included some additional qualifying metrics which would show if increased resource usage was negatively impacting on performance or if it was just the server consuming more resources because it could since the new platform had much more capacity. With that in mind these are the metrics I settled on using:

  • Average of CPU usage (24 hours), Percentage, Quantitative
  • CPU idle time on virtual host of VM (24 hours), Percentage, Qualifying
  • Top 5 services by CPU usage, List, Qualitative
  • Average of  Memory usage (24 hours), Percentage, Quantitative
  • Average balloon driver memory usage (24 hours), MB consumed, Qualifying
  • Top 5 services by Memory usage, List, Qualitative
  • Average of Network usage (24 hours), Percentage, Quantitative
  • Average TCP retransmissions (24 hours), Total, Qualifying
  • Top 5 services by Network bandwidth utilized, List, Qualitative
  • Average of Disk usage (24 hours), Percentage, Quantitative
  • Average queue depth (24 hours), Total, Qualifying
  • Top 5 services by Storage IOPS/Bandwidth utilized, List, Qualitative

Essentially these metrics can be broken down into 3 categories: quantitative, qualitative  and qualifying. Quantitative metrics are the base metrics which will form the main part of the before and after analysis. Qualitative metrics are mostly just informational (being the Top 5 consumers of X resource) however they’ll provide some useful insight into what might be causing an issue. For example if an SQL box isn’t showing the SQL process as a top consumer then it’s likely something is causing that process to take a dive before it can actually use any resources. Finally the qualifying metrics are used to indicate whether or not increased usage of a certain metric signals an impact to the server’s performance like say if the memory usage is high and the memory balloon size is high it’s quite likely the system isn’t performing very well.

The vast majority of these metrics are provided in perfmon however there were a couple that I couldn’t seem to get through the counters, even though I could see them in Resource Monitor. As it turns out Resource Monitor makes use of the Event Tracing for Windows (ETW) framework which gives you an incredibly granular view of all events that are happening on your machine. What I was looking for was a breakdown of disk and network usage per process (in order to generate the Top 5 users list) which is unfortunately bundled up in the IO counters available in perfmon. In order to split these out you have to run a Kernel Trace through ETW and then parse the resulting file to get the metrics you want. It’s a little messy but unfortunately there’s no good way to get those metrics separated. The resulting perfmon profile I created can be downloaded here.

The next issue I’ve run into is getting the data into a readily digestible format. You see not all servers are built the same and not all of them run the same amount of software. This means that when you open up the resulting CSV file from different servers the column headers won’t line up so you’ve got to either do some tricky Excel work (which is often prone to failure) or get freaky with some PowerShell (which is messy and complicated). I decided to go for the latter as at least I could maintain and extend the script somewhat easily whereas an Excel spreadsheet has a tendency to get out of control faster than anyone expects. That part is still a work in progress however but I’ll endeavour to update this post with the completed script once I’ve got it working.

After that point it’s a relatively simple task of displaying everything in a nicely formatted Excel spreadsheet and doing comparisons based on the metrics you’ve generated. If I had more time on my hands I probably would’ve tried to integrate it into something like a SharePoint BI site so we could do some groovy tracking and intelligence on it but due to tight time constraints I probably won’t get that far. Still a well laid out spreadsheet isn’t a bad format for presenting such information, especially when you can colour everything green when things are going right.

I’d be keen to hear other people’s thoughts on how you’d approach a problem like this as trying to quantify the nebulous idea of “server performance” has proven to be far more challenging than I first thought it would be. Part of this is due to the data manipulation required but it was also ensuring that all aspects of a server’s performance were covered and converted down to readily digestible metrics. I think I’ve gotten close to a workable solution with this but I’m always looking for ways to improve it or if there’s a magical tool out there that will do this all for me 😉

10,000 Hours of Deliberate Practice: A Necessary but not Sufficient Condition for Mastery.

It’s been almost 6 years since I first began writing this blog. If you dare to troll through the early archives there’s no doubt that the writing in there is of lower quality, much of it to do with me still trying to find my voice in this medium. Now, some 1300+ posts later, the hours I’ve invested in developing this blog my writing has improved dramatically and every day I feel far more confident in my abilities to churn out a blog post that meets a certain quality threshold. I attribute much of that to my dedication to writing at least once a day, an activity which has seen me invest thousands of hours into improving my craft. Indeed I felt that this was something of an embodiment of the 10,000 hour rule at work, something that newly released research says isn’t the main factor at play.

Variance Due to Deliberate PracticeThe  study conducted by researchers at Princeton University (full text available here) attempted to discern just how much of an impact deliberate practice had on performance. They conducted a meta analysis of 150 studies that investigated the relationship between these two variables and classified them along major domains as well as the methodology used to gather performance data. The results show that whilst deliberate practice can improve your performance within a certain domain (and which domain its in has a huge effect on how great the improvement is) it’s not the major contributor in any case. Indeed the vast majority of improvements are due to factors that reside outside of deliberate practice which seemingly throws the idea of 10,000 hours worth of practice being the key component to mastering something.

To be clear though the research doesn’t mean that practice is worthless, indeed in pretty much every study conducted there’s a strong correlation between increased performance and deliberate practice. What this study does show though is that there are factors outside of deliberate practice which have a greater influence on whether or not your performance improves. Unfortunately determining what those factors are was out of the scope of the study (it’s only addressed in passing in the final closing statements of the report) but there are still some interesting conclusions to be made about how one can go about improving themselves.

Where deliberate practice does seem to help with performance is with activities that have a predictable outcome. Indeed performances for routine activities show a drastic improvement when deliberate practice is undertaken whilst unpredictable things, like aviation emergencies, show less improvement. We also seem to overestimate our own improvement due to practice alone as studies that relied on people remembering past performances showed a much larger improvement than studies that logged performances over time. Additionally for the areas which showed the least amount of improvement due to deliberate practice it’s likely that there’s no good definition for “practice” within these domains, meaning it’s much harder to quantify what needs to be practiced.

So where does this leave us? Are we all doomed to be good at only the things which our nature defines for us, never to be able to improve on anything? As far as the research shows no, deliberate practice might not be the magic cure all for improving but it is a great place to start. What we need to know now is what other factors play into improving performances within their specific domains. For some areas this is already well defined (I can think of many examples in games) but for other domains that are slightly more nebulous in nature it’s entirely possible that we’ll never figure out the magic formula. Still at least now you don’t worry so much about the hours you put in, as long as you still, in fact, put them in.

 

Samsung’s V-NAND Has Arrived, and It’s Awesome.

When people ask me what one component on their PC they should upgrade my answer is always the same: get yourself a SSD. It’s not so much the raw performance characteristics that make the upgrade worth it, more all those things that many people hate about computers seem to melt away when you have a SSD behind it. All your applications load near instantly, your operating system feels more responsive and those random long lock ups where your hard drive seems to churn over for ages simply disappears. However the one drawback is their size and cost, being an order of magnitude above the good old spinning rust. Last year Samsung announced their plans to change that with V-NAND and today they deliver on that promise.

Samsung 850 Pro V-NAND SSD

The Samsung 850 Pro is the first consumer drive to be released with V-NAND technology and is available in sizes up to 1TB. The initial promise of 128Gbit per chip has unfortunately fallen a little short of its mark with this current production version only delivering around 86Gbit per chip. This is probably due to economical reasons as the new chips under the hood of this SSD are smaller than the first prototypes which helps to increase the yield per wafer. Interestingly enough these chips are being produced on an older lithography process, 30nm instead of the current standard 20nm for most NAND chips. That might sound like a step back, and indeed it would be for most hardware, however the performance of the drive is pretty phenomenal, meaning that V-NAND is going to get even better with time.

Looking at the performance reviews the Samsung 850 Pro seems to be a top contender, if not the best, in pretty much all of the categories. In the world of SSDs having consistently high performance like this across a lot of categories is very unusual as typically a drive manufacturer will tune performance to a certain profile. Some favour random reads, others sustained write performance, but the Samsung 850 Pro seems to do pretty much all of them without breaking a sweat. However what really impressed me about the drive wasn’t so much the raw numbers, it was how the drive performed over time, even without the use of TRIM.

samsung 850 pro 512gb - hdtach-3-

SSDs naturally degrade in performance over time, not due to the components wearing out but due to the nature of how they read and write data. Essentially it comes down to blocks needing to be checked to see if they’re free or not before they can be written to, a rather costly process. A new drive has all blank space which means these checks don’t need to be done but over time they’ll get into unknown states due to all the writing and rewriting. The TRIM command tells SSDs that certain blocks have been freed up, allowing the drive to flag them as unused, recovering some of the performance. The graph above shows what happens when the new Samsung 850 Pro reaches that performance degradation point even without the use of TRIM. If you compare that to other SSDs this kind of consistent performance almost looks like witchcraft but it’s just the V-NAND technology showing one of its many benefits.

Indeed Samsung is so confident in these new drives it’s giving all of them a 10 year warranty, something you can’t find even on good old spinning rust drives anymore. I’ll be honest when I first read about V-NAND I had a feeling that the first drives would likely be failure ridden write offs, like most new technologies are. However this new drive from Samsung appears to be the evolutionary step that all SSDs need to take as this first iteration device is just walking all over the competition. I was already sold on a Samsung SSD for my next PC build but I think an 850 Pro just made the top of my list.

Now if only those G-SYNC monitors could come out already, then I’d be set to build my next gen gaming PC.

The Longevity of Next Gen Consoles.

There’s an expectation upon purchasing a console that it will remain current for a decent length of time, ostensibly long enough so that you feel that you got your money’s worth whilst also not too long that the hardware starts to look dated in comparison to everything else that’s available. Violating either of these two constraints usually leads to some form of consumer backlash like it did when the Xbox360 debuted rather shortly after the original Xbox. With the next generation bearing down on us the question of how long this generation of consoles will last, and more importantly stay relevant, is a question that’s at the forefront of many people’s minds.

Durgano Block Diagram

 

Certainly from a purely specifications perspective the next generation of high performance consoles aren’t going to be among the fastest systems available for long. Both of them are sporting current gen CPUs and GPUs however it’s quite likely that their hardware will be superseded before they ever hit the retail shelves. AMD is currently gearing up to release their 8000 series GPUs sometime in the second quarter of this year. The CPUs are both based off AMD’s Jaguar micro-architecture and should be current for at least a year or so after their initial release, at least in terms of the AMD line, although with the release of Haswell from Intel scheduled for some time in the middle of this year means that even the CPUs will be somewhat outdated upon release. This is par for the course for any kind of IT hardware however so it shouldn’t come as much of a surprise that more powerful options will be available even before their initial release.

Indeed consoles have always had a hard time keeping up with PCs in terms of raw computing power although the lack of a consistent, highly optimizable platform is what keeps consoles in the game long after their hardware has become ancient. There does come a time however when the optimizations just aren’t sufficient and the games start to stagnant which is what led to the more noticeable forms of consolization that made their way into PC games. It’s interesting to note this as whilst the current generations of consoles have been wildly popular since their inception the problem of consolization wasn’t really apparent until many years afterwards, ostensibly when PC power started to heavily outstrip the current gen consoles’ abilities.

Crytek head honcho Cevat Yerli has gone on record saying that even the next gen consoles won’t be able to keep up with PCs when it comes to raw power. Now this isn’t a particularly novel observation in itself, any PC gamer would be able to tell you this, but bringing in the notion of price is an intriguing one. As far as we can tell the next generation of consoles will come out at around $600, maybe $800 if Sony/Microsoft don’t want to use them as loss leaders any more. Whilst they’re going to be theoretically outmatched by $2000 gaming beasts from day 1 it gets a lot more interesting if we start making comparisons to a similarly priced PC and the capabilities it will have. In that regard consoles actually offer quite a good value proposition for quite a while to come.

So out of curiosity I specced up a PC that was comparable to the next gen consoles and came out at around $950. At this end of the spectrum prices aren’t affected as much by Moore’s Law since they’re so cheap already and the only part that would likely see major depreciation would be the graphics card which came in at about $300. Still, taking the optimizations that can be made on consoles into account, the next gen consoles do represent pretty good value for the performance they will deliver on release and will continue to do so for at least 2~3 (1~2 iterations of Moore’s Law) years afterwards thanks to their low price point. Past then the current generation of CPUs and GPUs will perform well enough at the same price point in order to beat them in a price per dollar scenario.

In all honesty I hadn’t really thought of making a direct comparison at the same price point before and the results were quite surprising. The comparison is even more apt now thanks to the next generation coming with a x86 architecture underneath which essentially makes them cheap PCs. Sure they may never match up to the latest and greatest but they sure do provide some pretty good value. Whilst I didn’t think they’d have trouble selling these things this kind of comparison will make the decision to buy one of them that much easier, at least to people like me who are all about extracting the maximum value for their dollars spent.

3 Tips on Improving Azure Table Storage Performance and Reliability.

If you’re a developer like me you’ve likely got a set of expectations about the way you handle data. Most likely they all have their roots in the object-oriented/relational paradigm meaning that you’d expect to be able to get some insight into your data by simply running a few queries against it or simply looking at the table, possibly sorting it to find something out. The day you decide to try out something like Azure Table storage however you’ll find that these tools simply aren’t available to you any more due to the nature of the service. It’s at this point where, if you’re like me, you’ll get a little nervous as your data can end up feeling like something of a black box.

A while back I posted about how I was over-thinking the scalability of my Azure application and how I was about to make the move to Azure SQL. That’s been my task for the past 3 weeks or so and what started out as a relatively simple task of simply moving data from one storage mechanism to another has turned into this herculean task that has seen me dive deeper into both Azure Tables and SQL than I have ever done previously. Along the way I’ve found out a few things that, whilst not changing my mind about the migration away from Azure tables, certainly would have made my life a whole bunch easier had I known about them.

1. If you need to query all the records in an Azure table, do it partition by partition.

The not-so-fun thing about Azure Tables is that unless you’re keeping track of your data in your application there’s no real metrics you can dredge up in order to give you some idea of what you’ve actually got. For me this meant that I had one table that I knew the count of (due to some background processing I do using that table) however there are 2 others which I have absolutely 0 idea about how much data is actually contained in there. Estimates using my development database led me to believe there was an order of magnitude more data in there than I thought there was which in turn led me to the conclusion that using .AsTableServiceQuery() to return the whole table was doomed from the start.

However Azure Tables isn’t too bad at returning an entire partition’s worth of data, even if the records number in the 10s or 100s of thousands. Sure the query time goes up linearly depending on how many records you’ve got (as Azure Tables will only return a max of 1000 records at a time) but if they’re all within the same partition you avoid the troublesome table scan which dramatically affects the performance of the query, sometimes to the point of it getting cancelled which isn’t handled by the default RetryPolicy framework. If you need all the data in the entire table you can then do queries on each partition and then dump them all in a list inside your application and then continue to do your query.

2. Optimize your context for querying or updating/inserting records.

Unbeknownst to me the TableServiceContext class has quite a few configuration options available that will allow you to change the way the context behaves. The vast majority of errors I was experiencing came from my background processor which primarily dealt with reading data without making any modifications to the records. If you have applications where this is the case then it’s best to set the Context.MergeOption to MergeOption.NoTracking as this means the context won’t attempt to track the entities.

If you have multiple threads running or queries that return large amounts of records this can lead to a rather large improvement in performance as the context doesn’t have to track any changes to them and the garbage collector can free up these objects even if you use the context for another query. Of course this means that if you do need to make any changes you’ll have to change the context and then attach to the entity in question but you’re probably doing that already. Or at least you should be.

3. Modify your web.config or app.config file to dramatically improve performance and reliability.

For some unknown reason the default number of HTTP connections that a Windows Azure application can make (although I get the feeling this affects all applications making use of the .NET frameworks) is set to 2. Yes just 2. This then manifests itself as all sorts of crazy errors that don’t make a whole bunch of sense like “the underlying connection was closed” when you try to make more than 2 requests at any one time (which includes queries to Azure Tables). The max number of connections you can specify depends on the size of the instance you’re using but Microsoft has a helpful guide on how to set this and other settings in order to make the most out of it.

Additionally some of the guys at Microsoft have collected a bunch of tips for improving the performance of Azure Tables in various circumstances. I’ve cherry picked out the best ones which I’ve confirmed that have worked wonders for me however there’s a fair few more in there that might be of use to you, especially if you’re looking to get every performance edge you can. Many of them are circumstantial and some require you to plan out or storage architecture in advance (so something that can’t be easily retrofitted into an existing app) but since the others have worked I hazard a guess they would to.

I might not be making use of some of these tips now that my application is going to be SQL and TOPAZ but if I can save anyone the trouble I went through trying to sort through all those esoteric errors I can at least say it was worth it. Some of these tips are just good to know regardless of the platform you’re on (like the default HTTP connection limit) and should be incorporated into your application as soon as its feasible. I’ve yet to get all my data into production yet as its still migrating but I get the feeling I might go on another path of discovery with Azure SQL in the not too distant future and I’ll be sure to share my tips for it then.

Fusion-IO’s ioDrive Comparison: Sizing up Enterprise Level SSDs.

Of all the PC upgrades that I’ve ever done in the past the one that’s most notably improved performance of my rig is, by a wide margin, installing a SSD. Whilst good old fashioned spinning rust disks have come a long way in recent years in terms of performance they’re still far and away the slowest component in any modern system. This is what chokes most PC’s performance as the disk is a huge bottleneck, slowing everything down to its pace. The problem can be mitigated somewhat by using several disks in a RAID 0 or RAID 10 set but all of those pale in comparison when compared to even a single SSD.

The problem doesn’t go away for the server environment either, in fact most of the server performance problems I’ve diagnosed have had their roots in poor disk performance. Over the years I’ve discovered quite a few tricks to get around the problems presented by traditional disk drives but there are just some limitations you can’t overcome. Recently at work the issue of disk performance came to a head again as we investigated the possibility of using blade servers in our environment. I casually made mention of a company that I had heard of a while back, Fusion-IO, who specialised in making enterprise class SSDs. The possibility of using one of the Fusion-IO cards as a massive cache for the slower SAN disk was a tantalizing prospect and to my surprise I was able to snag an evaluation unit in order to put it through its paces.

The card we were sent was one of the 640GB ioDrives. It’s surprising heavily for its size, sporting gobs of NAND flash and a massive heat sink that hides the propeitary c ontroller. What intrigued me about the card initially was the NAND didn’t sport any branding I recognised before (usually its recognisable like Samsung) but as it turns out each chip is a 128GB Micron NAND Flash chip. If all that storage was presented raw it would total some 3.1 TB and this is telling of the underlying infrastructure of the Fusion-IO devices.

The total storage available to the operating system once this card is installed is around 640GB (600GB usable). Now to get that kind of storage out of the Micron NAND chips you’d only need 5 of them but the ioDrive comes with a grand total of 25 dotting the board. No traditional RAID scheme can account for the amount of storage presented. So based on the fact that there’s 25 chips and only 5 chips worth of capacity available it follows that the Fusion-IO card uses quintuplet sets of chips to provide the high level of performance that they claim. That’s an incredible amount of parallelism and if I’m honest I expected these chips to all be 256MB chips that were all RAID 1 to make one big drive.

Funnily enough I did actually find some Samsung chips on this card, two 1GB DDR2 chips. These are most likely used for the CPU on the ioDrive which has a front side bus of either 333 or 400MHz based on the RAM speed.

But enough of the techno geekery, what’s really important is how well this thing performs in comparison to traditional disks and whether or not it’s worth the $16,000 price tag that comes along with it. Now I had done some extensive testing of various systems in the past in order to ascertain whether the new Dell servers we were looking at where going to perform as well as their HP counterparts. All of this testing was purely disk based using IOMeter, a disk load simulator that tests and reports on nearly every statistic you want to know about your disk subsystem. If you’re interested in replicating the results I’ve got then I’ve uploaded a copy of my configuration file here. The servers included in the test are Dell M610x, Dell M710HD, Dell M910, Dell R710 and a HP DL380G7. For all the tests (bar the two labelled local install) all of them are a base install of ESXi 5 with a Windows 2008R2 virtual machine installed on top of it. The specs of the virtual machine are 4 vCPUs, 4GB RAM and a 40GB disk.

As you can see the ioDrive really is in a class all of its own. The only server that comes close in terms of IOPS is the M910 and that’s because it’s sporting 2 Samsung SSDs in RAID 0. What impresses me most about the ioDrive though is its random performance which manages to stay quite high even as the block size starts to get bigger. Although its not shown in these tests the one area where the traditional disks actually equal the Fusion-IO is in terms of throughput when you get up to really large write sizes, on the order of 1MB or so. I put this down to the fact that the servers in question, the R710s and DL380G7s, have 8 disks in them that can pump out some serious bandwidth when they need to. If I had 2 Fusion-IO cards though I’m sure I could easily double that performance figure.

What interested me next was to see how close I could get to the spec sheet performance. The numbers I just showed you are particularly incredible but Fusion-IO claims that this particular drive was capable of something on the order of 140,000 IOPS if I played my cards correctly. Using the local install of Windows 2008 I had on there I fired up IOMeter again and set up some 512B tests to see if I could get close to those numbers. The results, as shown in the Dell IO contoller software, are shown below:

Ignoring the small blip in the centre where I had to restart the test you can see that whilst the ioDrive is capable of some pretty incredible IO the advertised maximums are more than likely theoretical than practical. I tried several different tests and while a few averaged higher than this (approximately 80K IOPS was my best) it was still a far cry from the figures they have quoted. Had they gotten within 10~20% I would’ve given it to them but whilst the ioDrive’s performance is incredible it’s not quite as incredible as the marketing department would have you believe.

As a piece of hardware the Fusion-IO ioDrive is really the next step up in terms of performance. The virtual machines I had running directly on the card were considerably faster than their spinning rust counterparts and if you were in need of some really crazy performance you really couldn’t go past one of these cards. For the purpose we had in mind for it however (putting it inside a M610x blade) I can’t really recommend it as it’s a full height blade that only has the power of a half height. The M910 represents much better value with its crazy CPU and RAM count and the SSDs, whilst being far from Fusion-IO level, do a pretty good job of bridging the disk performance gap. I didn’t have enough time to see how it would improve some real world applications (it takes me longer than 10 days to get something like this into our production environment) but based on these figures I have no doubt it improve the performance of whatever I put it into considerably. 

Virtual Machine CPU Over-provisioning: Results From The Real World.

Back when virtualization was just starting to make headway into the corporate IT market the main aim of the game was consolidation. Vast quantities of CPU, memory and disk resources were being squandered as servers sat idle for the vast majority of their lives, barely ever using the capacity that was assigned to them. Virtualization allowed IT shops the ability to run many low resource servers on the one box, significantly reducing the hardware requirement cost whilst providing a whole host of other features. It followed then that administrators looked towards over-provisioning their hosts, I.E. creating more virtual machines than the host was technically capable of handling.

The reason this works is because of a feature of virtualization platforms called scheduling. In essence when you put a virtual machine on an over-provisioned host it will not be guaranteed to get resources when it needs them, instead it’s scheduled on and in order to keep it and all the other virtual machines running properly. Surprisingly this works quite well as for the most part virtual machines spend a good part of their life idle and the virtualization platform uses this information to schedule busy machines ahead of idle ones. Recently I was approached to find out what the limits were of a new piece of hardware that we had procured and I’ve discovered some rather interesting results.

The piece of kit in question is a Dell M610x blade server with the accompanying chassis and interconnects. The specifications we got were pretty good being a dual processor arrangement (2 x Intel Xeon X5660) with 96GB of memory. What we were trying to find out was what kind of guidelines should we have around how many virtual machines could comfortably run on such hardware before performance started to degrade. There was no such testing done with previous hardware so I was working in the dark on this one, so I’ve devised my own test methodology in order to figure out the upper limits of over-provisioning in a virtual world.

The primary performance bottleneck for any virtual environment is the disk subsystem. You can have the fastest CPUs and oodles of RAM and still get torn down by slow disk. However most virtual hosts will use some form of shared storage so testing that is out of the equation. The two primary resources we’re left with then are CPU and memory and the latter is already a well known problem space. However I wasn’t able to find any good articles on CPU over-provisioning so I devised some simple tests to see how the systems would perform when under a load that was well above its capabilities.

The first test was a simple baseline, since the server has 12 available physical cores (HyperThreading might say you get another core, but that’s a pipe dream) I created 12 virtual machines each with a single core. I then fully loaded the CPUs to max capacity. Shown below is a stacked graph of each virtual machine’s ready time which is a representation of how long the virtual machine was ready¹ to execute some instruction but was not able to get scheduled onto the CPU.

The initial part of this graph shows the machines all at idle. Now you’d think at that stage that their ready times would be zero since there’s no load on the server. However since VMware’s hypervisor knows when a virtual machine is idle it won’t schedule it on as often as the idle loops are simply wasted CPU cycles. The jumpy period after that is when I was starting up a couple virtual machines at a time and as you can see those virtual machine’s ready times drop to 0. The very last part of the graph shows the ready time rocketing down to nothing for all the virtual machines with the top grey part of the graph being the ready time of the hypervisor itself. 

This test doesn’t show anything revolutionary as this is pretty much the expected behaviour of a virtualized system. It does however provide us with a solid baseline from which we can draw some conclusions from further tests. The next test I performed was to see what would happen when I doubled the work load on the server, increasing the virtual core count from 12 to a whopping 24. 

For comparison’s sake the first graph’s peak is equivalent to the first peak of the second graph. What this shows is that when the CPU is oversubscribed by 100% the CPU wait times rocket through the roof with the virtual machines waiting up to 10 seconds in some cases to get scheduled back onto the CPU. The average was somewhere around half a second which for most applications is an unacceptable amount of time. Just imagine trying to use your desktop and having it freeze for half a second every 20 seconds or so, you’d say it was unusable. Taking this into consideration we now know that there must be some level of happy medium in the centre. The next test then aimed right bang in the middle of these two extremes, putting 18 CPUs on a 12 core host.

Here’s where it gets interesting. The graph depicts the same test running over the entire time but as you can see there are very distinct sections depicting what I call different modes of operation. The lower end of the graph shows a time when the scheduler is hitting bang on its scheduling and the wait times are overall quite low. The second is when the scheduler gives much more priority to the virtual machines that are thrashing their cores and the machines that aren’t doing anything get pushed to the side. However in both instances the 18 cores running are able to get the serviced in a maximum of 20 milliseconds or so, well within the acceptable range of most programs and user experience guidelines.

Taking this all into consideration it’s then reasonable to say that the maximum you can oversubscribe a virtual host in regards to CPU is 1.5 times the number of physical cores. You can extrapolate that further by taking into consideration the average load and if it’s below 100% constantly then you can divide the number of CPUs by that percentage. For example if the average load of these virtual machines was 50% then theoretically you could support 36 single core virtual machines on this particular host. Of course once you get into the very high CPU count things like overhead start to come into consideration, but as a hard and fast rule it works quite well.

If I’m honest I was quite surprised with these results as I thought once I put a single extra thrashing virtual machine on the server it’d fall over in a screaming heap with the additional load. It seems though that VMware’s scheduler is smart enough to be able to service a load much higher than what the server should be capable of without affecting the other virtual machines that adversely. This is especially good news for virtual desktop deployments as typically the limiting factor there was the number of CPU cores available. If you’re an administrator of a virtual deployment I hope you found this informative and it will help you when planning future virtual deployments.

¹CPU ready time was chosen as the metric as it most aptly showcases a server’s ability to serve a virtual machine’s request of the CPU when in a heavy scheduling scenario. Usage wouldn’t be an accurate metric to use since for all these tests the blade was 100% utilized no matter the number of virtual machines running.

Website Performance (or People are Impatient).

Way back when I used to host this server myself on the end of my tenuous ADSL connection loading up the web site always felt like something of a gamble. There were any number of things that could stop me (and the wider world) from getting to it like: the connection going down, my server box overheating or even the power going out at my house (which happened more often than I realised). About a year ago I made the move onto my virtual private server and instantly all those worries evaporated and the blog has been mostly stable ever since. I no longer have to hold my breath every time I type my url into the address bar nor do I worry about posting media rich articles anymore, something I avoided when my upstream was a mere 100KB/s.

What really impressed me though was the almost instant traffic boost that I got from the move. At the time I just put it down to more people reading my writing as I had been at it for well over a year and a half at that point. At the same time I had also made a slight blunder with my DNS settings which redirected all traffic from my subdomains to the main site so I figured that the burst in traffic was temporary and would drop off as people’s DNS caches expired. The strangest thing was though that the traffic never went away and continued to grow steadily. Not wanting to question my new found popularity I just kept doing what I was always doing until I stumbled across something that showed me what was happening.

April last year saw Google mix in a new metric to their ranking algorithm: page load speed, right around the same time that I experienced the traffic boost from moving off my crappy self hosting and onto the VPS. The move had made a significant improvement in the usability of the site, mostly due to the giant pipe that it has, and it appeared that Google was now picking up on that and sending more people my way. However the percentage of traffic coming here from search engines remained the same but since it was growing I didn’t care to investigate much further.

I started to notice some curious trends though when aggregating data from a couple different sources. I use 2 different kinds of analytics here on The Refined Geek the first being WordPress.com Stats (just because it’s real-time) and Google Analytics for long term tracking and pretty graphs. Now both of them agree with each other pretty well however the one thing they can’t track is how many people come to my site but leave before the page is fully loaded. In fact I don’t think there’s any particular service that can do this (I would love to be corrected on this) but if you’re using Google’s Webmaster Tools you can get a rough idea of the number of people that come from their search engine but get fed up waiting for your site to load. You can do this by checking the number of clicks you get from search queries and comparing that to the number of people visiting your site from Google Analytics. This will give you a good impression of how many people abandon your site because it’s running too slow.

For this site the results are quite surprising. On average I lose about 20% of my visitors between them clicking on the link in Google and actually loading a page¹. I shudder to think how many I was losing back in the days where a page would take 10+ seconds to load but I’d hazard a guess it was roughly double that if I take into account the traffic boost I got after moving to a dedicated provider. Getting your site running fast then is probably one of the most important things you can do if you’re looking to get anywhere on the Internets, at least that’s what my data is telling me.

After I realised this I’ve been on a bit of a performance binge, trying anything and everything to get it running better. I’m still in the process of doing so however and many of the tricks that people talk about for WordPress don’t translate well into the Windows world so I’m basically hacking my way through it. I’ve dedicated part of my weekend to this and I’ll hopefully write up the results next week so that you other crazy Windows based WordPressers can benefit from my tinkering.

¹If people are interested in finding out this kind of data from their Google Analytics/Webmasters Tools account let me know and I might run up a script to do the comparison for you.

 

A SSD By Any Other Synthetic Benchmark Would Be As Fast.

Like any technology geek real world performance of a component is the most important aspect for me when I’m looking to purchase new hardware. Everyone knows manufacturer’s can’t be trusted with ratings, especially when they come up with their own systems that provide big numbers that mean absolutely nothing, so I primarily base my purchasing decisions based on aggregating reviews from various sources around the Internet in order to get a clear picture of which brand/revision I should get. After that point I usually go for the best performance per dollar as whilst it’s always nice to have the best components the price differential is usually not worth the leap, mostly because you won’t notice the incremental increase. There are of course notable exceptions to this hard and fast rule and realistically my decision in the end wasn’t driven by rational thought so much as it was pure geeky lust after the highest theoretical performance.

Solid State Drives present quite an interesting value proposition for us consumers. They are leaps and bounds faster than their magnetic predecessors thanks to their ability to access data instantaneously and their extremely high throughput rates. Indeed with the hard drive being the bottleneck of performance for nearly every computer in the world the most effective upgrade you can get is that of a SSD. Of course nothing can beat magnetic hard drives for their cost, durability and capacity so it’s very unlikely that we’ll be seeing the end of them anytime soon. Still the enormous gap that separates SSDs from any other storage medium brings about some interesting issues of its own: benchmarks, especially synthetic ones, are almost meaningless for end users.

I’ll admit I was struck by geek lust when I saw the performance specs for the OCZ Vertex 3, they were just simply amazing. Indeed the drive has matched up to my sky high expectations with me being able to boot, login and open up all my applications in the time it took my previous PC just to get to the login screen. Since then I’ve been recommending the Vertex 3 to anyone who was looking to get a new drive but just recently OCZ announced their new budget line of SSDs, the Agility 3.  Being almost $100 cheaper and sporting very similar performance specs to that of the Vertex it’s a hard thing to argue against especially when you consider just how fast these SSDs are in the first place.

Looking at the raw figures it would seem like the Agility series are around 10% slower than their Vertex counterparts on average, which isn’t bad for a budget line. However when you consider that the 10% performance gap is the difference between your windows loading in 6.3 seconds rather than 7 and your applications launching in 0.9 seconds instead of 1 then the gap doesn’t seem all that big. Indeed I’d challenge anyone to be able to spot the differences between two identical systems configured with different SSDs as these kinds of performance differences will only matter to benchmarkers and people building high traffic systems.

Indeed one of my mates had been running a SSD for well over a year and a half before I got mine and from what he tells me the performance of units back then was enough for him to not notice any slow down after not formatting for that entire time. Likely then if you’re considering getting a SSD but are turned off by the high price of current models you’ll be quite happy with the previous generation as the perceived performance will be identical. Although with the Agility 3 120GB version going for a mere $250 the price difference between generations isn’t really that much anymore.

Realistically SSDs are just the most prominent example of why synthetic benchmarks aren’t a good indicator of real world performance. There’s almost always an option that will provide similar performance for a drastically reduced price and for the end user the difference will likely be unnoticeable. SSDs are just so far away from their predecessors that the differentials between the low and high end are usually not worth mentioning, especially if you’re upgrading from good old spinning rust. Of course there will always be geeks like me whose lust will overcome their sensibility and reach for the ultimate in performance, which is why those high end products still exist today.

Why I Dropped CloudFlare.

I’m always looking out for ways to improve my blog behind the scenes mostly because I’ve noticed that a lot more people visit when the page doesn’t take more than 10 seconds to load. Over the course of its life I’ve tried a myriad of things with the blog from changing operating systems to trying nearly every plugin under the sun that said it could boost my site’s performance. In the end the best move I ever made was to put it on a Windows virtual private server in the USA that was backed up by a massive pipe and everything I’ve tried hasn’t come close since.

However I was intrigued by the services offered by CloudFlare, a new web start up that offered to speed up basically any web site. I’d read about them a while back when they were participating in TechCrunch Disrupt and the idea of being able to back my blog with a CDN for free was something few would pass up. At the time however my blog was on a Linux server with all the caching plugins functioning fine, so my site was performing pretty much as fast as it could at the time. After the migration to my new Windows server however I had to disable my caching plugins as they assumed a Linux host for them to function properly. I didn’t really think about CloudFlare again until they came up in my feed reader just recently, so I decided to give them a go.

They’re not wrong when they say their set up is painless (at least for an IT geek like myself). After signing up with them and entering in my site details all that I needed to do was update my name servers to point to theirs and I was fully integrated with their service. At first I was a bit confused since it didn’t seem to be doing anything but proxying the connections to my site but it would seem that it does cache static content. How it goes about this doesn’t seem to be public knowledge however, so I got the feeling it only does it per request. Still after getting it all set up I decided I’d leave it over the weekend to see how it performed and come this morning I wasn’t terribly impressed with the results.

Whilst the main site suffered absolutely 0 downtime my 2 dozen sub domains seemed to have dropped off the face of the earth. Initially I had thought that this was because of the wildcard DNS entry that I had used to redirect all subdomain requests (CloudFlare says they won’t proxy them if you do this, which was fine for me in this instance). However after manually entering in the subdomains and waiting 24 hours to see the results they were still not accessible. Additionally the site load times didn’t improve noticeably, leaving me wondering if this was worth all the time I had put into it. After changing my name servers back to their previous locations all my sites came back up immediately and soured me on the whole CloudFlare idea.

It could be that it was all a massive configuration goof on my part but since I was able to restore my sites I’m leaning it towards being a problem with CloudFlare. For single site websites it’s probably a good tool and I’d be lying if I said I wasn’t interested in their DDOS protection (I was on edge after doing that LulzSec piece) but it seems my unique configuration doesn’t gel with their services. Don’t let me talk you out of trying them however since so many people seem to be benefiting from their services, it’s just that there might be potential problems if you’re running dozens of subdomains like me.