Back when virtualization was just starting to make headway into the corporate IT market the main aim of the game was consolidation. Vast quantities of CPU, memory and disk resources were being squandered as servers sat idle for the vast majority of their lives, barely ever using the capacity that was assigned to them. Virtualization allowed IT shops the ability to run many low resource servers on the one box, significantly reducing the hardware requirement cost whilst providing a whole host of other features. It followed then that administrators looked towards over-provisioning their hosts, I.E. creating more virtual machines than the host was technically capable of handling.
The reason this works is because of a feature of virtualization platforms called scheduling. In essence when you put a virtual machine on an over-provisioned host it will not be guaranteed to get resources when it needs them, instead it’s scheduled on and in order to keep it and all the other virtual machines running properly. Surprisingly this works quite well as for the most part virtual machines spend a good part of their life idle and the virtualization platform uses this information to schedule busy machines ahead of idle ones. Recently I was approached to find out what the limits were of a new piece of hardware that we had procured and I’ve discovered some rather interesting results.
The piece of kit in question is a Dell M610x blade server with the accompanying chassis and interconnects. The specifications we got were pretty good being a dual processor arrangement (2 x Intel Xeon X5660) with 96GB of memory. What we were trying to find out was what kind of guidelines should we have around how many virtual machines could comfortably run on such hardware before performance started to degrade. There was no such testing done with previous hardware so I was working in the dark on this one, so I’ve devised my own test methodology in order to figure out the upper limits of over-provisioning in a virtual world.
The primary performance bottleneck for any virtual environment is the disk subsystem. You can have the fastest CPUs and oodles of RAM and still get torn down by slow disk. However most virtual hosts will use some form of shared storage so testing that is out of the equation. The two primary resources we’re left with then are CPU and memory and the latter is already a well known problem space. However I wasn’t able to find any good articles on CPU over-provisioning so I devised some simple tests to see how the systems would perform when under a load that was well above its capabilities.
The first test was a simple baseline, since the server has 12 available physical cores (HyperThreading might say you get another core, but that’s a pipe dream) I created 12 virtual machines each with a single core. I then fully loaded the CPUs to max capacity. Shown below is a stacked graph of each virtual machine’s ready time which is a representation of how long the virtual machine was ready¹ to execute some instruction but was not able to get scheduled onto the CPU.
The initial part of this graph shows the machines all at idle. Now you’d think at that stage that their ready times would be zero since there’s no load on the server. However since VMware’s hypervisor knows when a virtual machine is idle it won’t schedule it on as often as the idle loops are simply wasted CPU cycles. The jumpy period after that is when I was starting up a couple virtual machines at a time and as you can see those virtual machine’s ready times drop to 0. The very last part of the graph shows the ready time rocketing down to nothing for all the virtual machines with the top grey part of the graph being the ready time of the hypervisor itself.
This test doesn’t show anything revolutionary as this is pretty much the expected behaviour of a virtualized system. It does however provide us with a solid baseline from which we can draw some conclusions from further tests. The next test I performed was to see what would happen when I doubled the work load on the server, increasing the virtual core count from 12 to a whopping 24.
For comparison’s sake the first graph’s peak is equivalent to the first peak of the second graph. What this shows is that when the CPU is oversubscribed by 100% the CPU wait times rocket through the roof with the virtual machines waiting up to 10 seconds in some cases to get scheduled back onto the CPU. The average was somewhere around half a second which for most applications is an unacceptable amount of time. Just imagine trying to use your desktop and having it freeze for half a second every 20 seconds or so, you’d say it was unusable. Taking this into consideration we now know that there must be some level of happy medium in the centre. The next test then aimed right bang in the middle of these two extremes, putting 18 CPUs on a 12 core host.
Here’s where it gets interesting. The graph depicts the same test running over the entire time but as you can see there are very distinct sections depicting what I call different modes of operation. The lower end of the graph shows a time when the scheduler is hitting bang on its scheduling and the wait times are overall quite low. The second is when the scheduler gives much more priority to the virtual machines that are thrashing their cores and the machines that aren’t doing anything get pushed to the side. However in both instances the 18 cores running are able to get the serviced in a maximum of 20 milliseconds or so, well within the acceptable range of most programs and user experience guidelines.
Taking this all into consideration it’s then reasonable to say that the maximum you can oversubscribe a virtual host in regards to CPU is 1.5 times the number of physical cores. You can extrapolate that further by taking into consideration the average load and if it’s below 100% constantly then you can divide the number of CPUs by that percentage. For example if the average load of these virtual machines was 50% then theoretically you could support 36 single core virtual machines on this particular host. Of course once you get into the very high CPU count things like overhead start to come into consideration, but as a hard and fast rule it works quite well.
If I’m honest I was quite surprised with these results as I thought once I put a single extra thrashing virtual machine on the server it’d fall over in a screaming heap with the additional load. It seems though that VMware’s scheduler is smart enough to be able to service a load much higher than what the server should be capable of without affecting the other virtual machines that adversely. This is especially good news for virtual desktop deployments as typically the limiting factor there was the number of CPU cores available. If you’re an administrator of a virtual deployment I hope you found this informative and it will help you when planning future virtual deployments.
¹CPU ready time was chosen as the metric as it most aptly showcases a server’s ability to serve a virtual machine’s request of the CPU when in a heavy scheduling scenario. Usage wouldn’t be an accurate metric to use since for all these tests the blade was 100% utilized no matter the number of virtual machines running.
I make no secret of the fact that I’ve pretty much built my career around a single line of products, specifically those from VMware. Initially I simply used their workstation line of products to help me through university projects that required Linux to complete but after one of my bosses caught wind of my “experience” with VMware’s products I was put on the fast line to become an expert in their technology. The timing couldn’t have been more perfect as virtualization then became a staple of every IT department I’ve had the pleasure of working with and my experience with VMware ensured that my resume always floated around near the top when it came time to find a new position.
In this time I’ve had a fair bit of experience with their flagship product now called vSphere. In essence it’s an operating system you can install on a server that lets you run multiple, distinct operating system instances on top of it. Since IT departments always bought servers with more capacity than they needed systems like vSphere meant they could use that excess capacity to run other, not so power hungry systems along side them. It really was a game changer and from then on servers were usually bought with virtualization being the key purpose in mind rather than them being for a specific system. VMware is still the leader in this sector holding an estimated 80% of the market and has arguably the most feature rich product suite available.
Yesterday saw the announcement of their latest product offering vSphere 5. From a technological standpoint it’s very interesting with many innovations that will put VMware even further ahead of their competition, at least technologically. Amongst the usual fanfare of bigger and better virtual machines and improvements to their current technologies vSphere 5 brings with it a whole bunch of new features aimed squarely at making vSphere the cloud platform for the future. Primarily these innovations are centred around automating certain tasks within the data centre, such as provisioning new servers and managing server loads including down to the disk level which wasn’t available previously. Considering that I believe the future of cloud computing (at least for government organisations and large scale in house IT departments) is a hybrid public/private model these improvements are a welcome change , even if I won’t be using them immediately.
The one place that VMware falls down and is (rightly) heavily criticized for is the price. With the most basic licenses costing around $1000 per core it’s not a cheap solution by any stretch of the imagination, especially if you want to take advantage of any of the advanced features. Still since the licencing was per processor it meant that you could buy a dual processor server (each with say, 6 cores) with oodles of RAM and still come out ahead of other virtualization solutions. However with vSphere 5 they’ve changed the way they do pricing significantly, to the point of destroying such a strategy (and those potential savings) along with it.
Licensing is still charged on a per-processor basis but instead of having an upper limit on the amount of memory (256GB for most licenses, Enterprise Plus gives you unlimited) you are now given a vRAM allocation per licence purchased. Depending on your licensing level you’ll get 24GB, 32GB or 48GB worth of vRAM which you’re allowed to allocate to virtual machines. Now for typical smaller servers this won’t pose much of a problem as a dual proc, 48GB RAM server (which is very typical) would be covered easily by the cheapest licensing. However should you exceed even 96GB of RAM, which is very easy to do, that same server will then require additional licenses to be purchased in order to be able to full utilize the hardware. For smaller environments this has the potential to make VMware’s virtualization solution untenable, especially when you put it beside the almost free competitor of Hyper-V from Microsoft.
The VMware user community has, of course, not reacted positively to this announcement. Whilst for many larger environments the problems won’t be so bad as the vRAM allocation is done at the data center level and not the server level (allowing over-allocated smaller servers to help out their beefier brethren) it does have the potential to hurt smaller environments especially those who heavily invested in RAM heavy, processor poor servers. It’s also compounded by the fact that you’ll only have a short time to choose to upgrade for free, thus risking having to buy more licenses, or abstain and then later have to pay an upgrade fee. It’s enough for some to start looking into moving to the competition which could cut into VMware’s market share drastically.
The reasoning behind these changes is simple: such pricing is much more favourable to a ubiquitous cloud environment than it is to the current industry norm for VMware deployments. VMware might be slightly ahead of the curve on this one however as most customers are not ready to deploy their own internal clouds with the vast majority of current cloud users being hosted solutions. Additionally many common enterprise applications aren’t compatible with VMware’s cloud and thus lock end users out of realising the benefits of a private cloud. VMware might be choosing to bite the bullet now rather than later in the hopes it will spur movement onto their cloud platform at a later stage. Whether this strategy works or not remains to be seen, but current industry trends are pushing very hard towards a cloud based future.
I’m definitely looking forward to working with vSphere 5 and there are several features that will definitely provide an immense amount of value to my current environment. The licensing issue, whilst I feel won’t be much of an issue, is cause for concern and whilst I don’t believe VMware will budge on it any time soon I do know that the VMware community is an innovative lot and it won’t be long before they work out how to make the best of this licensing situation. Still it’s definitely an in for the competition and whilst they might not have the technological edge they’re more than suitable for many environments.