Back when virtualization was just starting to make headway into the corporate IT market the main aim of the game was consolidation. Vast quantities of CPU, memory and disk resources were being squandered as servers sat idle for the vast majority of their lives, barely ever using the capacity that was assigned to them. Virtualization allowed IT shops the ability to run many low resource servers on the one box, significantly reducing the hardware requirement cost whilst providing a whole host of other features. It followed then that administrators looked towards over-provisioning their hosts, I.E. creating more virtual machines than the host was technically capable of handling.
The reason this works is because of a feature of virtualization platforms called scheduling. In essence when you put a virtual machine on an over-provisioned host it will not be guaranteed to get resources when it needs them, instead it’s scheduled on and in order to keep it and all the other virtual machines running properly. Surprisingly this works quite well as for the most part virtual machines spend a good part of their life idle and the virtualization platform uses this information to schedule busy machines ahead of idle ones. Recently I was approached to find out what the limits were of a new piece of hardware that we had procured and I’ve discovered some rather interesting results.
The piece of kit in question is a Dell M610x blade server with the accompanying chassis and interconnects. The specifications we got were pretty good being a dual processor arrangement (2 x Intel Xeon X5660) with 96GB of memory. What we were trying to find out was what kind of guidelines should we have around how many virtual machines could comfortably run on such hardware before performance started to degrade. There was no such testing done with previous hardware so I was working in the dark on this one, so I’ve devised my own test methodology in order to figure out the upper limits of over-provisioning in a virtual world.
The primary performance bottleneck for any virtual environment is the disk subsystem. You can have the fastest CPUs and oodles of RAM and still get torn down by slow disk. However most virtual hosts will use some form of shared storage so testing that is out of the equation. The two primary resources we’re left with then are CPU and memory and the latter is already a well known problem space. However I wasn’t able to find any good articles on CPU over-provisioning so I devised some simple tests to see how the systems would perform when under a load that was well above its capabilities.
The first test was a simple baseline, since the server has 12 available physical cores (HyperThreading might say you get another core, but that’s a pipe dream) I created 12 virtual machines each with a single core. I then fully loaded the CPUs to max capacity. Shown below is a stacked graph of each virtual machine’s ready time which is a representation of how long the virtual machine was ready¹ to execute some instruction but was not able to get scheduled onto the CPU.
The initial part of this graph shows the machines all at idle. Now you’d think at that stage that their ready times would be zero since there’s no load on the server. However since VMware’s hypervisor knows when a virtual machine is idle it won’t schedule it on as often as the idle loops are simply wasted CPU cycles. The jumpy period after that is when I was starting up a couple virtual machines at a time and as you can see those virtual machine’s ready times drop to 0. The very last part of the graph shows the ready time rocketing down to nothing for all the virtual machines with the top grey part of the graph being the ready time of the hypervisor itself.
This test doesn’t show anything revolutionary as this is pretty much the expected behaviour of a virtualized system. It does however provide us with a solid baseline from which we can draw some conclusions from further tests. The next test I performed was to see what would happen when I doubled the work load on the server, increasing the virtual core count from 12 to a whopping 24.
For comparison’s sake the first graph’s peak is equivalent to the first peak of the second graph. What this shows is that when the CPU is oversubscribed by 100% the CPU wait times rocket through the roof with the virtual machines waiting up to 10 seconds in some cases to get scheduled back onto the CPU. The average was somewhere around half a second which for most applications is an unacceptable amount of time. Just imagine trying to use your desktop and having it freeze for half a second every 20 seconds or so, you’d say it was unusable. Taking this into consideration we now know that there must be some level of happy medium in the centre. The next test then aimed right bang in the middle of these two extremes, putting 18 CPUs on a 12 core host.
Here’s where it gets interesting. The graph depicts the same test running over the entire time but as you can see there are very distinct sections depicting what I call different modes of operation. The lower end of the graph shows a time when the scheduler is hitting bang on its scheduling and the wait times are overall quite low. The second is when the scheduler gives much more priority to the virtual machines that are thrashing their cores and the machines that aren’t doing anything get pushed to the side. However in both instances the 18 cores running are able to get the serviced in a maximum of 20 milliseconds or so, well within the acceptable range of most programs and user experience guidelines.
Taking this all into consideration it’s then reasonable to say that the maximum you can oversubscribe a virtual host in regards to CPU is 1.5 times the number of physical cores. You can extrapolate that further by taking into consideration the average load and if it’s below 100% constantly then you can divide the number of CPUs by that percentage. For example if the average load of these virtual machines was 50% then theoretically you could support 36 single core virtual machines on this particular host. Of course once you get into the very high CPU count things like overhead start to come into consideration, but as a hard and fast rule it works quite well.
If I’m honest I was quite surprised with these results as I thought once I put a single extra thrashing virtual machine on the server it’d fall over in a screaming heap with the additional load. It seems though that VMware’s scheduler is smart enough to be able to service a load much higher than what the server should be capable of without affecting the other virtual machines that adversely. This is especially good news for virtual desktop deployments as typically the limiting factor there was the number of CPU cores available. If you’re an administrator of a virtual deployment I hope you found this informative and it will help you when planning future virtual deployments.
¹CPU ready time was chosen as the metric as it most aptly showcases a server’s ability to serve a virtual machine’s request of the CPU when in a heavy scheduling scenario. Usage wouldn’t be an accurate metric to use since for all these tests the blade was 100% utilized no matter the number of virtual machines running.
It’s no secret that I owe a large part of my IT career to virtualization. It was a combination of luck, timing and willingness to jump into the unknown that led me down the VMware path having my first workplace using VMware’s products which set the stage for every job there after seeing my experience and latching on to it with a crack-junkie like desire. Over the years then I’ve become intimately familiar with many virtualization solutions but inevitably I find myself coming back to VMware because simply put they’re the market leaders and pretty much everyone who can afford to use them does so. So you can imagine then I was somewhat excited when I saw the release of vSphere 5 and I’ve been putting it through its paces over the past couple weeks.
On the surface ESXi 5 and vSphere 5 look almost identical to their predecessors. ESXi 5 is really only distinguishable from 4 thanks to the slightly different layout and changed font, whilst vSphere 5 is exactly the same spare for some new icons and additional links to new features. I guess with any new product version I’ve just come to expect a UI revamp even if it adds nothing to the end product so the fact that VMware decided to stick with their current UI came as somewhat of a surprise but I can’t really fault them for doing so. The real meat of the vSphere 5 is under the hood and there have been some major improvements from my initial testing.
vSphere 5 brings with it Virtual Machine Version 8 which amongst the usual more CPUs/more memory upgrades brings along with it support for 3D accelerated graphics, UEFI for the BIOS (which technically means it can OSX Lion although that will never happen¹) and USB 3.0 support. There’s also a few new options available when creating a new virtual machine like the ability to add virtual sockets (not just virtual cores) and the choice between eager and lazy zeroed disks.
The one overall impression that vSphere 5 has left on me though is that it’s fast, like really fast. The UI is much more responsive, operations that used to take minutes are now done in seconds and in the few performance tests we’ve done ESXi 5 seems to be consistently faster than its 4.1 Update 1 counterpart. According to my sources close to the matter this is because ESXi 5 is all new code from the ground up, enabling them to enhance performance significantly. From my first impressions with it I’d say that they’ve succeed in doing this and I’m looking forward to seeing how it handles real production loads in the very near future.
What really amazed me was a lot of the code that I had developed for vSphere 4 was 100% compatible with vSphere 5. I had been dreading having to rewrite the near 2000 lines of code that I had developed for the build system in order to get ESXi 5 into our environment but every command worked without a hitch, showing VMware’s dedication to backwards compatibility is extremely good, approaching the king of compatibility Microsoft. Indeed those looking to migrate to vSphere 5 don’t have much to worry about as pretty much every feature of the previous version is supported, and migrating to the newer platform is quite painless.
I’ve yet to have a chance to fiddle with some of the new features (like the storage appliance, which looks incredibly cool) but overall my first impressions of vSphere 5 are quite good, along the lines of what I’ve come to expect from VMware. I haven’t yet run into major gotchas yet but I’ve only had a couple VMs running in an isolated vSphere instance so my sample size is rather limited. I’m sure once I start throwing some real applications at it I’ll start running into some more interesting problems but suffice to say that VMware has done well with this release and I can see vSphere 5 making its home in all IT departments where VMware is already deployed.
¹The stipulation for all Apple products is that they run on Apple hardware, including virtualized instances. Since the only things you can buy with OSX Server installed on them are Mac Mini Servers or Mac Pros, neither of which are on the Hardware Compatability List, running your own virtualized copies of OSX Server (legitimately) simply can’t happen. Yet I still get looks of amazement when I tell people Apple is a hardware company, figures.
It’s a sad truth that once a company reaches a certain level of success they tend to stop listening to their users/customers, since by that point they have enough validation to continue down whatever path suits them. It’s a double edged sword for the company as whilst they now have much more freedom to experiment since they don’t have to fight for every customer they also have enough rope to hang themselves should they be too ambitious. This happens more in traditional business rather than say Web 2.0 companies since the latter’s bread and butter is their users and the community that surrounds them, leaving them a lot less wiggle room when it comes to going against the grain of their wishes.
I recently blogged about VMware’s upcoming release of vSphere 5 which whilst technologically awesome did have the rather unfortunate aspect of screwing over the small to medium size enterprises that had heavily invested in the platform. At the time I didn’t believe that VMware would change their mind on the issue, mostly because their largest customers would most likely be unaffected by it (especially the cloud providers) but just under three weeks later VMware has announced that they are changing the licensing model, and boy is it generous:
We are a company built on customer goodwill and we take customer feedback to heart. Our primary objective is to do right by our customers, and we are announcing three changes to the vSphere 5 licensing model that address the three most recurring areas of customer feedback:
We’ve increased vRAM entitlements for all vSphere editions, including the doubling of the entitlements for vSphere Enterprise and Enterprise Plus.
We’ve capped the amount of vRAM we count in any given VM, so that no VM, not even the “monster” 1TB vRAM VM, would cost more than one vSphere Enterprise Plus license.
We adjusted our model to be much more flexible around transient workloads, and short-term spikes that are typical in test & dev environments for example.
The first 2 points are the ones that will matter to most people with the bottom end licenses getting a 33% boost to 32GB of vRAM allocation and every other licensing level getting their allocations doubled. Now for the lower end that doesn’t mean a whole bunch but the standard configuration just gained another 16GB of vRAM which is nothing to sneeze at. At the higher end however these massive increases start to really pile on, especially for a typical configuration that has 4 physical CPUs which now sports a healthy 384GB vRAM allocation with default licensing. The additional caveat of virtual machines not using more than 96GB of vRAM means that licensing costs won’t get out of hand for mega VMs but in all honesty if you’re running virtual machines that large I’d have to question your use of virtualization in the first place. Additionally the change from a monthly average to a 12 month average for the licensing check does go some way to alleviating the pain that some users will feel, even though they could’ve worked around it by asking VMware nicely for one of those unlimited evaluation licenses.
What these changes do is make vSphere 5 a lot more feasible for users who have already invested heavily in VMware’s platform. Whilst it’s no where near the current 2 processors + gobs of RAM deal that many have been used to it does now make the smaller end of the scale much more palatable, even if the cheapest option will leave you with a meagre 64GB of RAM to allocate. That’s still enough for many environments to get decent consolidation ratios of say 8 to 1 with 8GB VMs, even if that’s slightly below the desired industry average of 10 to 1. The higher end, whilst being a lot more feasible for a small number of ridiculously large VMs, still suffers somewhat as higher end servers will still need additional licenses to fully utilize their capacity. Of course not many places will need 4 processor, 512GB beasts in their environments but it’s still going to be a factor to count against VMware.
The licensing changes from VMware are very welcome and will go a long way for people like me who are trying to sell vSphere 5 to their higher ups. Whilst licensing was never an issue for me I do know that it was a big factor for the majority and these improvements will allow them to stay on the VMware platform without having to struggle with licensing concerns. I have to then give some major kudos to VMware for listening to their community and making these changes that will ultimately benefit both them and their customers as this kind of interaction is becoming increasingly rare as time goes on.
I make no secret of the fact that I’ve pretty much built my career around a single line of products, specifically those from VMware. Initially I simply used their workstation line of products to help me through university projects that required Linux to complete but after one of my bosses caught wind of my “experience” with VMware’s products I was put on the fast line to become an expert in their technology. The timing couldn’t have been more perfect as virtualization then became a staple of every IT department I’ve had the pleasure of working with and my experience with VMware ensured that my resume always floated around near the top when it came time to find a new position.
In this time I’ve had a fair bit of experience with their flagship product now called vSphere. In essence it’s an operating system you can install on a server that lets you run multiple, distinct operating system instances on top of it. Since IT departments always bought servers with more capacity than they needed systems like vSphere meant they could use that excess capacity to run other, not so power hungry systems along side them. It really was a game changer and from then on servers were usually bought with virtualization being the key purpose in mind rather than them being for a specific system. VMware is still the leader in this sector holding an estimated 80% of the market and has arguably the most feature rich product suite available.
Yesterday saw the announcement of their latest product offering vSphere 5. From a technological standpoint it’s very interesting with many innovations that will put VMware even further ahead of their competition, at least technologically. Amongst the usual fanfare of bigger and better virtual machines and improvements to their current technologies vSphere 5 brings with it a whole bunch of new features aimed squarely at making vSphere the cloud platform for the future. Primarily these innovations are centred around automating certain tasks within the data centre, such as provisioning new servers and managing server loads including down to the disk level which wasn’t available previously. Considering that I believe the future of cloud computing (at least for government organisations and large scale in house IT departments) is a hybrid public/private model these improvements are a welcome change , even if I won’t be using them immediately.
The one place that VMware falls down and is (rightly) heavily criticized for is the price. With the most basic licenses costing around $1000 per core it’s not a cheap solution by any stretch of the imagination, especially if you want to take advantage of any of the advanced features. Still since the licencing was per processor it meant that you could buy a dual processor server (each with say, 6 cores) with oodles of RAM and still come out ahead of other virtualization solutions. However with vSphere 5 they’ve changed the way they do pricing significantly, to the point of destroying such a strategy (and those potential savings) along with it.
Licensing is still charged on a per-processor basis but instead of having an upper limit on the amount of memory (256GB for most licenses, Enterprise Plus gives you unlimited) you are now given a vRAM allocation per licence purchased. Depending on your licensing level you’ll get 24GB, 32GB or 48GB worth of vRAM which you’re allowed to allocate to virtual machines. Now for typical smaller servers this won’t pose much of a problem as a dual proc, 48GB RAM server (which is very typical) would be covered easily by the cheapest licensing. However should you exceed even 96GB of RAM, which is very easy to do, that same server will then require additional licenses to be purchased in order to be able to full utilize the hardware. For smaller environments this has the potential to make VMware’s virtualization solution untenable, especially when you put it beside the almost free competitor of Hyper-V from Microsoft.
The VMware user community has, of course, not reacted positively to this announcement. Whilst for many larger environments the problems won’t be so bad as the vRAM allocation is done at the data center level and not the server level (allowing over-allocated smaller servers to help out their beefier brethren) it does have the potential to hurt smaller environments especially those who heavily invested in RAM heavy, processor poor servers. It’s also compounded by the fact that you’ll only have a short time to choose to upgrade for free, thus risking having to buy more licenses, or abstain and then later have to pay an upgrade fee. It’s enough for some to start looking into moving to the competition which could cut into VMware’s market share drastically.
The reasoning behind these changes is simple: such pricing is much more favourable to a ubiquitous cloud environment than it is to the current industry norm for VMware deployments. VMware might be slightly ahead of the curve on this one however as most customers are not ready to deploy their own internal clouds with the vast majority of current cloud users being hosted solutions. Additionally many common enterprise applications aren’t compatible with VMware’s cloud and thus lock end users out of realising the benefits of a private cloud. VMware might be choosing to bite the bullet now rather than later in the hopes it will spur movement onto their cloud platform at a later stage. Whether this strategy works or not remains to be seen, but current industry trends are pushing very hard towards a cloud based future.
I’m definitely looking forward to working with vSphere 5 and there are several features that will definitely provide an immense amount of value to my current environment. The licensing issue, whilst I feel won’t be much of an issue, is cause for concern and whilst I don’t believe VMware will budge on it any time soon I do know that the VMware community is an innovative lot and it won’t be long before they work out how to make the best of this licensing situation. Still it’s definitely an in for the competition and whilst they might not have the technological edge they’re more than suitable for many environments.