The current project I’m on has a requirement for being able to determine a server’s overall performance before and after a migration, mostly to make sure that it still functions the same or better once its on the new platform. Whilst it’s easy enough to get raw statistics from perfmon getting an at-a-glance view of how a server is performing before and after a migration is a far more nuanced concept, one that’s not easily accomplished with some Excel wizardry. With that in mind I thought I’d share with you my idea for creating such a view as well as outlining the challenges I’ve hit when attempting to collate the data.
At a high level I’ve focused on the 4 core resources that all operating systems consume: CPU, RAM, disk and network. For the most part these metrics are easily captured by the counters that perfmon has however I wanted to go a bit further to make sure that the final comparisons represented a more “true” picture of before and after performance. To do this I included some additional qualifying metrics which would show if increased resource usage was negatively impacting on performance or if it was just the server consuming more resources because it could since the new platform had much more capacity. With that in mind these are the metrics I settled on using:
Essentially these metrics can be broken down into 3 categories: quantitative, qualitative and qualifying. Quantitative metrics are the base metrics which will form the main part of the before and after analysis. Qualitative metrics are mostly just informational (being the Top 5 consumers of X resource) however they’ll provide some useful insight into what might be causing an issue. For example if an SQL box isn’t showing the SQL process as a top consumer then it’s likely something is causing that process to take a dive before it can actually use any resources. Finally the qualifying metrics are used to indicate whether or not increased usage of a certain metric signals an impact to the server’s performance like say if the memory usage is high and the memory balloon size is high it’s quite likely the system isn’t performing very well.
The vast majority of these metrics are provided in perfmon however there were a couple that I couldn’t seem to get through the counters, even though I could see them in Resource Monitor. As it turns out Resource Monitor makes use of the Event Tracing for Windows (ETW) framework which gives you an incredibly granular view of all events that are happening on your machine. What I was looking for was a breakdown of disk and network usage per process (in order to generate the Top 5 users list) which is unfortunately bundled up in the IO counters available in perfmon. In order to split these out you have to run a Kernel Trace through ETW and then parse the resulting file to get the metrics you want. It’s a little messy but unfortunately there’s no good way to get those metrics separated. The resulting perfmon profile I created can be downloaded here.
The next issue I’ve run into is getting the data into a readily digestible format. You see not all servers are built the same and not all of them run the same amount of software. This means that when you open up the resulting CSV file from different servers the column headers won’t line up so you’ve got to either do some tricky Excel work (which is often prone to failure) or get freaky with some PowerShell (which is messy and complicated). I decided to go for the latter as at least I could maintain and extend the script somewhat easily whereas an Excel spreadsheet has a tendency to get out of control faster than anyone expects. That part is still a work in progress however but I’ll endeavour to update this post with the completed script once I’ve got it working.
After that point it’s a relatively simple task of displaying everything in a nicely formatted Excel spreadsheet and doing comparisons based on the metrics you’ve generated. If I had more time on my hands I probably would’ve tried to integrate it into something like a SharePoint BI site so we could do some groovy tracking and intelligence on it but due to tight time constraints I probably won’t get that far. Still a well laid out spreadsheet isn’t a bad format for presenting such information, especially when you can colour everything green when things are going right.
I’d be keen to hear other people’s thoughts on how you’d approach a problem like this as trying to quantify the nebulous idea of “server performance” has proven to be far more challenging than I first thought it would be. Part of this is due to the data manipulation required but it was also ensuring that all aspects of a server’s performance were covered and converted down to readily digestible metrics. I think I’ve gotten close to a workable solution with this but I’m always looking for ways to improve it or if there’s a magical tool out there that will do this all for me 😉
I have the pleasure of configuring some Dell kit without the use of a pre-execution environment. This presents quite a challenge as many of the management tools are designed to run within such an environment or an installed operating system which means that my options for configuring these serves is somewhat limited. Thankfully for most of the critical stuff Dell’s RACADM tool is more than capable of managing the server remotely however it unfortunately doesn’t have any access to the system BIOS where some critical changes need to be made. Thus I was in need of finding a solution to this problem and it seems that my saviour comes in the form of a protocol called Web Services Management (WSMAN).
WSMAN is an open protocol for server management which provides a rather feature rich interface to your hardware for getting, setting and enumerating the various features and settings on your hardware. Of course since its so powerful it’s also rather complex in nature and you won’t really be able to stumble your way through it without the help of a vendor specific guide. For Dell servers the appropriate guide is the Lifecycle Controller Web Services Interface Guide (there’s an equivalent available for Linux) which gives you a breakdown of the commands that are available and what they can accomplish.
They’re not fully documented however so I thought I’d show you a couple commands I’ve used in order to configure some BIOS settings on one of the M910 blades I’m currently working on. The first requirement was to disable all the on board NICs as we want to use the Qlogic QME8262-k 10GB NICs instead. In order to do this however we first need to get some information out of the WSMAN interface in order to know which variables to change. The first command you’ll want to run is the following:
winrm e http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_BIOSEnumeration -u:root -p:calvin -r:https://[iDRACIP]/wsman -SkipCNcheck -SkipCAcheck -encoding:utf-8 -a:basic
Which will give you a whole bunch of output along these lines:
DCIM_BIOSEnumeration AttributeName = EmbNic1Nic2 Caption CurrentValue = Enabled DefaultValue Description ElementName FQDD = BIOS.Setup.1-1 InstanceID = BIOS.Setup.1-1:EmbNic1Nic2 IsOrderedList IsReadOnly = FALSE PendingValue PossibleValues = Disabled, Enabled DCIM_BIOSEnumeration AttributeName = EmbNic1 Caption CurrentValue = EnabledPxe DefaultValue Description ElementName FQDD = BIOS.Setup.1-1 InstanceID = BIOS.Setup.1-1:EmbNic1 IsOrderedList IsReadOnly = FALSE PendingValue PossibleValues = Disabled, EnablediScsi, EnabledPxe, Enabled
Of note in the output are the AttributeName and PossibleValues variables. In essence these represent the current and possible states of the BIOS variables and all of them can be modified through the appropriate WSMAN command. The Dell guide I referenced earlier though doesn’t exactly tell you how to do this and the only example that appears to be close is one for modifying the BIOS boot mode setting. However as it turns out this same command can be used to modify any variable that is output by the previous command so long as you create the appropriate XML file. Shown below is the command and XML file to disable the first 2 embedded NICs:
Code: winrm i SetAttribute http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_BIOSService?SystemCreationClassName=DCIM_ComputerSystem +CreationClassName=DCIM_BIOSService +SystemName=DCIM:ComputerSystem+Name=DCIM:BIOSService -u:root -p:calvin -r:https://[iDRACIP]/wsman -SkipCNcheck -SkipCAcheck -encoding:utf-8 -a:basic -file:SetAttribute_BIOS.xml SetAttribute_BIOS.xml: <p:SetAttribute_INPUT xmlns:p="http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_BIOSService"> <p:Target>BIOS.Setup.1-1</p:Target> <p:AttributeName>EmbNic3Nic4</p:AttributeName> <p:AttributeValue>Disabled</p:AttributeValue> </p:SetAttribute_INPUT>
This appears to work quite well for individual attributes but I’ve encountered errors when trying to set more than one BIOS variable at a time. This could easily be due to me fat fingering the input file (I didn’t really check it before troubleshooting it further) but it could also be a limitation of the WSMAN implementation on the Dell servers. Either way once you’ve run that command you’ll notice the response from the server states that the values are pending and the server requires a reboot. Now I’m not 100% sure if you can get away with just rebooting it through the iDRAC or physically rebooting it but there is a WSMAN command which I can guarantee will apply the setting whilst also rebooting the server for you. Again this one relies on an XML file for it to succeed:
Code: winrm i CreateTargetedConfigJob http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_BIOSService?SystemCreationClassName=DCIM_ComputerSystem +CreationClassName=DCIM_BIOSService +SystemName=DCIM:ComputerSystem +Name=DCIM:BIOSService -u:root -p:calvin -r:https://[iDRACIP]/wsman -SkipCNcheck -SkipCAcheck -encoding:utf-8 -a:basic -file:CreateTargetedConfigJob_BIOS.xml CreateTargetedConfigJob_BIOS.xml: <p:CreateTargetedConfigJob_INPUT xmlns:p="http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_BIOSService"> <p:Target>BIOS.Setup.1-1</p:Target> <p:RebootJobType>2</p:RebootJobType> <p:ScheduledStartTime>TIME_NOW</p:ScheduledStartTime> <p:UntilTime>20131111111111</p:UntilTime> </p:CreateTargetedConfigJob_INPUT>
Upon executing this command the server will reboot and then load into the Lifecycle Controller where it will apply the desired settings. After which it will reboot again and you’ll be able to view the settings inside the BIOS proper. It appears that this command can be used for any variable that appears within the initial BIOS enumeration so using this it is quite possible to fully configure the BIOS remotely. You can also access quite a lot of things within the iDRAC itself however I’ve found that RACADM is a much easier way to go about this, especially if you simply dump the entire config, edit it, then reupload it. Still the option is there if you want to use the single tool but unless you’re something of a masochist I wouldn’t recommend doing everything through WSMAN.
All that being said however the WSMAN API appears to cover pretty much everything in the server so if you need to do something remotely to it (hardware wise) and you don’t have the luxury of a PXE or installed operating system than its definitely something to look into. Hopefully the above commands will get you started and then the rest of the Dell integration guide will make a little more sense. If you’ve got any questions about a particular command hit me up in the comments, on Twitter or on my Facebook fan page and I’ll help you out as much as I can.