No matter what you do you’ve got to have a bit of pride in what you’re doing. I’d love to tell everyone that my sense of pride in my work comes from my long line of successful projects, which I will admit do give me a warm and fuzzy feeling, but more and more I think it comes down to this: Give me any IT system known to man, be it a personal computer or corporate infrastructure, and guaranteed I’ll find a problem that no one has ever seen before and won’t even try to explain.
This came up recently with our blade implementation I mentioned a while ago. Everything has been going great, with our whole environment able to run on a single blade comfortably. Whilst I was migrating everything across something happened that managed to knock one of our 2 blades offline. No worries I thought to myself, I had enabled HA on the farm so all the virtual machines would magically reappear. Not 2 minutes later did our other blade server drop off the network, taking all the (non-production, thank heavens) servers offline. After spending a lot of time on getting this up and running I was more than a little irked that it had developed a problem like this, but I endeavoured to find the cause.
That was about 2 weeks ago and I thought I had nipped it in the bud when I had found the machines responsible and modified their configuration so they’d behave. I was working on reconfiguring some network properties on one machine when I suddenly lost connection again. Knowing that this could happen I had made sure to move most of the servers off before attempting this so we didn’t lose our entire environment this time around. However what troubled me wasn’t the blade dropping off the network it was how I managed to trigger it (a bit of shop talk follows).
VMware’s hypervisor is supposed to abstract the physical hardware away from the guest operating system so that you can easily divvy it up and get more use out of a server. As such it’s pretty rare for a change from within a guest to affect the physical hardware. However when I was changing one network adapter within a guest from a static address (it was on a different subnet prior to migration) to DHCP I completely lost network connectivity to the guest and host. It seems that a funny combination of VMware, HP Blades and Windows TCP/IP stack contains a magic combination so that when you do what I did, the network stack on the VMware host gets corrupted (I’ve confirmed its not the VirtualConnect module or anything else, since I had virtual machines running in the same chassis on a different blade perfectly well).
I’ve struggled with similar things with my own personal computer for years. My current machine suffers from random BSODs that I’m sure are due to the motherboard which is unfortunately the only component I can’t easily replace. Every phone I had for the past 3 years suffered from one problem or another that would render it useless for extended periods of time. Because of this I’ve come to the conclusion that because I’m supposed to be an expert with technology I will inheritly get the worst problems.
It’s not all bad though. With problems like these comes experience. Just like my initial projects which ultimately failed to deliver (granted one of those was a project at University and the other one was woefully under resourced) I learnt what can go wrong where, and had to develop troubleshooting skills to cope with that. I don’t think I’d know a lot about technology today if I hadn’t had so many things break on me. It was this quote that summed it up so well for me:
I’ve missed more than 9,000 shots in my career. I’ve lost almost 300 games. 26 times I’ve been trusted to take the game winning shot and missed. I’ve failed over and over and over again in my life and that is why I succeed.
That quote was from Michael Jordan. A man who is constantly associated with success attributes it to his failures, something which I can attest to. It also speaks to the engineer in me, as with any engineering project the first implementation should never be the one delivered, as revising each implementation lets you learn where you made mistakes and correct them. There’s only so much you can learn from getting it right.
This still doesn’t stop me from wanting to thrash my computer for its dissent against me, however 🙂