.

Comparing hardware options when refreshing a virtualization platform

Earlier this year I found our organization in a, presumably, not uncommon situation: we have a VMware ESX 3.5[i] environment that consists of three clusters, approximately 15 hosts/200 guests and the hardware is end of life. We're far behind in preparing hardware and software platform roadmaps but needed to determine a way to refresh the hardware while providing significant expansion opportunities with the eventual introduction of vSphere in the next 6mo or so.

To start, here were some of our driving factors and considerations:

  • While we have multiple hardware manufacturers today, we now have standards to which we should adhere. We don't want to maintain more makes or models than is absolutely necessary.
  • Our current environment is over-subscribed (peak avg utilization per farm is about 60% CPU and memory) with our expected N+N fault tolerance
  • We're in the middle of a network upgrade which will take our core, through access layers, to 10GB.
  • We have invested in blade infrastructure that remains viable and exudes operational efficiency
  • We operate at a 300:1 logical server to admin ratio and believe it can be much higher

I set out to answer three things:

  1. What capacity do we demand today, blending average clusters and guests?
  2. What hard or soft limits or best-practices should be respected in engineering a replacement?
  3. Given our needs and possible hardware spend to satisfy a refresh, what is the average cost of a virtual machine? Cost must include infrastructure, server hardware, hypervisor, storage, and ancillary virtualization support products such as monitoring.

Answering the first question was relatively easy with a quick custom performance report from MOM. I blended average and peak utilization of existing VM CPU and memory in a MOM report.

I realize this alone would be an oversimplification of virtualization candidacy or demand; however, blending over the variety of guests we have and being comfortable with both our network and storage subsystems, I'm not particularly concerned with IO. Also, I am similarly confident that our high performance and demand systems are not yet virtual. The vast majority of our systems fit the mold of a single vCPU and 512-1024MB memory. I would never advocate using this methodology for the virtualization of specific workloads, but to maintain our existing systems, this will be adequate.

Answering #2, was out of our depth so we called on VMware and our local engineer to fill in these gaps. The two primary soft restrictions we placed on our calculations were to limit each physical host to 50% CPU at peak and no more than 30 guests per host. Admittedly, these numbers should be revised, but for the sake of time, I'll allow you, the trusted reader, to adjust to what is currently supported or recommended.

The final question can be assisted using this spreadsheet as a starting point. It was built for our environment, but modifying a few sheets or formulas should allow it to be tailored to anyone's environment, hardware configuration & prices, and compute capacity limits. System specifics were stripped and other elements made a bit more generic. Some quick reference notes:

  • Blade RU were calculated as bladeRU=(Chassis RU/slots) * (slots per blade)
  • Chassis column represents the per slot cost of chassis infrastructure. So calculate the total cost of a chassis with no servers and divide it by the number of slots in the chassis, multiply that by the number of slots consumed by the particular model's chassis footprint.
  • Qty column is free to help you guage total capacity achieved and potential wasted given soft/hard limits
  • Current VM usage sheet identifies model farm configurations and utilization
  • For whatever reason, the stoplight conditional formatting never saves properly.... I'm no excel guru

The financials included in this document were based on HP's online pricing data many moons ago and should only be used as an illustration. I suggest consulting your select hardware vendors or VARs to price specific configurations.

I have to admit that I was VERY surprised at the results of the exercise (and how frustrated I became with my beloved Excel). I would never have guessed the BL490s with solid state drives and 10GB to the host would have provided us with the best per-VM cost. Did we go with those? No, but never should price alone dictate direction.

For reasons I'll not divulge in this post, we went a very different direction due to some specific requirements that I did not include above for confidentiality reasons, but that does not negate the results illustrated in the spreadsheet. The document is not an exact blueprint, but if you have some intermediate Excel skills and know your way around virtualization requirements, this may provide you with a launching point to tailor a document that will help you with a similar refresh or new deployment.

Up next: fixing our chargeback model, working to develop a subscription based utility compute offering, self-service, and a monitoring overhaul.

Excel Document Icon
VMBestFit.xlsx

Are Symantec AntiVirus and SMS killing WMI and your Microsoft SQL Service Pack installation?

They were killing ours!

Our DBAs spent a good amount of time troubleshooting the installation of Microsoft SQL Service Pack 2 on an instance of our 8-node Windows Server 2003 cluster before escalating to me. While I've encountered this problem with SQL installations in the past, they were not aware of the situation and no Microsoft or Google searches helped... it was particularly disgusting to hear an MVP recommended they try with a domain admin account... as if. Anyway, the situation plays out like this:

The symptoms are consistent. You try to install SQL, or a SQL service pack, on an isolated or clustered system and it hangs... no timeout, error, feedback whatsoever. The SQL installation logs help deduce the problem, but we couldn't find much from the service pack installation that was helpful although the symptoms and resolution were the same. As an independent qualifier, you can also open the Computer Management console and try opening WMI properties. In each case we've had this issue, opening WMI properties either hangs or returns critical errors for some of the classes.

You could simply reboot the system and all would be well; however, if you're running high availability systems, this is not always the best option. I tend to opt for system surgery to help identify, and resolve, root cause. A quick check of the system's running tasks, some immediate red flags were raised. We had eight instances of wmiprvse.exe, two instances of both scanwrapper.exe and SmsWusHandler.exe which I have not encountered previously.

What I have encountered, however, is some inexplicable conflict of operations with SMS and SAV which cause WMI to fail. My first run-in exposed the problem by killing all SAV and SMS processes but still finding handles to each within a running WMI image. In today's case, discerning that the scanwrapper and smswushandler were components of our SMS patching process was logical. The resolution in both cases was as follows:

1. Stop the service SMS Agent Host (ccmexec.exe)
2. Stop the AntiVirus service
2a. Symantec AntiVirus (rtvscan.exe)
2b. Symantec AntiVirus Definition Watcher (defwatch.exe)
2c. Symantec Event Manager (ccEvtMgr)
2d. Symantec Settings Manager (ccSetMgr)
3. Kill the aforementioned processes if they don't stop gracefully
4. Kill any orphaned SMS processes (such as scanwrapper and smswushandler)
5. Kill all instances of wmiprvse.exe (using pskill is convenient)
6. WMI restarts automatically, start the rest manually

After that, retry the install/service pack.

Validating clustered disk signatures with powershell

I did not intend on blogging about technical operations material, but since the few posts I've been polishing have morphed into something resembling more a whitepaper than blog/article, I wanted to get something fresh online.


If you're fortunate enough to have a good sized MSCS cluster hosting the likes of SQL or Exchange and have the added benefit of high change rates on your clustered disks (read: disk swaps, adds, expansions), you've likely encountered some of the quirkiness provided by the systems' handling of disk signatures. Try as we might, there has been limited success in producing consistent results without requiring time consuming system restarts which may result in momentary service outages due to failover.

The symptoms experienced are simple enough: Clustered group won't fail over to node X because a disk cannot be brought online. The disk is visible in disk manager, but will not come online on one or many nodes. In cases where the disk in question is assigned a drive letter, we can more easily troubleshoot the situation because the disk can be moved from cluster group to cluster group, but in the overwhelming majority of cases we've ecountered, it's a mount point.

When it is a vital disk to a highly available service, some additional assurances that failover will function is more appreciated than waiting for the next opportunity to force a failover or facing an unplanned failure. The cluster logs do a good job of helping you identify the nature of the problem, but there are only a few searches that even Google can return with appropriate steps to resolve.

So, what is one to do? Seek out your resolution in the registry, of course! In our situation, the root cause can be attributed to mishandling of disk signatures by the ClusDisk service, which is controled by Cluster services. I failed to find much that covers the relationship between the services and the registry keys, but from what I can tell, ClusDisk contains a key of known disk signatures that are associated with disks that are configured for cluster management and are available to the node you are viewing. The key (named for the signature) contains a default value of the path to the disk as would be defined by the logical disk manager. For some strange reason, we have disks that are configured properly (according to the cluadmin) in that they are available to all nodes, but they show up not in the Signatures key of ClusDisk, but in the AvailableDisks key. This has happened particularly often when swapping disks (drive or mout point) with a new, larger, LUN.

So what to do? Well, first, we found this MSKB article http://support.microsoft.com/kb/932465 that covers our situation perfectly; however, Method 1 solution is specific for >2tb, which our disks are not. Also, we have confirmed this situation presents itself on Windows Server 2003 Enterprise 64-bit, while the KB only mentions x86. So we press on and find that the solution works on x64 as well, although we have not tested the hotfix.

So now that we have a fix for the solution when we encounter single disk issue, how are we to go about preventing the problem from manifesting itself? On one of our clusters we have 57 shared disks between seven nodes. Manually checking each of these will be a major pain and time killer. So here's where I find a good opportunity to start learning PowerShell.

The code below is my first PowerShell script. The idea is simple

  1. Attach to a cluster to identify all nodes
  2. Obtain a list of all configured disk signatures for all nodes
  3. Check each node against the deduplicated/normalized list of signatures
  4. Report any discrepancies

The only prerequisites for execution of this script are as follows:

  1. Execute script with a single argument: the name of the cluster
  2. You must execute the script/powershell under the context with administrative access to the cluster.
  3. All cluster nodes must have access to all clustered disks (I recommend scanning the bus on each node prior to running this)
  4. All cluster nodes must be configured as possible owners for each disk
  5. You must have powershell installed on the system running the script (powershell does not need to be on the cluster)

That's all there is to it. Nothing fancy yet. I may go back and rewrite this as a function that returns a container of system/missing signature pairs, but for now this is all I need. We'll likely perform any remediation manually.



# Get cluster to review from commandline arguments
$Cluster=$args[0]

# Declar variables and path to necessary registry keys
$key="SYSTEM\\CurrentControlSet\\Services\\ClusDisk\\Parameters\\Signatures"
$key2="SYSTEM\\CurrentControlSet\\Services\\ClusDisk\\Parameters\\AvailableDisks"
$diskset=@()
$AvailDisks=@()
$missing=@()

# Attach to cluster to retrieve node list
$Nodes=get-wmiobject -namespace root\mscluster -computername $Cluster -class mscluster_Node

# Gather aggregate list of all clustered disks from all nodes
# Also identify ay unassigned disks per node
foreach ($node in $nodes){
$regKey=[Microsoft.Win32.RegistryKey]::OpenRemoteBaseKey('LocalMachine', $Node.Name)
$regKey2=[Microsoft.Win32.RegistryKey]::OpenRemoteBaseKey('LocalMachine', $Node.Name)
$regKey=$regKey.OpenSubKey($key)
$regKey2=$regKey2.OpenSubKey($key2)
Foreach($sub in $regKey.GetSubKeyNames()){if ($diskset -notcontains $sub) {$diskset+=$sub}}
if ($regKey2){Foreach($sub2 in $regKey2.GetSubKeyNames()){
$AvailDisks+=($Node.Name + " has disk " + $sub2 + " available")}}
}

# Report number of unique clustered disks
write-host $diskset.Length disks assigned on cluster $Cluster

# Check to see if each node identifies the signature of each clustered disk
# Report any missing disk
foreach ($node in $Nodes){
$regKey=[Microsoft.Win32.RegistryKey]::OpenRemoteBaseKey('LocalMachine', $node.Name)
$regKey=$regKey.OpenSubKey($key)
$checkset=@()
Foreach($sub in $regKey.GetSubKeyNames()){$checkset+=$sub}
foreach($sub in $diskset){if ($checkset -notcontains $sub){
Write-Host $node.Name is missing disk $sub
$missing+=($node.Name,$sub)
}

}
}

# Report available (not clustered) disk signatures
$AvailDisks


My next self-inspired powershell challenge will be to cycle through a cluster reboot with each group incuring only one failover. We've needed this to speed up system updates and reduce downtime and occurence.

Email can be a significant distraction


Where would we be today without email? An impossible question to answer. What would we do without it, however, is not too difficult to imagine. I, for one, would be happy to get rid of it.

With the rise of social network and unified instant messaging, I find email to be little more than a distraction. I receive, on average, 240 emails per day in my Inbox. Those messages are sent directly to me from individuals, not automated from various monitoring systems, distribution lists, or the like as they are redirected to various alternate locations. With each message received, the little new message balloon pops to exploit attention deficit. I made the mistake, so to say, of creating rules to pop an alert window when I receive a message from a VIP (typically senior or executive management). Each of these distractions may pull me away from my work. If I estimate an average 45sec to process (read, scan thread for context, reply if necessary) each message, that sucks up three hours of my day just for email! I barely have 90min/day not booked for meetings so this does not jive with my desire to maintain some quality of life outside of the workplace. Granted, I do face a situation where staffing numbers are disproportional to demand, but c'est la vie.

Possibly more of a pet peeve turned distraction is my distaste for how many use their mobile devices. These gadgets are great for when you're not able to be in the office, but sitting around a conference room table during a meeting is NOT the time to be checking your email. If you're checking you're email, you're not invested in the meeting. Excuse yourself. Worse yet is the notification of new email received. Some go as far as to have a ringer or bell for each message they receive as though they need something to advertise their mail volume. Some are kind enough to hush their phone to vibrate... but place them on the hard table to watch the phone dance and shake with a whir like an electric razor with each message.

I'm sure some of my few readers will contend, "but I am often awaiting an important message about and I need to read it when it gets here." Oh really? It's SO important that you're relying on email and your wireless device and requisite service? I sincerely doubt it.

To help solidify my thought on the subject, I've taken to checking my work email only during scheduled times or when prompted to review something that has been escalated. For example, I check once in the morning, after lunch, and before I head home. I now spend maybe 45min processing email. I've found that most messages to which I feel compelled to reply don't require my involvement after they sit. Also, those around me have become quickly trained to the fact that if there is a legitimate emergency, they have my desk and mobile numbers and I'll be glad to take their issue over the phone.

This paradigm that everyone is always connected to email, for individuals as opposed to roles like a support mailbox, is flawed. Email rarely contains information required to cause action and produce results. Email can't act as a work queue because it does a poor job of understanding context. Granted, solutions can be built on email platforms - say, a ticketing system built on Exchange - but that is not the norm. The "but I sent you an email" excuse is tired and carries no water.

Assuming those on my IM contact list respect my active status, instant messaging is far more convenient than email for a quick response at either party's convenience. I rely on ticketing systems for any work requests due to their ability to prioritize, escalate, report, provide workflow, and correlate like items.

If you're sending me a documet to read at my leisure or a general thanks for the work my teams have completed, contact info... email is great. If you need something from me, anything other than picking up the phone has a chance to leave you sorely dissapointed if you have preset expectations.

On the other side of that coin, I do see some tremendous value in email for automated alerting (although MMS is arguably more reliable), for operations and support related functions, and I'm sure there are more that I haven't considered. That's fine for them, but in my role email should be relegated to little more than material review.

Platform Commodification

Internalizing IT as a service-based profit center as opposed to a traditional cost center better positions an organization to remain nimble as compute requirements shift. The abstraction of resources, including both capital investment and operational expense, by virtue of a utility subscription model based on demand not only serves to better meet customer expectations, but to remove two traditional, and inappropriate, paradigms: that projects/programs or business units own specific equipment, and that projects/programs or business units are confied only to specific equipment.

The culture shift of an entity from owning, and in turn being restricted to use of, specific equipment to one that subcribes to use of compute capacity requirements offers data services opportunities to stretch dollars and improve efficiencies while providing additional visibility to environmental total cost of ownership... does not happen overnight.

A well developed commodity based service subscription framework is both easily maintained and accompanied by a level of fiscal transparency that is unrivaled. Successful implementation provides executive leadership with greater understanding of the "cost of doing business" as opposed to the "cost of IT" without cyclical sticker shock that is often associated with environmental refresh.

  1. Select appropriate hardware, software, and other platform refresh cycles
  2. Negotiate capital and maintenance cost with vendors
  3. Understand your SLAs and OLAs
  4. Include cost for business continuity requirements
  5. Identify dependancies & their cost (including maintenance and operations) footprint (real estate), power, cooling are a significant consideration per platform, from real estate up to application layers of service offering, establish unit cost
  6. Monitoring costs should be included, based on SLA
  7. Make everything a la carte. The more options, the more opportunity
  8. Don't over-estimate your base capacity surplus. Establish, or grow into, the environments first. Do not risk adding services to the menu if you risk depleating your surplus before recovering enough funds to expand capacity.
  9. Know your gaps. Publish them, know your enemy. Overcommitting is too easy and risky. Don't mortgage your business.
  10. Most importantly: surround yourself with people who "get it" and those that can "do it." Without a mix of both, and some that dip into each quality, you'll be so caught up in this particular discipline that you'll lose track of your other responsibilities.
  11. Lastly, no need to gold plate. Rarely do high-cost benefits provide enough return to justify the expense. Stick with the requirements and don't concern yourself with glitter.

Seeding a cloud, as I like to put it, is not only fulfilling personally, but something that will shave hundreds of thousands of dollars (for most large companies) off expense budgets within the first year. The right combination of hardware, software, storage, and network virtualization can provide you with agility that the business neither expected nor anticipated.

IT Governance

While some argue it is nothing more than stifiling control, I am a firm believer in the importance of IT Governance in financially responsible organizations. Transparency vs cost-effectiveness, while often a difficult balance, can be as key to a lean IT organization's success as appropriate change management controls. Lack of familiarity with financial modeling, basic accounting, complete total cost of ownership (TCO) estimation, and accountability help contribute to reluctant acceptance of such practices by those who have "grown up in IT" without the benefit of exposure to demand for measured return on investment.

This is not to say that all organizations require the same level, or depth, of governance, but any looking to treat IT as anything greater than a sunk cost and use IT to grow, improve, or transform their business should employ some formal methodology. Numerous frameworks exist to help each organization best adopt a methodology that suits their business needs, but they all share significant common threading.

For many organizations, maturation of an adopted methodology can take enough time to render the significance of framework nuances moot due to changes within the business or simple IT evolution. The core elements remain the same. I believe governance exposes IT value when the following gates are drawn:

  1. Value statement & functional requirements: if we do X by accomplishing a,b,c, we will make/save Y
  2. Non-functional Requirements: standards, SLA, OLA, etc adherence
  3. Integration/architecture, business case (cost/benefit), plan: how it works in our environment; what it will cost to integrate and satisfy all requirments; TCO/ROI; project plan
  4. Review: accountability, did it accomplish 1 & 2 and without deviation from 3?

Additional gates to accommodate standardization processes, exceptions, and general lifecycle review help keep these aforementioned gates fresh.

Businesses with which I have had the luxury to work, and that have remained successful through myriad economic dynamics, even 100+ years, have a history of investing carefully. While the IT landscape is constantly changing, the same general discipline of well calculated and informed risk analysis serves to avoid significant waste, prioritization of corporate spend, and organizational accountability.

Don't touch that url! Why, you ask?

It's difficult to summarize the intent of this site without ending up with a post the length of an encyclopedia volume. I suppose I'll start with an explanation of what I walked into.

I started working for my current employer in August '06 as a Systems Administrator focused on the Microsoft Windows platform (WIntel Team). The position is responsible for the following, in order of importance:

  1. Ensure availability of deployed IT services
  2. Provide execution of change management requests
  3. Design and implementation of new IT services and architecture

My first few weeks were spent revising monitoring standards and reconstructing a formal engagement methodology for new service deployment. One month after coming on board I was promoted to Team Lead. The team name was changed to the Microsoft and Virtualization Platforms (MVP) Team to better reflect our service offering. I added a preliminary server stack intending to draft an abstraction of layers that help define lines of administrative delineation within the organization. Here is a copy of the original (yes, there's a duplicate vertical band):


The MVP Team is generally responsible for the architecture selection, procurement, provisioning, and support of hardware, virtualization (VMware being the only approved standard), operating systems (Windows 2003 Server, VMware), and OS services. This has since evolved, but we'll cover that later.

My first priority to further team development was to identify operational pain points. Monitoring was already being addressed. We quickly began work on further enriching our engagement methodology, improving customer turnaround time and quality, consolidation in the data center, and expansion of our virtualization environment.

In the course of the subsequent 10 months, the team has made tremendous strides. There is strong focus on normalizing our operations and introducing turnkey automation where appropriate. We have dropped system procurement and deployment time down from as high as 3 weeks to as low as 1 hour for both physical and virtual systems. The rate of false-positive alerts received by our on-call has declined dramatically. We freed over 27% of our data center rack space. Our virtualization environment has grown from four physical systems to nearly 20 hosts running just under 150 guests. We standardized on blade servers (HP primarily, with some IBM) which has helped cut down our deployment and general operational time.

A little over a month ago I moved into the position of IT Manager over the MVP, UNIX, and iSeries teams. The majority of the time since this shift has been spent on 07Q4 budget drain, high profile operational issues, business continuity planning, and a long list of general operations refinement.

I'm fortunate to be surrounded by some exceptional people. My technical teams have some of the most qualified individuals I've met during my career. Additionally, the management team assembled at our facility is phenomenal. My only wish is that everyone would be less willing to sacrifice their personal life for work. We have a lot of people that have missed a lot of their family prior to the holidays. That has to end or we'll risk losing people.

So where does the site come in? Well, I want to document some of what we're doing. Obviously I can't get into the specifics, but the general what, how, and results should prove interesting. We have a lot of things to improve upon and sharing some of the ways we go about it could result in some valuable feedback.