Do you see things a little differently?

Tuesday, December 14, 2010

10GB, ESXi, and Strange behavior

I was sitting at my desk, minding my own business when one of the Technical Systems IT folk pointed out that their server wasn't responding. Imagine my surprise when the vCenter console showed the guest OS as disconnected, and the host it was associated to, as being in Stand-By mode. Something was clearly not right. I went to the data center and verified the host had power. I noticed that the 10GB NIC did not show any lights, and the Nexus 5k did not show a connection. Hmmmm. I went back to my desk and used the Remote Access Card to get to the ESXi console, and took a look at some logs.

At this point, I needed to get what turned out to be 3 guest servers back running.  The easiest way to do it, was to reboot the system.  Sadly with ESXi, all logging is in memory, and a reboot dumps the logs. To prevent loss of logs like this in the future, I downloaded KIWI syslogger (http://www.kiwisyslog.com/) from Solar Winds. The free version should be okay for now, but I may go back and get the commercial version to add things like log rotation. In vSphere go to the host configuration tab, and select the advanced link. You should see a syslog item on the screen that pops up.  Just enter the name of the server hosting the logger, and off it goes.
I opened a case with VMware support, and told them what had happened. Thankfully, I had the screen shot above. The errors listed point to an issue with the Brocade drivers. Following is the response from VMware support.
_______________________
 Your Broadcom or bnx2x driver crashed / dumped under load. It's a physical NIC issue typically caused by a dated driver. Please verify you are on the latest driver. You can do this by running the following command and comparing it to the VMware HCL and downloading the correct driver.

Command:
1) esxcfg-nics -l  --> look for the card with the bnx2x driver, might be all or a few it's ok, just pick one all the same driver
2) once you have found It will have a vmnic# (ex vmnic6)
3) run ethtool -i vmnic# --> this will tell you the driver version. Now go to VMware HCL (see below) and select "IO devices" put in the model card based on the output of the esxcfg-nics -l command (example bcm500 or 1000base-t)
http://www.vmware.com/resources/compatibility/search.php
4) if the driver says "asynic" it's a newer driver, make sure you only download the driver for your build (ex esx 4.0 or 4.0u1 and so on)
_______________________

Okay, now I may have a problem.  The driver bundle I'm using is "BCM-bnx2x-1.52.12.v40.3-offline_bundle-223054" which if I'm reading the chart correctly, is the latest compatible version.-> ESX / ESXi 4.0 U2 bnx2x version 1.52.12.v40.3
Time to verify that what I installed is what the system has.  Since I'm running ESXi, I need to use the CLI interface on another system, or if I had vMA installed, I could use that. The command from the remote system looks like this: "esxcfg-nics.pl --server x.x.x.x --username root --password rootpassword -l".  The server, user, and password have been changed to protect the innocent. Following is what was returned.
Name    PCI     Driver     Link Speed    Duplex MAC Address        MTU    Descri
ption
vmnic0  05:00.0 bnx2       Up   1000Mbps Full   00:18:8b:4f:1a:79  1500   Intel
Corporation Broadcom NetXtreme II BCM5708 1000Base-T
vmnic1  09:00.0 bnx2       Up   1000Mbps Full   00:18:8b:4f:1a:7b         Intel
Corporation Broadcom NetXtreme II BCM5708 1000Base-T
vmnic4  0c:00.0 bnx2x      Up   10000Mbps Full   00:10:18:65:91:00  1500   Intel
 Corporation Broadcom 57711
vmnic5  0c:00.1 bnx2x      Up   10000Mbps Full   00:10:18:65:91:02         Intel
 Corporation Broadcom 57711
It would seem that vmnic 4 & 5 are the ones I need. Of course I could have gotten this information from vCenter. Next I need to see what driver is being used.  Hmmm. Wonder how I'm supposed to run "ethtool" when this is ESXi? I don't see it in the vSphere CLI perl scripts. Okay, sending off the question to support. Let's see what they recommend.

Before I sign off, I wanted to mention that I am seeing over 4500 entries in the logger for 4 ESXi hosts. If you don't have log rotation, you better have some free drive space.


Small update: The way to attempt to run the requested commands is to enable ssh and run the ESXi instance in "unsupported" mode.