Do you see things a little differently?

Thursday, October 18, 2012

When VMware Snapshots go bad....

It started with a warning indicator in vSphere. When I investigated, I found an error indicating that snapshots where too deeply nested. This seemed odd to me because no one should be taking snapshots on these systems. I opened the snapshot manager and found at least 32 snapshots nested with the name "Consolidate Helper - 0". Some quick research tells me that these are the result of a VM backup gone awry. The first step to addressing this was to select "Delete All" from the Snapshot manager window. VMware told me that it was deleting the snapshots, and I figured all was well. Unfortunately all was not well. The snapshots do not appear in the manager window anymore, but the snapshot files are still there. I was alerted to this problem by the following on the summary tab.
Another Google search and a VMware KB article told me that the log files still require consolidation and to accomplish that, I can right click on the server go into the snapshot manager and choose "consolidate".
  This would be great, except when the process finished I received the following error.
Clearly some days you're the fire hydrant.
This is not going well. I enable the shell, and ssh so I can run some debug commands on the host. Per the VMware KB, I can run the command
vmkfstools -D /vmfs/volumes///
to determine which files are locked, and by what. I already have a good idea because of some lines in the vmware.log file.

DISK: Failed to open disk for consolidate '/vmfs/volumes/4c5880d7-5dcd2226-7980-0024e8679cae/MYHOST01/MYHOST01_1-000066.vmdk

Running vmkfstools yields the following...

Lock [type 10c00001 offset 71018496 v 24740, hb offset 3964928
gen 385, mode 0, owner 00000000-00000000-0000-000000000000 mtime 535877 nHld 0 nOvf 0]
Addr <4 111="111" 93="93">, gen 24718, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 406, nb 0 tbz 0, cow 0, newSinceEpoch 0, zla 4305, bs 65536


where no owner is identified. Not Good. At this point, I'm left with rebooting the guest OS and hoping that takes care of it. 

Nope that didn't work.

I did what I should have done to start with, which is evaluate what was doing the backups. Turns out we had an Avamar proxy server which must have been performing VMDK backups of the systems. I edited the proxy configuration and determined that two disks from the system having issues where mounted to the Avamar proxy server as local disk. Once I removed the disk from the Avamar system, I was able to complete the consolidation without issue. Another crisis averted, time to head home.