Last week, one of my hosts purple-screened. This seems like a bad thing, but it’s really not, and it does happen sometimes. It’s good practice to determine the root cause in case it’s something likely to happen again.
Believe it or not, purple screens are really a good thing. Your system is trying to save you from much worse. What is generally happening when you get a “Purple Screen of Death” is that some piece of hardware or software is misbehaving to the point that you are going to start experiencing data corruption and so the entire system halts to protect it from itself.
In my case, my PSOD came from a non-maskable interrupt. This is a special interrupt that the system is not allowed to ignore. It’s a signal that something critical just happened and Bad Things will ensue unless immediate action is taken. In the short term, it means your system just crashed. In the long term, it just saved you from potentially major data corruption. Bad, and yet good. On my HP server, NMI errors are generated by the hpnmi driver (which you should have installed as part of your VMware HP driver package to all of your HP ESXi hosts). This driver will keep an eye on your HP hardware and generate an NMI in the event of a catastrophic failure.
To begin tracking down the cause of my crash, I first took a complete set of log files via the VMware vCenter console. My generated bundle was about 800MB. To do this, in your vCenter Infrastructure Client, you’d click on File, Export System Logs. Browse through your tree and select your host. Make sure you include vCenter Server and Client as well for a complete picture.
Now that I have a full set of logs, it’s time to start digging into them. If you’re not an experienced log reader, this process can be extremely daunting. You may find tools like the VMware Log Insight tool to be helpful. Unfortunately, this is not a free tool, but it does work very well.
Since I never got to see the actual PSOD screen before my fellow support engineer restarted the host, I wasn’t really sure what I was looking for. However, I had the e-mail alert from when the host went down so I had a target time and date. My PSOD was somewhere around 23:49 UTC. My first stop was /var/log/vmkernel.log, which should tell me everything the vmkernel was up to.
2014-05-30T23:37:30.314Z cpu8:9295)FS3Misc: 1465: Long VMFS rsv time on ‘SANB-ISO1’ (held for 264 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors
VMB: 49: mbMagic: 2badb002, mbInfo 0x101158
VMB: 54: flags a6d
Ok, nothing there. Normal log entries right before the reboot lines. Let’s try vpxa.log. That’s the host agents.
2014-05-30T23:47:16.908Z [3F4D3B90 verbose ‘vpxavpxaInvtHost’ opID=WFU-fe82c6fb] [HostChanged] Found update for tracked MoRef vim.HostSystem:ha-host
2014-05-30T23:47:16.909Z [3F4D3B90 verbose ‘VpxaHalCnxHostagent’ opID=WFU-fe82c6fb] [WaitForUpdatesDone] Starting next WaitForUpdates() call to hostd
2014-05-30T23:47:16.909Z [3F4D3B90 verbose ‘VpxaHalCnxHostagent’ opID=WFU-fe82c6fb] [WaitForUpdatesDone] Completed callback
Section for VMware ESX, pid=9549, version=5.1.0, build=1743533, option=Release
2014-05-31T00:10:15.434Z [FFC066D0 info ‘Default’] Logging uses fast path: false
Still nothing! Normal operation right up until the reboot when it started logging again. Alright, let’s get nasty. We’ll go straight to the dump file at /var/core/vmkernel-zdump. In this case, the log export reconstructed the dump file for me automatically but it still left all of the FRAG parts. Lots and lots of binary data, but what’s this I see right around 23:47, two minutes before my e-mail:
2014-05-30T23:47:38.642Z cpu0:9115)WARNING: NMI: 952: LINT1 motherboard interrupt
That doesn’t look good. Scroll a bit to get passed the stack dumps and I find the actual PSOD:
2014-05-30T23:47:38.819Z cpu0:9115)@BlueScreen: Panic requested by one or more 3rd party NMI handlers
Alright, so I have a purple screen requested by a third party NMI. The only third party NMI installed is HPNMI, as mentioned at the beginning of this post. HPNMI has a useful feature in that it will write logs to the iLO (integrated lights out), which is isolated from the running hypervisor on the host. Let’s go see what iLO has to say! I fire up my web browser and open my iLO Integrated Management Log:
Uh oh. That’s a hardware failure.
Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 9, Function 0, Error status 0x00200000)
Let’s see if we can figure out exactly what failed. I’ll SSH into my host so that I can enumerate the PCI Express bus and figure out what’s on Bus 0 Device 9, built into the motherboard. For this, I’ll use lspci.
~ # lspci -v
00:00:09.0 PCI bridge Bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 9 [PCIe RP[00:00:09.0]]
Class 0604: 8086:3410
Bus 0 Device 9 is the Intel PCI Express Root Port controller. Dang. This means something on the motherboard itself went south. I was hoping for a bad NIC or something.
According to HP, the only way to be sure about this problem is to replace the entire motherboard. Unfortunately, this host is out of warranty and the HP Care Pack has run out. My only hope now is that this is a one-time error and not a symptom of a serious issue. I have since moved my more critical VMs off of this host and demoted it to “Spare host with non-critical VMs only”. Still usable, but not exactly trustworthy.
I hope this write-up encourages you to dig into your own PSOD’s instead of just rebooting and hoping for the best. The log files can be extremely intimidating but even if you only understand 5% of the information they display, it may be enough to give you insight into your issue. Thanks for reading!