Diagnosing a PSOD

Last week, one of my hosts purple-screened.  This seems like a bad thing, but it’s really not, and it does happen sometimes.  It’s good practice to determine the root cause in case it’s something likely to happen again.

Believe it or not, purple screens are really a good thing.  Your system is trying to save you from much worse.  What is generally happening when you get a “Purple Screen of Death” is that some piece of hardware or software is misbehaving to the point that you are going to start experiencing data corruption and so the entire system halts to protect it from itself.

In my case, my PSOD  came from a non-maskable interrupt.  This is a special interrupt that the system is not allowed to ignore.  It’s a signal that something critical just happened and Bad Things will ensue unless immediate action is taken.  In the short term, it means your system just crashed.  In the long term, it just saved you from potentially major data corruption.  Bad, and yet good.  On my HP server, NMI errors are generated by the hpnmi driver (which you should have installed as part of your VMware HP driver package to all of your HP ESXi hosts).  This driver will keep an eye on your HP hardware and generate an NMI in the event of a catastrophic failure.

Continue reading