Let me preface this by saying that this is not my first vSphere upgrade. I’m comfortable with the procedures and I’m confident in my ability. I still broke the network. But, like a good admin should, I own up to it. It was my fault. But, also like a good admin, I tracked down the outage and fixed it. Pointing fingers and redirecting blame doesn’t get the packets flowing again. Ok, I’ve got that out of the way so you can’t look upon me with scorn.
It all started during a monthly maintenance cycle. This month, I was deploying a new vCenter 5.5 Server Appliance (VCSA) to replace our non-virtualized Windows 2008 vCenter 5.1 Server. I didn’t really care about historical data, and recreating permissions and so forth is relatively simple in this particular environment, so my plan was to stand up a new VCSA, disconnect the hosts from the old vCenter and connect them to VCSA. Next, I used VUM to upgrade the hosts from 5.1 to 5.5.
Everything went great until I got home at 3:30am. I VPN’ed into the network to make sure things were running and…couldn’t. The VPN client authenticated but I couldn’t get an IP address. That’s odd, thought I. I fired up my out-of-band management tool to start troubleshooting. Hmm, our DHCP server failed to come back up after reboot. Ok, that’s not too much of an issue. I got that running again and still could not connect.
Ok, so let’s start looking at logs. I fire up my DHCP test tool and learn that I can acquire an address but only on the same subnet as the DHCP server. That certainly sounds like a networking problem, but I didn’t mess with networking during the upgrade. In any case, while we do have VLANs on the network, the ESXi hosts don’t handle any of that. Simple trunk ports on the physical side and tagged traffic off of the vswitches. Nothing changed, configs are all the same as before the upgrade.
Next, it’s time to look at the physical network. I can’t ping the management interface of the core router. Well, that’s a thing, isn’t it? Maybe it’s still VMware, though, so I remote into a non-virtualized server (but on the same physical switches). Same results. Full network traffic everywhere. I can get out to the internet, I can see the DHCP server, I can get DCHP requests to it (but only on the local subnet) but pinging the management interface of the core router drops something like 75% of the packets.
Now remember, it’s about 4:30am and I’ve been up since 6:00am yesterday. I’m not thinking exceptionally well at this point. The first thing that comes into my head is that the ip helper-address commands have somehow been removed from one or more VLANs on the core router. I know it doesn’t make any sense, but that’s all I can come up with. I spend another hour troubleshooting before coming to the conclusion that the router has somehow freaked out and that maybe a reboot will fix it. I fired off an e-mail to my coworkers to try that when they get in and then I go to bed since it’s now 5:30 am and I’m too tired to drive safely, so going back to the office is out of the question.
I get about three hours’ sleep before I get up to check my mail to see if there’s still a problem. No trouble tickets! Excellent. Problem solved. VPN in to fiddle a couple of things and…wait, what? Still can’t get an IP address. That’s not sandwich talk, friends. That’s not sandwich talk at all. Clearly, DHCP is still not working and yet I’m not getting buried with trouble tickets about connectivity!
I race to the office at a little over Mach 2 and proceed to dig in onsite, in person, and with a live laptop connected to Stuff(tm).
One of the tier 1 techs had been assigning static IP’s to affected users, and the rest still had DHCP leases. OK, that’s one mystery settled. However, as soon as the rest of the leases expire, I’m boned. VoIP phones will stop working, too. I gotta get this worked out tout de suite.
I connect directly to the core router and manually assign myself an IP address on the management network. STILL dropping mad packets when hitting the management IP. Ok, that means it’s not (necessarily) a networking problem. Throw a console cable on the router and see that the CPU is at 60% and it’s (mostly) idle. So, the CPU is getting bogged down by something. The router management backplane is getting so hammered by something that the CPU is spiking and it’s dropping packets. The backplane is saturated and can’t forward ip helper-address to the DHCP server. Routing clearly still works, but packet encapsulation isn’t happening on the VLAN gateway addresses due to the saturation. Time to fire up my packet sniffer and look for the broadcast storm that I now know must be the cause.
I start connecting to a few random switches and I’m seeing nothing. Normal broadcast traffic. Exactly zero out of the ordinary. Not a broadcast storm after all. Unfortunately, I don’t have Netflow-capable switches, so I can’t look very hard at traffic patterns across the infrastructure. Sit back and think. The only thing that changed was VMware. The network went down right after my upgrade. They HAVE to be related. There’s no way around it. Let’s isolate my hosts.
I have two cross-connects from my VMware stack. The first is a port-channel off the public side, and the other is a simple trunk off the storage side for management and maintenance. Due to a design limitation going back several years, the ESXi management ports are on the storage side. Don’t worry, I’m working on getting this fixed. Stop badgering me.
I started by shutting down the front side channel group. No result. Now it’s time to pull the trunk cable on the storage side. IMMEDIATELY fixes the problem across the entire network. DHCP functioning again, core router entirely accessible. WAT.
The link across that trunk was fine. I was getting sub-millisecond traffic across it. If it were saturating the network, I would have seen ping times go up…right? So what’s going on? Set up a SPAN port, connect my sniffer, plug the cable back in. In the space of seven seconds, I captured 22,927 packets and they all said the same thing:
Source: (my ESXI hosts) Destination: (my Syslog server) Prot: syslog Info: DAEMON.ERR Terminating on fatal IPC exception
Wow! The reason the network crashed is because my ESXi hosts are spewing out syslog traffic from the back side to the front so fast that it tanked the management interface on the core router! Well, I can fix the networking side of it simply enough. I turned off syslogging. Now I can re-connect my management trunk and get everyone working again. Now to find out why all three servers are erroring out so dramatically.
SSH into one of the hosts and tail -f the (now local) syslog. I find this error scrolling by so fast that I can’t read the timestamps:
2015-08-19T18:38:50Z lsassd[34304]: 0x668f9b70:Terminating on fatal IPC exception
lsass is a Windows authentication daemon. Something with the host connectivity to Active Directory has gotten borked in a horrible way. A quick search of the VMware knowledge base turns up this KB article. It turns out that upgrading my hosts from 5.1 to 5.5 (and probably in conjunction with moving them to a new vCenter instance) has broken the Active Directory domain participation and, like a psychotic ex-girlfriend, they decided to complain about it LOUDLY AND WAY TOO OFTEN. Unjoining the hosts and rejoining was sufficient to resolve the problem completely.
So, I learned that when I upgrade a host to a new version, I should unjoin it from Active Directory first. Although I’ve never had this problem on any other host I’ve upgraded, it’s easy enough to do and is a simple preventative measure. I also learned that I never would have known there was a problem if I hadn’t been syslogging to an external logging host. While that logging did cause a network outage, it possibly save me from a bigger problem somewhere down the line.