So I had an interesting little problem this morning. I got a call from a fellow engineer asking if something was wrong with our vCenter 5.1 server. He couldn’t log in. Obviously, that’s more than a little concerning so I told him I’d take a look at it. I brought up my client and attempted to sign in and received the following error:
A general system error occurred: Authorize Exception
Well. That’s odd. As it happens, I had an already-logged-in session on the vCenter server itself, so I remoted in to see if I could see anything. Nothing in the event logs, nothing in the client, everything looked fine. According to VMware, this error occurs when the SSO service loses connectivity to your LDAP authentication source, but there were no connection problems between the SSO service and the local domain controller.
I dug into the SSO log file at C:\Program Files\VMware\Infrastructure\SSOServer\logs\ssoAdminServer.log and saw a large string of errors, but this particular one caught my eye:
[2014-08-29 05:37:53,500 WARN opID= DomainKeepAliveThread com.vmware.vim.sso.admin.server.impl.KeepAlive] Unexpected exception in KeepAlive attempt.
com.rsa.common.ConnectionException: Error connecting to the identity source
Caused by: javax.naming.NamingException: getInitialContext failed. javax.resource.spi.ResourceAdapterInternalException: Unable to create a managed connection ‘ldaps://<my domain controller>:3269‘ with ‘GSSAPI’
Now, the interesting thing about this is that the domain controller listed is in another AD site. There’s no reason the server should be attempting to use it for authentication. So, now I had two problems:
- Why was SSO attempting to contact that server for authentication instead of the local domain controller
- Why was that server not responding to requests
I decided to tackle the first one first. See if I can resolve the authentication problem to get my vCenter accessible again. I fired up my web browser and attempted to login using the Admin@system-domain account, since I needed to look at the SSO settings. Aaaaand I add a third problem to my list. The password is expired. *sigh*
At least this problem is easy to fix:
- Stop the vCenter SSO service
- Open SQL Management Studio and dig into the RSA database
- Expand tables and locate dbo.IMS_AUTHN_PASSWORD_POLICY
- Right-click and choose “Edit top 200 rows”
- Look for the row that reads Password Policy for SSO system users within the NOTES field.
- In that row, scroll over to the MAX_LIFE_SEC column and change the setting. The default is 1 year (in seconds). I changed mine to 90000000 seconds, which is 1041 days.
- Restart vCenter SSO service and log in again.
- Now you can change the password or set the password expiration to 0 (never expire) in the web client.
Back to the original problem! Now that I’m logged in as Admin@system-domain, I can look at my authentication sources. I clicked on Administration, Sign-On and Discovery, Configuration.
I chose my AD identity source and clicked edit.
According to this, my backup LDAP server was, indeed, the server that is inaccessible. I changed the secondary to a more appropriate (and working) server, saved the config and my login problem was resolved!
So, what was going on here? Well, a couple of things. We had recently been making some routing changes at our datacenter. One of the changes removed the route to one of our remote sites and made it inaccessible. The other remote site contained the secondary LDAP source, which was down due to a power outage. That meant that the local DC was more-or-less isolated in the Active Directory replication scheme. Normally this isn’t a problem and AD will fix itself automatically, but in this case, it was causing SSO to skip the primary and attempt to authenticate against the bad secondary. Fixing the sources and repairing the AD replication and down server resolved all of the sign-in errors.
Surprisingly, this was the only symptom of the outage. Our systems are built with a lot of redundancy and so things tend to stay up when something breaks, but every once in awhile something snaps in a way that our monitoring doesn’t detect and we end up with weird little problems that are hard to track down.