Issue with vSphere vCLS VMs
This short post is about an issue in VMware vCenter that causes vSphere Cluster Services (vCLS) VMs fail to deploy. Because of this cluster functions like Distributed Resource Scheduler (DRS) doesn’t work. My post shows a solution for this problem.
The customers environment has clusters with functioning DRS. When a new cluster was deployed, everything works fine. But with the enabling of DRS and HA, an error appears in vCenter.
[vSphere DRS functionality was impacted due to unhealthy state vSphere Cluster Services caused by the unavailability of vSphere Cluster Service VMs. vSphere Cluster Service VMs are required to maintain the health of vSphere DRS.]
This means that vSphere could not successfully deploy the vCLS VMs in the new cluster. Unfortunately it was not possible to us to find the root cause. What we tried to resolve the issue:
- Deleted and re-created the cluster. We tested to use different orders to create the cluster and enable HA and DRS. No matter what we tried, we always got the same error.
- Since we used hosts that were already used in other clusters, we reinstalled them.
- Attempted to switch Retreat Mode on and off. Here things got weird. For the new cluster retreat mode changes nothing. That made sense, since nothing had worked so far. But as we tried switching mode for other clusters too, nothing changed there either. So, neither the provision of vCLS nor deletions work.
- Analyzed logs. We found out that it couldn’t be a host-problem. ESXi hosts did not get any order to create these VMs. But we also didn’t find a hint in vCenter logs.
- Checked the Security Token Service (STS) certificate. An expired STS certificate can cause multiple different issues. In this case, certificate was valid.
- Despite STS certificate was valid, we re-created it using the
fixstsscript. Also no success.
Solution of the problem
After all the setbacks we finally found a solution. Re-creating solution user certificates restored functionality to deploy vCLS VMs. It depends on the way you handle certificates in vCenter to re-create solution user certificates. In this case – and I guess this is the case for the most environments – machine certificate was replaced by a local CA signed certificate. All other certificates were not exchanged. With this setup, solution user certificates can easily replaced in certificate manger.
If your company has stricter security guidelines, you may have to follow one of these instructions:
- Replace Solution User Certificates with VMCA Certificates (Intermediate CA)
- Replace Solution User Certificates with New VMCA-Signed Certificates
If you are unsure about your current configuration and the implications of this solution, open a support ticket for clarification!
- Here we have found the decisive clue to the solution.
- For more information about replacing machine certificate in vCenter read my post.
- If you try this at home, make sure, you backup vCenter appliance (VCSA) before critical steps! Here in this case, at least before you re-create any certificate: backup! At least one of the following methods should be use:
- VM snapshot. To create the most reliable snapshot, shutdown VCSA and take a snapshot while VM is powered down.
- File level backup. For this start a manual backup in vCenter Server Management Interface (VAMI)