Issue with vSphere vCLS VMs
This short post is about an issue in VMware vCenter that causes vSphere Cluster Services (vCLS) VMs fail to deploy. Because of this cluster functions like Distributed Resource Scheduler (DRS) doesn’t work. My post shows a solution for this problem.
The customers environment has clusters with functioning DRS. When a new cluster was deployed, everything works fine. But with the enabling of DRS and HA, an error appears in vCenter.
[vSphere DRS functionality was impacted due to unhealthy state vSphere Cluster Services caused by the unavailability of vSphere Cluster Service VMs. vSphere Cluster Service VMs are required to maintain the health of vSphere DRS.]
Troubleshooting
This means that vSphere could not successfully deploy the vCLS VMs in the new cluster. Unfortunately it was not possible to us to find the root cause. What we tried to resolve the issue:
- Deleted and re-created the cluster. We tested to use different orders to create the cluster and enable HA and DRS. No matter what we tried, we always got the same error.
- Since we used hosts that were already used in other clusters, we reinstalled them.
- Attempted to switch Retreat Mode on and off. Here things got weird. For the new cluster retreat mode changes nothing. That made sense, since nothing had worked so far. But as we tried switching mode for other clusters too, nothing changed there either. So, neither the provision of vCLS nor deletions work.
- Analyzed logs. We found out that it couldn’t be a host-problem. ESXi hosts did not get any order to create these VMs. But we also didn’t find a hint in vCenter logs.
- Checked the Security Token Service (STS) certificate. An expired STS certificate can cause multiple different issues. In this case, certificate was valid.
- Despite STS certificate was valid, we re-created it using the
fixsts
script. Also no success.
Solution of the problem
After all the setbacks we finally found a solution. Re-creating solution user certificates restored functionality to deploy vCLS VMs. It depends on the way you handle certificates in vCenter to re-create solution user certificates. In this case – and I guess this is the case for the most environments – machine certificate was replaced by a local CA signed certificate. All other certificates were not exchanged. With this setup, solution user certificates can easily replaced in certificate manger.
If your company has stricter security guidelines, you may have to follow one of these instructions:
- Replace Solution User Certificates with VMCA Certificates (Intermediate CA)
- Replace Solution User Certificates with New VMCA-Signed Certificates
If you are unsure about your current configuration and the implications of this solution, open a support ticket for clarification!
Notes
- Here we have found the decisive clue to the solution.
- For more information about replacing machine certificate in vCenter read my post.
- If you try this at home, make sure, you backup vCenter appliance (VCSA) before critical steps! Here in this case, at least before you re-create any certificate: backup! At least one of the following methods should be use:
- VM snapshot. To create the most reliable snapshot, shutdown VCSA and take a snapshot while VM is powered down.
- File level backup. For this start a manual backup in vCenter Server Management Interface (VAMI)
I really regret not reading this article until so late. In fact, about a week ago, I changed the IP, subnet, and domain name of my vCenter server. After that, I completely reconfigured everything, but still encountered the problem of vCLS not being created correctly. I also did something similar: sshed to each ESXi and vCenter server, pinging each other to ensure that the network was working, but the problem still couldn’t be solved. I looked at many tutorials online, but it was your article that ultimately saved me! Thank you sooooooooooooooo much!
Hi!
I have 2 different vcenters with 1 cluster and 3 ESX on both of them.
I experienced the problem of vCLS not auto deploying in the 2, tried your workaround and worked in 1 of them, but with the other one no way.
Tried to create another cluster but not working.
Can you advice me another solution if didn’t work? This ESX machines was in another VC, you recommend me to reinstall them?
Thank you so much <3
Hi!
Before re-installing vCenter, I would try to re-create respectively reset all certificates using the certificate manager (point 8). But be careful with this step: make as much backups as possible. And be aware of the tasks you have to do after re-creating certificates. Another proposal would be to open a ticket at VMware support of course!
Hi,
After following the instructions from the KB article (https://kb.vmware.com/s/article/2112577), the vCLS VMs were deployed correctly, and DRS started to work.