Issue with vSphere vCLS VMs

Issue with vSphere vCLS VMs

This short post is about an issue in VMware vCenter that causes vSphere Cluster Services (vCLS) VMs fail to deploy. Because of this cluster functions like Distributed Resource Scheduler (DRS) doesn’t work. My post shows a solution for this problem.

The customers environment has clusters with functioning DRS. When a new cluster was deployed, everything works fine. But with the enabling of DRS and HA, an error appears in vCenter.

[vSphere DRS functionality was impacted due to unhealthy state vSphere Cluster Services caused by the unavailability of vSphere Cluster Service VMs. vSphere Cluster Service VMs are required to maintain the health of vSphere DRS.]

Troubleshooting

This means that vSphere could not successfully deploy the vCLS VMs in the new cluster. Unfortunately it was not possible to us to find the root cause. What we tried to resolve the issue:

  • Deleted and re-created the cluster. We tested to use different orders to create the cluster and enable HA and DRS. No matter what we tried, we always got the same error.
  • Since we used hosts that were already used in other clusters, we reinstalled them.
  • Attempted to switch Retreat Mode on and off. Here things got weird. For the new cluster retreat mode changes nothing. That made sense, since nothing had worked so far. But as we tried switching mode for other clusters too, nothing changed there either. So, neither the provision of vCLS nor deletions work.
  • Analyzed logs. We found out that it couldn’t be a host-problem. ESXi hosts did not get any order to create these VMs. But we also didn’t find a hint in vCenter logs.
  • Checked the Security Token Service (STS) certificate. An expired STS certificate can cause multiple different issues. In this case, certificate was valid.
  • Despite STS certificate was valid, we re-created it using the fixsts script. Also no success.

Solution of the problem

After all the setbacks we finally found a solution. Re-creating solution user certificates restored functionality to deploy vCLS VMs. It depends on the way you handle certificates in vCenter to re-create solution user certificates. In this case – and I guess this is the case for the most environments – machine certificate was replaced by a local CA signed certificate. All other certificates were not exchanged. With this setup, solution user certificates can easily replaced in certificate manger.

If your company has stricter security guidelines, you may have to follow one of these instructions:

If you are unsure about your current configuration and the implications of this solution, open a support ticket for clarification!

Notes

4 responses to “Issue with vSphere vCLS VMs”

  1. Anduin Xue says:

    I really regret not reading this article until so late. In fact, about a week ago, I changed the IP, subnet, and domain name of my vCenter server. After that, I completely reconfigured everything, but still encountered the problem of vCLS not being created correctly. I also did something similar: sshed to each ESXi and vCenter server, pinging each other to ensure that the network was working, but the problem still couldn’t be solved. I looked at many tutorials online, but it was your article that ultimately saved me! Thank you sooooooooooooooo much!

  2. CA says:

    Hi!

    I have 2 different vcenters with 1 cluster and 3 ESX on both of them.
    I experienced the problem of vCLS not auto deploying in the 2, tried your workaround and worked in 1 of them, but with the other one no way.
    Tried to create another cluster but not working.
    Can you advice me another solution if didn’t work? This ESX machines was in another VC, you recommend me to reinstall them?

    Thank you so much <3

    • vNote42 says:

      Hi!
      Before re-installing vCenter, I would try to re-create respectively reset all certificates using the certificate manager (point 8). But be careful with this step: make as much backups as possible. And be aware of the tasks you have to do after re-creating certificates. Another proposal would be to open a ticket at VMware support of course!

  3. Artur says:

    Hi,
    After following the instructions from the KB article (https://kb.vmware.com/s/article/2112577), the vCLS VMs were deployed correctly, and DRS started to work.

Leave a Reply

Your email address will not be published. Required fields are marked *