In my home lab I did the usual mistake of forgetting to enable EVC mode before adding my second NUC to the VSAN cluster. I got stuck for a day trying to find a solution for it, but I finally managed to solve it. Here’s how.
The situation
When I created my cluster I totally forgot to enable EVC (Enhanced vMotion Compatibility). One excuse could be that I started with a single node VSAN, so everything was running inside the same Intel NUC. Then, when I added the second node, vMotion was working anyway as the CPU’s have the same EVC levels in both nodes, so I didn’t care too much:
Last week I had to do some heavy maintenance to my Home Lab, and I found out that some virtual machines were running with a higher EVC level than the underlying cluster. I still can’t understand how it is even possible, I searched around for a while but I couldn’t find any explanation. Anyway, I had to fix the problem.
Per-VM EVC
Since vSphere 6.7 it’s possible to configure EVC modes not just at the cluster level, but also at the VM level. This would force the VM to boot with the configured level, regardless of the underlying cluster level. So, I thought this could have been my way to solve the issue, but I was only partially correct. What I did was to stop one by one all my VM’s, since a VM has to be powered-down to configure the setting. As you can see below, the option is greyed out when the VM is running:
This was not a big deal as this is my home lab; I went through all the VM’s to force the Haswell EVC mode, and I even created a powershell script to check which VM’s were still missing:
$clusters=get-cluster foreach ($cl in $clusters) { get-cluster $cl | Get-VM | Select Name,Powerstate,@{Name='VM_EVC_Mode';Expression={$_.ExtensionData.Runtime.MinRequiredEVCModeKey}} | ft }
For some virtual machines, the option was totally missing. Thanks to this blog post, I learned that the VM has to be at least at Hardware Level 14 to enable Per-VM EVC. I used the “Schedule VM Compatibility Upgrade” feature and rebooted each VM to fix the hardware on all of them. I did it also for my VCSA: this operation is not supported by VMware, but being a home lab I accepted the fact that I was going to do something unsupported. Each VM rebooted without any problem.
After some hours of work, I ended up with this situation:
You can see here that the vcenter machine still has this weird EVC mode that is higher than what the underlying host could do.
Catch-22 on VCSA!
Ok, I thought “one last VM to reconfigure before enabling EVC at the cluster level”. And here started the problem: EVC mode for a VM can only be configured in vCenter, but the vCenter appliance is exactly the VM that I need to reconfigure. And there is no scheduling option for EVC mode like for Virtual Hardware upgrade. I searched for a couple of hours on forums and blogs, and apparently the solution was to create a new empty cluster, enable EVC mode immediately, move at least one ESXi host into it and migrate also the VM’s there. That sounded fine, but there was one last problem: I use VSAN as my only storage in the lab, so the solution has to be a bit different since when I remove a host from the cluster, also VSAN is disabled on that node, and thus there’s no shared storage anymore.
NFS to the rescue
FIRST: if you are using distributed switches, check if you have ephemeral binding! If not, enable it before starting all the procedure, or you will have even more problems when powering down VCSA. As a safety net, you can think about buying one of those “Realtek USB-C to Ethernet adapter”. They are very cheap and can save you from dangerous mistakes you can do. And I say it cause they saved me multiple times already.
To start I’ve put one of my two hosts into maintenance mode. Since it is also a VSAN node, VSAN got reconfigured and it was disabled on my node #2. There was no datastore anymore in that node:
I created a new temporary cluster, enabled immediately EVC mode at Haswell level, and I migrated the ESXi server into it:
Then, I needed a temporary datastore that could be shared between the two ESXi servers. Since I have a Synology NAS at home, I quickly created an NFS share and I mounted it into both the ESXi servers.
The following step was to do a Storage vMotion of the VCSA appliance into the NFS share. Once the migration was completed, I checked that the VM was visible from the datastore browser of the ESXi cluster running in the temporary “EVC-enabled” cluster:
This check has to be done in the local interface of the ESXi server, not from VCSA. In fact, the next step would have been to power down the VCSA appliance, register it into the second ESXi and power it on. Finger crossed! (and HAVE A BACKUP!).
And yes! vcenter was able to boot in the second ESXi, in its own cluster, named “Temp_Cluster”. I was finally able to go into the original cluster “VCC_Cluster” and enable EVC mode, since all the other VM’s where powered off:
By running again the previous powershell script, I saw the two clusters, and none of the VM’s had an unexpected EVC level:
Finally, I was able to move back vCenter and the second ESXi in the original cluster, and remove the temporary cluster and the NFS share.