Thursday, February 2, 2012

Nasty "bug" in ESXi 5 with hot-add cpu and 8+ cores

We recently did some load testing with a very large SQL server running under ESXi, this was to check the viability of using our VMware environment to provision extra capacity for one of our production physical databases. The VM was set-up with hot-add CPU's and hot-plug memory so we could increase capacity as required. We had a host dedicated to this role and were pushing the VM up to 32cores and 128GB RAM, a beast! The underlying server was a Dell M910 with 64 threads and 192GB RAM.

The problem was very noticeable as memory increased, SQL Server was filling its buffer pool, at about 36-38GB SQL would stop responding normally, the privileged CPU usage would increase dramatically as would the rate in which it filled the buffer pool. It would carry on this increase in memory until it hit the SQL server Max Memory value and then return to normal operation. It was noticed that the number of vCPU's used would affect the point at which the issue would happen, less vCPU's the longer the server would be okay for.

VMware were originally happy with the configuration of the VM, so over to Microsoft for the SQL Server core support team. Eventually we identified that if we disabled SQL Server NUMA with the startup trace flag of T8015 that the issue did not happen. So back over to VMware support and they came back and said gave me some more information.

It turns out that CPU hot-add and vNUMA are currently incompatible and that when CPU  hot-add is enabled, vNUMA is disabled and the following is seen in the vmware log for the VM "NUMA and VCPU hot add are incompatible.  Forcing UMA"
The Guest OS tries to use NUMA but the VM cannot and then the trouble starts.

I hope VMware create a KB to make people aware of this, as you wouldn't really expect this behaviour. Or alternatively alert users of the incompatibility of hot-add and vNUMA on systems with more than 8vCPU's.

No comments:

Post a Comment