We're running several hundred VM's on one of our clusters and have multiple business units managing servers at the OS level, running on these clusters. We have one business unit who runs their own monitoring software on their Windows Servers that is telling them the 'CPU Steal' is very high, and that it's an issue with the hosts having CPU contention. We manage the underlying infrastructure, so I'm trying to match vSphere metrics up with what they are reporting for relevance.
I'm not familiar with CPU Steal, but typically I would review the CPU Ready values of a VM experiencing CPU performance issues. With CPU Ready value < 5% there's nothing to worry about, general rule of thumb in my experience.
Looking at 1 particular VM (for example) with reported issues of high CPU Steal, CPU Ready is very low, 1.2% max peak however the CPU Co-Stop reached peaks of 250ms during these 1.2% ready peaks. If these 2 values indicate the same (or similar) information (VM vCPU is waiting to process on the hosts physical CPU) how can the values be so different?
Looking at the Max VM CPU Contention values from VROPS at the cluster level, it ranges from 4 to 16 - what is acceptable value for this metric?