In this blog post we are going to cover how to detect CPU Steal time using G8Keeper application. We will also discuss on what could be the cause and remedies for such issue. But before we go into more detail, let’s try to understand what is CPU Steal time.
CPU Steal time (also known as stolen CPU) is the percentage of time a virtual machine process is waiting on the physical CPU for its CPU time.
In a cloud based virtual machines (VM), a hypervisor acts as an interface between the physical server and its virtualized environment. This software layer is installed on the physical hardware and manages all tasks by allocating CPU time to processes on the virtual machines, networking operations, storage I/O requests etc.
CPU steal time occurs when the processes are ready to be executed by the virtual CPU, but it’s waiting for the hypervisor to allocate a physical CPU to it. This can happen when the hypervisor is servicing another VM.
How to detect CPU Steal time.
The easiest and most straightforward way to do it is run “top” command on any linux distribution. You will see an output like this one as displayed below. This is displayed in percentage value.
Any small number up to 5% is generally fine. However, when you see a big number such as the one displayed below that is when you should get worried.
Causes of Steal time
There can be multiple causes for steal time on the CPU. Some of them could be because of the processes on your VM, others could be due to heavy operations running on other user’s VM that is sharing the same hardware on the cloud. Let’s discuss both of these in a bit more detail.
- CPU Steal time due to processes running on your server: Most commonly this type of issue is faced if you are using burstable type of cloud server. To read more about AWS Burstable type instances and how they work click here
Burstable type of instances provide you some CPU credits which allows you to use higher of CPU for limited time. Once the credits are exhausted and there are tasks running that require higher CPU performance, then CPU throttling kicks in and it shows as steal time on CPU usage.
- CPU Steal time due to other user’s VM: Sometimes the cloud service provider may end up creating a lot of VMs on the same hardware due to sudden surge in demand. Or some of the VMs that are sharing the same hardware as your VM may suddenly have a lot of CPU requirement. In that case the processes running on your VM may have to wait for longer time and this can lead to CPU Steal on your VM.
Identifying Steal time on CPU
Identifying CPU steal time is difficult since it only shows up in output of “top” command. Looking at CPU usage on your cloud service provider dashboard it may not be very clear that CPU Steal is happening. For example looking at the below CPU usage on AWS EC2 instance, it is not possible to say if CPU steal is happening until you look at output from “top” command.
This is where G8keeper comes to rescue. On G8keeper, by default you will see total CPU usage and CPU used by user processes and system processes. On G8keeper you will see something like this
Now looking at the chart above, you will notice there is huge difference between total CPU usage (49%) and sum of user and system CPU (2.2 + 5.5 = 7.7%). This is when you can click on the chart to see further details. When you do that, it will show you something like displayed below
On this chart, it clearly shows you that CPU Steal is being done, and you can see the entire history on since when it stared so that you can diagnose the cause and remedy the same. Now that we know what is CPU steal time, it’s types and how to identify it. Let’s look at how we can remedy the situation.
Remedies of CPU Steal time:
The Remedy of CPU Steal time depends on the actual cause of the same. Going back to the 2 types we discussed earlier:
- CPU Steal time due to processes running on your server: Investigate the your CPU usage and processes running on your server around the time when CPU Steal started. Try and see if there are some processes using excessive CPU since sometime. If you are on a burstable type of VM, are you running short on the CPU Credit? If yes, then the actual cause of Steal is due to the processes running on your server. In this case you need to investigate the processes to see if there is genuine surge in CPU usage (may be due to increased traffic). In that case switch to bigger server. For example, when we investigated the CPU usage on our AWS instance we saw following
This clearly shows that there was increased CPU usage since last 2 days. We can confirm the same with G8keeper logs as well and see when CPU steal kicked in.
This clearly shows that there was increased CPU usage for some time and then CPU Steal kicks in (where we see orange line). Now we can confirm from AWS console to see if we are running short on CPU credit balance.
Looking at that we realize that CPU credit was continuously depleting, and it finally became almost 0.
So we conclude that this is a genuine case of CPU steal and should be solved by either getting more credit or increasing the server size.
- CPU Steal time due to other user’s VM: Contrary of the case above, if we do not see any change in the CPU usage pattern on our VM and we have sufficient CPU credit balance, then we can conclude that this is happening due to other user’s VM. In this case you need to reboot the server from the cloud service provider’s panel and hope that when the server is starting up (which it generally would if some hardware is generally being overused), they allocate your VM on a different hardware.
Hopefully this would take care of all your doubts on CPU steal time. With G8Keeper installed you can go a step further to see what processes are using the CPU in real time and historically and based on that you can take a more informed decision on whether to increase server size or handle rogue process to bring CPU consumption back to normal.