By Tim Cook on Aug 31, 2007
...and how to measure both at the same time
With the delivery of Solaris 10, Sun made two significant changes to how system utilization is measured. One change was to how CPU utilisation is measured
Solaris used to (and virtually all other POSIX-like OS'es still) measure CPU utilisation by sampling it. This happened once every "clock tick". A clock tick is a kernel administrative routine which is executed once (on one CPU) for every clock interrupt that is received, which happens once every 10 milliseconds. At this time, the state of each CPU was inspected, and a "tick" would be added to each of the "usr", "sys", "wt" or "idle" buckets for that CPU.
The problem with this method is two-fold:
- It is statistical, which is to say it is an approximation of something, derived via sampling
- The sampling happens just before the point when Solaris looks for threads that are waiting to be woken up to do work.
Solaris 10 now uses microstate accounting. Microstates are a set of finer-grained states of execution, including USR, SYS, TRP (servicing a trap), LCK (waiting on an intra-process lock), SLP (sleeping), LAT (on a CPU dispatch queue), although these all fall under one of the traditional USR, SYS and IDLE. These familiar three are still used to report system-wide CPU utilisation (e.g. in vmstat, mpstat, iostat), however you can see the full set of states each process is in via "prstat -m".
The key difference in system-wide CPU utilization comes in how microstate accounting is captured - it is captured at each and every transition from one microstate to another, and it is captured in nanosecond resolution (although the granularity of this is platform-dependent). To put it another way it, it is event-driven, rather than statistical sampling.
This eliminated both of the issues listed above, but it is the second issue that can cause some significant variations in observed CPU utilization.
If we have a workload that does a unit of work that takes less than one clock tick, then yields the CPU to be woken up again later, it is likely to avoid being on a CPU when the sampling is done. This is called "hiding from the clock", and is not difficult to achieve (see "hide from the clock" below).
Other types of workloads that do not explicitly behave like this, but do involve processes that are regularly on and off the CPU can look like they have different CPU utilization on Solaris releases prior to 10, because the timing of their work and the timing of the sampling end up causing an effect which is sort-of like watching the spokes of a wheel or propeller captured on video. Another factor involved in this is how busy the CPUs are - the closer a CPU is to either idle or fully utilized, the more accurate sampling is likely to be.
What This Looks Like in the Wild
I was recently involved in an investigation where a customer had changed only their operating system release (to Solaris 10), and they saw an almost 100% increase (relative) in reported CPU utilization. We suspected that the change to event-based accounting may have been a factor in this.
During our investigations, I developed a DTrace utility which can capture CPU utilization that is like that reported by Solaris 10, then also measure it the same way as Solaris 9 and 8, all at the same time.
The DTrace utility, called util-old-new, is available here. It works by enabling probes from the "sched" provider to track when threads are put on and taken off CPUs. It is event-driven, and sums up nanoseconds the same way Solaris 10 does, but it also tracks the change in a system variable, "lbolt64" while threads are on CPU, to simulate how many "clock ticks" the thread would have accumulated. This should be a close match, because lbolt64 is updated by the clock tick routine, at pretty much the same time as when the old accounting happened.
Using this utility, we were able to prove that the change in observed utilisation was pretty much in line with the way Solaris has changed how it measures utilisation. The up-side for the customer was that their understanding of how much utilisation they had left on their system was now more accurate. the down side was that they now had to re-assess whether, and by how much, this changed the amount of capacity they had left.
Here is some sample output from the utility. I start the script when I already have one CPU-bound thread on a 2-CPU system, then I start up one instance of Alexander Kolbasov's "hide-from-clock", which event-based accounting sees, but sample-based accounting does not:
mashie[bash]# util-old-new 5 NCPUs = 2 Date-time s8-tk/1000 s9-tk/1000 ns/1000 2007 Aug 16 12:12:14 508 523 540 2007 Aug 16 12:12:19 520 523 553 2007 Aug 16 12:12:24 553 567 754 2007 Aug 16 12:12:29 549 551 798 2007 Aug 16 12:12:34 539 549 810 \^C
The Other Change in Utilization Measurement
By the way, the other change was to "hard-wire" the Wait I/O ("%wio" or "wt" or "wait time") statistic to zero. The reasoning behind this is that CPU's do not wait for I/O (or any other asynchronous event) to complete - threads do. Trying to characterize how much a CPU is not doing anything in more than one statistic is like having two fuel gauges on your car - one for how much fuel remains for highway driving, and another for city driving.
References & Resources
- My util-old-new utility - based on DTrace
- "How Busy Is Your CPU, Really?" - article by Adrian Cockcroft.
- Eric Schrock's blog entry describing microstate accounting.
- Alexander Kolbasov's "hide from the clock" example program.
- "How busy is the CPU, really" - Adrian Cockcroft's ITworld article from 1998 - good diagrams to explain the shortcoming of sample-based accounting.
- Usenix Security Symposium paper - "Secretly Monopolizing the CPU Without Superuser Privileges"
- Interesting opensolaris.org thread that covers the Wait I/O issue.
- Call Record for the RFE to hard-wire Wait I/O to zero in Solaris 10.
P.S. This entry is intended to cover what I have spoken about in my previous two entries. I will soon delete the previous entries.