...and how to measure both at the same time
With the delivery of Solaris 10, Sun made two significant changes to how system
utilization is measured. One change was to how CPU utilisation is measured
Solaris used to (and virtually all other POSIX-like OS'es still) measure CPU
utilisation by sampling it. This happened once every "clock tick". A
clock tick is a kernel administrative routine which is executed once (on one
CPU) for every clock interrupt that is received, which happens once
every 10 milliseconds. At this time, the state of each CPU was inspected, and a
"tick" would be added to each of the "usr", "sys", "wt" or "idle" buckets for
The problem with this method is two-fold:
- It is statistical, which is to say it is an approximation of something, derived via sampling
- The sampling happens just before the point when Solaris looks for threads that are waiting to be woken up to do work.
Solaris 10 now uses microstate accounting. Microstates are a set of
finer-grained states of execution, including USR, SYS, TRP (servicing a trap), LCK (waiting on an intra-process lock), SLP (sleeping), LAT (on a CPU dispatch queue), although these all fall under one of the traditional USR, SYS and IDLE. These familiar three are still used to report system-wide CPU utilisation (e.g. in vmstat, mpstat, iostat), however you can see the full set of states each process is in via "prstat -m".
The key difference in system-wide CPU utilization comes in how
microstate accounting is captured - it is captured at each and every
transition from one microstate to another, and it is captured in nanosecond
resolution (although the granularity of this is platform-dependent). To put it
another way it, it is event-driven, rather than statistical sampling.
This eliminated both of the issues listed above, but it is the second issue that
can cause some significant variations in observed CPU utilization.
If we have a workload that does a unit of work that takes less than one clock
tick, then yields the CPU to be woken up again later, it is likely to avoid
being on a CPU when the sampling is done. This is called "hiding from the
clock", and is not difficult to achieve (see "hide from the clock" below).
Other types of workloads that do not explicitly behave like this, but do involve
processes that are regularly on and off the CPU can look like they have
different CPU utilization on Solaris releases prior to 10, because the timing of their work and the timing of the sampling end up causing an effect which is sort-of like watching the spokes of a wheel or propeller captured on video. Another factor involved in this is how busy the CPUs are - the closer a CPU is to either idle or fully utilized, the more accurate sampling is likely to be.
What This Looks Like in the Wild
I was recently involved in an investigation where a customer had changed only
their operating system release (to Solaris 10), and they saw an almost 100% increase (relative) in reported CPU utilization. We suspected that the change to event-based accounting may have been a factor in this.
During our investigations, I developed a DTrace utility which can capture CPU
utilization that is like that reported by Solaris 10, then also measure it the
same way as Solaris 9 and 8, all at the same time.
The DTrace utility, called
util-old-new, is available here.
It works by
enabling probes from the "sched" provider to track when threads are put on and
taken off CPUs. It is event-driven, and sums up nanoseconds the
same way Solaris 10 does, but it also tracks the change in a system variable,
"lbolt64" while threads are on CPU, to simulate how many "clock ticks" the
thread would have accumulated. This should be a close match, because lbolt64 is
updated by the clock tick routine, at pretty much the same time as when the old
Using this utility, we were able to prove that the change in observed
utilisation was pretty much in line with the way Solaris has changed how it
measures utilisation. The up-side for the customer was that their understanding
of how much utilisation they had left on their system was now more accurate. the down side was that they now had to re-assess whether, and by how much, this changed the amount of capacity they had left.
Here is some sample output from the utility. I start the script when I already have one CPU-bound thread on a 2-CPU system, then I start up one instance of Alexander Kolbasov's "hide-from-clock", which event-based accounting sees, but sample-based accounting does not:
mashie[bash]# util-old-new 5
NCPUs = 2
Date-time s8-tk/1000 s9-tk/1000 ns/1000
2007 Aug 16 12:12:14 508 523 540
2007 Aug 16 12:12:19 520 523 553
2007 Aug 16 12:12:24 553 567 754
2007 Aug 16 12:12:29 549 551 798
2007 Aug 16 12:12:34 539 549 810
The Other Change in Utilization Measurement
By the way, the other change was to "hard-wire" the Wait I/O
("%wio" or "wt" or "wait time") statistic to zero. The reasoning behind this is
that CPU's do not wait for I/O (or any other asynchronous event) to complete -
threads do. Trying to characterize how much a CPU is not doing anything in more than one statistic is like having two fuel gauges on your car - one for how much fuel remains for highway driving, and another for city driving.
References & Resources
P.S. This entry is intended to cover what I have spoken about in my previous two entries. I will soon delete the previous entries.