Wednesday Jan 21, 2015

CPU utilization of multi-threaded architectures explained

by Martin Tegtmeier from ISV Engineering in Walldorf


Simple CPU metrics (user/system/idle/io-wait) are still widely used although these numbers need interpretation on today's multi-thread/multi-core architectures. "Idle" as measured by operating systems cannot be literally translated into available CPU resources - turning capacity planning into a more complex problem.


Back in the days when 1 processor contained 1 core capable of running 1 thread, CPU utilization reported by the operating system indicated actual resource consumption (and resource availability) of the processor. In such environments CPU utilization grows linearly with increased workload.

Multi-core CPUs:  1 processor = 2 or more cores
In multi-core CPUs, where 1 processor contains 2 or more cores, each processing core has its own arithmetic and logic unit, floating point unit, set of registers, pipeline, as well as some amount of cache. However multi-core CPUs also share some resources between the cores (e.g. L3-Cache, memory controller).

Simultaneous multi-threading CPUs/cores:  1 processor or core = 2 or more threads  (aka "Hyper-Threading", "Chip Multi-threading")
The hardware components of one physical core are shared between several threads. Each thread has at least its own set of registers. Most resources of the core (arithmetic and logic unit, floating point unit, cache) are shared between the threads. Naturally those threads compete for processing resources and stall if the desired units are already busy.

What are the benefits of resource sharing?
Resource sharing can increase overall throughput and efficiency by keeping the processing units of a core busy. For instance hyper-threading can reduce or hide stalls on memory access (cache misses). Instead of wasting many cycles while data is fetched from main memory the current thread is suspended and the next runnable thread is resumed and continues execution.

What are the disadvantages?

  • CPU time accounting measurements (sys/usr/idle) as reported by standard tools do not reflect the side-effects of resource sharing between hardware threads
  • It is impossible to correctly measure idle and extrapolate available computing resources

Idle does not indicate how much more work can be accomplished by the CPU

Assuming 1 CPU core has 4 threads. Currently 2 (single-threaded) processes are scheduled to run on this core and these 2 processes already saturate all available shared compute resources (ALU, FPU, Cache, Memory bandwidth, etc.) of the core. Commonly used performance tools would still report (at least) 50% idle since 2 logical processors (hardware threads) appear completely idle.

In order to correctly estimate how much work can be added until the system approaches full saturation the operating system would need to get detailed utilization information of all shared core processing units (ALU, FPU, Cache, Memory bandwidth, etc.) as well as knowing the characteristics of the workload to be added (!).

Measurements with SAP ABAP workload

To illustrate our case, let's look at a very specific but very common workload in Enterprise Computing: SAP-SD ABAP. We took these measurements on a SPARC T5 system running the latest Solaris 11 release. Simulated benchmark users logged onto the SAP system and entered SD transactions. The maximum number of SD-Users and SAP transaction throughput the system could handle are represented by the 100% mark on the X-Axis. A series of test runs was carried out in order to measure CPU utilization (Y-Axis) as reported by the operating system at 0%, 12.5%, 25%, 50%, 60%, 75%, 90% and 100% of the maximum number of SD-Users.