In this blog post, Oracle Linux performance engineers Richard Smith and Jesse Gordon describes modifications made to three Phoronix Test Suite workloads that reduce run to run variability, thereby allowing the results to be more consistent so smaller changes in performance may be detected.
The Phoronix Test Suite (PTS) is a comprehensive testing and benchmarking platform, which is widely used to help assess the performance of Linux systems. We have been using PTS to understand and track the performance of Oracle Linux (OL) over time, particularly while running on cloud instances, such as those provided by Oracle Cloud Infrastructure (OCI). Further, we also measure PTS performance on cloud instances running other versions of Linux, such as CentOS and Ubuntu.
A key requirement of using workloads to track performance is result repeatability. If run-to-run variability is too high, then the workload results can not be used to accurately determine changes in performance. While using PTS, we started with the default installation and execution (parameter) settings; however, we noticed that, in several cases, results were highly variable. Detailed analysis helped us identify the reasons for this behavior, showing us the need to modify how we run certain workloads so that more consistent results are achieved, allowing us to detect smaller changes as statistically significant.
In this blog, we explain the changes we have made in how we run three PTS workloads: Redis, Himeno and RAMspeed.
a Redis server process
a Redis client process which issues a series of requests to the server
Five different operations are modeled (set, get, lpush, lpop and sadd).
Early on, we noticed that the Redis performance results exhibited high run-to-run variability. Our investigations focused on two issues:
variability due to non-uniform memory architecture (NUMA) impacts
variability due to insufficient run duration
As there are two processes that run on the same system, the scheduler will determine on which CPUs they execute. Analysis showed this can lead to significant differences in performance. Once we bound the server process to a fixed CPU, we experimented by binding the client process in various ways:
when the Redis client process was bound to the same CPU as the Redis server process, performance was poor
when the Redis client process was bound to a CPU in the same NUMA node as the Redis server process, best performance was observed
when the Redis client process was bound to a CPU in a different NUMA node as the Redis server process, lower performance resulted
We found confirmation of our findings on the Redis site, on this page: https://redis.io/topics/benchmarks under the section labeled "Factors impacting Redis performance".
On multi CPU sockets servers, Redis performance becomes dependent on the NUMA configuration and process location. The most visible effect is that Redis benchmark results seem non-deterministic because client and server processes are distributed randomly on the cores.
To minimize the NUMA impact, we use numactl to ensure that the two processes always run on the same NUMA node (via "numactl --cpunodebind 0", as we are guaranteed to always have at least one NUMA node).
Regarding run duration, the PTS default is to run for 1 million iterations. This leads to short total run duration, often less than 1 second. Hence, Redis performance results are more sensitive to (level one data) cache misses. To understand this impact, we conducted a series of experiments running the same Redis sub-test with 100K, 1M, 10M and 100M iterations and measured the standard deviation of the results. Standard deviation decreased with increasing number of iterations, and was nearly 10x lower at 10 million iterations than at 1 million. We increased the number of iterations to 10 million for our experiment runs, which increased total run duration to 5 (or more) seconds, which also reduced the impact of cache misses.
Himeno evaluates the performance of incompressible fluid analysis code. This benchmark program solves the Poisson equation solution using the Jacobi iteration method. The PTS Himeno workload is used to help evaluate processor (CPU) performance.
Early experiment results showed significant difference in reported Himeno throughput (MFLOPS) results on OCI AMD cloud instances running Oracle Linux version 7 (OL7) and Ubuntu. We delved deeper when we saw a large improvement (over 8x) in Ubuntu Himeno results for most of September 2019 – we wanted to see if there were changes we could apply to also improve OL7 Himeno results. Inspection of the workload implementation showed the key code segment was a triply nested loop consisting of matrix operations involving a set of three dimensional arrays. We saw that the performance difference (as measured by memory latency) between Ubuntu and OL7 grew larger as the data set increased in size (i.e., Ubuntu memory latency grew faster than OL7). This led us to check on the page size(s) being used. After that, we looked at the default transparent huge pagesize (THP) settings for OL7 and Ubuntu and saw that they were different:
on OL7 (and OL8), THP is enabled with the "always" setting
on Ubuntu, THP was enabled with the "madvise" setting
Subsequent experiments showed that similar performance was obtained when the same THP settings were used for the two operating systems.
To render Himeno results more predictable, we chose to enable THP across all operating system images. Because the operating system is less busy handling TLB misses and page faults (which exhibit nondeterministic performance), this also reduced run to run variance (from as much as 10x to a few percent).
RAMspeed is an open source utility used to measure the cache and memory performance of computer systems. The PTS RAMspeed workload consists of five sub-tests each of which can be executed in either integer or floating point modes.
Early experiment results showed OL7 results on RAMspeed exceeded those obtained using Ubuntu. An initial check of the two system configurations showed a difference in GCC levels: OL7 was using version 4.8.5 while Ubuntu was using version 7.3.0. Analysis eventually focused on the code emitted by the two compiler versions for the key central loop of the benchmark implementation. There we observed the key difference:
GCC4 utilized 128-bit AVX (Intel Advanced Vector eXtensions) instructions, while
GCC7 utilized 512-bit AVX instructions
Why does this make a difference? Although the 512-bit instructions are more compact, they also consume more power. As more power is consumed, power saving measures are instituted resulting in the CPU frequency being lowered due to scaling. Lower frequency means lower performance results. This approach, known as dynamic frequency scaling, is used to stop a processor from overheating by limiting the amount of power consumed under heavy load (thermal management). Further experiments showed that, when building the code using the same compiler level, similar performance was achieved on both OL7 and Ubuntu.
As a result, we chose to standardize the GCC compiler release used for all operating system images, initially using GCC7, and later moved to GCC8 once OL8 was released. By doing so, we can focus on differences in Linux implementation, rather than differences in compiler implementation. We note that GCC8 emits 256-bit vector processing instructions for the key RAMspeed inner loop.
We have described three PTS workloads that have caused us to modify how we run PTS:
For Redis, we increase the number of iterations per experiment as well as use numactl to ensure that the server and client processes run on the same NUMA node;
these changes reduce performance variation due to too-short run times and unpredictable remote memory accesses.
For Himeno, we ensure that THP is enabled (set to "always") for all operating system images;
this change reduces performance variation by minimizing operating system overhead due to handling page faults.
For RAMspeed, we ensure that all operating system images use the same GCC compiler release:
this change reduces performance variation by ensuring identical code is generated on all platforms.
P.S. The analysis in this blog was based on PTS redis profile version 1.1.0 and PTS version 9.6.0. PTS redis profile version 1.3.1 has been updated to default to 10 million iterations and PTS version 10.2.1 has added support for recording and displaying the Transparent Huge Pages value as part of the system table.