Tune it Up: Improving Redis Performance for Ampere A1 on Oracle Linux in OCI

June 15, 2022 | 5 minute read
Text Size 100%:

Introduction

In this blog we investigate how to tune Oracle Linux (OL) for maximizing Redis throughput on Ampere’s A1 [1] Arm–based two-socket systems in Oracle Cloud Infrastructure (OCI).

Recommendations

Based on our testing of Redis on A1, we recommend:

  • Disabling wakeup preemption by selecting the ‘throughput-performance’ tuned profile.
  • Despite the notification from the Redis application to disable THP, we recommend enabling it and evaluating application performance.

In the rest of the blog, we provide reasons for these recommendations.

Background

Redis [2] (REmote DIctionary Server) is an in-memory open-source data structure store used as a database, cache and message broker. Redis has two components:

  • A server process
  • One or more client processes which issue a series of requests to the server

Experimental Setup

Using the open-source memtier [3] Redis benchmark and included scripts, we setup and ran the tests on 8 core VM and 160 core bare-metal A1 systems, using throughput (operations/second) as the performance metric. Moreover, we ran a single Redis instance with these parameters:

--test-time 300 --pipeline=100 --ratio 1:10 --clients=25 --run-count=1
--data-size-range=10240-1048576

We should note that for better resource utilization, on the 8 core VM and 160 core bare-metal systems, we have also run with 4 and 80 Redis instances, respectively, and have observed similar results.

Understanding the Impact of Kernel Scheduling Parameters

In other unrelated research, we have observed that some workloads are impacted by the two kernel scheduling parameters: sched_latency_ns and sched_wakeup_granularity_ns. We experimented and observed that Redis is also impacted by these kernel scheduling parameters. These parameters are described in the kernel documentation [4] as:

  • sched_latency_ns is the targeted preemption latency for CPU-bound tasks. Increasing this variable increases a CPU-bound task’s timeslice. Default is 24,000,000.
  • sched_wakeup_granularity_ns gives preemption granularity when tasks wake up. Increasing this variable reduces wake-up preemption, reducing disturbance of compute bound tasks. Lowering it improves wake-up latency and throughput for latency-critical tasks, particularly when a short duty cycle load component must compete with CPU-bound components. Setting this larger than half of sched_latency_ns, will disable wakeup preemption. Default is 4,000,000.

Since the default sched_wakeup_granularity_ns is less than half of sched_latency_ns, wakeup preemption is enabled by default on A1 instances in OCI. In order to see the impact of disabling wakeup preemption, we gradually increased the value of sched_wakeup_granularity_ns until it became larger than half of sched_latency_ns. The results are shown in the following graph.

Figure 1
Figure 1: Performance impact of increasing sched_wakeup_granularity_ns

From the graph, two things stand out:

  • Throughput generally improves as we increase sched_wakeup_granularity_ns
  • Sudden jump in the throughput as sched_wakeup_granularity_ns changes reaches 12,000,000 i.e., as the ratio becomes greater than 2, and preemption is disabled.

These observations suggest disabling preemption and setting sched_wakeup_granularity_ns to a very large value. However, setting the parameter to extremely large values is not advisable as it may have an undesirable impact on other applications running on the system. Instead of setting these parameters directly, it is simpler to choose a tuned profile that can enable and disable wakeup preemption. To this end, we found that the the default tuned profiles – oci-rps-xps oci-busy-polling oci-cpu-power oci-nic – enables preemption (Table 1), whereas the tuned profile throughput-performance disables preemption (Table 2). Table 1 shows the default values for the two parameters for different VM sizes. For the default case, although the individual values for sched_latency_ns and sched_wakeup_granularity_ns change with the number of Oracle CPUs (OCPUs), the ratio between them remains the same (6:1).

 

OCPUs
8 or more
4
2
1
sched-latency_ns
24,000,000
18,000,000
12,000,000
6,000,000
sched_wakeup_granularity_ns
4,000,000
3,000,000
2,000,000
1,000,000
Table 1: Scheduler parameter values with default tuned profiles, oci-rps-xps oci-busy-polling oci-cpu-power oci-nic, for OL on A1

 

sched-latency_ns
24,000,000
sched_wakeup_granularity_ns
15,000,000
Table 2: Scheduler parameter values with throughput-performance tuned profile

 

The individual parameters and tuned profiles can be set using the following commands:

sudo sysctl -w kernel.sched_wakeup_granularity_ns=$x

sudo tuned-adm profile throughput-performance

sudo tuned-adm profile oci-rps-xps oci-busy-polling oci-cpu-power oci-nic

So, why does disabling preemption helps Redis? In the default case, with wakeup preemption enabled, a client ready to get scheduled will kick a server process off a CPU once the server’s time slice is over. However, with preemption disabled, the client won’t be scheduled on that CPU until the server process is done with the current task. Thus, with disabling preemption, we are reducing the context switches and the cost associated with them. The following graphs show the reduction in the number of context switches (and improvement in the throughput) on disabling preemption. The data in these graphs is collected with perf stat [5]. In order to identify the root cause of the performance difference, we also looked at the perf profiles for the kernel instructions. For the lower performance case, there are calls to ‘swapper’ resulting in idle cycles.

Figure 2
Figure 2: Disabling preemption decreases context switches, resulting in an increase in the throughput

Impact of THP

When THP is enabled on our system, Redis outputs the following warning about the negative impacts of enabling Transparent Huge Pages:

WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will  create latency and memory usage issues with Redis. To fix
this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to
retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').

However, in our experiments, we have not observed any performance impacts of enabling THP. THP helps workloads with large working sets run more efficiently by reducing the number of pages required to map their memory. Therefore, we recommend enabling THP to reduce the cost associated in servicing page faults by using large memory pages.

Figure 3
Figure 3: Enabling THP does not impact performance

Summary

In this blog, we provided tuning recommendations for improving Redis performance on Ampere’s A1 in OCI. Specifically, we showed that after disabling wakeup preemption by selecting the throughput-performance tuned profile, and by minimizing page faults through using Transparent Huge Pages (THP), performance was 1.6x faster than using default values.

References

  1. Ampere A1 Compute
  2. Redis
  3. memtier_benchmark: A High-Throughput Benchmarking Tool for Redis & Memcached
  4. SUSE Documentation
  5. perf-stat(1) — Linux manual page

Muhammad Shoaib Bin Altaf


Previous Post

Oracle Linux 9 Developer Preview Now Available for Download

Simon Coter | 2 min read

Next Post


Oracle Linux and Virtualization training: Where to find free videos and labs

Craig McBride | 1 min read