Tuning glibc malloc on ARM: A Case Study

Introduction

Oracle routinely runs performance tests of key components in Oracle Linux to detect significant changes in performance. One performance test, t-test1 from Phoronix Test Suite (PTS), was observed to generate excessive page faults and consume substantial cpu time in system mode when running on Oracle Linux 9 (OL9) on Oracle Cloud Infrastructure (OCI) A1 VM. This was triggered by the default behaviour of glibc malloc. By tuning the behaviour via glibc malloc tunables, the excessive page faults can be eliminated, resulting in much better performance. In addition to providing a concrete example where non-default glibc malloc tunables helps, the blog also discusses the tools and techniques used to analyze the issue.

Initial Results

The first step taken was to reproduce the results generated by the Phoronix test harness. This was done by running the same workload multiple times underneath /usr/bin/time to check how variable the results were and to obtain some basic execution statistics. By default PTS runs the t-test1 workload with 1 and 2 threads concurrency. Although not shown, each test case was run many times to verify they were consistent.

threads

user

system

elap

minor page fault

maxresident(k)

8.85

12.08

21.08

12,304,107

13,996

6.88

7.07

14.07

6,846,809

16,700

Table 1: Baseline running t-test1 on A1 with OL9 for 1 and 2 threads concurrency

With a small working set, as measured by maxresident(k), the number of page faults was excessive. It was for example much more than the number of pages required to map 20MB when using 4KB pages (only around 5000 pages).

T-test1 microbenchmark

The benchmark itself takes up to 5 positional parameters. The following is based on inspecting the source code.

position

parameter

default

description

n_total_max

4000

Total number of threads to execute

n_thr

Number of concurrent threads to execute

i_max

10000

Number of actions each thread is to perform (an action is either an allocation or a free)

size

10000

The maximum size of objects to allocate, actual size is uniformly distributed over range 1..size

bins

6710 (computed)

Maximum number of concurrent objects a thread can allocate, such that total memory consumed by objects is no more than MEMORY (64MB by default).

Table 2: Positional parameters for t-test1 benchmark

At a very high level there is a queue of threads of size n_total_max to be executed, and they are executed with n_thr threads concurrently active. Each thread allocates a data structure to contain a number of objects, bins, and initially populates it. It then proceeds to allocate and free objects repeatedly, where the size of each object is randomly chosen in the range 1..size, until i_max actions have been performed. Crucially, before terminating, a thread cleans up by deleting all remaining objects it has allocated. This means that all the virtual memory that a thread allocates is freed before the thread terminates.

Base page size

The ARM architecture supports multiple page sizes, in fact more than just 4K and 64K. Even larger page sizes, commonly known as hugepages, are useful to map large amounts of memory efficiently. However, as with popular processors from Intel and AMD, OL9 chooses to use 4K for the ARM processors in Oracle’s A1 systems.

$ getconf PAGESIZE
4096
$ sudo grep -i pagesize /proc/1/smaps | head -2
KernelPageSize: 4 kB
MMUPageSize: 4 kB

Listing 1: OL9 base page size

Disabling Trimming

Default behaviour of glibc malloc is to return most physical memory to the kernel when the virtual memory that it backs is freed. As documented in mallopt(3), this behaviour can be changed programmatically via mallopt(3) M_TRIM_THRESHOLD parameter, or via the MALLOC_TRIM_THRESHOLD environment variable. The default value is 128*1024. With that default value, t-test1 was repeatedly allocating and freeing most physical memory, incurring the substantial cpu cost of servicing so many page faults. By using a much larger threshold, say 32M, or disabling trimming altogether by using the special value -1, the number of page faults could be slashed to only a few thousand.

threads

default(s)

-1 threshold(s)

faults default

faults -1 thresh

21.08

8.23

12,304,107

4,766

14.07

6.29

6,846,809

5,029

Table 3: Effect of using non-default MALLOC_TRIM_THRESHOLD=-1

Memory Padding

A feature of modern linux kernels, such as the UEK kernel in OL9, is a feature known as Transparent Huge Pages (THP). With THP, the kernel attempts to use much larger pages (commonly a hugepage size of 2MB) for data when certain criteria are met. In principle, using hugepages should further reduce the number of page faults at allocation time. However, inspecting the /proc/pid/smaps_rollup file for t-test1 showed AnonHugePages were not being used.

Tracing mprotect(2) syscalls showed that although virtual memory was being allocated in large chunks, much of it was initially assigned the protection flag PROT_NONE, meaning it couldn’t be read or written. The kernel was changing protection bits to PROT_READ|PROT_WRITE in response to a page fault, but doing so only one small page at a time. This stopped hugepages from being used. Another mallopt(3) environment variable, MALLOC_TOP_PAD_, can be used to tell glibc malloc to increase the amount of virtual memory to be made readable/writable above the current request when mprotect syscall is issued. Setting MALLOC_TOP_PAD_ to something quite large, in the range of 4MB to 32MB, resulted in hugepages being used. A side effect though was to increase the working set (Rss) of the program, so it is a tradeoff between performance and real memory consumption.

PAD(MB)

Elap(s)

Rss(KB)

AnonHugePages(KB)

20.52

14,788

4,096

13.48

17,048

12,288

9.09

22,012

20,480

9.03

21,864

20,480

Table 4: Impact of MALLOC_TOP_PAD on performance.

Alternative malloc implementations

While glibc malloc is the default malloc implementation in Oracle Linux, other implementations are available, and they make different tradeoff decisions. Here is a comparison of tcmalloc from the gperftools-libs package and jemalloc from the jemalloc package versus glibc malloc when running t-test1. The tables are for default values of tunables. Even with tuning of glibc malloc, tcmalloc and jemalloc were faster for t-test1, albeit at the cost of increased real memory usage.

malloc

Rss(KB)

AnonHugePages(KB)

1thr(s)

2thr(s)

glibc

13,564

21.83

14.81

tcmalloc

37,092

4,096

5.78

4.21

jemalloc

37,980

24,576

5.10

3.93

Table 5: Alternative malloc implementation characteristics and performance.

Conclusion

T-test1 represents a class of application that allocate and free a large amount of memory repeatedly, where the default behaviour of glibc malloc may yield poor performance due to an excessive number of page faults. If an application allocates large numbers of small objects it may also suffer from an inefficient mapping of its working set through the use of small pages. Both of these can be ameliorated by carefully specifying non-default values for glibc tunables. In some cases even better performance can be obtained by using alternative malloc implementations such as jemalloc or tcmalloc. A benefit of tuning glibc malloc to use hugepages is that it makes performance relatively independent of the choice for base page size of a kernel. Although the study focused on OL9 and ARM, the same behaviour can be observed with processors from Intel and AMD.

Tuning glibc malloc on ARM: A Case Study

Introduction

Initial Results

T-test1 microbenchmark

Base page size

Disabling Trimming

Memory Padding

Alternative malloc implementations

Conclusion

References

Richard Smith

New Oracle Linux developer preview releases now available

Accessing local variable values in BPF

Tuning glibc malloc on ARM: A Case Study

Introduction

Initial Results

T-test1 microbenchmark

Base page size

Disabling Trimming

Memory Padding

Alternative malloc implementations

Conclusion

References

Authors

Richard Smith

New Oracle Linux developer preview releases now available

Accessing local variable values in BPF