X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Oracle Ampere A1 Compute tuning for advanced users

Guest Author
Karsten Guthridge and Charlie Hunt delve into some advanced techniques related to tuning Oracle Ampere A1 Compute instances on Oracle Cloud Infrastructure.

 

 

With CPU-bound benchmarks and applications such as SPEC CPU 2017, it's relatively straightforward to get optimal performance on an Oracle Cloud Infrastructure Ampere A1 Compute instance. But, for applications that have many interrupts or that share memory across non-uniform memory access (NUMA) nodes, it takes some effort to get the best performance. On NUMA systems, a key aspect to control is remote memory accesses. Ensuring that work is done where an application's memory is located reduces expensive remote memory accesses and results in the best and most predictable performance on large-scale applications.

Oracle Ampere A1 Compute VM versus bare metal

Many customers prefer to run their cloud applications in a Virtual Machine (VM). If a VM is suitable for your application, using a VM rather than a bare metal shape is easier because all the NUMA tuning is done for you. A1.Flex shapes cost the same per OCPU as bare metal shapes yet can be as small as 1 CPU or as large as 80 CPUs and 512 GB.

Oracle's Altra-based BM.Standard.A1.160 compute shape features 160 Arm CPUs split between two NUMA nodes. Node 0 contains CPUs 0-79 and node 1 contains CPUs 80-159. As on all NUMA systems, it takes longer to access memory in the node that's remote to a given CPU. For example, A1's local memory latency is roughly four times faster than the remote memory latency.

This post shows you how to configure your application to access memory with the fastest local latency. We'll share some tuning tips for MySQL and Java, but the concepts are applicable to any memory-intensive application.

Tuning tips for large applications

Use the following configurations and procedures to get the best memory performance for any application with a large code or data footprint that's running on a NUMA system.

Ensure Local Memory Accesses

If your code has a large footprint (for example, databases often have binaries larger than 100 MB) then performance can suffer significantly if the application stalls while waiting for instructions to be fetched from the remote NUMA node.

Clear the filesystem cache to ensure that a previous invocation of your application didn't leave the binary cached in the remote node. Doing this lets the binary be re-cached in the local node at the next invocation.

# echo 3 > /proc/sys/vm/drop_caches

Then, start your application using numactl to control the CPUs and memory node that your application uses. Here's an example that shows how to run an application in node 0 on an A1 bare metal instance:

# numactl -C 0-79 --localalloc <command>

The text (and data) for <command> is now in the filesystem cache in node 0.

Check memory locality

To ensure that your application's memory is in the node that’s local to the CPUs that your application is running on, use numastat and numa_maps to check the application's locality. If necessary, you can use migratepages to move the memory. See the MySQL section later in this post for examples of these commands. Your system should already have numactl and migratepages installed; if they're missing, run yum install numactl to get both tools.

Transparent HugePages

Transparent HugePages (THP) gives performance gains by reducing stalls on page translations, but the magnitude of the effect varies by application and workload. It's been reported that disabling THP sometimes results in small reductions in memory use at the expense of performance, but we generally have not been able to reproduce this on up-to-date systems. However, if high memory efficiency is more important to you than throughput performance then try disabling THP, but do not expect to see a significant improvement. THP is enabled by default but it's simple and immediate to disable, as follows:

# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# echo never > /sys/kernel/mm/transparent_hugepage/defrag

You can reenable THP by echoing "always" into the same pseudofile and then restarting your application.

Try a Performance-Optimized Malloc

Alternatives to the default glibc malloc that trade memory efficiency and universality for scalable performance have been around for decades. Currently, two excellent options are included in Oracle Linux: JEMalloc and TCMalloc. You try them without modifying your application, as follows:

# yum install -y gperftools jemalloc
# LD_PRELOAD=/lib64/libtcmalloc.so <command> <arguments>
# LD_PRELOAD=/lib64/libjemalloc.so <command> <arguments>

In this case, <command> is any dynamically linked executable, including a shell script that calls other applications. Performance gains and changes in resident set size (RSS) memory will vary depending on your workload.

MySQL tuning on Oracle Ampere A1 Compute bare metal

If you're managing your own database, it's important to consider the two nodes on the A1 bare metal instance. This section provides tips for starting the database by using local memory and then shares tips for system configuration and adjusting MySQL's my.cnf file, courtesy of Oracle's MySQL Performance Architect Dimitri Kravtchuk.

Ensure local memory accesses

Although we recommend the previous tips for ensuring local memory, sometimes restarting a database isn't possible or convenient. Thankfully, the numactl package (included in Oracle Linux) has the migratepages tool, which you can use to move memory. For example, you can migrate all the mysqld pages that are in node 1 to node 0 by using this command:

# sudo migratepages `pgrep mysqld` 1 0

Data and text are migrated, but shared libraries are not. You can verify where the text is located by looking at the top of the numa_maps file:

# head -1 /proc/`pgrep mysqld`/numa_maps
00400000 bind:0 file=/usr/sbin/mysqld mapped=345 N0=345 kernelpagesize_kB=64

If pages exist in node 1 (that is, N1=...) then something went wrong. Try freeing up some system memory, and then run migratepages again.

The first line in the numa_maps file is usually the most important for MySQL performance, but you should scan the rest of the numa_maps file for remote memory allocations.

When MySQL is running, check that the mysqld heap is also in node 0 by using numastat, for example:

# numastat -c mysqld

Per-node process memory usage (in MBs) for PID 223931 (mysqld)
        Node 0 Node 1 Total
        ------ ------ -----
Huge         0      0     0
Heap         5      0     5
Stack        0      0     0
Private  24802      0 24802
------- ------ ------ -----
Total    24807      0 24807

Performance varies from socket to socket depending on where the kernel text is located. Kernel text is allocated starting at an address with random offset at boot time using a KASLR algorithm. Because of this, kernel text can be in node 0, node 1, or both, but it doesn't move after the system is booted. You can deduce the dominant NUMA location of performance-sensitive kernel text (mostly the IP stack and scheduler functions) by running MySQL or another benchmark that's sensitive to kernel performance in node 0 and in node 1, noting which is faster.

What if you need to scale beyond one socket? Rather than expand your instance into another NUMA node, it's best to create an independent MySQL instance in each node because remote memory latency is especially costly for databases. Details on how to do this are included in the MySQL documentation.

That's it! If your database stays up for a long time you can always use numastat and numa_maps to check the locality of mysqld, and fix any problems with migratepages.

Configure MySQL

If you expect a large number of concurrent connections (a few hundred or more) make the followng changes to /etc/security/limits.conf. Then, open a new shell so that they take effect.

*                soft    nofile          131072
*                hard    nofile          131072
*                soft    nproc           65536
*                hard    nproc           65536
*                soft    core            unlimited
*                hard    core            unlimited
*                hard    stack           10240
*                soft    stack           10240
*                hard    fsize           unlimited
*                soft    fsize           unlimited
mysql            hard    nice            -20
mysql            soft    nice            -20

Likewise, the following networking tunables in /etc/sysctl.conf increases the number of remembered connections at the expense of a small amount of memory.

net.ipv4.tcp_max_syn_backlog=10000
net.core.netdev_max_backlog=10000
net.core.somaxconn=10000

Although not directly performance related, making the following change to /etc/systemd/journald.conf results in the systemd journal event log persisting across reboots:

#Storage=auto
Storage=persistent

Then run the following command:

# systemctl restart systemd-journald.service

Besides preserving how the system was configured in past reboots, this change makes troubleshooting easier.

Configure MySQL my.cnf

Dimitri Kravtchuk also offers the following my.cnf tuning heuristics for running in one socket of an A1 bare metal instance:

  innodb_buffer_pool_size=192G        # 75% of total memory in one node
  innodb_buffer_pool_instances=40     # max(CPUs/2 , 4)
  innodb_numa_interleave=off          # We do not want memory interleaved across nodes
  innodb_purge_threads=4              # (CPUs <= 8 ) ? 1 : 4
  innodb_parallel_read_threads=20     # max(CPUs / 4, 4)
  innodb_log_file_size=1GB
  innodb_log_files_in_group=16        #  min(CPUs, 16 )
  innodb_log_buffer_size=64M
  max_connections=5000
  back_log=5000
  thread_pool_size=128                # min(CPUs*8, 128), if you use ThreadPool
  performance_schema=on               # This can degrade performance a few percent, but allows diagnostics
  innodb_io_capacity_max=5000         # Assumes block storage
  innodb_io_capacity=2500
  innodb_page_cleaners=40             # Set same as innodb_buffer_pool_instances

Please refer the MySQL documentation to understand what these configurations do.

Note: If you want to translate these settings to other platforms that have multiple threads per core, change "CPUs" to "cores" in the equations.

Transparent HugePages

There's plenty of debate online over the pros and cons of using MySQL with THP. Sometimes, small reductions in memory use have been observed at the expense of throughput performance, but as we wrote in the preceding section on THP we were not able to reproduce this behavior on current systems. In addition, the Oracle Linux team continues to improve the performance and efficacy of THP.

Try a performance-optimized malloc

A previous section discussed using JEmalloc or TCmalloc. Throughput and response time improvements vary depending on your workload, but gains of about 5% have been observed on the sysbench OLTP benchmark.

Java tuning on Oracle Ampere A1 Compute bare metal

As with MySQL, A1 instances give the best Java performance when attention is given to memory locality. But unlike MySQL, data locality often dominates program or kernel text locality and therefore the numactl tool should be all you need. You shouldn't need to migrate Java text with migratepages, as shown in the MySQL example, because most of the code that's running is generated by the HotSpot JVM and is not from the Java executable file. If one socket is enough CPU power for your Java application then you should run it in a single socket using numactl.

The HotSpot JVM scales to large numbers of CPU sockets and CPU cores. If the Java application can scale, the HotSpot JVM can also scale. There are, however, some general guidelines that can offer additional performance to Java applications.

In general, deploying a Java application as one-instance-per-CPU-socket offers the best performance. Therefore, if an application deployment runs multiple Java application instances, or can be partitioned to allow its deployment to map to one Java application per CPU socket, that Java application typically experiences better performance over other deployment models.

If multiple Java applications or multiple instances of a Java application are deployed on a system, numactl is the easiest method to use. The following example illustrates how to use numactl on the two NUMA nodes of an A1 bare metal instance. Recall that each CPU socket has one NUMA node that contains 80 single-threaded cores, and the system has two CPU sockets.

# numactl -m 0 -C 0-79   java -cp         $CLASS_SEARCH_PATH $ARGS ... &

# numactl -m 1 -C 80-159 java -cp $CLASS_SEARCH_PATH         $ARGS ... &       `

In this example, the -m flag identifies the memory node that's paired with each CPU socket. (You can determine which CPUs belong to which NUMA node by using lscpu, numactl -H, or various other Linux commands). The -C flag identifies the hardware thread IDs from the CPU and memory node. So, memory node 0 maps to the CPU socket that has CPU hardware thread IDs 0-79, and memory node 1 maps to the CPU socket that has CPU hardware thread IDs 80-159.

What if you can't partition your application?

If partitioning your application as shown above is too difficult then the HotSpot JVM command line option -XX:+UseNUMA can improve Java application performance. This option configures the JVM in a way that allocates memory locally to the CPU where a Java thread is running, rather than on a remote node that would experience higher memory latency.

Confirm memory locality

You can check that the preceding numactl commands worked as intended by using the numastat tool. For example:

 # numastat -c java

  Per-node process memory usage (in MBs)
  PID            Node 0 Node 1 Total
  -------------  ------ ------ -----
  233957 (java)   13935      0 13935
  233958 (java)      17  11877 11895
  -------------  ------ ------ -----
  Total           13953  11877 25830

Here we see that each java process is mostly in its own node. PID 233958 has 17 MB allocated in the remote node because of shared libraries that are linked to the Java binary, and shared libraries can live only in one place.

Other tuning tips for Java on Oracle Ampere A1 Compute

Check your garbage

As on any platform, consult the Oracle Garbage Collection Tuning Guide.

Transparent HugePages

Performance and memory use with THP varies depending on your workload's data locality and other factors. As with MySQL, we are generally not able to reproduce cases where disabling THP is an improvement on current systems. See the previous THP section for the steps to disable and enable THP.

Give it a try

Are you ready to test your application on Oracle Ampere A1 Compute? You can start with a free tier VM or configure Oracle's most powerful A1 bare metal server. Helpful resources are available at the Oracle Developers Portal and a wide variety of applications are available for your use in the Oracle Cloud Marketplace.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.