With CPU-bound benchmarks and applications such as SPEC CPU 2017, it's relatively straightforward to get optimal performance on an Oracle Cloud Infrastructure Ampere A1 Compute instance. But, for applications that have many interrupts or that share memory across non-uniform memory access (NUMA) nodes, it takes some effort to get the best performance. On NUMA systems, a key aspect to control is remote memory accesses. Ensuring that work is done where an application's memory is located reduces expensive remote memory accesses and results in the best and most predictable performance on large-scale applications.
Many customers prefer to run their cloud applications in a Virtual Machine (VM). If a VM is suitable for your application, using a VM rather than a bare metal shape is easier because all the NUMA tuning is done for you. A1.Flex shapes cost the same per OCPU as bare metal shapes yet can be as small as 1 CPU or as large as 80 CPUs and 512 GB.
Oracle's Altra-based BM.Standard.A1.160 compute shape features 160 Arm CPUs split between two NUMA nodes. Node 0 contains CPUs 0-79 and node 1 contains CPUs 80-159. As on all NUMA systems, it takes longer to access memory in the node that's remote to a given CPU. For example, A1's local memory latency is roughly four times faster than the remote memory latency.
This post shows you how to configure your application to access memory with the fastest local latency. We'll share some tuning tips for MySQL and Java, but the concepts are applicable to any memory-intensive application.
Use the following configurations and procedures to get the best memory performance for any application with a large code or data footprint that's running on a NUMA system.
If your code has a large footprint (for example, databases often have binaries larger than 100 MB) then performance can suffer significantly if the application stalls while waiting for instructions to be fetched from the remote NUMA node.
Clear the filesystem cache to ensure that a previous invocation of your application didn't leave the binary cached in the remote node. Doing this lets the binary be re-cached in the local node at the next invocation.
# echo 3 > /proc/sys/vm/drop_caches
Then, start your application using numactl to control the CPUs and memory node that your application uses. Here's an example that shows how to run an application in node 0 on an A1 bare metal instance:
# numactl -C 0-79 --localalloc <command>
The text (and data) for <command> is now in the filesystem cache in node 0.
To ensure that your application's memory is in the node that’s local to the CPUs that your application is running on, use numastat and numa_maps to check the application's locality. If necessary, you can use migratepages to move the memory. See the MySQL section later in this post for examples of these commands. Your system should already have numactl and migratepages installed; if they're missing, run yum install numactl to get both tools.
Transparent HugePages (THP) gives performance gains by reducing stalls on page translations, but the magnitude of the effect varies by application and workload. It's been reported that disabling THP sometimes results in small reductions in memory use at the expense of performance, but we generally have not been able to reproduce this on up-to-date systems. However, if high memory efficiency is more important to you than throughput performance then try disabling THP, but do not expect to see a significant improvement. THP is enabled by default but it's simple and immediate to disable, as follows:
# echo never > /sys/kernel/mm/transparent_hugepage/enabled # echo never > /sys/kernel/mm/transparent_hugepage/defrag
You can reenable THP by echoing "always" into the same pseudofile and then restarting your application.
Alternatives to the default glibc malloc that trade memory efficiency and universality for scalable performance have been around for decades. Currently, two excellent options are included in Oracle Linux: JEMalloc and TCMalloc. You try them without modifying your application, as follows:
# yum install -y gperftools jemalloc # LD_PRELOAD=/lib64/libtcmalloc.so <command> <arguments> # LD_PRELOAD=/lib64/libjemalloc.so <command> <arguments>
In this case, <command> is any dynamically linked executable, including a shell script that calls other applications. Performance gains and changes in resident set size (RSS) memory will vary depending on your workload.
If you're managing your own database, it's important to consider the two nodes on the A1 bare metal instance. This section provides tips for starting the database by using local memory and then shares tips for system configuration and adjusting MySQL's my.cnf file, courtesy of Oracle's MySQL Performance Architect Dimitri Kravtchuk.
Although we recommend the previous tips for ensuring local memory, sometimes restarting a database isn't possible or convenient. Thankfully, the numactl package (included in Oracle Linux) has the migratepages tool, which you can use to move memory. For example, you can migrate all the mysqld pages that are in node 1 to node 0 by using this command:
# sudo migratepages `pgrep mysqld` 1 0
Data and text are migrated, but shared libraries are not. You can verify where the text is located by looking at the top of the numa_maps file:
# head -1 /proc/`pgrep mysqld`/numa_maps 00400000 bind:0 file=/usr/sbin/mysqld mapped=345 N0=345 kernelpagesize_kB=64
If pages exist in node 1 (that is, N1=...) then something went wrong. Try freeing up some system memory, and then run migratepages again.
The first line in the numa_maps file is usually the most important for MySQL performance, but you should scan the rest of the numa_maps file for remote memory allocations.
When MySQL is running, check that the mysqld heap is also in node 0 by using numastat, for example:
# numastat -c mysqld Per-node process memory usage (in MBs) for PID 223931 (mysqld) Node 0 Node 1 Total ------ ------ ----- Huge 0 0 0 Heap 5 0 5 Stack 0 0 0 Private 24802 0 24802 ------- ------ ------ ----- Total 24807 0 24807
Performance varies from socket to socket depending on where the kernel text is located. Kernel text is allocated starting at an address with random offset at boot time using a KASLR algorithm. Because of this, kernel text can be in node 0, node 1, or both, but it doesn't move after the system is booted. You can deduce the dominant NUMA location of performance-sensitive kernel text (mostly the IP stack and scheduler functions) by running MySQL or another benchmark that's sensitive to kernel performance in node 0 and in node 1, noting which is faster.
What if you need to scale beyond one socket? Rather than expand your instance into another NUMA node, it's best to create an independent MySQL instance in each node because remote memory latency is especially costly for databases. Details on how to do this are included in the MySQL documentation.
That's it! If your database stays up for a long time you can always use numastat and numa_maps to check the locality of mysqld, and fix any problems with migratepages.
If you expect a large number of concurrent connections (a few hundred or more) make the followng changes to /etc/security/limits.conf. Then, open a new shell so that they take effect.
* soft nofile 131072 * hard nofile 131072 * soft nproc 65536 * hard nproc 65536 * soft core unlimited * hard core unlimited * hard stack 10240 * soft stack 10240 * hard fsize unlimited * soft fsize unlimited mysql hard nice -20 mysql soft nice -20
Likewise, the following networking tunables in /etc/sysctl.conf increases the number of remembered connections at the expense of a small amount of memory.
net.ipv4.tcp_max_syn_backlog=10000 net.core.netdev_max_backlog=10000 net.core.somaxconn=10000
Although not directly performance related, making the following change to /etc/systemd/journald.conf results in the systemd journal event log persisting across reboots:
Then run the following command:
# systemctl restart systemd-journald.service
Besides preserving how the system was configured in past reboots, this change makes troubleshooting easier.
Dimitri Kravtchuk also offers the following my.cnf tuning heuristics for running in one socket of an A1 bare metal instance:
innodb_buffer_pool_size=192G # 75% of total memory in one node innodb_buffer_pool_instances=40 # max(CPUs/2 , 4) innodb_numa_interleave=off # We do not want memory interleaved across nodes innodb_purge_threads=4 # (CPUs <= 8 ) ? 1 : 4 innodb_parallel_read_threads=20 # max(CPUs / 4, 4) innodb_log_file_size=1GB innodb_log_files_in_group=16 # min(CPUs, 16 ) innodb_log_buffer_size=64M max_connections=5000 back_log=5000 thread_pool_size=128 # min(CPUs*8, 128), if you use ThreadPool performance_schema=on # This can degrade performance a few percent, but allows diagnostics innodb_io_capacity_max=5000 # Assumes block storage innodb_io_capacity=2500 innodb_page_cleaners=40 # Set same as innodb_buffer_pool_instances
Please refer the MySQL documentation to understand what these configurations do.
Note: If you want to translate these settings to other platforms that have multiple threads per core, change "CPUs" to "cores" in the equations.
There's plenty of debate online over the pros and cons of using MySQL with THP. Sometimes, small reductions in memory use have been observed at the expense of throughput performance, but as we wrote in the preceding section on THP we were not able to reproduce this behavior on current systems. In addition, the Oracle Linux team continues to improve the performance and efficacy of THP.
A previous section discussed using JEmalloc or TCmalloc. Throughput and response time improvements vary depending on your workload, but gains of about 5% have been observed on the sysbench OLTP benchmark.
As with MySQL, A1 instances give the best Java performance when attention is given to memory locality. But unlike MySQL, data locality often dominates program or kernel text locality and therefore the numactl tool should be all you need. You shouldn't need to migrate Java text with migratepages, as shown in the MySQL example, because most of the code that's running is generated by the HotSpot JVM and is not from the Java executable file. If one socket is enough CPU power for your Java application then you should run it in a single socket using numactl.
The HotSpot JVM scales to large numbers of CPU sockets and CPU cores. If the Java application can scale, the HotSpot JVM can also scale. There are, however, some general guidelines that can offer additional performance to Java applications.
In general, deploying a Java application as one-instance-per-CPU-socket offers the best performance. Therefore, if an application deployment runs multiple Java application instances, or can be partitioned to allow its deployment to map to one Java application per CPU socket, that Java application typically experiences better performance over other deployment models.
If multiple Java applications or multiple instances of a Java application are deployed on a system, numactl is the easiest method to use. The following example illustrates how to use numactl on the two NUMA nodes of an A1 bare metal instance. Recall that each CPU socket has one NUMA node that contains 80 single-threaded cores, and the system has two CPU sockets.
# numactl -m 0 -C 0-79 java -cp $CLASS_SEARCH_PATH $ARGS ... & # numactl -m 1 -C 80-159 java -cp $CLASS_SEARCH_PATH $ARGS ... & `
In this example, the -m flag identifies the memory node that's paired with each CPU socket. (You can determine which CPUs belong to which NUMA node by using lscpu, numactl -H, or various other Linux commands). The -C flag identifies the hardware thread IDs from the CPU and memory node. So, memory node 0 maps to the CPU socket that has CPU hardware thread IDs 0-79, and memory node 1 maps to the CPU socket that has CPU hardware thread IDs 80-159.
If partitioning your application as shown above is too difficult then the HotSpot JVM command line option -XX:+UseNUMA can improve Java application performance. This option configures the JVM in a way that allocates memory locally to the CPU where a Java thread is running, rather than on a remote node that would experience higher memory latency.
You can check that the preceding numactl commands worked as intended by using the numastat tool. For example:
# numastat -c java Per-node process memory usage (in MBs) PID Node 0 Node 1 Total ------------- ------ ------ ----- 233957 (java) 13935 0 13935 233958 (java) 17 11877 11895 ------------- ------ ------ ----- Total 13953 11877 25830
Here we see that each java process is mostly in its own node. PID 233958 has 17 MB allocated in the remote node because of shared libraries that are linked to the Java binary, and shared libraries can live only in one place.
As on any platform, consult the Oracle Garbage Collection Tuning Guide.
Performance and memory use with THP varies depending on your workload's data locality and other factors. As with MySQL, we are generally not able to reproduce cases where disabling THP is an improvement on current systems. See the previous THP section for the steps to disable and enable THP.
Are you ready to test your application on Oracle Ampere A1 Compute? You can start with a free tier VM or configure Oracle's most powerful A1 bare metal server. Helpful resources are available at the Oracle Developers Portal and a wide variety of applications are available for your use in the Oracle Cloud Marketplace.