Daniel Jordan from our core kernel team wrote up a review of his ktask framework, which has been submitted for review to the Linux Kernel Mailing list. This post has been edited to add a link to lstat.sh below.
Lately I’ve been working on a project called ktask, which allows the Linux kernel to parallelize large tasks. Although these tasks always run in kernel context, they may be initiated directly in the kernel or indirectly through a system call from an application.
In this post, I’ll explain the problem that brought this framework about, the high-level idea behind the framework, some of the use cases I’ve implemented so far, and some other details about the project. Finally, I’ll wrap up by explaining some of the Linux performance tools I’ve used to make sure each task is parallelized efficiently.
Motivation
To ensure that applications and the kernel itself continue to perform well as core counts and memory sizes increase, the kernel needs to scale. For example, when a system call requests a certain fraction of system resources, the kernel should respond in kind by devoting a similar fraction of system resources to service the request. In some places, however, the kernel doesn’t yet rise to the occasion.
For example, when a large application such as a database or in-memory cache is shutting down and freeing its pages back to the system, that application may easily have been using half the machine’s memory, but the kernel currently devotes only a single thread to this work no matter how much memory is being freed. This creates a bottleneck in which a large application can take minutes to return all its resources, preventing other applications from using them in the meantime.
Concept
Enter ktask, which is designed to scale the number of kernel threads used to service these types of heavy requests.
The concept is fairly simple, but a little terminology is needed up front. A task is the total work there is to do and a chunk is a unit of work given to a thread.
To complete a task using the ktask framework, a kernel client (such as system call code) provides a thread function that is responsible for completing one chunk. The thread function is defined in a standard way, with start and end arguments that delimit the chunk as well as an argument that the client uses to pass data specific to the task.
In addition, the client supplies an object representing the start of the task and an iterator function that knows how to advance some number of units in the task to yield another object representing the new task position. The framework uses the start object and iterator internally to divide the task into chunks.
Finally, the client passes the total task size and a minimum chunk size to indicate the minimum amount of work that’s appropriate to do in one chunk. The sizes are given in task-specific units (e.g. pages, inodes, bytes). The framework uses these sizes, along with the number of online CPUs and an internal maximum number of threads, to decide how many threads to start and how many chunks to divide the task into.
For example, consider the task of clearing a huge page. This used to be done in a single thread with a ‘for’ loop that calls a page clearing function for each constituent base page. To parallelize with ktask, the client first moves the ‘for’ loop to the thread function, adapting it to operate on the range passed to the function. In this simple case, the thread function’s start and end arguments are just addresses delimiting the portion of the huge page to clear. Then, where the ‘for’ loop used to be, the client calls into ktask with the start address of the huge page, the total size of the huge page, and the thread function. Internally, ktask will divide the address range into an appropriate number of chunks and start an appropriate number of threads to complete them.
For more concrete information on the interface, see the latest upstream patchset.
Use Cases and Performance Results
So far, ktask is planned for use in a few places: the unmap(2) and exit(2) paths where the system frees anonymous pages, struct page initialization at boot time, and zeroing the biggest huge pages sizes. To show off the performance of the framework, I’ll use the simple example of clearing a huge page.
These results are from an Oracle X5-8 server with the following specs:
CPU type: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz CPU count: 144 cores (288 threads); 8 nodes @ 18 cores/node Memory: 1T
Here’s the data from the test. Four range sizes were used to demonstrate the scaling potential both within a single NUMA node and across all nodes on a large multi-node system. The first size, 100 GiB, consists entirely of pages from the same node. As the sizes increase beyond that, the memory zeroed grows to include progressively more of the system’s nodes.
nthread speedup size (GiB) min time (s) stdev 1 100 41.13 0.03 2 2.03x 100 20.26 0.14 4 4.28x 100 9.62 0.09 8 8.39x 100 4.90 0.05 16 10.44x 100 3.94 0.03 1 200 89.68 0.35 2 2.21x 200 40.64 0.18 4 4.64x 200 19.33 0.32 8 8.99x 200 9.98 0.04 16 11.27x 200 7.96 0.04 1 400 188.20 1.57 2 2.30x 400 81.84 0.09 4 4.63x 400 40.62 0.26 8 8.92x 400 21.09 0.50 16 11.78x 400 15.97 0.25 1 800 434.91 1.81 2 2.54x 800 170.97 1.46 4 4.98x 800 87.38 1.91 8 10.15x 800 42.86 2.59 16 12.99x 800 33.48 0.83
The test scales well up to 8 threads, finally hitting a wall at 16 threads mostly due to topping out the chip’s memory bandwidth. In fact, the data shows we’re actually getting superlinear speedups from 2 to 8 threads because more threads can use more of the chips’ caches.
The loop we’re stressing here is clear_page_erms, a platform-specific page clearing function with just a few instructions in a tight loop, which tops out at a bandwidth of 2550 MiB/s with one thread. We get the same bandwidth per thread for 2, 4, or 8 threads, but at 16 threads the per-thread bandwidth drops to 1420 MiB/s.
However, the performance also improves because of ktask’s NUMA awareness (ktask starts worker threads on the node local to the work being done). This becomes a bigger factor as the amount of pages to zero grows to include memory from multiple nodes so that we get the nice scalability benefit that speedups actually increase as the memory size increases.
Tools Used to Build this Framework
This framework is often only the first of two steps in optimizing a path like huge page clearing or munmap(2). Very often additional kernel code tuning is needed to remove obstacles to efficient parallelization, such as hot locks, cacheline bouncing, and redundant work across threads. Fortunately the Linux kernel comes with many tools to help diagnose these problems.
The tool I’ve used the most is lock_stat, a userspace-accessible tool that collects a variety of kernel locking metrics, including contention counts, acquisition counts, and wait times. To use lock_stat, your kernel should be built with CONFIG_LOCK_STAT=y. Like many utilities exported from the kernel, lock_stat is controlled by proc files. The first, /proc/lock_stat, displays the data collected, and the second, /proc/sys/kernel/lock_stat, enables or disables the tool. A simple wrapper script attached to this post automates this all for you, so that you just run this:
lstat cmd...
[edit] Download lstat.sh source from github.com/oracle
After the command returns, lock_stat will have been disabled, so that you can browse /proc/lock_stat at your leisure without having to worry about future system activity muddying the data. Alternatively, the -f option allows the data to be written out to a file.
Another useful tool is perf probe, a perf subcommand that allows you to add dynamic tracepoints to a running kernel without code modification. In my case I’ve used it to explore workqueue thread latency, but perf probe is generally useful if you have specific functions you’d like to trace. It can show you what cpu the function ran on and when at any point within the function (entry, return, or even a specific instruction). To use it, your kernel should be configured with CONFIG_KPROBE_EVENTS=y, CONFIG_KPROBES=y, and CONFIG_PERF_EVENTS=y.
Here’s a sample run of perf probe that uses a combination of predefined events and dynamic probes to measure the thread latency mentioned earlier. The purpose of this experiment, to give some context, was to verify that workqueue threads were not adding a significant amount of latency between the time a ktask client calls ktask_run and a ktask thread function, here dispose_list_task, runs. dispose_list_task makes many calls to another function called evict2, thus making it a good candidate for parallelization. perf probe can help by showing when and where each step in this process happened.
First, we add (-a) the dynamic probes verbosely (-v).
# perf probe -v -a evict_inodes # perf probe -v -a ktask_run # perf probe -v -a ktask_task # perf probe -v -a 'dispose_list_task start:x64 end:x64' # fourth probe # perf probe -v -a 'evict2 inode:x64' # perf probe -v -a 'evict2_ret=evict2+309 inode:x64' # sixth probe # perf probe -v -a 'dispose_list_task_ret=dispose_list_task%return' # perf probe -v -a 'ktask_task_ret=ktask_task%return' # perf probe -v -a 'ktask_run_ret=ktask_run%return' # perf probe -v -a 'evict_inodes_ret=evict_inodes%return'
The first three just use a kernel function name to make the probe fire on function entry. The fourth probe, dispose_list_task, fires on entry as well but also prints the function arguments start and end as 64-bit hex values. The sixth fires at a specific instruction in evict2. The remaining probes fire on return from their respective functions.
Now we actually record the command we’re interested in by using the dynamic probes we set up. Those probes not prefixed with ‘probe:’ (here, just ‘workqueue:’) are predefined perf events that appear in every recent kernel.
# perf record -aR \ -e probe:evict_inodes \ -e probe:ktask_run \ -e workqueue:workqueue_queue_work \ -e workqueue:workqueue_activate_work \ -e workqueue:workqueue_execute_start \ -e probe:ktask_task \ -e probe:dispose_list_task \ -e probe:evict2 \ -e probe:evict2_ret \ -e probe:dispose_list_task_ret \ -e probe:ktask_task_ret \ -e workqueue:workqueue_execute_end \ -e probe:ktask_run_ret \ -e probe:evict_inodes_ret \ cmd...
The -a flag causes events on all cpus in the system to be recorded. Normally, perf records only the events from the command given, but in this case we also want to trace the workqueue threads in the kernel.
# chown user:group perf.record.out
After changing the file permissions so we’re not needlessly running as root, we can post-process the raw output generated from perf record as shown below:
$ perf script -F cpu,event,time,trace > perf.script.out $ cat perf.script.out [006] 0.000000: probe:evict_inodes: (ffffffff811f6c20) [006] 0.014580: probe:ktask_run: (ffffffff8107e210) [006] 0.014584: workqueue:workqueue_queue_work: work struct=0xffff8818634b6058 function=ktask_task workqueue=0xffff883fef931a00 req_cpu=1 cpu=1 [006] 0.014585: workqueue:workqueue_activate_work: work struct 0xffff8818634b6058 ...snip... [001] 0.014645: workqueue:workqueue_execute_start: work struct 0xffff8818634b6058: function ktask_task [001] 0.014667: probe:ktask_task: (ffffffff8107e0c0) [001] 0.014671: probe:dispose_list_task: (ffffffff811f5a50) start_x64=0x0 end_x64=0x1 [001] 0.014673: probe:evict2: (ffffffff811f5890) inode_x64=0xffff8818a15fbb50 [001] 0.016089: probe:evict2_ret: (ffffffff811f59c5) inode_x64=0xffff8818a15fbb50 [001] 0.016090: probe:evict2: (ffffffff811f5890) inode_x64=0xffff881fc9fceb10 [001] 0.017483: probe:evict2_ret: (ffffffff811f59c5) inode_x64=0xffff881fc9fceb10 ...snip... [001] 0.193898: probe:evict2: (ffffffff811f5890) inode_x64=0xffff8816939c6b10 [001] 0.195335: probe:evict2_ret: (ffffffff811f59c5) inode_x64=0xffff8816939c6b10 [001] 0.195339: probe:dispose_list_task_ret: (ffffffff811f5a50 <- ffffffff8107e134) [001] 0.195345: probe:dispose_list_task: (ffffffff811f5a50) start_x64=0x15 end_x64=0x16 [001] 0.195347: probe:evict2: (ffffffff811f5890) inode_x64=0xffff8819a55f2350 [001] 0.196753: probe:evict2_ret: (ffffffff811f59c5) inode_x64=0xffff8819a55f2350 ...snip... [001] 2.701235: probe:evict2: (ffffffff811f5890) inode_x64=0xffff8816fdb52290 [001] 2.702268: probe:evict2_ret: (ffffffff811f59c5) inode_x64=0xffff8816fdb52290 [001] 2.702269: probe:dispose_list_task_ret: (ffffffff811f5a50 <- ffffffff8107e134) [001] 2.702273: probe:ktask_task_ret: (ffffffff8107e0c0 <- ffffffff81072c69) [001] 2.702275: workqueue:workqueue_execute_end: work struct 0xffff8818634b6058 ...snip... [006] 2.706126: probe:ktask_run_ret: (ffffffff8107e210 <- ffffffff811f6e0a) [006] 2.706129: probe:evict_inodes_ret: (ffffffff811f6c20 <- ffffffff811db4b4)
Starting from the left, we can see the cpu number in brackets, the event’s time stamp in seconds, and the probe name. Any additional data we requested is shown at the end of the line. In this case, we can see that the amount of time that passed between the call to ktask_run and the time one thread’s first chunk started running (probe:dispose_list_task) was less than 100 microseconds (0.014671 – 0.014580), which shows that workqueue thread latency is not the issue.
Finally, we remove the dynamic tracepoints from the system:
# perf probe -d '*'
Conclusion
In this post, I’ve explained the scalability issues that motivated ktask, the high-level ideas in the framework, some of the places it’s been used, and wrapped up with some of the performance tools used in the process of writing the framework.
The plan for ktask going forward is to solicit more feedback upstream, add more callers, and continue to enhance the core framework. Hopefully ktask will be making its way upstream in the near future so that everyone benefits from enhanced kernel scalability.