Kernel developer Daniel Jordan got a nice writeup on lwn.net for his work on ktask. Daniel wrote a blog post about ktask when the first version of this work was submitted to the Linux kernel community. Since then, the code has evolved to cover many additional dimensions in order to help it integrate with other systems. LWN.net subscribers can learn more in this recent writeup on the evolution of ktask.
Ktask is a generic framework for kernel task parallelization: any task which is currently single-threaded in the kernel can be broken up into workable chunks and handed off to the ktask helper, which will make clever scheduling and CPU participation decisions to ensure that the task finishes quickly. This change is not automatic; ktask introduces a coding construct which must be used by developers who wish to take advantage of this parallelized functionality.
Memory initialization (page zeroing) will be helped considerably by the parallelization of ktask. Initializing memory is a critical task done by the OS to keep data secure, and can be a significant factor in the startup time for database applications and for virtual machines. ktask allows those tasks to be spread out across all the cores on a system and allow it to scale across the CPUs on the system.
As the patches have been reviewed and revised, more use cases have bubbled up and we’re excited to see more opportunities to make this generic framework useful for the kernel, including parallelizing kernel operations in the infiniband driver, to improve vfio performance, and more!
ktask: parallelize CPU-intensive kernel work
ktask is a generic framework for parallelizing CPU-intensive work in the kernel. The intended use is for big machines that can use their CPU power to speed up large tasks that can’t otherwise be multithreaded in userland. The API is generic enough to add concurrency to many different kinds of tasks–for example, page clearing over an address range or freeing a list of pages–and aims to save its clients the trouble of splitting up the work, choosing the number of helper threads to use, maintaining an efficient concurrency level, starting these threads, and load balancing the work between them.
Some Results
Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
Test: Clear a range of gigantic pages (triggered via fallocate)
nthread speedup size (GiB) min time (s) stdev
1 100 41.13 0.03
2 2.03x 100 20.26 0.14
4 4.28x 100 9.62 0.09
8 8.39x 100 4.90 0.05
16 10.44x 100 3.94 0.03
1 200 89.68 0.35
2 2.21x 200 40.64 0.18
4 4.64x 200 19.33 0.32
8 8.99x 200 9.98 0.04
16 11.27x 200 7.96 0.04
1 400 188.20 1.57
2 2.30x 400 81.84 0.09
4 4.63x 400 40.62 0.26
8 8.92x 400 21.09 0.50
16 11.78x 400 15.97 0.25
1 800 434.91 1.81
2 2.54x 800 170.97 1.46
4 4.98x 800 87.38 1.91
8 10.15x 800 42.86 2.59
16 12.99x 800 33.48 0.83
This data shows the speedup for zeroing large amounts of memory, and the advantages as the tasks are spread across available cores. Raw data for these results.
We look forward to seeing ktask as part of upstream Linux!