Kernel developer Daniel Jordan got a nice writeup on lwn.net for his work on ktask. Daniel wrote a blog post about ktask when the first version of this work was submitted to the Linux kernel community. Since then, the code has evolved to cover many additional dimensions in order to help it integrate with other systems. LWN.net subscribers can learn more in this recent writeup on the evolution of ktask.

Ktask is a generic framework for kernel task parallelization: any task which is currently single-threaded in the kernel can be broken up into workable chunks and handed off to the ktask helper, which will make clever scheduling and CPU participation decisions to ensure that the task finishes quickly. This change is not automatic; ktask introduces a coding construct which must be used by developers who wish to take advantage of this parallelized functionality.

Memory initialization (page zeroing) will be helped considerably by the parallelization of ktask. Initializing memory is a critical task done by the OS to keep data secure, and can be a significant factor in the startup time for database applications and for virtual machines. ktask allows those tasks to be spread out across all the cores on a system and allow it to scale across the CPUs on the system.

As the patches have been reviewed and revised, more use cases have bubbled up and we’re excited to see more opportunities to make this generic framework useful for the kernel, including parallelizing kernel operations in the infiniband driver, to improve vfio performance, and more!

ktask: parallelize CPU-intensive kernel work

ktask is a generic framework for parallelizing CPU-intensive work in the kernel. The intended use is for big machines that can use their CPU power to speed up large tasks that can’t otherwise be multithreaded in userland. The API is generic enough to add concurrency to many different kinds of tasks–for example, page clearing over an address range or freeing a list of pages–and aims to save its clients the trouble of splitting up the work, choosing the number of helper threads to use, maintaining an efficient concurrency level, starting these threads, and load balancing the work between them.

Some Results

Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
Test:     Clear a range of gigantic pages (triggered via fallocate)

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    100          41.13    0.03
      2     2.03x          100          20.26    0.14
      4     4.28x          100           9.62    0.09
      8     8.39x          100           4.90    0.05
     16    10.44x          100           3.94    0.03

      1                    200          89.68    0.35
      2     2.21x          200          40.64    0.18
      4     4.64x          200          19.33    0.32
      8     8.99x          200           9.98    0.04
     16    11.27x          200           7.96    0.04

      1                    400         188.20    1.57
      2     2.30x          400          81.84    0.09
      4     4.63x          400          40.62    0.26
      8     8.92x          400          21.09    0.50
     16    11.78x          400          15.97    0.25

      1                    800         434.91    1.81
      2     2.54x          800         170.97    1.46
      4     4.98x          800          87.38    1.91
      8    10.15x          800          42.86    2.59
     16    12.99x          800          33.48    0.83

This data shows the speedup for zeroing large amounts of memory, and the advantages as the tasks are spread across available cores. Raw data for these results.

We look forward to seeing ktask as part of upstream Linux!