X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Multithreaded Struct Page Initialization

Guest Author

Oracle Linux kernel developer Daniel Jordan contributes this post on the initial support for multithreaded jobs in padata.

 

 

The last padata blog described unbinding padata jobs from specific CPUs. This post will cover padata's initial support for multithreading CPU-intensive kernel paths, which takes us to the memory management system.

The Bottleneck

During boot, the kernel needs to initialize all its page structures so they can be freed to the buddy allocator and put to good use. This became expensive as memory sizes grew into the terabytes, so in 2015 Linux got a new feature called deferred struct page initialization that brought the time down on NUMA machines. Instead of a single thread doing all the work, that thread only initialized a small subset of the pages early on, and then per-node threads did the rest later.

This helped significantly on systems with many nodes, saving hundreds of seconds on a 24 TB server. However, it left some performance on the table for machines with many cores but not enough nodes to take full advantage of deferred init as it was initially implemented. One of the machines I tested had 2 nodes and 768 GB memory, and its pages took 1.7 seconds to be initialized, by far the largest component of the 4 seconds it took to boot the kernel. That may seem like a small amount of time in absolute terms, but it matters in a few different cases as explained in this changelog:

Optimizing deferred init maximizes availability for large-memory systems and allows spinning up short-lived VMs as needed without having to leave them running. It also benefits bare metal machines hosting VMs that are sensitive to downtime. In projects such as VMM Fast Restart, where guest state is preserved across kexec reboot, it helps prevent application and network timeouts in the guests.

So there was a need to use more than one thread per node to take full advantage of system memory bandwidth on machines where memory was concentrated over relatively few nodes.

The Timing

Deferred init turned out to be a good place to start upstreaming support for multithreaded kernel jobs because of how early it happens. This is before userspace is ready, when there is no other significant activity in the system because it is waiting for page initialization to be finished. That allowed delaying many of the prerequisites that the community has deemed necessary for starting these jobs from userspace.

These prereqs have come up a few times in the past. They blocked attempts at adding similar functionality for page migration and page zapping, and the community raised them again in the initial versions of this work. The concerns involve both the extent to which the extra threads, known has helpers, respect the resource controls of the main thread, which initiates the job, and whether these extra threads will unduly interfere with other activity on the system.

In the first case, the resource that matters for page init threads is CPU consumption, which can be restricted with cgroup's CPU subsystem. The CPU controller, however, only becomes active after boot is finished, so respecting it during page init is not necessary. And in the second case, there is no concern about interfering with other tasks on the system because the page init threads run when the rest of the system is largely idle and waiting for page init to finish.

For now, because all the multithreading functions are only used during boot, they are all currently marked with __init so that the kernel can both free the text after boot and enforce that no callers can use them afterward until the proper restrictions are in place.

The Implementation

For this first step in adding multithreading support, the implementation is thankfully fairly simple. padata, an existing framework that assigns single threads to many small jobs, grew to support assigning many threads to single large jobs. To multithread such a job, the user defines a struct padata_mt_job:

    struct padata_mt_job {
        void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
        void             *fn_arg;
        unsigned long    start;
        unsigned long    size;
        unsigned long    align;
        unsigned long    min_chunk;
        int              max_threads;
    };

The job description contains basic information including a pointer to the thread function, an argument to that function containing any required shared data, and the start and size of the job. start and size are in job-specific units. For deferred init, the unit is page frame numbers (PFNs) to be initialized. A user may pass an alignment, which is useful in the page init case for avoiding cacheline bouncing of page section data between threads. The remaining two fields require a bit more explanation.

The first, min_chunk, describes the minimum amount of work that is appropriate for one helper thread to do in one call to the thread function. Like start and size, it is in job-specific units. min_chunk is a hint to keep an inordinate number of threads from being started for too little work, which could hurt performance. During page init, a job is started for each of the deferred PFN ranges, and some of those ranges may be small enough to warrant starting fewer threads than the other job parameters would otherwise allow.

The second, max_threads, is simply a cap on the number of threads that can be started for that job. It was not obvious at the beginning of the project what number would work best for all systems, and there was some discussion upstream of setting it to the number of cores on the node, which has performed better than using all SMT CPUs on the node in similar workloads to page init. However, performance testing across several recent CPU types found, surprisingly, that more threads always produced greater speedups, albeit with diminishing returns. Since the system is otherwise idle during page init, though, it made sense to take full advantage of the CPUs.

With the job defined, the page init code starts it with padata_do_multithreaded. padata internally decides how many threads to start, taking care to assign work in amounts small enough to load balance between helpers, so they finish at roughly the same time, but large enough to minimize management overhead. The function waits for the job to complete before returning.

Multithreaded page init is only available on kernels configured with DEFERRED_STRUCT_PAGE_INIT, and since performance testing has only been done on x86 systems, that is the only architecture where the feature is currently available. Other architectures are free to override deferred_page_init_max_threads with the per-node thread counts right for them.

The Results

Here are the numbers from all the systems tested. This is the data that led to using all SMT CPUs on a node.

    Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
      2 nodes * 26 cores * 2 threads = 104 CPUs
      384G/node = 768G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   4089.7 (  8.1)         --   1785.7 (  7.6)
       2% (  1)       1.7%   4019.3 (  1.5)       3.8%   1717.7 ( 11.8)
      12% (  6)      34.9%   2662.7 (  2.9)      79.9%    359.3 (  0.6)
      25% ( 13)      39.9%   2459.0 (  3.6)      91.2%    157.0 (  0.0)
      37% ( 19)      39.2%   2485.0 ( 29.7)      90.4%    172.0 ( 28.6)
      50% ( 26)      39.3%   2482.7 ( 25.7)      90.3%    173.7 ( 30.0)
      75% ( 39)      39.0%   2495.7 (  5.5)      89.4%    190.0 (  1.0)
     100% ( 52)      40.2%   2443.7 (  3.8)      92.3%    138.0 (  1.0)

    Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest)
      1 node * 16 cores * 2 threads = 32 CPUs
      192G/node = 192G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1988.7 (  9.6)         --   1096.0 ( 11.5)
       3% (  1)       1.1%   1967.0 ( 17.6)       0.3%   1092.7 ( 11.0)
      12% (  4)      41.1%   1170.3 ( 14.2)      73.8%    287.0 (  3.6)
      25% (  8)      47.1%   1052.7 ( 21.9)      83.9%    177.0 ( 13.5)
      38% ( 12)      48.9%   1016.3 ( 12.1)      86.8%    144.7 (  1.5)
      50% ( 16)      48.9%   1015.7 (  8.1)      87.8%    134.0 (  4.4)
      75% ( 24)      49.1%   1012.3 (  3.1)      88.1%    130.3 (  2.3)
     100% ( 32)      49.5%   1004.0 (  5.3)      88.5%    125.7 (  2.1)

    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
      2 nodes * 18 cores * 2 threads = 72 CPUs
      128G/node = 256G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1680.0 (  4.6)         --    627.0 (  4.0)
       3% (  1)       0.3%   1675.7 (  4.5)      -0.2%    628.0 (  3.6)
      11% (  4)      25.6%   1250.7 (  2.1)      67.9%    201.0 (  0.0)
      25% (  9)      30.7%   1164.0 ( 17.3)      81.8%    114.3 ( 17.7)
      36% ( 13)      31.4%   1152.7 ( 10.8)      84.0%    100.3 ( 17.9)
      50% ( 18)      31.5%   1150.7 (  9.3)      83.9%    101.0 ( 14.1)
      75% ( 27)      31.7%   1148.0 (  5.6)      84.5%     97.3 (  6.4)
     100% ( 36)      32.0%   1142.3 (  4.0)      85.6%     90.0 (  1.0)

    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
      1 node * 8 cores * 2 threads = 16 CPUs
      64G/node = 64G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1029.3 ( 25.1)         --    240.7 (  1.5)
       6% (  1)      -0.6%   1036.0 (  7.8)      -2.2%    246.0 (  0.0)
      12% (  2)      11.8%    907.7 (  8.6)      44.7%    133.0 (  1.0)
      25% (  4)      13.9%    886.0 ( 10.6)      62.6%     90.0 (  6.0)
      38% (  6)      17.8%    845.7 ( 14.2)      69.1%     74.3 (  3.8)
      50% (  8)      16.8%    856.0 ( 22.1)      72.9%     65.3 (  5.7)
      75% ( 12)      15.4%    871.0 ( 29.2)      79.8%     48.7 (  7.4)
     100% ( 16)      21.0%    813.7 ( 21.0)      80.5%     47.0 (  5.2)

Server-oriented distros that enable deferred page init sometimes run in small VMs, and they still benefit even though the fraction of boot time saved is smaller:

    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
      1 node * 2 cores * 2 threads = 4 CPUs
      16G/node = 16G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --    716.0 ( 14.0)         --     49.7 (  0.6)
      25% (  1)       1.8%    703.0 (  5.3)      -4.0%     51.7 (  0.6)
      50% (  2)       1.6%    704.7 (  1.2)      43.0%     28.3 (  0.6)
      75% (  3)       2.7%    696.7 ( 13.1)      49.7%     25.0 (  0.0)
     100% (  4)       4.1%    687.0 ( 10.4)      55.7%     22.0 (  0.0)
    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
      1 node * 2 cores * 2 threads = 4 CPUs
      14G/node = 14G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --    787.7 (  6.4)         --    122.3 (  0.6)
      25% (  1)       0.2%    786.3 ( 10.8)      -2.5%    125.3 (  2.1)
      50% (  2)       5.9%    741.0 ( 13.9)      37.6%     76.3 ( 19.7)
      75% (  3)       8.3%    722.0 ( 19.0)      49.9%     61.3 (  3.2)
     100% (  4)       9.3%    714.7 (  9.5)      56.4%     53.3 (  1.5)

Future Work

This post described page init, the first user of padata's support for multithreaded jobs. All future users will need to be aware of various resource controls, such as cgroup's CPU and cpuset controllers, sched_setaffinity, and NUMA memory policy. There are also some draft patches written that will be part of the next phase, such as ones to run helpers at the highest nice level on the system to avoid disturbing other tasks.

The plan for the immediate future is to get the CPU controller ready to throttle kernel threads.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.