News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Unbinding Parallel Jobs in Padata

Scott Michael
Director, Software Development

Oracle Linux kernel developer Daniel Jordan contributes this post on enhancing the performance of padata.

padata is a generic framework for parallel jobs in the kernel -- with a twist. It not only schedules jobs on multiple CPUs but also ensures that each job is properly serialized, i.e. finishes in the order it was submitted. This post will provide some background on this somewhat obscure feature of the core kernel and cover recent efforts to enhance its parallel performance in preparation for more multithreading in the kernel.

How Padata Works

padata allows users to create an instance that represents a certain class of parallel jobs, for example IPsec decryption (see pdecrypt in the kernel source). The instance serves as a handle when submitting jobs to padata so that all jobs submitted with the same handle are serialized amongst themselves. An instance also allows for fine-grained control over which CPUs are used to run work, and contains other internal data such as the next sequence number to assign for serialization purposes and the workqueue used for parallelization.

To initialize a job (known cryptically as padata_priv in the code), a pair of function pointers are required, parallel and serial, where parallel is responsible for doing the actual work in a workqueue worker and serial completes the job once padata has serialized it. The user submits the job along with a corresponding instance to the framework via padata_do_parallel to start it running, and once the job's parallel part is finished, the user calls padata_do_serial to inform padata of this. padata_do_serial is currently always called from parallel, but this is not strictly required. padata ensures that a job's serial function is called only when the serial functions of all previously-submitted jobs from the same instance have been called.

Though parallelization is ultimately padata's (and this blog post's) reason for being, its serialization algorithm is the most technically interesting part, so I'll go on a tangent to explain a bit about it. For scalability reasons, padata allocates internal per-CPU queues, and there are three types, parallel, reorder, and serial, where each type is used for a different phase of a padata job's lifecycle. When a job is submitted to padata, it's atomically assigned a unique sequence number within the instance that determines the order its serialization callback runs. The sequence number is hashed to a CPU that is used to select which queue a job is placed on. When the job is preparing to execute its parallel function, it is placed on a parallel per-CPU queue that determines which CPU it runs on (this becomes important later in the post). Using a per-CPU queue allows multiple tasks to submit parallel jobs concurrently with only minimal contention from the atomic op on the sequence number, avoiding a shared lock. When the parallel part finishes and the user calls padata_do_serial, padata then places the job on the reorder queue, again corresponding to the CPU that the job hashed to. And finally, a job is placed on the serial queue once all jobs before it have been serialized.

During the parallel phase, jobs may finish out of order relative to when they were submitted. Nevertheless, each call to padata_do_serial places the job on its corresponding reorder queue and attempts to process the entire reorder queue across all CPUs, which entails repeatedly checking whether the job with the next unserialized sequence number has finished until there are no more jobs left to reorder. These jobs may or may not include the one passed to padata_do_serial because again, jobs finish out of order.

This process of checking for the next unserialized job is the biggest potential bottleneck in all of padata because a global lock is used. Without the lock, multiple tasks might process the reorder queues at once, leading to duplicate serial callbacks and list corruption. However, if all calls to padata_do_serial were to wait on the lock when only one call actually ends up processing all the jobs, the rest of the tasks would be waiting for no purpose and introduce unnecessary latency in the system. To avoid this situation, the lock is acquired with a trylock call, and if a task fails to get the lock, it can safely bail out of padata knowing that a current or future lock holder will take care of its job.

This serialization process is important for the use case that prompted padata to begin with, IPsec. IPsec throughput was a bottleneck in the kernel because a single CPU, the one that the receiving NIC's interrupt ran on, was doing all the work, with the CPU-intensive portion largely consisting of the crypto operations. Parallelization could address this, but serialization was required to maintain the packet ordering that the upper layer protocols required, and getting that right was not an easy task. See this presentation from Steffen Klassert, the original author of padata, for more background.

More Kernel Multithreading

Though padata was designed to be generic, it currently has just the one IPsec user. There are more kernel codepaths that can benefit from parallelization, such as struct page initialization, page clearing in various memory management paths (huge page fallocate, get_user_pages), and page freeing at munmap and exit time. Two previous blog posts and an LWN article on ktask have covered some of these. Recent upstream feedback has called for merging ktask with padata, and the first step in that process is to change where padata schedules its parallel workers.

To that end, I posted a series on the mailing lists, merged for the v5.3 release, that adds a second workqueue per padata instance dedicated to parallel jobs. Earlier in the post, I described padata's per-CPU parallel queues. To assign a job to one of these queues, padata uses a simple round-robin algorithm to hash a job's sequence number to a CPU, and then runs the job bound to that CPU alone. Each successive job submitted to the instance runs on the next CPU. There are two problems with this approach. First, it's not NUMA-aware, so on multi-socket systems, a job may not run locally. Second, on a busy system, a job will likely complete faster if it allows the scheduler to select the CPU within the NUMA node it's run on. To solve both problems, the series uses an unbound workqueue, which is NUMA-aware by default and not bound to a particular CPU (hence the name).

Performance Results

The numbers from tcrypt, a test module in the kernel's crypto layer, look promising. Parts are shown here, see the upstream post for the full data.

Measurements are from a 2-socket, 20-core, 40-CPU Xeon server.

For repeatability, modprobe was bound to a CPU and the serial cpumasks
for both pencrypt and pdecrypt were also restricted to a CPU different
from modprobe's.

  # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
  # modprobe tcrypt mode=211 sec=1
  # modprobe tcrypt mode=215 sec=1

Busy system (tcrypt run while 10 stress-ng tasks were burning 100% CPU)

                             base                test
                             ----------------    ---------------
speedup    key_sz  blk_sz    ops/sec    stdev    ops/sec   stdev

(pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)

 117.2x       160      16        960       30     112555   24775
 135.1x       160      64        845      246     114145   25124
 113.2x       160     256        993       17     112395   24714
 111.3x       160     512       1000        0     111252   23755
 110.0x       160    1024        983       16     108153   22374
 104.2x       160    2048        985       22     102563   20530
  98.5x       160    4096        998        3      98346   18777
  86.2x       160    8192       1000        0      86173   14480

multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)

 242.2x       160      16       2363      141     572189   16846
 242.1x       160      64       2397      151     580424   11923
 231.1x       160     256       2472       21     571387   16364
 237.6x       160     512       2429       24     577264    8692
 238.3x       160    1024       2384       97     568155    6621
 216.3x       160    2048       2453       74     530627    3480
 209.2x       160    4096       2381      206     498192   19177
 176.5x       160    8192       2323      157     410013    9903

Idle system (tcrypt run by itself)

                             base                test
                             ----------------    ---------------
speedup    key_sz  blk_sz    ops/sec    stdev    ops/sec   stdev

(pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)

   2.5x       160      16      63412    43075     161615    1034
   4.1x       160      64      39554    24006     161653     981
   6.0x       160     256      26504     1436     160110    1158
   6.2x       160     512      25500       40     157018     951
   5.9x       160    1024      25777     1094     151852     915
   5.8x       160    2048      24653      218     143756     508
   5.6x       160    4096      24333       20     136752     548
   5.0x       160    8192      23310       15     117660     481

multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)

   1.0x       160      16     412157     3855     426973    1591
   1.0x       160      64     412600     4410     431920    4224
   1.1x       160     256     410352     3254     453691   17831
   1.2x       160     512     406293     4948     473491   39818
   1.2x       160    1024     395123     7804     478539   27660
   1.2x       160    2048     385144     7601     453720   17579
   1.2x       160    4096     371989     3631     449923   15331
   1.2x       160    8192     346723     1617     399824   18559

A few tools were used in the initial performance analysis to confirm the source of the speedups. I'll show results from one of them, ftrace. Custom kernel events were added to record the runtime and CPU number of each crypto request, which runs a padata job under the hood. For analysis only (not the runs that produced these results), the threads of the competing workload stress-ng were bound to a known set of CPUs, and two histograms were created of crypto request runtimes, one for just the CPUs without the stress-ng tasks ("uncontended") and one with ("contended"). The histogram clearly shows increased times for the padata jobs with contended CPUs, as expected:

Crypto request runtimes (usec) on uncontended CPUs

# request-count: 11980; mean: 41; stdev: 23; median: 45

    runtime (usec)     count
    --------------   --------
         0 -     1   [     0]:
         1 -     2   [     0]:
         2 -     4   [     0]:
         4 -     8   [   209]: *
         8 -    16   [  3630]: *********************
        16 -    32   [   188]: *
        32 -    64   [  6571]: **************************************
        64 -   128   [  1381]: ********
       128 -   256   [     1]:
       256 -   512   [     0]:
       512 -  1024   [     0]:
      1024 -  2048   [     0]:
      2048 -  4096   [     0]:
      4096 -  8192   [     0]:
      8192 - 16384   [     0]:
     16384 - 32768   [     0]:

Crypto request runtimes (usec) on contended CPUs
# request-count: 3991; mean: 3876; stdev: 455; median 3999

    runtime (usec)     count
    --------------   --------
         0 -     1   [     0]:
         1 -     2   [     0]:
         2 -     4   [     0]:
         4 -     8   [     0]:
         8 -    16   [     0]:
        16 -    32   [     0]:
        32 -    64   [     0]:
        64 -   128   [     0]:
       128 -   256   [     0]:
       256 -   512   [     4]:
       512 -  1024   [     4]:
      1024 -  2048   [     0]:
      2048 -  4096   [  3977]: **************************************
      4096 -  8192   [     4]:
      8192 - 16384   [     2]:
     16384 - 32768   [     0]:


Now that padata has unbound workqueue support, look out for further enhancements to padata in coming releases! Next steps include creating padata threads in cgroups so they can be properly throttled and adding multithreaded job support to padata.

Join the discussion

Comments ( 2 )
  • Erman Arslan Friday, February 14, 2020
    Very interesting and useful blog post for the ones who wants to understand the internals in a way and expanding their perspectives.

    According to my understanding,
    Jobs are submitted to parallel workers(through parallel queues) and thus they are executed in parallel in different cpus..
    Each job is placed in the reorder queue(corresponding to the CPU that the job is executed on), when its parallel processing is completed.
    So let's suppose we have a job with the name X.
    When X completes its parallel processing and enters into the reorder queue, it is checked if there are any jobs which is finished earlier than X and still not placed in the serial queue and still not serialized there by the serialization fuction.
    -- I think we change this serialization term to sorting, as the code give a sequence number to the jobs, and it seems it sorts them accordingly in the serialization fuction.
    This serialization function's job is to reorder the entries(the outputs, the jobs) placed in the serial queue and thus modifying the serial queue when a new job is placed there. (until the last job)
    This means that, the entries in the serial queue is actually ordered in any given time. So if there are no jobs which is finished earlier than X and serialized in the serial queue, then X is placed in serial queue and serialization function runs again to re-serialize the entries of the serial queue by taking this X into account.
    (Probably, the client code is clearing out the serial queue at the same time by consuming the correctly ordered data continously ..)
    If this serialization is run during the parallel processing, then this means, although this serialization seems to be a serial work here, it is actually happening at the same time while parallel processing is happening.
    Ofcourse it is still serial, but it is also partially parallel in a way. (probably sorting during parallel processing inside the Oracle Database is working in the same way)
    In my opinion, the sorting algorithm used in the serialization function is also important here. This partial parallelism for the serialization may be increased using parallel sorts there.

    This my understanding about pdata and its mechanism.

    Please shed a light on this matter and please correct me if I m wrong with my inferences above.

    Thank you.
    Erman Arslan
    Oracle ACE
  • Daniel Jordan Wednesday, February 19, 2020
    Thanks for your comment, Erman.

    Yes, you can think of the serialization step as sorting, that's exactly what's going on, and the serialization can indeed happen at the same time as other parallel jobs. Both of these terms--sorting and serialization--describe the code.

    The serialization is done by adding finished jobs to a per-CPU list, so it's actually already sorting in parallel.

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.