Oracle Linux kernel developer Daniel Jordan contributes this post on enhancing the performance of padata.
padata is a generic framework for parallel jobs in the kernel -- with a twist. It not only schedules jobs on multiple CPUs but also ensures that each job is properly serialized, i.e. finishes in the order it was submitted. This post will provide some background on this somewhat obscure feature of the core kernel and cover recent efforts to enhance its parallel performance in preparation for more multithreading in the kernel.
padata allows users to create an instance that represents a certain class of parallel jobs, for example IPsec decryption (see pdecrypt in the kernel source). The instance serves as a handle when submitting jobs to padata so that all jobs submitted with the same handle are serialized amongst themselves. An instance also allows for fine-grained control over which CPUs are used to run work, and contains other internal data such as the next sequence number to assign for serialization purposes and the workqueue used for parallelization.
To initialize a job (known cryptically as padata_priv in the code), a pair of function pointers are required, parallel and serial, where parallel is responsible for doing the actual work in a workqueue worker and serial completes the job once padata has serialized it. The user submits the job along with a corresponding instance to the framework via padata_do_parallel to start it running, and once the job's parallel part is finished, the user calls padata_do_serial to inform padata of this. padata_do_serial is currently always called from parallel, but this is not strictly required. padata ensures that a job's serial function is called only when the serial functions of all previously-submitted jobs from the same instance have been called.
Though parallelization is ultimately padata's (and this blog post's) reason for being, its serialization algorithm is the most technically interesting part, so I'll go on a tangent to explain a bit about it. For scalability reasons, padata allocates internal per-CPU queues, and there are three types, parallel, reorder, and serial, where each type is used for a different phase of a padata job's lifecycle. When a job is submitted to padata, it's atomically assigned a unique sequence number within the instance that determines the order its serialization callback runs. The sequence number is hashed to a CPU that is used to select which queue a job is placed on. When the job is preparing to execute its parallel function, it is placed on a parallel per-CPU queue that determines which CPU it runs on (this becomes important later in the post). Using a per-CPU queue allows multiple tasks to submit parallel jobs concurrently with only minimal contention from the atomic op on the sequence number, avoiding a shared lock. When the parallel part finishes and the user calls padata_do_serial, padata then places the job on the reorder queue, again corresponding to the CPU that the job hashed to. And finally, a job is placed on the serial queue once all jobs before it have been serialized.
During the parallel phase, jobs may finish out of order relative to when they were submitted. Nevertheless, each call to padata_do_serial places the job on its corresponding reorder queue and attempts to process the entire reorder queue across all CPUs, which entails repeatedly checking whether the job with the next unserialized sequence number has finished until there are no more jobs left to reorder. These jobs may or may not include the one passed to padata_do_serial because again, jobs finish out of order.
This process of checking for the next unserialized job is the biggest potential bottleneck in all of padata because a global lock is used. Without the lock, multiple tasks might process the reorder queues at once, leading to duplicate serial callbacks and list corruption. However, if all calls to padata_do_serial were to wait on the lock when only one call actually ends up processing all the jobs, the rest of the tasks would be waiting for no purpose and introduce unnecessary latency in the system. To avoid this situation, the lock is acquired with a trylock call, and if a task fails to get the lock, it can safely bail out of padata knowing that a current or future lock holder will take care of its job.
This serialization process is important for the use case that prompted padata to begin with, IPsec. IPsec throughput was a bottleneck in the kernel because a single CPU, the one that the receiving NIC's interrupt ran on, was doing all the work, with the CPU-intensive portion largely consisting of the crypto operations. Parallelization could address this, but serialization was required to maintain the packet ordering that the upper layer protocols required, and getting that right was not an easy task. See this presentation from Steffen Klassert, the original author of padata, for more background.
Though padata was designed to be generic, it currently has just the one IPsec user. There are more kernel codepaths that can benefit from parallelization, such as struct page initialization, page clearing in various memory management paths (huge page fallocate, get_user_pages), and page freeing at munmap and exit time. Two previous blog posts and an LWN article on ktask have covered some of these. Recent upstream feedback has called for merging ktask with padata, and the first step in that process is to change where padata schedules its parallel workers.
To that end, I posted a series on the mailing lists, merged for the v5.3 release, that adds a second workqueue per padata instance dedicated to parallel jobs. Earlier in the post, I described padata's per-CPU parallel queues. To assign a job to one of these queues, padata uses a simple round-robin algorithm to hash a job's sequence number to a CPU, and then runs the job bound to that CPU alone. Each successive job submitted to the instance runs on the next CPU. There are two problems with this approach. First, it's not NUMA-aware, so on multi-socket systems, a job may not run locally. Second, on a busy system, a job will likely complete faster if it allows the scheduler to select the CPU within the NUMA node it's run on. To solve both problems, the series uses an unbound workqueue, which is NUMA-aware by default and not bound to a particular CPU (hence the name).
The numbers from tcrypt, a test module in the kernel's crypto layer, look promising. Parts are shown here, see the upstream post for the full data.
Measurements are from a 2-socket, 20-core, 40-CPU Xeon server.
For repeatability, modprobe was bound to a CPU and the serial cpumasks
for both pencrypt and pdecrypt were also restricted to a CPU different
from modprobe's.
# modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
# modprobe tcrypt mode=211 sec=1
# modprobe tcrypt mode=215 sec=1
Busy system (tcrypt run while 10 stress-ng tasks were burning 100% CPU)
base test
---------------- ---------------
speedup key_sz blk_sz ops/sec stdev ops/sec stdev
(pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)
117.2x 160 16 960 30 112555 24775
135.1x 160 64 845 246 114145 25124
113.2x 160 256 993 17 112395 24714
111.3x 160 512 1000 0 111252 23755
110.0x 160 1024 983 16 108153 22374
104.2x 160 2048 985 22 102563 20530
98.5x 160 4096 998 3 98346 18777
86.2x 160 8192 1000 0 86173 14480
multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)
242.2x 160 16 2363 141 572189 16846
242.1x 160 64 2397 151 580424 11923
231.1x 160 256 2472 21 571387 16364
237.6x 160 512 2429 24 577264 8692
238.3x 160 1024 2384 97 568155 6621
216.3x 160 2048 2453 74 530627 3480
209.2x 160 4096 2381 206 498192 19177
176.5x 160 8192 2323 157 410013 9903
Idle system (tcrypt run by itself)
base test
---------------- ---------------
speedup key_sz blk_sz ops/sec stdev ops/sec stdev
(pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211)
2.5x 160 16 63412 43075 161615 1034
4.1x 160 64 39554 24006 161653 981
6.0x 160 256 26504 1436 160110 1158
6.2x 160 512 25500 40 157018 951
5.9x 160 1024 25777 1094 151852 915
5.8x 160 2048 24653 218 143756 508
5.6x 160 4096 24333 20 136752 548
5.0x 160 8192 23310 15 117660 481
multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)
1.0x 160 16 412157 3855 426973 1591
1.0x 160 64 412600 4410 431920 4224
1.1x 160 256 410352 3254 453691 17831
1.2x 160 512 406293 4948 473491 39818
1.2x 160 1024 395123 7804 478539 27660
1.2x 160 2048 385144 7601 453720 17579
1.2x 160 4096 371989 3631 449923 15331
1.2x 160 8192 346723 1617 399824 18559
A few tools were used in the initial performance analysis to confirm the source of the speedups. I'll show results from one of them, ftrace. Custom kernel events were added to record the runtime and CPU number of each crypto request, which runs a padata job under the hood. For analysis only (not the runs that produced these results), the threads of the competing workload stress-ng were bound to a known set of CPUs, and two histograms were created of crypto request runtimes, one for just the CPUs without the stress-ng tasks ("uncontended") and one with ("contended"). The histogram clearly shows increased times for the padata jobs with contended CPUs, as expected:
Crypto request runtimes (usec) on uncontended CPUs
# request-count: 11980; mean: 41; stdev: 23; median: 45
runtime (usec) count
-------------- --------
0 - 1 [ 0]:
1 - 2 [ 0]:
2 - 4 [ 0]:
4 - 8 [ 209]: *
8 - 16 [ 3630]: *********************
16 - 32 [ 188]: *
32 - 64 [ 6571]: **************************************
64 - 128 [ 1381]: ********
128 - 256 [ 1]:
256 - 512 [ 0]:
512 - 1024 [ 0]:
1024 - 2048 [ 0]:
2048 - 4096 [ 0]:
4096 - 8192 [ 0]:
8192 - 16384 [ 0]:
16384 - 32768 [ 0]:
Crypto request runtimes (usec) on contended CPUs
# request-count: 3991; mean: 3876; stdev: 455; median 3999
runtime (usec) count
-------------- --------
0 - 1 [ 0]:
1 - 2 [ 0]:
2 - 4 [ 0]:
4 - 8 [ 0]:
8 - 16 [ 0]:
16 - 32 [ 0]:
32 - 64 [ 0]:
64 - 128 [ 0]:
128 - 256 [ 0]:
256 - 512 [ 4]:
512 - 1024 [ 4]:
1024 - 2048 [ 0]:
2048 - 4096 [ 3977]: **************************************
4096 - 8192 [ 4]:
8192 - 16384 [ 2]:
16384 - 32768 [ 0]:
Now that padata has unbound workqueue support, look out for further enhancements to padata in coming releases! Next steps include creating padata threads in cgroups so they can be properly throttled and adding multithreaded job support to padata.
According to my understanding,
Jobs are submitted to parallel workers(through parallel queues) and thus they are executed in parallel in different cpus..
Each job is placed in the reorder queue(corresponding to the CPU that the job is executed on), when its parallel processing is completed.
So let's suppose we have a job with the name X.
When X completes its parallel processing and enters into the reorder queue, it is checked if there are any jobs which is finished earlier than X and still not placed in the serial queue and still not serialized there by the serialization fuction.
-- I think we change this serialization term to sorting, as the code give a sequence number to the jobs, and it seems it sorts them accordingly in the serialization fuction.
This serialization function's job is to reorder the entries(the outputs, the jobs) placed in the serial queue and thus modifying the serial queue when a new job is placed there. (until the last job)
This means that, the entries in the serial queue is actually ordered in any given time. So if there are no jobs which is finished earlier than X and serialized in the serial queue, then X is placed in serial queue and serialization function runs again to re-serialize the entries of the serial queue by taking this X into account.
(Probably, the client code is clearing out the serial queue at the same time by consuming the correctly ordered data continously ..)
If this serialization is run during the parallel processing, then this means, although this serialization seems to be a serial work here, it is actually happening at the same time while parallel processing is happening.
Ofcourse it is still serial, but it is also partially parallel in a way. (probably sorting during parallel processing inside the Oracle Database is working in the same way)
In my opinion, the sorting algorithm used in the serialization function is also important here. This partial parallelism for the serialization may be increased using parallel sorts there.
This my understanding about pdata and its mechanism.
Please shed a light on this matter and please correct me if I m wrong with my inferences above.
Thank you.
Erman Arslan
Oracle ACE
Yes, you can think of the serialization step as sorting, that's exactly what's going on, and the serialization can indeed happen at the same time as other parallel jobs. Both of these terms--sorting and serialization--describe the code.
The serialization is done by adding finished jobs to a per-CPU list, so it's actually already sorting in parallel.
Daniel