Moving from UEK5 to UEK6 brought about an unwelcome surprise: an increase in syscall latency on some x86 systems. The root cause, as we will see, was slightly slower evaluation of audit rules, which, given that they are evaluated for every syscall, is not great.
In this post we start off by exploring the root cause which turns out to not be UEK specific, it also impacts upstream kernels as well. Then we detail the fixes and how they take advantage of the speculative out-of-order nature of the CPU pipeline.
The changes, even though they target low-level optimizations, are quite straight-forward, almost trivial.
Execution latency of the getpid()
[1] increased by about 15% (measured on an Intel Skylake-X system), from 191ns on UEK5, to 217ns on UEK6.
This was measured in the usual way:
clock_gettime(CLOCK_MONOTONIC, &start); for (i = 0; i < large_number; i++) syscall(SYS_getpid); clock_gettime(CLOCK_MONOTONIC, &stop);
A quick perf record
, showed that almost all of the increased latency was in audit_filter_syscall()
which was more expensive in UEK6.
Oracle Exadata, where this problem was seen has 37 audit rules that are evaluated in the syscall path. Since audit only wants to log unusual or exceptional events, the benchmark would evaluate these rules in every iteration, but never generate any output. Essentially, purely local computation that became slower without there having been any material changes to the audit code or in the audit rules.
Some Intel Skylake-X parameters that we'll make use of later:
L1-load-latency: 4-6 cycles L2-load-latency: 14 cycles L1-cache-size: 32K (512 cachelines: 64 sets, 8 ways each) ROB size: 224 micro-ops
The parameters are taken from the Intel SDM.
Note: L1/L2 being the respective data-cache level and ROB, being the Reorder Buffer, where instructions are staged for in-order retirement.
Drilling down with perf stat -d
:
UEK5 (191 ns):
# perf stat -d -r 5 ./getpid # output normalized for a single getpid() call 677.9 cycles # 3.542 GHz 1635.0 instructions # 2.40 insn per cycle 325.0 branches 0.5 branch-misses # 0.16% of all branches 404.0 L1-dcache-loads 0.4 L1-dcache-load-misses # 0.10% of all L1-dcache accesses
UEK6 (217ns):
# perf stat -d -r 5 ./getpid # output normalized for a single getpid() call 770.4 cycles # 3.545 GHz 1652.0 instructions # 2.14 insn per cycle 332.2 branches 1.5 branch-misses # 0.45% of all branches 407.3 L1-dcache-loads 8.6 L1-dcache-load-misses # 2.13% of all L1-dcache accesses
Comparing, this is an increase of ~100 cycles with the L1d-loads and instruction counts being almost identical across UEK5 and UEK6. This underscores the fact that audit code which forms the bulk of instructions executed hasn’t changed all that much.
The IPC is commensurately lower[2]. The proximal cause seems to be the increased L1d-load-misses and the one extra branch-miss.
These observations were confirmed via enough non-correlated runs (with intervening reboot for each) and so are statistically significant. The L1d-load-miss numbers are somewhat variable across boot cycles, but the trend is close to what we see above.
audit_filter_syscall()
From perf record
we know that the bulk of the increased runtime went to audit_filter_syscall()
. The procedure itself is primarily a loop that walks the list of rules, calling audit_in_mask()
for each rule to check if it needs to be evaluated for the current syscall. For getpid()
the answer will be false
most of the time (32 of 37 times.)
audit_filter_syscall(...) { struct audit_entry *e; struct audit_entry *ctx; list = audit_filter_list[AUDIT_FILTER_EXIT]; list_for_each_entry_rcu(e, list, list) { if (audit_in_mask(&e->rule, ctx->major) && audit_filter_rules(tsk, &e->rule, ctx, NULL, &state, false, x)) { rcu_read_unlock(); ctx->current_state = state; return state; } } } audit_in_mask(const struct audit_krule *rule, unsigned long val) { if (val > 0xffffffff) return false; /* * val contains the current syscall number. AUDIT_WORD does * some bit shifting on it. */ word = AUDIT_WORD(val); if (word >= AUDIT_BITMASK_SIZE) return false; bit = AUDIT_BIT(val); /* * The load in rule->mask[word] depends on the audit_krule (which * hangs off the current rule entry) and the syscall number. */ return rule->mask[word] & bit; } audit_filter_rules(...) { /* * Large switch statement which we ignore for the rest of this * analysis because, as we will see later, loads executed in it don't * have an "interesting" alignment and so their latency should be easy * enough to hide. */ }
Next let’s look at the data structures accessed in the audit_filter_syscall()
loop and where the L1d-load-misses might be coming from.
/* Data structure layout annotated with size and cacheline occupancy * information using pahole. */ struct audit_entry { /* via audit_filter_list[AUDIT_FILTER_EXIT] */ struct list_head list; /* 0 16 */ struct callback_head rcu; /* 16 16 */ struct audit_krule rule; /* 32 376 */ ... /* size: 408, cachelines: 7, members: 3 */ /* last cacheline: 24 bytes */ }; struct audit_krule { /* inlined in struct audit_entry */ ... u32 mask[64]; /* 16 256 */ ... /* size: 376, cachelines: 6, members: 17 */ /* last cacheline: 56 bytes */ }; struct audit_context { ... int major; /* 20 4 */ ... /* size: 920, cachelines: 15, members: 46 (slightly larger on UEK6) */ /* sum members: 912, holes: 2, sum holes: 8 */ /* last cacheline: 24 bytes */ };
The effective execution loop in audit_filter_syscall()
(with cacheline access annotations):
struct audit_entry *e = &audit_filter_list[AUDIT_FILTER_EXIT]; for_each_iteration { e = e->next; /* cacheline-0 of audit_entry */ if (e == list) jmp out; if (audit_in_mask(e->rule.mask, /* cacheline-0 of audit_entry */ ctx->major)) /* cacheline-0 of audit_context */ audit_filter_rules(e->rule); } out:
As the annotations above mention, there are a total of three loads:
e->next
: the first cacheline of struct audit_entry
.e->rule.mask[]
: accesses the same cacheline as load (1) above.ctx->major
: accesses the first cacheline of struct audit_context
.Loads (1) and (2) will access a total of 37 cachelines, corresponding to a rule per iteration. Also notice that every single basic block in the rest of the iteration (apart from some error checking in audit_in_mask()
) has data dependence on the evaluation of e=e->next
. Worse this is a loop carried dependency, so each iteration depends on the previous one.
The cacheline for load (3) is accessed once every iteration. This load is unnecessary, ctx->major
contains the syscall number, which is a constant for the duration of the syscall. However, because the compiler’s alias analysis cannot prove that ctx->major
is not mutilated, it does not get cached in a register. This also means that audit_in_mask()
will do out-of-bound validation checks related to ctx->major
over and over.
Recalling the perf-stat -d
output above there are a total of around 400 L1d-loads for each getpid()
call. Of those, the loop does a total of 37*3 loads which map to a total of 38 unique cachelines.
Alright, I hear you think: granted, walking linked-lists is difficult, there are a lot of cachelines in a lot of iterations or whatever, life is hard and the compiler doesn’t know what it is doing[3]. Even given all of that, nothing here has changed from UEK5 to UEK6, so none of this explains why UEK6 would incur more L1d-load-misses[4].
Which is true, so that’s next.
From the background above, we know that the loop is pure computation, and purely local computation at that, so code changes elsewhere should have no effect. And there were no significant code changes from UEK5 to UEK6, so the loop is unchanged (which also applies to the generated assembly.)
Now insofar as L1d-load-misses are concerned: the number of cachelines accessed (from about 400 L1d-loads per getpid()
call, not all of which are to unique cachelines) amount to a number comfortably below the Skylake-X L1d-cache capacity of 512 cachelines. So this loop should not incur any capacity misses.
Which leaves conflict misses as the probable cause[5]. Skylake-X has an 8-way associative L1: if more than 8 loads in the loop map to the same cache-set some accesses would incur conflict misses.
Accesses in the loop and how they map to cache-sets:
struct audit_entry
: aligns at a 512B boundary, which limits it to cache-sets {0, 8, 16, ... 56}
, for a total of 8*8 cache-slots.struct audit_context
: aligns at a 1024B boundary, which resolves to cache-sets {0, 16, 32, 48}
, for a total of 4*8 cache-slots. As described earlier, this is a single cacheline which competes with a subset of the struct audit_entry
cachelines.Even then, this is 37 cachelines slotted into 64 slots and another slotting into 32 of those 64. This should be easy enough to satisfy, assuming that the kernel allocator has a reasonably sane distribution and isn’t skewed towards a particular set of cachelines (or is similarly skewed on both UEK5 and UEK6.)
If, allocations for struct audit_entry
were distributed uniformly, they would map into cache-sets uniformly, ending with similar populations across the cache-sets. This would give a cacheline-spread metric of ~0 (obtained by calculating the standard-deviation of populations across cache-sets.)
What we see:
cacheline-spread on UEK5: 1.58 cacheline-spread on UEK6: 1.91
(These results are from a large number (> 100) of non-correlated runs. auditd
allocates at boot, so this was done by rebooting between each run.)
From these numbers, UEK5 is far from a flat distribution, and UEK6 is somewhat worse, but not dispositively so. Additionally, a slight imbalance will not cause performance degradation: that happens only after cache conflicts kick in, which is after cache-set population crosses the associativity threshold.
To validate this, we measure how well cycles correlate[6] with 1) with L1d-misses, and 2) cacheline-spread:
For both UEK5 and UEK6, “cycles:L1d-misses” is tightly correlated (though the value of 0.74 for both is happenstance) which makes sense. “cycles:cacheline-spread”, however, is well correlated only on UEK6, not UEK5. This suggests that the UEK6 allocator skew is meaningfully worse, enough to cause lower performance.
Alright, having beaten this dead horse enough, let’s figure out how to fix it next[7].
To get back our lost performance, our task is simple: optimize a hot-loop[8] which is itself executed in the hot syscall path. Compounding the problem, the critical load in the loop is accessed via a linked list.
Stated like that, it sounds pretty bad. But, as we will see the structure of the problem helps quite a bit:
A no-op loop implies that the code does not actually care about most of the values it computes. It just inches towards a foregone conclusion.
This it does (as all code does) by means of dependency chains that transform the input state to output. Here, most dependency chains are short and, are really only used to predict the control flow. The only long dependency chain, woven through all the loop iterations, is the one walking the linked-list.
Now, critically since the branches are predicted perfectly or almost so, the control flow can run quite a bit further than any loads and dependent computation. The control flow thus essentially feeds these loads and other instructions to the ROB, where they wait until resources/dependencies become available, compute the output from their chain which, to reiterate, will only be used to predict the control flow.
Given that the control flow is already feeding instructions from the correct direction, these are in effect orphan chains that eventually retire without anyone having cared for the output they compute or how long that took.
Except: this happy state continues only until we run into a resource constraint. For instance, the size of the ROB on Skylake-X is 224 entries and each loop iteration is ~20 instructions. This means instructions worth around 10 loop iterations can be present in the ROB. Now, given that instructions retire on x86 in-order, long running instructions (L1d-load-misses of course, but also L1d-load hits[9]) with long dependence chains would slow retirement down, even were control-flow to be predicted perfectly.
Bearing these observations in mind, our fixes will try to reduce the amount and cost of work per loop iteration. This allows the loop to retire as close to the gating latency of any long running instructions in the loop.
ctx->major
in audit_filter_syscall()
@@ -785,13 +785,14 @@ static enum audit_state audit_filter_syscall(struct task_struct *tsk, { struct audit_entry *e; enum audit_state state; + unsigned long major = ctx->major; if (auditd_test_task(tsk)) return AUDIT_DISABLED; rcu_read_lock(); list_for_each_entry_rcu(e, list, list) { - if (audit_in_mask(&e->rule, ctx->major) && + if (audit_in_mask(&e->rule, major) && audit_filter_rules(tsk, &e->rule, ctx, NULL, &state, false)) { rcu_read_unlock();
Caching ctx->major
in a local variable helps in two ways:
audit_in_mask()
operates on ctx->major
doing some bit-shifting and error checking. Now that the compiler knows that major
is not modified, it can hoist most of that logic out of the loop so it is not reevaluated over-and-over in every loop iteration.struct audit_context
has similar natural alignment concerns as struct audit_entry
. Allowing the compiler to cache ctx->major
in a register (or on the stack) reduces one potential source of contention.With this change the number of instructions executed/loop-iteration reduce by 8 (of 20.) Note that most of those were almost free ALU instructions.
L1d-loads: we removed one L1d-load but added two (due to the compiler now spilling and reloading some state to/from the stack.) However, given that stack accesses are much less likely to have conflicting alignment constraints, the increased loads are less of a concern than the one we got rid of.
cycles: improve by about 40 cycles. This is because the greater room in the ROB allows our almost perfect branch prediction to speculatively run even further ahead of other instructions.
Change in latency for UEK6:
From the min-max range, there is a rather large variation in latency that’s caused by variations in allocation resulting in high or low cacheline-spread. In almost all cases though, the latency improves by ~10ns or thereabouts.
That said, after removing 8 instructions and one load (and adding two less consequential loads), the performance gain is rather miniscule: ~1 cycle/iteration. Just that the loop executes 37 times, so we make it up in volume.
More details (perf-stat
and the before/after versions of the generated code) in UEK6 commit-1 and in Upstream commit-1.
audit_in_mask()
@@ -790,12 +790,13 @@ static enum audit_state audit_filter_syscall(struct task_struct *tsk, rcu_read_lock(); list_for_each_entry_rcu(e, list, list) { - if (audit_in_mask(&e->rule, major) && - audit_filter_rules(tsk, &e->rule, ctx, NULL, - &state, false)) { ... + if (unlikely(audit_in_mask(&e->rule, major))) { + if (audit_filter_rules(tsk, &e->rule, ctx, NULL, + &state, false)) {
Annotate audit_in_mask()
as unlikely()
to allow the compiler to pessimize the call to audit_filter_rules()
. Two reasons for this change:
L1d-loads: reduce by 2 for each loop iteration. This is because the spills and reloads introduced in the “Cache ctx->major
…” patch have now been shifted to the unlikely path (the prologue and epilogue of the audit_filter_rules()
call.)
cycles: performance improves on average by ~30 cycles/call.
Change in latency for UEK6:
More details (perf-stat
and the before/after versions of the generated code) in UEK6 commit-2.
audit_filter_syscall()
@@ -777,7 +777,7 @@ static bool audit_in_mask(const struct audit_krule *rule, unsigned long * also not high enough that we already know we have to write an audit * record (i.e., the state is AUDIT_SETUP_CONTEXT or AUDIT_BUILD_CONTEXT). */ -static enum audit_state audit_filter_syscall(struct task_struct *tsk, +enum audit_state audit_filter_syscall(struct task_struct *tsk, struct audit_context *ctx, struct list_head *list)
audit_filter_syscall()
is only used locally in the file and so is marked static
. Additionally, it’s only ever called with a fixed list
value of &audit_filter_list[AUDIT_FILTER_EXIT])
.
GCC’s constant propagation pass makes use of these two things to, quite reasonably, const-propagate the third argument to the point of use.
This causes the exit check in the list_for_each
loop to look like this:
audit_filter_syscall.constprop.18(task, ctx): 0: 48 8b 1b mov (%rbx),%rbx 3: 48 81 fb e0 67 ac 82 cmp $0xffffffff82ac67e0,%rbx ffffffff8118b5ed: R_X86_64_32S audit_filter_list+0x40 10: 75 e2 jne start_iter
while, without const-propagation it would have looked like this:
audit_filter_syscall(task, ctx, list): 0: 48 8b 1b mov (%rbx),%rbx 3: 4c 39 e3 cmp %r12,%rbx 6: 75 e6 jne start_iter
Now either one ought to be alright, both cmp imm32,r
and cmp r,r
forms are equivalent with a latency of 1 cycle, and both are a single micro-op each.
The second form of the cmp
, however, can be macro-op fused with the jne
; not entirely sure if the first form can be[11]. The second form is also denser, though that’s not a concern here.
Disallowing GCC from making assumptions about calling contexts by removing the static
linkage from audit_filter_syscall()
forces it to pass the list
parameter in a register which results in a small performance improvement: ~20 cycles (about 0.5 cycles/loop iteration.)
Change in latency for UEK6:
More details (perf-stat
and the before/after versions of the generated code) in UEK6 commit-3 and, Upstream commit-3.
The audit subystem is fairly stable in the Linux kernel, not given to frequent changes. So it was puzzling when it became slower in recent kernels, and because a primary user is the syscall path, concerning[12].
The cause turned out to be higher skew in allocated buffers which results in more lopsided cache-set distribution.
The fixes compensate for the higher costs in the loop by taking advantage of the peculiarities of the execution path and optimizing for the speculative nature of the CPU pipeline.
The three patches, in sum reduce the overhead by about 30ns (~100 cycles).
Final perf stat -d -r 5
go from:
# perf stat -d -r 5 ./getpid # output normalized for a single getpid() call cycles 761.65 ( +- 5.22% ) instructions 1639.17 ( +- 0.00% ) IPC 2.18 ( +- 5.50% ) branches 328.21 ( +- 0.00% ) branch-misses 1.37 ( +- 6.56% ) L1-dcache-loads 404.35 ( +- 0.00% ) L1-dcache-load-misses 7.99 ( +- 70.71% )
to:
# perf stat -d -r 5 ./getpid # output normalized for a single getpid() call cycles 669.09 ( +- 11.23% ) instructions 1342.04 ( +- 0.00% ) IPC 2.03 ( +- 9.85% ) branches 328.19 ( +- 0.00% ) branch-misses 0.56 ( +- 5.35% ) L1-dcache-loads 384.31 ( +- 0.00% ) L1-dcache-load-misses 5.77 ( +- 84.57% )
This compares quite well to the UEK5-baseline:
# perf stat -d -r 5 ./getpid # output normalized for a single getpid() call cycles 672.90 ( +- 1.65% ) instructions 1622.08 ( +- 0.00% ) IPC 2.41 ( +- 1.65% ) branches 321.20 ( +- 0.00% ) branch-misses 0.51 ( +- 0.00% ) L1-dcache-loads 401.32 ( +- 0.00% ) L1-dcache-load-misses 2.28 ( +- 59.62% )
Note for non-Skylake-X architectures: Intel Icelake and AMD Milan (the other architectures tested) cope with L1d-load-misses much better so the baseline performance is much better.
With these patches, they only show a small improvement (~10ns): Icelake has a bigger L1d-cache (48K), and a much bigger ROB. Milan also has a bigger ROB and does memory renaming and bunch of other pipeline optimizations that limit the effect of these optimizations.
Endnote: what I found personally instructive was how much C really is “a portable assembler” and the significant codegen (and performance) changes that can result from minimal changes to the code.
getpid()
has a minimal kernel execution path (only does a PID lookup), and so is generally used to measure the overhead of the syscall path.
Comparing the IPC for the audit-only portion shows that a starker drop:
UEK5: 1427.0 instructions # 3.41 insn per cycle UEK6: 1432.0 instructions # 2.84 insn per cycle
Alas no, alias analysis is an undecidable problem.
Or for that matter, what causes the extra branch-miss.
Another possibility is out-of-line code -- frequent interrupts, vmexits etc -- trashing the cache but from profiling these were a non-issue.
Measured using the pearson-quotient(x, y): correlation coefficient between quantities x and y.
You might notice that this analysis does not address the extra branch-miss. That's because I still have no clue what causes it.
The correct fix would be to fix whatever ails the allocator. However, from a quick look at the changes that have gone into related code, it seems non-trivial to find a particular commit which points to the root cause of the skew (especially given that the skew is not constant, but varies from run-to-run.) Also, notably, the fixes described below also apply to UEK5, which means that even if UEK6 becomes faster, UEK5 will also improve somewhat.
As mentioned in CPU-parameters, L1d-loads take 4-6 cycles on Skylake-X. We also know that in the good case (UEK5), this loop is capable of an IPC of 3.41 insn per cycle. So, hiding L1d-load latency is critical for good performance.
https://www.agner.org/optimize/instruction_tables.pdf, pg 298 (Skylake-X)
The first form fused, needs three inputs: %rbx
, an imm32
encoding the distance to the address being compared, and an imm8
encoding the distance to the branch-dest; the second needs two registers: %rbx
, %r12
and only the imm8
.
Just for context, a kernel build (x86-defconfig) makes an aggregate of 27M syscalls, with a syscall every 44us.
Previous Post
Next Post