Syscall latency... and some uses of speculative execution

September 12, 2023 | 23 minute read
Text Size 100%:

Introduction

Moving from UEK5 to UEK6 brought about an unwelcome surprise: an increase in syscall latency on some x86 systems. The root cause, as we will see, was slightly slower evaluation of audit rules, which, given that they are evaluated for every syscall, is not great.

In this post we start off by exploring the root cause which turns out to not be UEK specific, it also impacts upstream kernels as well. Then we detail the fixes and how they take advantage of the speculative out-of-order nature of the CPU pipeline.

The changes, even though they target low-level optimizations, are quite straight-forward, almost trivial.

Background

Execution latency of the getpid()[1] increased by about 15% (measured on an Intel Skylake-X system), from 191ns on UEK5, to 217ns on UEK6.

This was measured in the usual way:

clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < large_number; i++)
        syscall(SYS_getpid);
clock_gettime(CLOCK_MONOTONIC, &stop);

A quick perf record, showed that almost all of the increased latency was in audit_filter_syscall() which was more expensive in UEK6.

Oracle Exadata, where this problem was seen has 37 audit rules that are evaluated in the syscall path. Since audit only wants to log unusual or exceptional events, the benchmark would evaluate these rules in every iteration, but never generate any output. Essentially, purely local computation that became slower without there having been any material changes to the audit code or in the audit rules.

CPU-parameters

Some Intel Skylake-X parameters that we'll make use of later:

L1-load-latency: 4-6 cycles
L2-load-latency: 14 cycles 
L1-cache-size: 32K (512 cachelines: 64 sets, 8 ways each)

ROB size: 224 micro-ops

The parameters are taken from the Intel SDM.

Note: L1/L2 being the respective data-cache level and ROB, being the Reorder Buffer, where instructions are staged for in-order retirement.

Root cause analysis

Drilling down with perf stat -d:

UEK5 (191 ns):

# perf stat -d -r 5 ./getpid
# output normalized for a single getpid() call

 677.9    cycles                #  3.542 GHz
1635.0    instructions          #  2.40  insn per cycle
 325.0    branches
   0.5    branch-misses         #  0.16% of all branches
 404.0    L1-dcache-loads
   0.4    L1-dcache-load-misses #  0.10% of all L1-dcache accesses

UEK6 (217ns):

# perf stat -d -r 5 ./getpid
# output normalized for a single getpid() call

 770.4    cycles                #    3.545 GHz
1652.0    instructions          #    2.14  insn per cycle
 332.2    branches
   1.5    branch-misses         #    0.45% of all branches
 407.3    L1-dcache-loads
   8.6    L1-dcache-load-misses #    2.13% of all L1-dcache accesses

Comparing, this is an increase of ~100 cycles with the L1d-loads and instruction counts being almost identical across UEK5 and UEK6. This underscores the fact that audit code which forms the bulk of instructions executed hasn’t changed all that much.

The IPC is commensurately lower[2]. The proximal cause seems to be the increased L1d-load-misses and the one extra branch-miss.

These observations were confirmed via enough non-correlated runs (with intervening reboot for each) and so are statistically significant. The L1d-load-miss numbers are somewhat variable across boot cycles, but the trend is close to what we see above.

audit_filter_syscall()

From perf record we know that the bulk of the increased runtime went to audit_filter_syscall(). The procedure itself is primarily a loop that walks the list of rules, calling audit_in_mask() for each rule to check if it needs to be evaluated for the current syscall. For getpid() the answer will be false most of the time (32 of 37 times.)

audit_filter_syscall(...) {
    struct audit_entry *e; 
    struct audit_entry *ctx;

    list = audit_filter_list[AUDIT_FILTER_EXIT]; 

    list_for_each_entry_rcu(e, list, list) {

        if (audit_in_mask(&e->rule, ctx->major) && 
            audit_filter_rules(tsk, &e->rule, ctx, NULL,
                               &state, false, x)) { 
                rcu_read_unlock(); 
                ctx->current_state = state;
                return state;
        }
    }


}
 
audit_in_mask(const struct audit_krule *rule, unsigned long val) {
    if (val > 0xffffffff)
        return false; 

    /*
     * val contains the current syscall number. AUDIT_WORD does
     * some bit shifting on it.
    */
    word = AUDIT_WORD(val);
    if (word >= AUDIT_BITMASK_SIZE)
        return false;

    bit = AUDIT_BIT(val);

    /*
     * The load in rule->mask[word] depends on the audit_krule (which
     * hangs off the current rule entry) and the syscall number.
     */
    return rule->mask[word] & bit;
}
audit_filter_rules(...) {
    /*
     * Large switch statement which we ignore for the rest of this
     * analysis because, as we will see later, loads executed in it don't
     * have an "interesting" alignment and so their latency should be easy
     * enough to hide.
     */
}

Memory accesses

Next let’s look at the data structures accessed in the audit_filter_syscall() loop and where the L1d-load-misses might be coming from.

/* Data structure layout annotated with size and cacheline occupancy
 * information using pahole. */

struct audit_entry {    /* via audit_filter_list[AUDIT_FILTER_EXIT] */

        struct list_head           list;                 /*     0    16 */
        struct callback_head       rcu;                  /*    16    16 */
        struct audit_krule         rule;                 /*    32   376 */
        ...
        /* size: 408, cachelines: 7, members: 3 */
        /* last cacheline: 24 bytes */
};

struct audit_krule {    /* inlined in struct audit_entry */
        ...
        u32                        mask[64];             /*    16   256 */
        ...
        /* size: 376, cachelines: 6, members: 17 */
        /* last cacheline: 56 bytes */
};

struct audit_context {
        ...
        int                        major;                /*    20     4 */
        ...
        /* size: 920, cachelines: 15, members: 46 (slightly larger on UEK6) */
        /* sum members: 912, holes: 2, sum holes: 8 */
        /* last cacheline: 24 bytes */
};

The effective execution loop in audit_filter_syscall() (with cacheline access annotations):

struct audit_entry *e = &audit_filter_list[AUDIT_FILTER_EXIT];

for_each_iteration {
    e = e->next;                    /* cacheline-0 of audit_entry */
    if (e == list)
        jmp out;
    if (audit_in_mask(e->rule.mask, /* cacheline-0 of audit_entry */
                      ctx->major))  /* cacheline-0 of audit_context */
        audit_filter_rules(e->rule);
}
out:

As the annotations above mention, there are a total of three loads:

  1. Pointer chasing in e->next: the first cacheline of struct audit_entry.
  2. e->rule.mask[]: accesses the same cacheline as load (1) above.
  3. ctx->major: accesses the first cacheline of struct audit_context.

Loads (1) and (2) will access a total of 37 cachelines, corresponding to a rule per iteration. Also notice that every single basic block in the rest of the iteration (apart from some error checking in audit_in_mask()) has data dependence on the evaluation of e=e->next. Worse this is a loop carried dependency, so each iteration depends on the previous one.

The cacheline for load (3) is accessed once every iteration. This load is unnecessary, ctx->major contains the syscall number, which is a constant for the duration of the syscall. However, because the compiler’s alias analysis cannot prove that ctx->major is not mutilated, it does not get cached in a register. This also means that audit_in_mask() will do out-of-bound validation checks related to ctx->major over and over.

Recalling the perf-stat -d output above there are a total of around 400 L1d-loads for each getpid() call. Of those, the loop does a total of 37*3 loads which map to a total of 38 unique cachelines.

Alright, I hear you think: granted, walking linked-lists is difficult, there are a lot of cachelines in a lot of iterations or whatever, life is hard and the compiler doesn’t know what it is doing[3]. Even given all of that, nothing here has changed from UEK5 to UEK6, so none of this explains why UEK6 would incur more L1d-load-misses[4].

Which is true, so that’s next.

Theory of the case

From the background above, we know that the loop is pure computation, and purely local computation at that, so code changes elsewhere should have no effect. And there were no significant code changes from UEK5 to UEK6, so the loop is unchanged (which also applies to the generated assembly.)

Now insofar as L1d-load-misses are concerned: the number of cachelines accessed (from about 400 L1d-loads per getpid() call, not all of which are to unique cachelines) amount to a number comfortably below the Skylake-X L1d-cache capacity of 512 cachelines. So this loop should not incur any capacity misses.

Which leaves conflict misses as the probable cause[5]. Skylake-X has an 8-way associative L1: if more than 8 loads in the loop map to the same cache-set some accesses would incur conflict misses.

Accesses in the loop and how they map to cache-sets:

  • struct audit_entry: aligns at a 512B boundary, which limits it to cache-sets {0, 8, 16, ... 56}, for a total of 8*8 cache-slots.
  • struct audit_context: aligns at a 1024B boundary, which resolves to cache-sets {0, 16, 32, 48}, for a total of 4*8 cache-slots. As described earlier, this is a single cacheline which competes with a subset of the struct audit_entry cachelines.

Even then, this is 37 cachelines slotted into 64 slots and another slotting into 32 of those 64. This should be easy enough to satisfy, assuming that the kernel allocator has a reasonably sane distribution and isn’t skewed towards a particular set of cachelines (or is similarly skewed on both UEK5 and UEK6.)

Allocation skew

If, allocations for struct audit_entry were distributed uniformly, they would map into cache-sets uniformly, ending with similar populations across the cache-sets. This would give a cacheline-spread metric of ~0 (obtained by calculating the standard-deviation of populations across cache-sets.)

What we see:

cacheline-spread on UEK5:   1.58
cacheline-spread on UEK6:   1.91

(These results are from a large number (> 100) of non-correlated runs. auditd allocates at boot, so this was done by rebooting between each run.)

From these numbers, UEK5 is far from a flat distribution, and UEK6 is somewhat worse, but not dispositively so. Additionally, a slight imbalance will not cause performance degradation: that happens only after cache conflicts kick in, which is after cache-set population crosses the associativity threshold.

To validate this, we measure how well cycles correlate[6] with 1) with L1d-misses, and 2) cacheline-spread:

 

Kernel
cycles:L1d-misses
cycles:cacheline-spread
UEK5
0.74
0.22
UEK6
0.74
0.61

 

For both UEK5 and UEK6, “cycles:L1d-misses” is tightly correlated (though the value of 0.74 for both is happenstance) which makes sense. “cycles:cacheline-spread”, however, is well correlated only on UEK6, not UEK5. This suggests that the UEK6 allocator skew is meaningfully worse, enough to cause lower performance.

Alright, having beaten this dead horse enough, let’s figure out how to fix it next[7].

Speeding it up

To get back our lost performance, our task is simple: optimize a hot-loop[8] which is itself executed in the hot syscall path. Compounding the problem, the critical load in the loop is accessed via a linked list.

Stated like that, it sounds pretty bad. But, as we will see the structure of the problem helps quite a bit:

  1. On a sane system, the common-case is extremely common, syscalls are frequent, and audit logging is unusual. This means that low branch mispreds are not unusual and something we might even depend on.
  2. We are optimizing a no-op loop: the loop walks a bunch of rules, does error checking, and decides if it needs to log. In the common-case, it will conclude that it doesn’t. (This is really (1) restated to stress the no-op nature of the loop.)

A no-op loop implies that the code does not actually care about most of the values it computes. It just inches towards a foregone conclusion.

This it does (as all code does) by means of dependency chains that transform the input state to output. Here, most dependency chains are short and, are really only used to predict the control flow. The only long dependency chain, woven through all the loop iterations, is the one walking the linked-list.

Now, critically since the branches are predicted perfectly or almost so, the control flow can run quite a bit further than any loads and dependent computation. The control flow thus essentially feeds these loads and other instructions to the ROB, where they wait until resources/dependencies become available, compute the output from their chain which, to reiterate, will only be used to predict the control flow.

Given that the control flow is already feeding instructions from the correct direction, these are in effect orphan chains that eventually retire without anyone having cared for the output they compute or how long that took.

Except: this happy state continues only until we run into a resource constraint. For instance, the size of the ROB on Skylake-X is 224 entries and each loop iteration is ~20 instructions. This means instructions worth around 10 loop iterations can be present in the ROB. Now, given that instructions retire on x86 in-order, long running instructions (L1d-load-misses of course, but also L1d-load hits[9]) with long dependence chains would slow retirement down, even were control-flow to be predicted perfectly.

Bearing these observations in mind, our fixes will try to reduce the amount and cost of work per loop iteration. This allows the loop to retire as close to the gating latency of any long running instructions in the loop.

Cache ctx->major in audit_filter_syscall()

@@ -785,13 +785,14 @@ static enum audit_state audit_filter_syscall(struct task_struct *tsk,
 {
        struct audit_entry *e;
        enum audit_state state;
+       unsigned long major = ctx->major;

        if (auditd_test_task(tsk))
                return AUDIT_DISABLED;

        rcu_read_lock();
        list_for_each_entry_rcu(e, list, list) {
-               if (audit_in_mask(&e->rule, ctx->major) &&
+               if (audit_in_mask(&e->rule, major) &&
                    audit_filter_rules(tsk, &e->rule, ctx, NULL,
                                       &state, false)) {
                        rcu_read_unlock();

Caching ctx->major in a local variable helps in two ways:

  • Explicitly indicates to the compiler that there are no stores to the cached value. audit_in_mask() operates on ctx->major doing some bit-shifting and error checking. Now that the compiler knows that major is not modified, it can hoist most of that logic out of the loop so it is not reevaluated over-and-over in every loop iteration.
  • As described earlier, struct audit_context has similar natural alignment concerns as struct audit_entry. Allowing the compiler to cache ctx->major in a register (or on the stack) reduces one potential source of contention.

With this change the number of instructions executed/loop-iteration reduce by 8 (of 20.) Note that most of those were almost free ALU instructions.

L1d-loads: we removed one L1d-load but added two (due to the compiler now spilling and reloading some state to/from the stack.) However, given that stack accesses are much less likely to have conflicting alignment constraints, the increased loads are less of a concern than the one we got rid of.

cycles: improve by about 40 cycles. This is because the greater room in the ROB allows our almost perfect branch prediction to speculatively run even further ahead of other instructions.

Change in latency for UEK6:

Version
Min
(ns)
Mean
(ns)
Median
(ns)
Max
(ns)
baseline
196.26
212.00
207.80
240.52
ctx->major
183.50
201.41
198.80
226.93

 

From the min-max range, there is a rather large variation in latency that’s caused by variations in allocation resulting in high or low cacheline-spread. In almost all cases though, the latency improves by ~10ns or thereabouts.

That said, after removing 8 instructions and one load (and adding two less consequential loads), the performance gain is rather miniscule: ~1 cycle/iteration. Just that the loop executes 37 times, so we make it up in volume.

More details (perf-stat and the before/after versions of the generated code) in UEK6 commit-1 and in Upstream commit-1.

Annotate branch direction for audit_in_mask()

@@ -790,12 +790,13 @@ static enum audit_state audit_filter_syscall(struct task_struct *tsk,
         rcu_read_lock();
         list_for_each_entry_rcu(e, list, list) {
 -               if (audit_in_mask(&e->rule, major) &&
 -                   audit_filter_rules(tsk, &e->rule, ctx, NULL,
 -                                      &state, false)) {
                                        ...
 +               if (unlikely(audit_in_mask(&e->rule, major))) {
 +                       if (audit_filter_rules(tsk, &e->rule, ctx, NULL,
 +                                              &state, false)) {

Annotate audit_in_mask() as unlikely() to allow the compiler to pessimize the call to audit_filter_rules(). Two reasons for this change:

  • The primary motivation was to get rid of the extra branch mispred. This change succeeds in that task but it is unclear why: there’s no significant change in the basic-block structure. The only change is from a branch inversion due to the unlikely clause.
  • The branch inversion means that the not-taken direction is chosen more often: 32/37 times (changing from 5/37 earlier.) The issue-latency for not-taken branches is 0.5-1 cycles, for taken branches 1-2 cycles[10] is slightly cheaper.

L1d-loads: reduce by 2 for each loop iteration. This is because the spills and reloads introduced in the “Cache ctx->major…” patch have now been shifted to the unlikely path (the prologue and epilogue of the audit_filter_rules() call.)

cycles: performance improves on average by ~30 cycles/call.

Change in latency for UEK6:

Version
Min
(ns)
Mean
(ns)
Median
(ns)
Max
(ns)
ctx->major
183.50
201.41
198.80
226.93
ctx->major+annot
165.26
188.72
184.25
230.34

 

More details (perf-stat and the before/after versions of the generated code) in UEK6 commit-2.

Remove static linkage from audit_filter_syscall()

@@ -777,7 +777,7 @@ static bool audit_in_mask(const struct audit_krule *rule, unsigned long
   * also not high enough that we already know we have to write an audit
   * record (i.e., the state is AUDIT_SETUP_CONTEXT or AUDIT_BUILD_CONTEXT).
   */
 -static enum audit_state audit_filter_syscall(struct task_struct *tsk,
 +enum audit_state audit_filter_syscall(struct task_struct *tsk,
                                              struct audit_context *ctx,
                                              struct list_head *list)

audit_filter_syscall() is only used locally in the file and so is marked static. Additionally, it’s only ever called with a fixed list value of &audit_filter_list[AUDIT_FILTER_EXIT]).

GCC’s constant propagation pass makes use of these two things to, quite reasonably, const-propagate the third argument to the point of use.

This causes the exit check in the list_for_each loop to look like this:

audit_filter_syscall.constprop.18(task, ctx):
   0:       48 8b 1b                mov    (%rbx),%rbx
   3:       48 81 fb e0 67 ac 82    cmp    $0xffffffff82ac67e0,%rbx
                    ffffffff8118b5ed: R_X86_64_32S  audit_filter_list+0x40
  10:       75 e2                   jne    start_iter

while, without const-propagation it would have looked like this:

audit_filter_syscall(task, ctx, list):
   0:       48 8b 1b                mov    (%rbx),%rbx
   3:       4c 39 e3                cmp    %r12,%rbx
   6:       75 e6                   jne    start_iter

Now either one ought to be alright, both cmp imm32,r and cmp r,r forms are equivalent with a latency of 1 cycle, and both are a single micro-op each.

The second form of the cmp, however, can be macro-op fused with the jne; not entirely sure if the first form can be[11]. The second form is also denser, though that’s not a concern here.

Disallowing GCC from making assumptions about calling contexts by removing the static linkage from audit_filter_syscall() forces it to pass the list parameter in a register which results in a small performance improvement: ~20 cycles (about 0.5 cycles/loop iteration.)

Change in latency for UEK6:

Version
Min
(ns)
Mean
(ns)
Median
(ns)
Max
(ns)
ctx->major+annot
165.26
188.72
184.25
230.34
ctx->major+annot+extern
159.88
184.35
177.62
250.82

 

More details (perf-stat and the before/after versions of the generated code) in UEK6 commit-3 and, Upstream commit-3.

Summary

The audit subystem is fairly stable in the Linux kernel, not given to frequent changes. So it was puzzling when it became slower in recent kernels, and because a primary user is the syscall path, concerning[12].

The cause turned out to be higher skew in allocated buffers which results in more lopsided cache-set distribution.

The fixes compensate for the higher costs in the loop by taking advantage of the peculiarities of the execution path and optimizing for the speculative nature of the CPU pipeline.

The three patches, in sum reduce the overhead by about 30ns (~100 cycles).

Final perf stat -d -r 5 go from:

# perf stat -d -r 5 ./getpid
# output normalized for a single getpid() call

cycles                  761.65  (  +- 5.22% )
instructions           1639.17  (  +- 0.00% )
IPC                       2.18  (  +- 5.50% )
branches                328.21  (  +- 0.00% )
branch-misses             1.37  (  +- 6.56% )
L1-dcache-loads         404.35  (  +- 0.00% )
L1-dcache-load-misses     7.99  (  +- 70.71% )

to:

# perf stat -d -r 5 ./getpid
# output normalized for a single getpid() call

cycles                  669.09  (  +- 11.23% )
instructions           1342.04  (  +-  0.00% )
IPC                       2.03  (  +-  9.85% )
branches                328.19  (  +-  0.00% )
branch-misses             0.56  (  +-  5.35% )
L1-dcache-loads         384.31  (  +-  0.00% )
L1-dcache-load-misses     5.77  (  +- 84.57% )

This compares quite well to the UEK5-baseline:

# perf stat -d -r 5 ./getpid
# output normalized for a single getpid() call

cycles                  672.90  (  +-  1.65% )
instructions           1622.08  (  +-  0.00% )
IPC                       2.41  (  +-  1.65% )
branches                321.20  (  +-  0.00% )
branch-misses             0.51  (  +-  0.00% )
L1-dcache-loads         401.32  (  +-  0.00% )
L1-dcache-load-misses     2.28  (  +- 59.62% )

Note for non-Skylake-X architectures: Intel Icelake and AMD Milan (the other architectures tested) cope with L1d-load-misses much better so the baseline performance is much better.

With these patches, they only show a small improvement (~10ns): Icelake has a bigger L1d-cache (48K), and a much bigger ROB. Milan also has a bigger ROB and does memory renaming and bunch of other pipeline optimizations that limit the effect of these optimizations.

Endnote: what I found personally instructive was how much C really is “a portable assembler” and the significant codegen (and performance) changes that can result from minimal changes to the code.

References

  1. getpid() has a minimal kernel execution path (only does a PID lookup), and so is generally used to measure the overhead of the syscall path.

  2. Comparing the IPC for the audit-only portion shows that a starker drop:

    UEK5: 1427.0  instructions  #  3.41  insn per cycle
    UEK6: 1432.0  instructions  #  2.84  insn per cycle
  3. Alas no, alias analysis is an undecidable problem.

  4. Or for that matter, what causes the extra branch-miss.

  5. Another possibility is out-of-line code -- frequent interrupts, vmexits etc -- trashing the cache but from profiling these were a non-issue.

  6. Measured using the pearson-quotient(x, y): correlation coefficient between quantities x and y.

  7. You might notice that this analysis does not address the extra branch-miss. That's because I still have no clue what causes it.

  8. The correct fix would be to fix whatever ails the allocator. However, from a quick look at the changes that have gone into related code, it seems non-trivial to find a particular commit which points to the root cause of the skew (especially given that the skew is not constant, but varies from run-to-run.) Also, notably, the fixes described below also apply to UEK5, which means that even if UEK6 becomes faster, UEK5 will also improve somewhat.

  9. As mentioned in CPU-parameters, L1d-loads take 4-6 cycles on Skylake-X. We also know that in the good case (UEK5), this loop is capable of an IPC of 3.41 insn per cycle. So, hiding L1d-load latency is critical for good performance.

  10. https://www.agner.org/optimize/instruction_tables.pdf, pg 298 (Skylake-X)

  11. The first form fused, needs three inputs: %rbx, an imm32 encoding the distance to the address being compared, and an imm8 encoding the distance to the branch-dest; the second needs two registers: %rbx, %r12 and only the imm8.

  12. Just for context, a kernel build (x86-defconfig) makes an aggregate of 27M syscalls, with a syscall every 44us.

Ankur Arora


Previous Post

Discover the Latest Advancements in Automation with Oracle Linux Automation Manager 2.1

Monica S | 3 min read

Next Post


Get inspired at Oracle CloudWorld 2023—hear from customers, technical industry experts, and executives and get your questions answered

Michele Resta | 3 min read