Background: what are memory protection keys?

Memory protection keys (pkeys) are a hardware feature used for inexpensive and fine-grained (page-level) access control, present in Intel/AMD x86_64 processors.

On x86_64 processors, pkeys are 4 bits long (so there can be 16 of them). Each process has a complete set of 16 keys, shared among its threads. For each pkey, the process can configure access such that all access is allowed, or write access is disabled or all access is disabled entirely. Each thread has a thread-local, 32-bit register called PKRU, which encodes the access rights for each key, so each key has 2 bits to indicate read/write access restriction. This register can be accessed via 2 new instructions – rdpkru and wrpkru. This feature is only available in 64-bit mode. If the thread attempts to access a memory region in a way that violates the restrictions set in the PKRU register of that thread, it will receive a SIGSEGV, with the si_code set to SEGV_PKUERR.

pkeys vs. mprotect()

With protection keys, one can change access restrictions for various memory regions in a single process’s address space quite fast, with just a register update (of the PKRU register) rather than a costly page table update and tlb flushes. mprotect() can be costly and slow, and is applicable to all threads in the process – each thread cannot have separate access control for the same memory region. pkeys are thread-local, fine-grained, with almost no overhead. However, they do require hardware support (unlike mprotect()). And there can only be 16 pkeys per process (as of kernel version v6.19 – that might change in the future). Using pkeys also makes the code slightly complex and could lead to more programmer errors. There are some special cases (like the one covered in this blog) that are not handled well by pkeys even in the latest upstream kernel – so it’s still WIP.

The problem

Let’s assume there’s a multithreaded application that runs untrusted user code. Each thread has its stack/code protected by a non-zero pkey, and the PKRU register is set up such that only that particular non-zero pkey is enabled – i.e. pkey zero is disabled. Each thread also sets up an alternate signal stack to handle signals, which is protected by pkey zero. The pkeys man page documents that the PKRU will be reset to init_pkru when the signal handler is invoked, which means that pkey zero access will be enabled. But this reset happens after the kernel attempts to push fpu state to the alternate stack, which is not (yet) accessible by the kernel, which leads to a new SIGSEGV being sent to the application, terminating it.

Enabling both the non-zero pkey (for the thread) and pkey zero in the application will not work for this use case. This would enable the alt stack to be writeable by all – the rationale here is that the code running in that thread (using a non-zero pkey) is untrusted and should not have access to the alternate signal stack (that uses pkey zero), to prevent the return address of a function from being changed. The expectation is that kernel should be able to set up the alternate signal stack and deliver the signal to the application even if pkey zero is explicitly disabled by the application (as documented in the pkeys man page). The signal handler accessibility should not be dictated by whatever PKRU value the thread sets up.

The context

More about PKRU

From the Intel manual:

The protection-key feature provides an additional mechanism by which IA-32e paging controls access to usermode addresses. When CR4.PKE = 1, every linear address is associated with the 4-bit protection key located in bits 62:59 of the paging-structure entry that mapped the page containing the linear address (see Section 4.5). The PKRU register determines, for each protection key, whether user-mode addresses with that protection key may be read or written.

The PKRU register (protection key rights for user pages) is a 32-bit register with the following format: for each i (0 ≤ i ≤ 15), PKRU[2i] is the access-disable bit for protection key i (ADi); PKRU[2i+1] is the write-disable bit for protection key i (WDi). Software can use the RDPKRU and WRPKRU instructions with ECX = 0 to read and write PKRU. In addition, the PKRU register is XSAVE-managed state and can thus be read and written by instructions in the XSAVE feature set.

Use of the protection key i of a user-mode address depends on the value of the PKRU register: – If ADi = 1, no data accesses are permitted. – If WDi = 1, permission may be denied to certain data write accesses: * User-mode write accesses are not permitted. * Supervisor-mode write accesses are not permitted if CR0.WP = 1. (If CR0.WP = 0, WDi does not affect supervisor-mode write accesses to user-mode addresses with protection key i.)

In other words:

# define PKEY_DISABLE_ACCESS    0x1
# define PKEY_DISABLE_WRITE     0x2

Since the PKRU register is thread-local, it’s possible that 2 different threads have different access rights to the same memory region. And each thread’s PKRU controls access to a page irrespective of the thread running in user-mode or kernel-mode.

For kernel addresses however, the protection keys do not apply:

The protection key of a supervisor-mode address is ignored and does not control data accesses to the address. Because of this, Section 4.6.1 does not refer to protection keys when specifying the access rights for supervisor-mode addresses.

The default pkru value set at boot time is 0x55555554, which disallows access to all pkeys except pkey 0 (which has both enabled). This change was made in commit:

commit acd547b29880800d29222c4632d2c145e401988c
Author: Dave Hansen <dave.hansen@linux.intel.com>
Date:   Fri Jul 29 09:30:21 2016 -0700

    x86/pkeys: Default to a restrictive init PKRU
...
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index a4f4d693e2c1..3725976d0af5 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1643,6 +1643,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.

        initrd=         [BOOT] Specify the location of the initial ramdisk

+       init_pkru=      [x86] Specify the default memory protection keys rights
+                       register contents for all processes.  0x55555554 by
+                       default (disallow access to all but pkey 0).  Can
+                       override in debugfs after boot.
+

Signal Handler Behavior

From the pkeys man page:

Each time a signal handler is invoked (including nested signals), the thread is temporarily given a new, default set of protection key rights that override the rights from the interrupted context. This means that applications must re-establish their desired protection key rights upon entering a signal handler if the desired rights differ from the defaults. The rights of any interrupted context are restored when the signal handler returns.

This signal behavior is unusual and is due to the fact that the x86 PKRU register (which stores protection key access rights) is managed with the same hardware mechanism (XSAVE) that manages floating-point registers. The signal behavior is the same as that of floating-point registers.

fpu__clear_user_states() does reset PKRU, but that happens much later in the flow. Before that, the kernel tries to save registers on to the alternate signal stack in setup_rt_frame(), and that fails if the application has explicitly disabled pkey 0 (and the alt stack is protected by pkey 0). The solution (which is described in a later section) moves that reset a little earlier in the flow, so that setup_rt_frame() can succeed.

XSAVE

XSAVE is a CPU instruction for saving and restoring extended processor state, on Intel and AMD x86 systems. The XSAVE area contains all the registers it supports (in either a compacted or standard format) – including floating point registers, MPX, PKRU, etc. The register XCR0 contains the bitmap that specifies which of these features are enabled and managed by XSAVE. The bit corresponding to the PKRU register in XCR0 is bit 9 – i.e. if XCR0[9] = 1, then the PKRU register state will be managed by the XSAVE instruction set. The registers will be saved to the XSAVE area in memory using XSAVE (or XSAVEC for compacted format) and restored from this XSAVE area back to the processor’s registers using XRSTOR.

For this issue, our requirement is that the PKRU register in the XSAVE area (and therefore the user sigframe) contain the user-defined value, so that it is XRSTOR’d correctly in sigreturn. But the actual PKRU register itself must be updated to be more permissive (i.e. enable all pkeys) temporarily, so that the kernel can access both the current thread’s execution stack (protected by pkey X) as well as the alternate signal stack (protected by pkey 0). This is a bit tricky, and unfortunately, the solution is also a bit hacky. It involves updating the PKRU register to enable all pkeys, and then after it is pushed onto the sigframe, overwriting the PKRU value on the sigframe with the user-defined PKRU value, so that the correct value is restored from the sigcontext.

The investigation

Kernel debug

The testcase, authored by Keith Lucas, is this: in a simple multithreaded application, create one thread so that its stack/code/etc. is protected by a non-zero pkey, and that thread sets up an alternate signal stack that’s protected by pkey 0. The PKRU register is set that only the non-zero pkey is enabled — i.e. pkey 0 is disabled. The thread also sets up a signal handler to handle SIGSEGV and induces a segmentation fault by accessing unmapped memory. It expects that the signal handler will be invoked, but it crashes instead due to an unexpected second SIGSEGV being generated by the kernel while setting up the alternate signal stack (which is due to pkey access restriction).

[opc@aruramak-ol8 ~]$ ./altstackmpk-fixed --segfault
initial pkru = 0x55555554
signal stack:   0x7fb98f7da000-0x7fb98f7dd000
thread stack:   0x7fb98e7cc000-0x7fb98efcc000
relocated code: 0x7fb98f7d9000-0x7fb98f7da000
set pkru = 0xfffffff0
thread signal stack:    0x7fb98f7cf000-0x7fb98f7d2000
initial thread pkru = 0xfffffff0
relocated code
Segmentation fault (core dumped)

In the syslog, we can see that the process was served SIGSEGV which was not caught/handled – hence the crash:

...
Dec  7 21:25:35 aruramak-ol8 kernel: potentially unexpected fatal signal 11.
Dec  7 21:25:35 aruramak-ol8 kernel: CPU: 1 PID: 2184194 Comm: altstackmpk-fix Kdump: loaded Not tainted 5.4.17-2102.204.4.4.el8uek.x86_64 #2
Dec  7 21:25:35 aruramak-ol8 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.5.1 06/16/2021
Dec  7 21:25:35 aruramak-ol8 kernel: RIP: 0033:0x7fb98f7d91b6
Dec  7 21:25:35 aruramak-ol8 kernel: Code: 8b 45 a8 48 8b 7d a0 48 8b 75 98 48 8b 55 90 0f 05 48 89 85 70 ff ff ff 0f b6 85 08 ff ff ff 84 c0 74 0d 48 8b 85 10 ff ff ff <c7> 00 01 00 00 00 48 c7 85 68 ff ff ff 3c 00 00 00 48 c7 85 60 ff
Dec  7 21:25:35 aruramak-ol8 kernel: RSP: 002b:00007fb98efcae20 EFLAGS: 00010202
Dec  7 21:25:35 aruramak-ol8 kernel: RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 00007fb98f7d919d
Dec  7 21:25:35 aruramak-ol8 kernel: RDX: 000000000000000f RSI: 00007fb98efcade0 RDI: 0000000000000001
Dec  7 21:25:35 aruramak-ol8 kernel: RBP: 00007fb98efcaec0 R08: 0000000000000000 R09: 0000000000000000
Dec  7 21:25:35 aruramak-ol8 kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe1814ddfe
Dec  7 21:25:35 aruramak-ol8 kernel: R13: 00007ffe1814ddff R14: 00007ffe1814ded0 R15: 00007fb98efcafc0
Dec  7 21:25:35 aruramak-ol8 kernel: FS:  00007fb98efcb700 GS:  0000000000000000
Dec  7 21:25:35 aruramak-ol8 systemd[1]: Started Process Core Dump (PID 2184195/UID 0).
Dec  7 21:25:35 aruramak-ol8 systemd-coredump[2184196]: Process 2184193 (altstackmpk-fix) of user 1000 dumped core.#012#012Stack trace of thread 2184194:#012#0  0x00007fb98f7d91b6 n/a (n/a)#012#1  0x0000000000400c12 n/a (n/a)#012#2  0x00007fb98f39915a n/a (n/a)
...

The function that fails is this:

In handle_signal():

...
        failed = (setup_rt_frame(ksig, regs) < 0);
        if (!failed) {
                /*
                 * Clear the direction flag as per the ABI for function entry.
                 *
                 * Clear RF when entering the signal handler, because
                 * it might disable possible debug exception from the
                 * signal handler.
                 *
                 * Clear TF for the case when it wasn't set by debugger to
                 * avoid the recursive send_sigtrap() in SIGTRAP handler.
                 */
                regs->flags &= ~(X86_EFLAGS_DF|X86_EFLAGS_RF|X86_EFLAGS_TF);
                /*
                 * Ensure the signal handler starts with the new fpu state.
                 */
                fpu__clear_user_states(fpu);
        }
        signal_setup_done(failed, ksig, stepping);
...

Failure path: setup_rt_frame() -> x64_setup_rt_frame() -> get_sigframe() -> copy_fpstate_to_sigframe() -> __clear_user() -> failure, with SIGSEGV and si_code set to SEGV_PKUERR.

bool copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
{
...
        if (use_xsave()) {
                struct xregs_state __user *xbuf = buf_fx;

                /*
                 * Clear the xsave header first, so that reserved fields are
                 * initialized to zero.
                 */
                if (__clear_user(&xbuf->header, sizeof(xbuf->header))) <--
                        return false;
        }
...

__clear_user() fails while trying to clear the alt stack which is write protected because pkey 0 is disabled here.

The PKRU value is reset to the default (enabling pkey 0 only) in fpu__clear_user_states().

Debug log:

...
[76813.749337][  T586] pkru = fffffff3
[76813.749536][  T586] __clear_user(&xbuf->header, sizeof(xbuf->header)) = 64 failed
...

copy_fpstate_to_sigframe() stores floating point unit registers to userspace signal frame (which is now pkey 1 protected).

The default value of pkru is set to 0x55555554 – which denies all access to pkeys 1-15, but pkey 0 is allowed. The test code sets this as the PKRU value:

    if (info.enable_mpk) {
        __builtin_ia32_wrpkru(0xfffffff3);
    }

which disables pkey 0 and enables pkey 1 for the signal stack.

With the patch (described later in this blog post), the crash does not happen anymore – the application successfully receives a SIGSEGV and handles it:

[opc@aruramak-ol8 ~]$ ./altstackmpk-fixed --segfault
initial pkru = 0x55555554
signal stack:   0x7efe6897a000-0x7efe6897d000
thread stack:   0x7efe67600000-0x7efe67e00000
relocated code: 0x7efe68979000-0x7efe6897a000
set pkru = 0xfffffff0
thread signal stack:    0x7efe68976000-0x7efe68979000
initial thread pkru = 0xfffffff0
relocated code
signal handler rsp = 0x7efe68978450
signal handler pkru = 0xfffffff0
handling signal 11
info->si_code 1
info->si_addr 0xffffffff

Where is PKRU reg stored, in the XSAVE buffer?

Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: [Debug] setup_xstate_features(): extended_state_area begins at offset 576
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: [Debug] setup_xstate_features(): xstate_offset[2]: 576, xstate_sizes[2]: 256
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: [Debug] setup_xstate_features(): xstate_offset[5]: 1088, xstate_sizes[5]: 64
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: [Debug] setup_xstate_features(): xstate_offset[6]: 1152, xstate_sizes[6]: 512
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: [Debug] setup_xstate_features(): xstate_offset[7]: 1664, xstate_sizes[7]: 1024
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: [Debug] setup_xstate_features(): xstate_offset[9]: 2688, xstate_sizes[9]: 8
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: [Debug] setup_xstate_features(): new last_good_offset = 2688

These values are slightly different from the ones printed in setup_init_fpu_buf():

Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
Mar  7 23:23:31 aruramak-ol7-testvm3 kernel: x86/fpu: Enabled xstate features 0x2e7, context size is 2440 bytes, using 'compacted' format.

The difference is that the latter prints xstate_comp_offsets[] (which is also used by get_xsave_addr() to calculate the offset) while I’m printing xstate_offsets[] – the standard, non-compacted offset that we need to use since we’re reading from a standard user buffer (the sigframe) which should not be in the compacted format. (See copy_xregs_to_user() in the kernel code for comment.)

get_xsave_addr() assumes that the xsave area will always be in compacted format (if compacted format is supported by the processor, which it is, here). I just worked around this (for now) by defining this:

+
+/*
+ * Get address of an xfeature in the xsave buffer using non-compacted format.
+ */
+
+void *__raw_xsave_usr_addr(struct xregs_state *xsave, int xfeature_nr)
+{
+       if (!xfeature_enabled(xfeature_nr)) {
+               WARN_ON_FPU(1);
+               return NULL;
+       }
+
+       return (void *)xsave + xstate_offsets[xfeature_nr];
+}

Which is identical to __raw_xsave_addr() but assumes standard format.

The solution

The idea

The PKRU register is managed by XSAVE, which means the sigframe contents must match the register contents – which is not the case here. We want the sigframe to contain the user-defined PKRU value (so that it is restored correctly from sigcontext) but the actual register must be reset to 0 so that the alt stack is accessible and the signal can be delivered to the application. It seems that the proper fix here would be to remove PKRU from the XSAVE framework and manage it separately, which is quite complicated. As a workaround, Dave Hansen suggested that we do something like this:

        orig_pkru = rdpkru();
        wrpkru(0);
        xsave_to_user_sigframe();
        put_user(pkru_sigframe_addr, orig_pkru)

The patchset

The complete patchset for this bugfix is:

6998a73efbb8 selftests/mm: Add new testcases for pkeys
d10b554919d4 x86/pkeys: Restore altstack access in sigreturn()
70044df250d0 x86/pkeys: Update PKRU to enable all pkeys before XSAVE
84ee6e8d195e x86/pkeys: Add helper functions to update PKRU on the sigframe
24cf2bc982ff x86/pkeys: Add PKRU as a parameter in signal handling functions

This was part of the v6.11 release.

Some issues that were encountered while trying to solve this problem

There were a lot of issues, including getting the XSAVE reg offsets right, restoring PKRU at the right time, etc., that were resolved before the patch worked as intended. Apart from that, there were a few others more noteworthy – I’ve listed those below.

wrpkru(0)

In one of the earlier iterations of the patch, I had:

write_pkru(current_pkru & init_pkru_snapshot);

instead of:

wrpkru(0);

But that made the assumption that the current PKRU permissions allow writes to the current stack and the init PKRU value can write to the alternative stack. It may not always be the case. To simplify matters, and make it more generic, the code enables all pkeys via wrpkru(0) before setting up the signal stack, and it restores the user-defined pkey before exiting the signal handler. We do not want to assume or restrict the alt sigstack to pkey 0. As Dave Hansen pointed out:

This code has ZERO knowledge of the permissions on either the current or alternate stack.

sigreturn

Jeff Xu pointed out another part of the flow where this could blow up:

in sigreturn(), it calls restore_altstack(), and requires read access to altstack. However, at the time, PKRU is already restored from sigframe, so SEGV will raise (the value in sigframe doesn’t have read access to the PKEY).

The sigreturn patch (patch #4) was added to the patchset to fix this.

PKRU register was not correctly restored on AMD systems

When PKRU is set to 0 in the signal handling flow, it seems that xstate_bv[9] is also set to 0, which effectively means that the PKRU register is not managed by the XSAVE feature set henceforth. This causes XRSTOR to not restore the PKRU register from the sigframe, which means PKRU stays at 0 when the control is returned back to the user program (i.e. the XSAVE contents are ignored). This is not expected behavior.

This is only seen on systems with AMD CPUs – not Intel CPUs. On Intel systems, the XRSTOR works, which means xstate_bv[9] is not set to 0. This can be fixed in the kernel by reenabling PKRU (i.e. xstate_bv[9]=1) after PKRU is set to 0, or before it’s updated on the sigframe.

From the Intel manual:

13.6 PROCESSOR TRACKING OF XSAVE-MANAGED STATE

The following notation describes the state of the init and modified optimizations: – XINUSE denotes the state-component bitmap corresponding to the init optimization. If XINUSE[i] = 0, state component i is known to be in its initial configuration; otherwise XINUSE[i] = 1. It is possible for XINUSE[i] to be 1 even when state component i is in its initial configuration. On a processor that does not support the init optimization, XINUSE[i] is always 1 for every value of i. … – PKRU state. PKRU state is in its initial configuration if the value of the PKRU is 0. …


13.8.1 Standard Form of XRSTOR

XRSTOR updates state component i based on the value of bit i in the XSTATE_BV field of the XSAVE header: – If XSTATE_BV[i] = 0, the state component is set to its initial configuration. Section 13.6 specifies the initial configuration of each state component. The initial configuration of state component 1 pertains only to the XMM registers and not to MXCSR. See below for the treatment of MXCSR – If XSTATE_BV[i] = 1, the state component is loaded with data from the XSAVE area. See Section 13.5 for specifics for each state component and for details regarding mode-specific operation and operation determined by instruction prefixes. See Section 13.13 for details regarding faults caused by memory accesses.

The line “PKRU state is in its initial configuration if the value of the PKRU is 0” seems to imply that when the PKRU register is set to 0, xinuse[9] is also automatically set to 0 and that is expected behavior, which causes XRSTOR to not load the register value from XSAVE area. But we do not want xinuse[9] to be set to 0 here, as we want the PKRU value to be correctly restored from the sigframe.

Dave Hansen remarked:

If ->xfeatures[PKRU]==0, then XRSTOR will ignore the data that __put_user() put in place.

How does ->xfeatures[PKRU] end up set to 0? On AMD, a WRPKRU(0) sets PKRU=0 and XINUSE[PKRU]=0. Intel doesn’t do that. Either behavior is architecturally permitted.

To make this behavior consistent across Intel and AMD systems, and to ensure that the PKRU value updated on the sigframe is always restored correctly, explicitly set XSTATE_BV[PKRU] to 1.

+   /* Mark PKRU as in-use so that it is restored correctly. */
+   xstate_bv = (mask & xfeatures_in_use()) | XFEATURE_MASK_PKRU;
+
+   err =  __put_user(xstate_bv, &buf->header.xfeatures);
+   if (err)
+       return err;
+
+   /* Update PKRU value in the userspace xsave buffer. */
    return __put_user(pkru, (unsigned int __user *)get_xsave_addr_user(buf, XFEATURE_PKRU));

A new patchset was submitted upstream to fix this issue, and was merged into v6.13, as well as v6.12 stable tree.

References

  1. Intel® 64 and IA-32 Architectures Software Developer’s Manual
  2. Memory Protection Keys
  3. Pkeys man page
  4. lwn.net: System calls for memory protection keys