This blog is focused on PKRAM, one of several proposed methods for preserving memory in a running kernel in order to make it available after kexec into a new kernel.

Why preserve memory across reboot?

Minimal downtime is extremely important to customers and cloud service providers. In particular, reboot of the host kernel in a virtualization environment is disruptive to running guests. However a reboot is sometimes necessary in order to pick up critical software fixes that cannot be applied through other means such as ksplice. In this situation there are limited options for mitigating impact to guests such as live migration to another host, but these can be resource intensive or even impractical if the guest is using a large amount of local resources.

PKRAM in conjunction with guest state saved and restored by QEMU as outlined here provides the ability to reboot to an updated host kernel with guest RAM remaining intact and with guests only subjected to a pause long enough to complete the reboot, restore preserved data, and start necessary userland services.

A usage example

Let’s start with an example to show how PKRAM can be used to preserve a shmem memory filesystem:

  1. Mount tmpfs with ‘pkram=NAME’ option.

    NAME is an arbitrary string specifying a preserved memory node. Different tmpfs filesystems may be saved to PKRAM if different names are passed.

    # mkdir -p /mnt
    # mount -t tmpfs -o pkram=mytmpfs none /mnt
  2. Populate a file under /mnt.

    # head -c 2G /dev/urandom > /mnt/testfile
    # md5sum /mnt/testfile
    e281e2f019ac3bfa3bdb28aa08c4beb3  /mnt/testfile
  3. Add the preserve mount option to preserve blocks across the next kexec.

    # mount -o remount,preserve,ro /mnt
  4. Load the new kernel image.

    Pass the PKRAM super block pfn via ‘pkram’ boot option. The pfn is exported via the sysfs file /sys/kernel/pkram.

    Strip the previous value from the kernel args, and add the new one:

    # args="$(cat /proc/cmdline|sed -e 's/pkram=[^ ]*//g')pkram=$(cat /sys/kernel/pkram)"

    Now set the new kernel image and args:

    # kexec -s -l /boot/vmlinuz-$kernel --initrd=/boot/initramfs-$kernel.img --append="$args"
  5. Boot to the new kernel.

    # systemctl kexec
  6. Mount tmpfs with ‘pkram=NAME’ option.

    It should find the PKRAM node with the tmpfs filesystem saved on previous unmount and restore it.

    # mount -t tmpfs -o pkram=mytmpfs none /mnt
  7. Use the restored file under /mnt

    # md5sum /mnt/testfile
    e281e2f019ac3bfa3bdb28aa08c4beb3  /mnt/testfile

Implementation

The implementation of PKRAM is adapted from an older RFC patchset sent out in 2013. Here are the basics:

  • Saving files to PKRAM is done by walking the pages of the files and building a list of the pages and attributes needed to restore them later. The pages containing this metadata as well as the file pages have their refcount incremented to prevent them from being freed even after the last user drops their reference (i.e. the filesystem is unmounted).

  • A single page is allocated for the PKRAM super block For the next kernel kexec boot to find preserved memory metadata, the pfn of the PKRAM super block, which is exported via /sys/kernel/pkram, is passed in the ‘pkram’ boot option.

  • A list of physical memory ranges representing all preserved pages is built and passed to the newly booted kernel where they are added to the memblock reserve list during early boot to prevent their reuse.

  • In the newly booted kernel, PKRAM adds all preserved pages to the memblock reserve list during early boot so that they will not be recycled.

  • Since kexec may load the new kernel code to any memory region, it could destroy preserved memory. When the kernel selects the memory region (kexec_file_load syscall), kexec will avoid preserved pages. When the user selects the kexec memory region to use (kexec_load syscall), kexec load will fail if there is conflict with preserved pages.

Limitations

While effective at preserving memory, PKRAM has important limitations to overcome:

  • Repeated reboots and certain usage patterns could see preserved memory fragment enough to make it impossible to load a new kernel without clobbering something.

  • Preserving memory that cannot be relocated isn’t supported. This impacts the ability to preserve guest memory with VFIO PCI devices since the memory must be pinned.

Other Efforts

Within the last year several other approaches to preserving memory across kexec have been proposed:

  1. Kexec HandOver (KHO)

    Proposes a generalized method for passing state across kexec.

  2. Pkernfs

    Proposes an in-memory filesystem with an eye towards supporting preservation of IOMMU data structures in addition to guest memory.

    Linux Plumbers Conference 2023 discussion

  3. prmem

    Proposal for a separate allocator of persistent memory.

  4. persistent memory pools

    Another allocator of persistent memory, primarily for kernel drivers and modules to use.

Looking Forward

The rising interest in the ability to preserve data across kernel reboot has led to several implementation proposals of which PKRAM is one. Through community interest and effort there is hope that this functionality will become available for general use.