What's Inside a Linux Kernel Core Dump

February 8, 2024 | 35 minute read
Text Size 100%:

Linux kernel core dumps are often critical for diagnosing and fixing problems with the OS. We’ve published several blogs related to kernel core dumps, including how to generate them, how to estimate their size, how to analyze them with Drgn, and even how to manually extract stack function arguments from them. But have you ever wondered what’s really in a core dump? In this blog, we’ll answer that question. We’ll start by discussing the different software which can actually produce a vmcore (there are more than you might think). Next, we’ll discuss the contents of vmcores, including what important metadata needs to be present in order to make analysis possible. We’ll then dive into the details of a few of the most prominent vmcore formats and variations on them, before finishing with a quick overview of some tools that can be used to analyze them.

This topic has a lot of history and so much variation, that there’s no hope of covering it all. Instead, we’ll focus on the sources and formats most commonly found with Oracle Linux, which should cover much of the modern desktop and server Linux landscape. This blog isn’t intended to be a step-by-step guide to achieving a particular task; it’s just a reference and introduction to a field that’s not frequently discussed.

Core dump sources

When the Linux kernel encounters an unrecoverable error (a “panic”) or a hang, it’s incredibly useful to save the state of memory, registers, etc. into a file for later analysis & debugging. In these situations, the system may be in a dire state, so the process which creates the core dump must be reliable. Further, downtime and disk space can be expensive, so the process must also be reasonably quick, and should avoid capturing unnecessary information. This is a tall order, and so there are multiple ways a core dump can be created, each with different trade-offs.

For distributions like Oracle Linux, the most common way to create a vmcore is by using kexec(8), generally with makedumpfile. However, it can also be done by a hypervisor or firmware level crash dump system. We’ll explain and discuss the different possibilities in this section.

Kexec crash kernels

On a system which is configured for kexec crash dumps, some memory is reserved at boot time for a second Linux kernel (the “kdump kernel”). On startup, the system uses kexec_load(2) to load a kernel image into this reserved memory region. If a panic occurs, all CPUs are halted and control is transferred to the kdump kernel. The kdump kernel boots up and represents the memory image of the previous kernel as the file /proc/vmcore - thus the name “vmcore”. Normally, the kdump kernel is configured to execute a tool (typically makedumpfile) which will then save this file to a disk or network location, and then reboot.

The /proc/vmcore file is thus one of the most common sources for kernel core dumps. The data is represented in ELF (Executable and Linkable Format), which we will discuss a bit more later on. However, if you were to go searching on your Linux machine for a /proc/vmcore file, you probably wouldn’t find it, because it only appears when you’re running within a kdump kernel. (More precisely, only when the elfcorehdr command-line option points at a valid ELF header created by the original kernel). But there is also another ELF formatted core image which the kernel provides: /proc/kcore.

Linux running kernels

Unlike /proc/vmcore, the file /proc/kcore is always available. Rather than showing the memory image of the previously crashed kernel, it shows the live memory image of the currently running kernel. It’s common for people to run live debugging tools like crash or drgn against /proc/kcore, since it’s always present and easy to access, assuming that security features such as lockdown=confidentiality are not active.

You could also create a copy of /proc/kcore much like you could of /proc/vmcore, however this isn’t terribly common. Typical use of /proc/kcore is for live debugging, while /proc/vmcore is normally saved for later inspection.

Makedumpfile

Both /proc/vmcore and /proc/kcore are direct representations of the memory space of a kernel, using the ELF format. This means that the files take up roughly the same amount of space as the physical memory in use by the kernel. On a laptop with 8GiB or 16GiB of memory, that may not be too bad, but for servers with 100s of GiBs or even TiBs of memory, that’s just not acceptable.

The makedumpfile tool is used to create a much smaller dump file, using two main strategies. First, it can omit data in memory that may not be useful for debugging the kernel (e.g. memory filled with zero’s, free memory, userspace memory, etc). Second, for the data that is included, it can compress each page of memory, if the output format supports it. Incidentally, it also supports the ability to “filter out” data for particular symbols or data structures, but this doesn’t reduce the data size: it simply redacts the data.

In a typical configuration, the kdump kernel is configured to use makedumpfile to save the /proc/vmcore file to disk or the network. You might expect this to take longer than simply copying the file to disk, but usually it’s faster: the slow part is writing the file to disk, and by omitting and compressing data, the time spent doing I/O is greatly decreased. So the end result is a vmcore which is quicker to generate, and takes up much less space than the original /proc/vmcore file would have. It’s less common, but perfectly valid to use makedumpfile to create a compressed dump of your currently-running kernel too, by running it on /proc/kcore instead. Makedumpfile also supports running against already-created core dump files, in order to re-filter them or convert the format.

Hypervisors

So far, all of the core dumps we’ve discussed came from Linux’s /proc/kcore or /proc/vmcore, with a possible intermediate step through makedumpfile. However, the Linux kernel is frequently run as a virtual machine guest, and in those cases, a hypervisor is responsible for managing the VM’s memory. It’s entirely possible for a hypervisor to create a core dump itself, by pausing the execution of the VM (to ensure consistency), and then saving a copy of that memory and the CPU state. This may be necessary in rare cases where the VM guest is unresponsive for some reason. If the normal means of triggering a panic & kdump within the guest OS are unsuccessful, a hypervisor core dump could provide you the necessary information to resolve the issue.

A current common example of a hypervisor creating a core dump would be QEMU, which supports a dump-guest-memory QMP/HMP command, which is frequently used via libvirt’s virsh dump command. You can also create memory dumps with Hyper-V, Xen, and other hypervisors.

Some hypervisors, such as Hyper-V, have their own custom dump formats. Others, like QEMU, support a range of formats. And naturally, each hypervisor tends to create its core dump format using its own implementation and quirks. For example, ELF vmcores generated by QEMU will look different than the ELF /proc/vmcore, which itself even looks different than the /proc/kcore file. As a result, it’s not always enough to simply know what format a vmcore is in: you may also need to know what created it.

Others

While /proc/vmcore, /proc/kcore, and hypervisors are certainly the most common sources of core dumps we see, this list is by no means exhaustive. Some server vendors provide firmware-based diagnostics and dumping mechanisms, which are similar in principle to hypervisor core dumps: the firmware takes the place of the hypervisor, halting the machine and saving elements of physical memory. These vendor-specific solutions will vary widely in capability, quality, and their output formats. And even outside of these solutions, there are other application-specific solutions (e.g. those which are better suited for embedded devices). And of course, there are historical systems & formats which are mostly unused today.

For the purpose of this article, we can’t cover all of those, so we’ll stick to the world of Linux ELF vmcores, makedumpfile, and hypervisors. These are the ones we see most commonly with Oracle Linux.

Data contained in core dumps

As we’ve seen, there’s quite a diversity of tools which can be used to create a kernel core dump. But they’re all trying to achieve the same end goal: provide enough data that an analysis tool can understand the dump, and allow a user to analyze the information. The end goal is to allow that user to debug whatever issue led to the core dump being created in the first place.

The main information provided by the core dump is, of course, the contents of memory. However, on its own, the memory contents are not enough information for tools to interpret meaningful information such as variable values, stack traces, log buffer contents, etc. Tools will need additional information to properly interpret a core dump:

  1. Probably the most important information is the exact kernel release as reported by uname -r, for example: 5.15.0-104.119.4.2.el9uek.x86_64. This string usually identifies a specific kernel build released by your distribution, and it is generally enough to identify the specific *-debuginfo package which applies to that kernel. Increasingly, debuggers are relying on a special value called the “build ID” which is unique to each build and can also be used to find debugging information, so the build ID may be an important metadata as well. The debugging information typically contains an ELF symbol table as well as DWARF information that describes variables, types, and much more, allowing debuggers to introspect data structures.
  2. Another fundamental requirement is to know the architecture that the core dump came from. This obviously includes the architecture name (x86_64, aarch64, etc.), but it may also include details such as page size, word size, or endianness that could be left unspecified by the architecture.
  3. The kernel is responsible for managing memory, and that includes maintaining the page tables. The vast majority of the kernel works with virtual addresses, and so debugging tools need to understand those virtual address mappings. Some core dump formats can represent the virtual address space as part of their memory encoding. In other cases, the core dump simply represents physical memory addresses, and leaves the debugger to find and interpret the page tables via other metadata. In Linux, the symbol swapper_pg_dir refers to the virtual address of the root page table. With some architecture specific metadata to translate that virtual address to its corresponding physical address, and with page table support for the architecture, a debugger can traverse these tables and understand the virtual address mappings.
  4. Most kernel configurations randomize at boot time the physical base address at which the kernel is loaded, as well as the virtual memory addresses where the kernel is mapped. These are different forms of so-called KASLR, or Kernel Address Space Layout Randomization. While it’s possible to search for well-known data in a core dump in order to “break” the KASLR, this takes a long time and can be error-prone, so it’s important for the dump to contain these KASLR offsets.
  5. Finally, the values of the registers for each CPU are very important. These include the program counter & stack pointer, which are crucial for creating an accurate stack trace for the code executing on a particular CPU. These are typically represented in an architecture-specific ELF note called NT_PRSTATUS.

These are just the low-level metadata that a debugger would need in order to interpret variables and data structures in memory, as well as unwind the stacks of active tasks. However, there are other kinds of metadata which need to be considered. For example, makedumpfile, which we’ve already discussed, has the ability to omit certain kinds of memory pages from its output. In order to do that, makedumpfile needs to be able to understand the kernel’s array of page frames. It can use debuginfo for this, but in practice, debuginfo is rarely available at runtime. So makedumpfile can use additional metadata that declares the sizes and member offsets of certain structures, as well as addresses of certain essential symbols. Other tools directly use the vmcore (without debuginfo) to extract the kernel log buffer, and so metadata for symbols and types related to the log buffer are also frequently needed.

All told, there’s quite a bit of metadata that’s necessary to interpret the kernel core dump. Some of this metadata is represented by the dump format intrinsically (e.g. ELF contains the architecture information in its header, and can hold virtual address mappings in the program headers). However, many of the Linux-specific details, such as page table locations, KASLR offsets, kernel release, and more are contained in a special piece of data called the “vmcoreinfo”.

The vmcoreinfo is text formatted, key=value block of data. Normally, a page of memory is allocated at startup by the Linux kernel, and the data is formatted and written into this page. The page is never overwritten or deallocated, so it’s always available – if you can find it. In the event of a panic, the kernel includes the already-created vmcoreinfo into the ELF /proc/kcore and /proc/vmcore files as an ELF note. This makes it easy to explore on a running system: simply run eu-readelf -n /proc/kcore as root and look for the “VMCOREINFO” note and its data. Much of the above metadata is represented as keys in this text, for example:

  • OSRELEASE - the kernel release
  • BUILD-ID - the build ID of the kernel (vmlinux)
  • PAGESIZE - physical page frame size
  • KERNELOFFSET - this is x86_64 specific, but it contains the KASLR offset
  • SYMBOL(swapper_pg_dir) - the root page table

The strength of the vmcoreinfo is that it is text-based, so it’s easy to extend with new information that wasn’t anticipated, unlike binary file formats. Another benefit is that it’s included in kernel memory anyway, so it’s almost always present somewhere in a core dump, if you’re willing to search for it. The downside is that core dump formats need to be aware of it and either include a copy, or include a pointer to it. If that copy or pointer is lost, or if the tool which created the dump never knew where it was (e.g. hypervisors), then it can be difficult or impossible to use with tools like Crash and Drgn.

Core dump formats

Now that we know the diversity of core dump producers, and that there is a large amount of metadata that needs to be well represented in a core dump, we’re ready to tackle an incomplete list of the more common core dump formats. In each section, we’ll show how to identify the format (frequently, by using the file utility, but sometimes using a hex dumper like xxd). We’ll also describe the benefits and drawbacks, along with the common producers and consumers.

ELF

The Executable and Linkable Format (ELF) is ubiquitous in the Linux world. Most programs are in ELF, as are the intermediate outputs of the compiler, and also the core dumps of userspace programs. In order to suit that diversity of use cases, ELF has to be very flexible.

ELF is essentially composed of four parts:

  • The ELF header
  • The section headers (optional)
  • The program headers (optional)
  • Data

The ELF header points to section & program headers, and gives some basic information about the architecture. The section & program headers are optional (but at least one needs to be included), and they define regions of the data and provide metadata about them. The section headers are intended to be more useful for a compiler or linker: they define a series of “sections” of data in the file, all of which are named and have different types and some flags. The program headers, on the other hand, define “segments”, which are intended to be used while executing a program. Each entry describes which parts of the file should be loaded into memory at what address, with what permissions. It’s like a recipe for creating a process image in memory.

ELF is typically used to represent core dumps using the program headers. Rather than describing how a loader should create the program in memory, a core dump contains the memory contents of the program, and the program headers describe where the contents were mapped at the time of the crash. The Linux Kernel creates userspace core dumps using the ELF format in this manner, and so it’s not surprising that /proc/kcore and /proc/vmcore are also represented as ELF files with memory regions described by program headers.

But how is the metadata handled? Some of it is included in the ELF header, especially the architecture. The remainder is usually stored in ELF “notes”. Program headers and section headers can both declare a section of the file as containing specially-formatted “note” data that can have arbitrary contents. The following notes are contained in vmcore/kcore files generated by Linux:

  • Notes of type PRSTATUS, which contain the registers for every CPU. (The /proc/kcore file only contains the PRSTATUS for the running CPU. Getting the data for other CPUs would require sending inter-processor-interrupts to collect the data, and by the time it was returned to userspace, the data would be stale anyway.)
  • A note of type PRPSINFO can contain the kernel command line
  • A note of type VMCOREINFO will contain the vmcoreinfo note with the metadata described above.

Detecting ELF Core Dumps

You can tell a file is an ELF core dump via the file command. For example:

$ sudo file /proc/kcore
/proc/kcore: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from 'BOOT_IMAGE=(hd0,gpt3)/vmlinuz-5.14.0-284.25.1.0.1.el9_2.x86_64 root=/dev/mapper'

You can also view the program headers that define the core dump memory via eu-readelf -l FILE, and you can see the notes (with contents, if possible) via eu-readelf -n FILE. For example, here are the program headers for a /proc/kcore and a /proc/vmcore respectively:

sh-5.1# readelf -l /proc/kcore

Elf file type is CORE (Core file)
Entry point 0x0
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  NOTE           0x0000000000000238 0x0000000000000000 0x0000000000000000
                 0x0000000000001b60 0x0000000000000000         0x0
  LOAD           0x00007fffff602000 0xffffffffff600000 0xffffffffffffffff
                 0x0000000000001000 0x0000000000001000  RWE    0x1000
  LOAD           0x00007fff81002000 0xffffffff81000000 0x000000007d600000
                 0x0000000001630000 0x0000000001630000  RWE    0x1000
  LOAD           0x0000490000002000 0xffffc90000000000 0xffffffffffffffff
                 0x00001fffffffffff 0x00001fffffffffff  RWE    0x1000
  LOAD           0x00007fffc0002000 0xffffffffc0000000 0xffffffffffffffff
                 0x000000003f000000 0x000000003f000000  RWE    0x1000
  LOAD           0x0000088000003000 0xffff888000001000 0x0000000000001000
                 0x000000000009e000 0x000000000009e000  RWE    0x1000
  LOAD           0x00006a0000002000 0xffffea0000000000 0xffffffffffffffff
                 0x0000000000003000 0x0000000000003000  RWE    0x1000
  LOAD           0x000008806f003000 0xffff88806f001000 0x000000006f001000
                 0x000000000ffff000 0x000000000ffff000  RWE    0x1000
  LOAD           0x00006a0001bc2000 0xffffea0001bc0000 0xffffffffffffffff
                 0x0000000000400000 0x0000000000400000  RWE    0x1000
sh-5.1# readelf -l /proc/vmcore

Elf file type is CORE (Core file)
Entry point 0x0
There are 4 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  NOTE           0x0000000000001000 0x0000000000000000 0x0000000000000000
                 0x0000000000001700 0x0000000000001700         0x0
  LOAD           0x0000000000003000 0xffffffff88600000 0x0000000056200000
                 0x0000000001630000 0x0000000001630000  RWE    0x0
  LOAD           0x0000000001633000 0xffff9a7ec0100000 0x0000000000100000
                 0x000000006ef00000 0x000000006ef00000  RWE    0x0
  LOAD           0x0000000070533000 0xffff9a7f3f000000 0x000000007f000000
                 0x0000000000fe0000 0x0000000000fe0000  RWE    0x0

You can see the /proc/kcore file contains many program header entries: this is because it is representing several virtual address regions, like the kernel’s vmalloc and vmemmap regions, in addition to the kernel’s mapping of code, and the kernel’s direct mapping of physical memory.

By comparison, /proc/vmcore contains far fewer headers. It doesn’t represent vmalloc or vmemmap regions, because the currently running kernel doesn’t have enough information to find them.

Benefits & Drawbacks of ELF Core Dumps

Since ELF is such a widespread, enduring standard, there are very many tools that support it. As we’ve seen above, readelf and its elfutils-based variant, eu-readelf can both show detailed information related to the file format, without the need for a user to read or write code to analyze it. Further, ELF is one of the only formats that “normal debuggers” (i.e. those which don’t specialize on the Linux kernel) can use.

However, ELF has a few drawbacks. Much of the data in a kernel core dump is not useful for debugging, and can or should be omitted. The ELF program header, which is used to define the memory of the core dump, can be used to help omit this data: segments can be split up so that the gaps between them exclude unneeded data. Unfortunately, this comes at a cost: each additional segment requires an entry of 56 bytes (for a 64-bit ELF file), and the number of program headers is limited by the size of the header field e_phnum, a 16-bit integer whose max value is 65,535, which is a limit that could conceivably be hit by systems with large amounts of memory. This limit can be exceeded (search for PN_XNUM in elf(5)), but it is a bit clunky. Despite these oddities, makedumpfile -E does support outputting ELF-formatted vmcores with pages excluded in exactly this way. It’s worth trying this out for yourself, if you want to explore: makedumpfile -E -d 31 /proc/kcore my_dump.elf would create such a file.

The far more important limitation is the ELF program header does not define any field or flag to indicate compression for a segment. So there is no broadly compatible way for an ELF vmcore to include compressed data. Of course, it is possible to compress the entire ELF file, but this may not be practical due to memory or CPU constraints.

Variant: QEMU ELF

Not all ELF kernel core dumps are created by the kernel or makedumpfile, though. QEMU (and thus, virsh dump) has the option to create an ELF-formatted vmcore. It uses mostly the same format as the Linux kernel, but it does include some extra QEMU-specific data for each CPU, in addition to the PRSTATUS note. It also includes a nearly empty section header table containing no information of value.

The main concern with QEMU’s ELF core dumps is whether they contain the vmcoreinfo note. Hypervisors may have access to the guest memory, but they don’t have any general-purpose way to see the vmcoreinfo data that the kernel has prepared at boot time. This means that, unless you’ve explicitly configured it, vmcores generated by QEMU won’t have a vmcoreinfo note. Thankfully, QEMU implements a virtualized “vmcoreinfo” device. The guest Linux kernel can detect its presence at boot and write its vmcoreinfo data into this device once it is ready. The QEMU hypervisor can then store this data alongside the virtual machine, in case it must later create a vmcore. If you’ve run QEMU with -device vmcoreinfo and you have a properly configured kernel, then QEMU will include that data into its core dumps.

It’s worth noting that, even if the ELF core dump doesn’t contain the vmcoreinfo as an ELF note, the data is still there buried in the core dump. With the right tools, you can search through the memory contents and find it.

More Variation: Virtual Addresses in ELF

As we’ve seen, ELF program headers are a quite flexible way of storing memory metadata. They allow each memory segment to be assigned a virtual and physical memory address. This means that an ELF core dump can represent the kernel’s virtual address space as well as its physical address space. For /proc/kcore and /proc/vmcore, the kernel does exactly that, since it already knows its own virtual address mappings. When creating ELF vmcores, makedumpfile will also populate the virtual address field according to the kernel’s virtual address mappings.

However, hypervisor or firmware level vmcores tend not to include the virtual address information in core dumps. While the memory mapping information is available to a hypervisor, it may not be easily accessible. For example, KVM-based hypervisors delegate page table management to processor virtualization extensions wherever possible, so the hypervisor itself may not even know the current guest page tables. Parsing the page tables from the guest memory would be expensive and could result in a denial of service if a malicious guest crafted a very large set of page tables. So hypervisors tend to avoid that complexity, and they’ll only include the physical addresses for memory. QEMU does have the support for including virtual address information, but it must be explicitly enabled (e.g. dump-guest-memory -p).

When hypervisors don’t have virtual address information available, the behavior is not well-defined. Some, like QEMU, include a fake virtual address (the same value as the physical address). This can make it difficult for a debugger to detect whether a vmcore really has accurate virtual memory mappings.

Kdump-compressed Format

While ELF may be a ubiquitous file format for programs, libraries, and userspace core dumps, the kdump-compressed format is by far the most common format used by our customers for kernel core dumps. A default Oracle Linux installation with kdump enabled will use makedumpfile to create core dumps in kdump-compressed format, which offers several advantages over ELF (the main one being compression, which reduces file size significantly). However, this format wasn’t always ubiquitous, nor did it originate with makedumpfile.

In the dark days prior to kexec being a viable way to make crash dumps, there were a few projects that contained out-of-tree Linux kernel patches to enable core dumps, as well as utilities to use these dumps. Your author was not of an age to pay attention at the time that these systems were at their zenith, so you’ll be spared the history lesson. One such project was lkdump, in whose diskdumputils-0.1.5.tar.bz2 distribution you can see an early definition of a file format called diskdump, in dumpheader.h. While this project seems to have died off, the format seems to be the predecessor to the one currently in use by makedumpfile.

For the purposes of this description, we’ll call the dump format “kdump-compressed”, though in truth, it goes by many names. This is because there is no single standard, except what’s in makedumpfile’s source code. You would be forgiven for calling it a modified “diskdump” format, as that is what makedumpfile’s own code seems to refer to it as. However, the file utility, as well as makedumpfile(8), refer to it as kdump-compressed, so we’ll try to use that name too. You can see how recent versions of file recognize the format below:

$ file vmcore
vmcore: Kdump compressed dump v6, system Linux, node stepbren-ol9.local, release 5.14.0-284.25.1.0.1.el9_2.x86_64, version #1 SMP PREEMPT_DYNAMIC Fri Aug 4 09:00:16 PDT 2023, machine x86_64, domain (none)

Leaving aside matters of naming and provenance, the kdump-compressed format is primarily output by makedumpfile (though QEMU can output a variant of the format), and it is primarily read by crash, although the libkdumpfile library is a powerful option for consuming it as well. The strength of the format is in its ability include or omit any page, as well as compress each page. A typical kdump-compressed vmcore can be just a fraction of the size of physical memory, though it will depend on the memory usage of the system, as well as the dump level and compression arguments passed to makedumpfile. This makes it ideal for use when a vmcore needs to be sent to remote engineers for analysis.

The downside of the kdump-compressed format, of course, is that it is quite niche. Any kernel-specific debugger, such as Crash and Drgn, can understand it, so in the common case, there aren’t many compatibility issues. However, in case that you’re having trouble opening a kdump-compressed vmcore (e.g. due to implementation bugs, corruption, etc), you’ll find fewer tools available to help you. There is no readelf equivalent for kdump-compressed files. You’ll likely find yourself writing your own tools to help in diagnosing these issues, possibly with the help of libkdumpfile. In fact, the examples directory in the libkdumpfile project contains some genuinely useful programs for investigating issues.

As shown above, recent versions of the file tool do a good job of identifying this format. However, older versions may unhelpfully report only: data. You can manually look for the file header using xxd:

$ xxd ./vmcore | head -n 4
00000000: 4b44 554d 5020 2020 0600 0000 4c69 6e75  KDUMP   ....Linu
00000010: 7800 0000 0000 0000 0000 0000 0000 0000  x...............
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................

If the first 8 bytes are "KDUMP " (that is, KDUMP followed by three spaces), then you can be confident that you’re looking at a kdump-compressed vmcore. If the first 8 bytes are "DISKDUMP", then you have a quite old core dump from the old diskdump utilities. If, however, the first 12 bytes are "makedumpfile", then you should continue reading, as this is a “flattened” kdump file, discussed below.

Kdump Format Details

The modern kdump-compressed format manages to achieve impressive compactness through a few optimizations. First, it uses a bitmap to represent the availability of any given page. This is quite efficient for the common cases, where there is a mix of pages included and excluded. For the common page size of 4096 bytes, each gigabyte of physical memory requires 32KiB of bitmap space in the output file, regardless of whether any of that memory is actually included in the output file.

In addition to this bitmap, the format maintains an array of 24-byte descriptors, one for each page included in the dump. The descriptor contains information about compression, size, and location of the data in the dump. It also reserves 8 bytes within this descriptor for a field called page_flags, which seems intended to be populated from the corresponding struct page in the kernel. From quick inspection, neither crash nor libkdumpfile, the two major consumers of the format, use this field. So there is always room for improvement.

To store metadata, the format uses an interesting trick. Rather than define its own format to store PRSTATUS or VMCOREINFO notes, the latest version (v6) of the format simply reuses the ELF Note structures & formats. This is a major advantage, since it allows any defined ELF note type (or custom notes) to be inserted into the vmcore, and it allows sharing code in systems which may output either ELF or kdump-compressed files.

Finally, the kdump-compressed format is not designed with any particular mechanism for encoding the kernel’s virtual address mappings. Instead, consumers like crash and kdump are expected to use architecture-specific code and VMCOREINFO metadata to find and interpret the page tables.

Variant: flattened format

To properly write the kdump-compressed format, the core dump creator (typically makedumpfile) needs to seek between two locations in the output file: the header area, which contains the bitmaps and page descriptors, and the data area, where the actual compressed page contents are written out. Since the compressed size of pages cannot be known ahead of time, the descriptors can’t be written before the pages. The descriptors need to be written near the beginning of the file to serve as an index for the variable-size page data. While seeking between these locations is no problem for conventional files, it is impossible to do this if you would like to output the core dump on stdout, or transmit it via a network socket: pipes and sockets do not support lseek(2).

To resolve this issue, makedumpfile introduced the “flattened” variant of the kdump-compressed format. Instead of using lseek(fd, SEEK_SET, offset) to go to a particular offset, and then write(fd, data, size), makedumpfile simply writes offset and size out, followed by the data. Before reading the data, a “reassembly” phase is necessary, which simply reads each offset + size record, and follows the instructions to create the final output file.

This format has a header which starts the 12 bytes: “makedumpfile”, so it can be recognized quite easily with xxd as shown above.

There is only one advantage to this format: it’s the most practical way to output the kdump-compressed format to a socket or pipe. However, once the core dump is saved, there is no point in using the flattened format! Flattened vmcores can be “reassembled” into a normal kdump-compressed vmcore using the command makedumpfile -R, or makedumpfile-R.pl.

For convenience, some analysis tools support directly reading the flattened variant of the format: Crash, starting with version 5.1.2; libkdumpfile, starting with 0.5.3; and Drgn, starting with 0.0.25 (alongside capable libkdumpfile). However, this support comes at a cost, which is frequently not advertised to the user. In order to directly read a flattened vmcore, these tools must first build an index of what the reassembled file should look like. This process is time-consuming for large vmcores, and as of this writing, no implementation saves the resulting index, so it needs to be recomputed each time the file is opened. Frequently, tools like Crash don’t inform the user about what the indexing process is, nor do they inform the user that they could avoid the indexing phase by simply using makedumpfile -R one time only to generate a standard vmcore.

To add even more confusion to the situation, QEMU’s default “kdump” output setting creates a flattened vmcore, not a standard kdump-compressed vmcore. This was done for much the same reason as makedumpfile: to allow outputting core dumps to pipe file descriptors. But unlike makedumpfile, which falls back to the flattened format only when lseek() fails, QEMU versions prior to 8.2 simply only implement the flattened format. Since QEMU 8.2, a new output setting called “kdump-raw” been added, which corresponds to the standard kdump format. Since it is opt-in, users are forced to know the difference between the flattened and standard formats. Users who don’t know it may end up with a vmcore incompatible with their tools, or which is painfully slow to analyze.

What’s worse, it’s possible for QEMU to omit the VMCOREINFO note from its vmcores, as we’ve discussed already. In the unhappy case where a flattened vmcore is produced without a VMCOREINFO note, the resulting file is nearly impossible to load in Crash, but Crash will spend a good deal of time trying to open it prior to failing. Your author has witnessed many well-qualified engineers give up on these sorts of core dumps, unaware of the subtleties of these formats, and unaware that none of the obstacles are insurmountable.

Xen ELF

While we have already covered ELF formatted vmcores, the ELF format used by the Xen hypervisor really deserves its own category. A core dump created by running xm dump-core from the dom0 (hypervisor) results in an ELF file, on which file will cheerfully report:

$ file xendump
xendump: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), no program header

This all seems fine, except for that last bit: “no program header”. As we saw previously, the ELF format for vmcores used by Linux and QEMU use the program headers to describe the memory segments and where they belong in memory. ELF section headers don’t have fields to represent this sort of information. Let’s take a look at the ELF sections that are present:

$ readelf -S xendump
There are 7 section headers, starting at offset 0x40:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .shstrtab         STRTAB           0000000000000000  100bfdfc0
       0000000000000048  0000000000000000           0     0     0
  [ 2] .note.Xen         NOTE             0000000000000000  00000200
       0000000000000568  0000000000000000           0     0     0
  [ 3] .xen_prstatus     PROGBITS         0000000000000000  00000768
       0000000000002860  0000000000001430           0     0     8
  [ 4] .xen_shared_info  PROGBITS         0000000000000000  00002fc8
       0000000000001000  0000000000001000           0     0     8
  [ 5] .xen_pages        PROGBITS         0000000000000000  00004000
       00000001003f8000  0000000000001000           0     0     4096
  [ 6] .xen_pfn          PROGBITS         0000000000000000  1003fc000
       0000000000801fc0  0000000000000008           0     0     8

There are some immediately recognizable items here. .note.Xen is a NOTE section, so it probably contains some metadata about the hypervisor. The .xen_prstatus section shares a name with the PRSTATUS notes we saw in previous core dumps, so it almost certainly contains the register state of each CPU. For the remainder, we can refer to a greatly appreciated documentation file turned up by a web search. To quote:

".xen_pfn" section
        name            ".xen_pfn"
        type            SHT_PROGBITS
        structure       array of uint64_t
        description
                This elements represents the frame number of the page
                in .xen_pages section.
                The size of arrays is stored in xch_nr_pages member of header
                note descriptor in .note.Xen note section.
                The entries are stored in ascending order.
                The value, ~(uint64_t)0, means invalid pfn and the
                corresponding page has zero. There might exist invalid
                pfn's at the end part of this array.
                This section must exist when the domain is auto translated
                physmap mode. Currently x86 full virtualized domain and
                ia64 domain.
[...]
".xen_pages" section
        name            ".xen_pages"
        type            SHT_PROGBITS
        structure       array of page where page is page size byte array
        description
                This section includes the contents of pages.
                The corresponding address is described in .xen_p2m section
                or .xen_pfn section.
                The page size is stored in xch_page_size member of header note
                descriptor in .note.Xen section.
                The array size is stored in xch_nr_pages member of header note
                descriptor in .note.Xen section.
                This section must exist.

So, the .xen_pages section contains the actual memory data (which makes sense, as it is the largest), and the .xen_pfn section provides the PFN (roughly the same as the address) of each page. This also gives some info on the note contents: they are providing some of that critical metadata like page sizes.

All this is to say that, while the Xen core dump may be in the ELF format, it is not at all similar to the ELF vmcores we’ve seen before. This is both a testament to, and a consequence of, ELF’s incredible flexibility. It’s worth noting that, thanks to the .xen_pfn section, it seems like it’s possible for this format to exclude pages from the vmcore, much like makedumpfile does with the kdump-compressed format. However, it’s unlikely that Xen actually uses this capability, since deciding which pages should be excluded is difficult for a hypervisor. On the other hand, unlike the kdump-compressed format, this format still cannot support per-page compression: the metadata is designed with the expectation that each page in the .xen_pages takes up the full PAGE_SIZE bytes.

Without more experience with the format, it’s difficult to say what its advantages are (beyond being the format available if you’re using Xen). It could be that the various metadata can provide insight into the hypervisor state as well. However, the disadvantages here are clear. Though ELF tools will analyze it, the non-standard use of application-specific sections instead of program headers precludes most standard debugging tools from understanding it. Of course, this does not apply to Crash, which does fully support Xen vmcores.

Other formats

Like we’ve already said, this listing of kernel core dump formats is mostly incomplete. Simply browsing the code in Crash which is responsible for identifying the core dump format is overwhelming, as it reveals several others which aren’t even mentioned here:

  • A few additional Xen-specific formats
  • A “kvmdump” format which seems to be obsolete now
  • An “sadump” format which seems related to a BIOS dump capability on some Fujitsu servers
  • Dump formats for the netdump and LKCD projects which predate kdump
  • Some formats for VMWare, as well as one related to Mission Critical Linux.

And these are just the formats Crash supports! Your author has also had the pleasure of using Hyper-V to create a virtual machine snapshot, and then using the vm2core tool provided by azure-linux-utils to create an ELF file which was marginally useful (with some tweaking) with Crash. Surely there are even more exotic formats yet to be found.

Core dump consumers

Throughout this article, we’ve mentioned a few tools which can be used to analyze Linux vmcores, namely: GDB, Crash, libkdumpfile, and drgn. In this section, we’ll briefly give some pros and cons of each tool, especially as they relate to their supported input formats and analysis capabilities.

GDB

GDB needs no introduction, as the GNU project’s very well-known general purpose debugger. GDB is of course capable of debugging running processes, but can also support ELF formatted core dumps, typically by running:

$ gdb EXECUTABLE
...
(gdb) core-file CORE

It is capable of debugging some Linux core dumps or /proc/kcore, and in fact the Linux kernel contains a set of scripts which could be used for debugging the kernel with GDB. However, this solution is limited in a number of ways. GDB only supports standard ELF core dumps, and it relies on the virtual addresses in the program headers to do address translations. If the program headers are missing for certain ranges, or if the ELF vmcore was from a source that didn’t include the virtual address translations, then GDB won’t be able to understand the core.

It also seems that GDB may not support kernels with KASLR enabled, but further research is needed to confirm that this is still a relevant concern. Needless to say, while GDB is a quite powerful debugger, it’s not designed for the kernel, and so it’s not regularly used for it.

Crash

The Crash utility can be thought of as a successor to the kernel’s GDB scripts. It wraps a GDB process and implements support for a broad variety of core dump formats, as well as details like page table translation for various architectures. Users can run common GDB commands or several quite useful kernel-specific helpers. The PyKdump framework can be used to further extend this system with Python scripting.

Crash is able to combine the power of GDB with support for almost every core dump format under the sun, which makes it a quite impressive debugging tool. It sets the standard for interactive kernel core dump debuggers.

libkdumpfile

Unlike Crash and GDB, libkdumpfile is not a debugger, per-se. Instead, it is a library which implements the ability to read many different types of core dumps, including ELF and kdump-compressed formats. Its bundled library, libaddrxlat, implements the details of address translation for a variety of architectures. You can write simple applications to read data out of a vmcore, and by using this library, you’ll find it shocking how many core dump formats and architectures your tools will support.

drgn

Drgn is a kernel (and userspace) debugger as a Python library. Rather than using debugger commands, it allows users to write Python code that treats the kernel’s data structures like regular Python objects. It supports standard ELF core dumps (and the running kernel) natively, and it relies on libkdumpfile to understand other formats, like kdump-compressed files. Its support for more esoteric core dumps (e.g. hypervisor core dumps which may be missing metadata) is behind alternatives like Crash, but it is improving.

Other Tools

When presented with a core dump that is difficult to understand, you may want to revisit some of the tools used throughout this article to better understand it:

  • file is a good first step! Be sure to use the most recent possible version, as its detection is constantly changing.
  • readelf and eu-readelf (from binutils and elfutils, respectively) provide critical tools for examining ELF files:
    • -l for viewing program headers
    • -S for viewing section headers
    • -n for viewing notes. Prefer eu-readelf here, because it supports printing the contents of some note types, like build IDs and VMCOREINFO.
  • xxd is quite useful for viewing data headers, though any hex dumper will do.
  • Some more specific needs may require some scripts, or even a libkdumpfile based tool. You can find some example tools in the libkdumpfile examples directory, as well as some of my own tools here.
  • Finally, some of the best diagnostic information when analyzing a core dump can come from reading the code of tools designed to generate or read them. To that end, here are some links to a few relevant portions of several important projects:
    • Linux: kcore.c implements /proc/kcore
    • Makedumpfile: diskdump_mod.h contains the definition of kdump-compressed format
    • QEMU: dump.c contains QEMU’s implementation of creating ELF and kdump-compressed vmcores.
    • Crash: diskdump.c contains the implementation of reading kdump-compressed files. However, there are lots of other relevant files for different formats and variants.
    • libkdumpfile: diskdump.c contains implementations related to kdump-compressed vmcores, and elfdump.c contains implementations related to ELF

Conclusion

Kernel core dumps are complex. They are not simply copies of system memory; they contain plenty of extra metadata which is critical to understanding their contents. And like any other type of data, the design of the file formats can enable lots of flexibility and power. However, due to the broad variety of tools out there, the diversity of dump formats is overwhelming, and the lack of documentation or specifications compounds the problem. While the ecosystem, and especially high quality tools like Crash, makedumpfile, libkdumpfile, and drgn, generally work very well together, there are still compatibility issues that can be difficult to work around. Hopefully this guide can provide the first step in understanding these issues, so that you can be better equipped to fix your vmcores in the future.

If you’ve made it this far in this article, then your interest in debugging the kernel is quite impressive. Your feedback, experiences, questions, and contributions related to core dump formats and kernel debuggers are wanted! Please join us on the linux-debuggers kernel mailing list and share your experiences.

Stephen Brennan


Previous Post

Gain experience in Oracle Linux system monitoring and logging

Nicolas Pares | 4 min read

Next Post


Ksplice Known Exploit Detection for DirtyCred Remastered, io_uring, A_PACKET, Looney Tunables and more...

Quentin Casasnovas | 5 min read