News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Accelerating Linux Boot Time

This blog post is written by Pasha Tatashin, on the Oracle Linux kernel team.

Linux Reboot

There are many reasons why it is important for machines to reboot quickly. Here are just a few examples:

  1. Increase uptime and availability
  2. Improve customer experience: machines feel faster overall when the boot/reboot process is fast
  3. Due to a shorter downtime, customers may choose to apply critical patches faster, thereby improving security

All of the above provide a competitive advantage.

Linux Early Boot Improvements

Since the beginning of this year, I have been looking into ways Linux boot and reboot performance can be improved. So far, I have looked only at early boot and have identified a number of problems. Several of them have already been pushed into mainline Linux, and some are in the process of being integrated.

Here is the current state of Linux reboot time on one of the largest SPARC machines, which contain 3064CPUs and 32T memory (most of fixes are in generic code, similar improvements are expected on other architectures with large configurations):

  Linux1 (before) Linux2  (after)

















Note: Firmware initialization is many times longer on raw hardware compared to the above. The data was measured in an LDom guest with virtualized I/O.

  • Linux1 is a mainline kernel before my changes.
  • Linux2 includes the following changes in addition:
        1. add_node_ranges() is slow
          On a 32T system this function takes 82 seconds because it walks one page at a time to determine which NUMA node this page belongs to. After optimizing this function, it now takes less than one second. This work is already in mainline Linux.
        2. Skip caching OpenFirmware CPU nodes for faster boot
          Linux caches information about every CPU from OpenBoot, and with 3064CPUs it takes 75.56s. But, we do not really need any information about CPU nodes on sun4v because all of that is contained in the hypervisor's machine descriptor. With this fix applied, it takes less than a second to cache data from OpenBoot during boot. This change has not yet been submitted to the mainline.
        3. Zero "struct pages" in parallel
          During boot the memblock allocator zeroes all the memory for struct pages using one thread. This fix uses the deferred struct page initialization capability to zero struct pages in parallel. The improvement from this fix is about 73s on a 32T machine. The patches have been submitted to the upstream community and are now under active review.
        4. Enable deferred struct page initialization (more on this below)
          Linux has an option to initialize "struct page" in parallel later in boot. This fix saves 171s of boot time. This option is currently disabled on SPARC pending completion of some dependent work (e.g., memory hotplugging).
        5. Slow hash table initialization
          Hash tables, such as inode_cachep, dentry_hashtable, and others that are allocated using alloc_large_system_hash() can become large as memory sizes are increased. Many of them can be initialized by simply zeroing (using memset()). Also, some of them do not have to grow as fast as they currently do. I added a new adaptive scaling algorithm to tackle this problem. The total time saved is 58s. The patch is in mainline.
        6. Early boot time stamps
          This work does not improve performance, but enables us to detect many of the problems that were found above.

Complete Deferred Page Initialization

Every physical page frame in the system has an associated "struct page" which is used to describe the status of the page. During boot, the kernel allocates memory for all struct pages and also initializes it. Traditionally, the initialization is done early in boot before slave CPUs are onlined, and thus, can take a significant amount of time on larger machines. This problem was partially resolved in 2015 when Mel Gorman added Parallel struct page initialisation feature. Mel's project made the kernel initialize only a small subset of pages early in boot, with the rest being initialized later when other CPUs are available. However, there are a couple ways to improve initialization time of struct pages even further:

  1. Even though the initialization is done in parallel, the zeroing of memory where struct pages are stored is still done using only a single thread. This is because Linux's early boot memory allocator (memblock) always zeroes the memory that it allocates. The fix is to move the zeroing part later in boot to the time when other CPUs are available. However, there are some complications that need to be addressed as some parts of kernel expect struct page memory to be either initialized or zeroed. Therefore, we must make sure that nothing accesses "struct page"s before the memory is zeroed and initialized. This project is currently under review and makes an Oracle X5-8 with 1T of memory boot 22 seconds faster. Because the savings are memory proportional and the X5-8 scales to up to 6T, the boot time savings on a fully configured X5-8 would be around 132 seconds.
    The project also fixed some preexisting boot-time bugs and added debugging checks to the memblock allocator to make sure new bugs won't be introduced in the future.

  2. The second optimization that we can apply to improve struct page initialization is to use more than one thread per node to initialize struct pages. A single memory node can contain many gigabytes of memory. As an example, an X5-8 with 6T of memory has 0.75T per node, and this value keeps growing with new systems. The newer Xeon processors support up to 1.5T per node while the EPYC processors from AMD support up to 2T per node. Daniel Jordan included this optimization in his ktask work which is currently also under review -- read about it in his Oracle Linux Kernel blog post, ktask: A Generic Framework for Parallelizing CPU-Intensive Work.

Early Boot Time Stamps

Linux has a very nice feature to put timestamps at the beginning of each line printed on the console. CONFIG_PRINTK_TIME is enabled on many distros by default. This allows us to quickly catch boot and even runtime regressions by using the dmesg -d command (the -d flag is only available in newer versions of dmesg). However, the timestamps are available quite late in boot and therefore, if we spend a significant amount of time early in boot, that time is never reported. In fact, all of the improvements that are described in this post happen before printk timestamps are available. Therefore, two projects were done to make timestamps available early in boot: one for SPARC and another for x86. While SPARC is already in mainline, the x86 is currently under review.



Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha