Steve Sistare is a kernel development architect at Oracle. In this blog post, he gives tips and tricks for making your systems boot faster.
In October Pasha Tatashin shared his work on booting the kernel substantially faster, especially for large systems:
However, booting the kernel is only part of the job, as services must also be be started in userland. That time can be significant and depends on the configuration. Oracle Linux is configured to satisfy a wide range of requirements by default, but if you are willing to tweak the configuration, you can substantially reduce your boot time.
The largest single improvement you can make is setting the kernel command line "quiet" option, which stops kernel messages from being printed to the console. These messages can consume many seconds of time during boot when the console is connected via a serial interface with a low baud rate setting. On an x86 system running Oracle Linux 7 where the default ILOM serial console rate is 9600 baud, the kernel+initrd+userspace time as reported by systemd-analyze was reduced from 62.259 seconds to 21.425 seconds after enabling the quiet option. You can instead increase the serial rate to 115200, but doing so is a pain: you must change it in the ILOM, in bios, in grub, and in the kernel command line arguments. Even then, quiet is still a few seconds faster than 115200 baud. Also, the kernel messages are not lost; they can still be viewed with the dmesg and journal commands, provided the kernel boots successfully! Don't use quiet during kernel development when you need to see the kernel output up to the point of failure. Lastly, you won't see much improvement using quiet when booting VMs that use para-virtualized (as opposed to emulated) console devices, because they are not limited by serial baud rate.
To enable the quiet option:
Use the latest Oracle Linux updates whenever possible to get the latest packages and optimizations. For example, an older version of the NetworkManager package had a bug where a gratuitous "carrier wait" timeout was enabled, delaying the boot by 5 seconds. This was fixed in in OL7.4. Similarly, I analyzed a case where the kbd package was waiting 5 seconds for a PUT_FONT ioctl to succeed, on a KVM guest with no graphical console, so waiting was futile. Worse yet, this occurred once during the initd boot phase, and again during the userspace phase, for a total delay of 10 seconds. This is also fixed in OL7.4.
Reduce or eliminate delays in grub. By default grub pauses for 5 seconds so you can choose a different kernel or modify kernel parameters, but this is configurable. Set it to 1 if you want this crutch for emergencies. Set it to 0 if you are running under KVM and have other ways to modify the parameters when a kernel fails to boot, such as by manually mounting your guest image and editing the grub files directly.
To change the timeout:
Verify that autofs is configured correctly. Recent changes in this package cause a 10-second delay at boot time if automount entries in /etc/nsswitch.conf are misconfigured, and you will see a message like "automount: problem reading master map, maximum wait exceeded" in the journal. See https://lkml.org/lkml/2017/5/26/64 for more details.
After following the above tips, let's see how we a doing. Use the systemd-analyze command to see the overall boot time and the time taken by the top services. Here I boot an OL7.4 KVM guest on an Intel Xeon Platinum CPU:
<... boot and login to the console ...> # systemd-analyze Startup finished in 697ms (kernel) + 879ms (initrd) + 3.877s (userspace) = 5.454s
# systemd-analyze blame 1.799s kdump.service 1.166s NetworkManager-wait-online.service 396ms postfix.service 321ms network.service 247ms systemd-udev-settle.service 244ms tuned.service 151ms dev-mapper-ol\x2droot.device 148ms lvm2-monitor.service 119ms lvm2-pvscan@251:2.service 113ms plymouth-quit.service 112ms plymouth-quit-wait.service 104ms systemd-vconsole-setup.service ...
Next, if you are using KVM with a bridge interface between the host and guest, and you know that the network topology has no loops, then disable the spanning tree protocol (STP) on the bridge. This saves 2 or more seconds of guest boot time. When STP is enabled, the bridge drops all packets received from a newly discovered address until the forwarding delay has expired; this avoids packet flooding when loops are present. Hence the guest's DHCP request packet is dropped, leading to dhclient timeout and retry.
To see your bridges:
# brctl show bridge name bridge id STP enabled interfaces virbr0 8000.525400fe7ca3 yes virbr0-nic
To see the forwarding delay:
# brctl showstp virbr0 | grep forward forward delay 2.00 bridge forward delay 2.00
To disable stp:
# brctl stp virbr0 off
If you are using libvirt, disable stp in the default configuration to make the change persist across host reboot:
# virsh net-edit default Find the line "<bridge name='virbr0' stp='on' delay='0'/> Change on to off.
(At first glance, you might think that delay='0' would solve the problem. However, the kernel enforces a minimum value of 2 seconds.)
After disabling stp, we get the following boot time. Note that the NetworkManager-wait-online time is much smaller, as we have eliminated the DHCP delay:
$ systemd-analyze Startup finished in 678ms (kernel) + 903ms (initrd) + 2.746s (userspace) = 4.328s
$ systemd-analyze blame
1.700s kdump.service 397ms postfix.service 294ms network.service 252ms systemd-udev-settle.service 206ms tuned.service 203ms dev-mapper-ol\x2droot.device 168ms NetworkManager-wait-online.service 157ms plymouth-quit-wait.service 157ms plymouth-quit.service 135ms lvm2-monitor.service 95ms systemd-vconsole-setup.service ...
Disable services you don't need, using "systemctl disable <name.service>". Use the "systemd-analyze blame" command to see the candidates, as shown above. For example, if you don't need a kernel crash dump after panic's, or are willing to wait until a problem occurs to enable it, then disabling kdump.service saves almost 2 seconds of boot time. If you are not running a mail server, then disable postfix.service. Unconvinced that tuned.service does any noticeable tuning? Disable it.
Putting it all together:
$ systemctl disable kdump.service $ systemctl disable postfix.service $ systemctl disable tuned.service <... reboot and login to the console ...>
$ systemd-analyze Startup finished in 681ms (kernel) + 863ms (initrd) + 1.136s (userspace) = 2.681s
$ systemd-analyze blame 299ms network.service 273ms systemd-udev-settle.service 178ms dev-mapper-ol\x2droot.device 160ms NetworkManager-wait-online.service 149ms lvm2-monitor.service 148ms plymouth-quit-wait.service 136ms plymouth-quit.service 94ms systemd-vconsole-setup.service ...