X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Recent Posts

Oracle Linux sessions at Open Source Summit Europe 2020

Open Source Summit connects the open source ecosystem under one roof. It covers cornerstone open source technologies; helps ecosystem leaders to navigate open source transformation; and delves into the newest technologies and latest trends touching open source. It is an extraordinary opportunity for cross-pollination between the developers, sysadmins, DevOps professionals, IT architects, and business & community leaders driving the future of technology. Check out the Oracle Linux sessions at this event and register today: Tuesday, October 27 13:00 GMT DTrace: Leveraging the Power of BPF - Kris Van Hees, Oracle Corp.   BPF and the overall tracing infrastructure in the kernel has improved tremendously and provides a powerful framework for tracing tools. DTrace is a well known and versatile tracing tool that is being re-implemented to make use of BPF and kernel tracing facilities. The goal of this open source project (hosted on github) is to provide a full-featured implementation of DTrace, leveraging the power of BPF to provide well known functionality. The presentation will provide an update on the progress of the re-implementation project of DTrace. Kris will share some of the lessons learnt along the way, highlighting how BPF provides the building blocks to implement a complex tracing tool. He will provide examples of creative techniques that showcase the power of BPF as an execution engine. Like any project, the re-implementation of DTrace has not been without some pitfalls, and Kris will highlight some of the limitations and unsolved problems the development team has encountered. Wednesday, October 28 13:00 GMT The Compact C Type (CTF) Debugging Format in the GNU Toolchain: Progress Report - Elena Zannoni & Nicholas Alcock, Oracle The Compact C Type Format (CTF) is a reduced form of debug information describing the type of C entities such as structures, unions, etc. It has been ported to Linux (from Solaris) and used to reduce the size of the debugging information for the Linux kernel and DTrace. It was extended to remove limits and add support for additional parts of the C type system. Last year, we integrated it into GCC and GNU binutils and added support for dumping CTF data in ELF objects and some support for linking CTF data into a final executable (and presented at this conference). This linking support was preliminary: it was slow and the CTF was large. Since last year, the libctf library and ld in binutils have gained the ability to properly deduplicate CTF with little performance hit: output CTF in linked ELF objects is now often smaller than the CTF in any input .o file. The libctf API has also improved, with support for new features, better error reporting, and a much-improved CTF iterator. This talk will provide an overview of CTF, the novel type deduplication algorithm used to reduce CTF size and discuss the other contributions of CTF to the toolchain, such as compiler and debugger support. 18:30 GMT KVM Address Space Isolation - Alexandre Chartre, Oracle First investigations about Kernel Address Space Isolation (ASI) were presented at Linux Plumber and KVM Forum last year. Kernel Address Space Isolation aims to mitigate some cpu hyper-threading data leaks possible with speculative execution attacks (like L1 Terminal Fault (L1TF) and Microarchitectural Data Sampling (MDS)). In particular, Kernel Address Space Isolation will provide a separate kernel address space for KVM when running virtual machines, in order to protect against a malicious guest VM attacking the host kernel using speculative execution attacks. Several RFCs for implementing this solution have been submitted. This presentation will describe the current state of the Kernel Address Space Isolation proposal with focusing on its usage with KVM, in particular the page table mapping requirements and the performance impact. Thursday, October 29 16:00 GMT Changing Paravirt Lock-ops for a Changing World - Ankur Arora, Oracle Paravirt ops are set in stone once a guest has booted. As an example we might expose `KVM_HINTS_REALTIME` to a guest and this hint is expected to stay true for the lifetime of the guest. However, events in a guest's life, like changed host conditions or migration might mean that it would be more optimal to revoke this hint. This talk discusses two aspects of this revocation: one, support for revocable `KVM_HINTS_REALTIME` and, second, work done in the paravirt ops subsystem to dynamically modify spinlock-ops.   Friday, October 30 14:00 GMT QEMU Live Update - Steven J. Sistare, Oracle The ability to update software with critical bug fixes and security mitigations while minimizing downtime is valued highly by customers and providers. In this talk, Steve presents a new method for updating a running instance of QEMU to a new version while minimizing the impact on the VM guest. The guest pauses briefly, for less than 200 msec in the prototype, without loss of internal state or external connections. The old QEMU process exec's the new QEMU binary, and preserves anonymous guest RAM at the same virtual address via a proposed Linux madvise variant. Descriptors for external connections are preserved, and VFIO pass through devices are supported by preserving the VFIO device descriptors and attaching them to a new KVM instance after exec. The update method requires code changes to QEMU, but no changes are required in system libraries or the KVM kernel module.  

Open Source Summit connects the open source ecosystem under one roof. It covers cornerstone open source technologies; helps ecosystem leaders to navigate open source transformation; and delves into...

Oracle Cloud Infrastructure

Creating an SSH Key Pair on the Linux Command Line for OCI Access

Introduction SSH is the standard on live command-line based access to Linux systems. Oracle Linux Tips and Tricks: Using SSH  is a good initial read. While an Oracle Cloud Infrastructure (OCI) instance is being created, a public SSH key is needed to be provided in the web interface to provide password-less SSH access to the new instance. The question is "How to produce the public SSH key needed?". This post aims to help the reader to achieve that objective on a Linux command-line. On Linux command line, the ssh-keygen command is used to generate the necessary public key. Starting Up Open a terminal in your Linux desktop GUI and make sure that you are logged on the user account (e.g. my_user - avoid using root account for general security reasons) that you would use to access the new Oracle Cloud Infrastructure instance via SSH Run ssh-keygen: $ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/my_user/.ssh/id_rsa): Give a name to your key pair to be generated (e.g. my_ssh_key) Enter file in which to save the key (/home/my_user/.ssh/id_rsa): my_ssh_key Enter passphrase (empty for no passphrase): Do not provide any passphrase and skip with enter. Enter same passphrase again: Your identification has been saved in my_ssh_key. Your public key has been saved in my_ssh_key.pub. The key fingerprint is: SHA256:tXpJNaug8iUdIEVCM+7WHX8gqS/AfRi//tUKanA1Eo8 my_user@my_desktop The key's randomart image is: +---[RSA 2048]----+ |   .=.o          | |   . =  ..       | |    o o ++o o    | |   o + BE=++ o   | |    = = So+.o    | |   . ..=.* + .   | |    . +o* = . .  | |     o =.o o .   | |      ..o.. .    | +----[SHA256]-----+ The file my_ssh_key.pub would have been created in your home directory. $ cat my_ssh_key.pub ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCkDBM0WOv+AzboCPaqhr8cAN/G HBoclnR+Gvo9x4JZA9gPYQIhCgGet4E8YgcWLwa0tDrZJvg/DuVMfQ0oA2JiaWHN W54lrfuACJVdF/8wZGKpgK5vnd7/pcAIZ9r6rdeaDyFSMEscNwX3pjEnkMp92ykQ tO4rmxnHtqefsvh+O4i4DT4EQE0bUanLriYs59K1XMkA2bIUvnjjD7ILKyNqVeYK hu5w/iS72+9l0U6nfifbyzy4VbqtOI1uU8bvdqeL7J6okTQjeJl/fW2tha//pNbm /nTVyLOOdYXxmAZ8zXX7r6X4pZE5lmbmowk3AZTojlI7MTrYOKuQcxsusUJ my_u ser@my_desktop Providing Key Information to the Oracle Cloud Infrastructure Instance While creating the Oracle Cloud Infrastructure instance, in the "Add SSH Keys" section, choose "PASTE PUBLIC KEYS" and copy/paste the contents of the public key file (alternatively you ca upload the file too) After the instance is created, use ssh command with the private key to access it (where <ip_addr> is the IP address of the new Oracle Cloud Infrastructure instance: $ ssh -i my_ssh_key opc@<ip_addr>      The authenticity of host '<ip_addr>(<ip_addr>)' can't be established.      ECDSA key fingerprint is SHA256:qD2zZE5hO0TYYEMQdDpSPz5izTuaFslwZiMOZp7kwDc.      ECDSA key fingerprint is MD5:ea:c3:e8:61:e9:29:7a:df:ae:b6:43:ad:5b:71:f7:90.      Are you sure you want to continue connecting (yes/no)? yes      Warning: Permanently added '<ip_addr>' (ECDSA) to the list of known hosts. [opc@<ip_addr> ~]$ Summary To be able to access an Oracle Cloud Infrastructure instance via ssh on a Linux desktop, one can use the ssh-keygen command to generate the necessary SSH key pair and add relevant information on the Oracle Cloud Infrastructure instance as described.

Introduction SSH is the standard on live command-line based access to Linux systems. Oracle Linux Tips and Tricks: Using SSH  is a good initial read. While an Oracle Cloud Infrastructure (OCI) instance...

Announcements

Announcing updated Oracle Linux Templates for Oracle Linux KVM

Oracle is pleased to announce updated Oracle Linux Templates for Oracle Linux KVM and Oracle Linux Virtualization Manager. New templates include: Oracle Linux 7 Update 8 Template Unbreakable Enterprise Kernel 5 Update 4 - kernel-uek-4.14.35-2025.400.8 8GB of RAM 37GB of OS Virtual-Disk Oracle Linux 8 Update 2 Template Unbreakable Enterprise Kernel 6 - kernel-uek-5.4.17-2011.4.4 8GB of RAM 37GB of OS Virtual-Disk New Oracle Linux Templates for Oracle Linux KVM and Oracle Linux Virtualization Manager supply powerful automation. These templates are built on cloud-init, the same technology used today on Oracle Cloud Infrastructure and includes improvements and regression fixes. Downloading Oracle Linux Templates for Oracle Linux KVM Oracle Linux Templates for Oracle Linux KVM are available on yum.oracle.com website on "Oracle Linux Virtual Machine" Download section. Further information The Oracle Linux 7 Template for Oracle Linux KVM allows you to configure different options on the first boot for your Virtual Machine; cloud-init options configured on the Oracle Linux 7 Template are: VM Hostname define the Virtual Machine hostname Configure Timezone define the Virtual Machine timezone (within an existing available list) Authentication Username define a custom Linux user on the Virtual Machine Password Verify Password define the password for the custom Linux user on the Virtual Machine SSH Authorized Keys SSH authorized keys to get password-less access to the Virtual Machine Regenerate SSH Keys Option to regenerate the Virtual Machine Host SSH Keys Networks DNS Servers define the Domain Name Servers for the Virtual Machine DNS Search Domains define the Domain Name Servers Search Domain for the Virtual Machine In-guest Network Interface Name define the virtual-NIC device name for the Virtual Machine (ex. eth0) Custom script Execute a custom-script at the end of the cloud-init configuration process These options can be easily managed by the "Oracle Linux Virtualization Manager" web interface by editing the Virtual Machine and enabling the "Cloud-Init/Sysprep" option: Further details on how to import and use the Oracle Linux 7 Template for Oracle Linux KVM are available in this technical article on Simon Coter's Oracle Blog. Oracle Linux KVM & Virtualization Manager Support Support for Oracle Linux Virtualization Manager is available to customers with an Oracle Linux Premier Support subscription. Refer to the Oracle Unbreakable Linux Network for additional resources on Oracle Linux support. Oracle Linux Virtualization Manager Resources Oracle Linux Resources Oracle Virtualization Resources Oracle Linux yum server Oracle Linux Virtualization Manager Training

Oracle is pleased to announce updated Oracle Linux Templates for Oracle Linux KVM and Oracle Linux Virtualization Manager. New templates include: Oracle Linux 7 Update 8 Template Unbreakable Enterprise...

Linux

NVMe over TCP

Oracle Linux Kernel engineer Alan Adamson provides an introduction to connecting NVMe flash storage using TCP.   Oracle Linux UEK5 introduced NVMe over Fabrics which allows transferring NVMe storage commands over a Infiniband or Ethernet network using RDMA Technology. UEK5U1 extended NVMe over Fabrics to also include Fibre Channel storage networks. Now with UEK6, NVMe over TCP is introduced which again extends NVMe over Fabrics to use a standard Ethernet network without having to purchase special RDMA-capable network hardware. What is NVMe-TCP? The NVMe Multi-Queuing Model implements up to 64k I/O Submission and Completion Queues as well as an Administration Submission Queue and a Completion Queue within each NVMe controller. For a PCIe attached NVMe controller, these queues are implemented in host memory and shared by both the host CPUs and NVMe Controller. I/O is submitted to a NVMe device when a device driver writes a command to a I/O submission queue and then writing to a doorbell register to notify the device. When the command has been completed, the device writes to a I/O completion queue and generates an interrupt to notify the device driver. NVMe over Fabrics extends this design so submission and completion queues in host memory are duplicated in the remote controller so a host-based queue-pair is mapped to a controller-based queue-pair. NVMe over Fabrics defines Command and Response Capsules that are used by queues to communicate across the fabric as well as Data Capsules. NVMe-TCP defines how these capsules are encapsulated within a TCP PDU (Protocol Data Unit). Each host-based queue-pair and its associated controller-based queue-pair maps to its own TCP connection and can be assigned to a separate CPU core. NVMe-TCP Benefits The ubiquitous nature of TCP. TCP is one of the most common network transports in use which is already implemented in most data centers across the world. Designed to work with existing network infrastructure. In Other words, there is no need to replace existing ethernet routers, switches, NICs which simplifies maintenance of the network infrastructure. Unlike RDMA-based implementations, TCP is fully routable and is well suited for larger deployments and longer distances while maintaining high performance and low latency. TCP is actively being maintained and enhanced by a large community. NVMe-TCP Drawbacks TCP can increase CPU usage because certain operations like calculating checksums must be done by the CPU as part of the TCP stack. Although TCP provides high performance with low latency, when compared with RDMA implementations, latency could affect some applications in part because of additional copies of data that must be maintained. Setting up NVMe-TCP Example UEK6 was released with NVMe-TCP enabled by default, but to try it with a upstream kernel, you will need to build with the following kernel configuration parameters: CONFIG_NVME_TCP CONFIG_NVME_TARGET_TCP=m Setting Up A Target $ sudo modprobe nvme_tcp $ sudo modprobe nvmet $ sudo modprobe nvmet-tcp $ sudo mkdir /sys/kernel/config/nvmet/subsystems/nvmet-test $ cd /sys/kernel/config/nvmet/subsystems/nvmet-test $ echo 1 |sudo tee -a attr_allow_any_host > /dev/null $ sudo mkdir namespaces/1 $ cd namespaces/1/ $ sudo echo -n /dev/nvme0n1 |sudo tee -a device_path > /dev/null $ echo 1|sudo tee -a enable > /dev/null If you don't have access to a NVMe device on the target host, you can use a null block device instead. $ sudo modprobe null_blk nr_devices=1 $ sudo ls /dev/nullb0 /dev/nullb0 $ echo -n /dev/nullb0 > device_path $ echo 1 > enable $ sudo mkdir /sys/kernel/config/nvmet/ports/1 $ cd /sys/kernel/config/nvmet/ports/1 $ echo 10.147.27.85 |sudo tee -a addr_traddr > /dev/null $ echo tcp|sudo tee -a addr_trtype > /dev/null $ echo 4420|sudo tee -a addr_trsvcid > /dev/null $ echo ipv4|sudo tee -a addr_adrfam > /dev/null $ sudo ln -s /sys/kernel/config/nvmet/subsystems/nvmet-test/ /sys/kernel/config/nvmet/ports/1/subsystems/nvmet-t You now should see the following message captured in dmesg: $ dmesg |grep "nvmet_tcp" [24457.458325] nvmet_tcp: enabling port 1 (10.147.27.85:4420) Setting Up The Client $ sudo modprobe nvme $ sudo modprobe nvme-tcp $ sudo nvme discover -t tcp -a 10.147.27.85 -s 4420 Discovery Log Number of Records 1, Generation counter 3 =====Discovery Log Entry 0====== trtype: tcp adrfam: ipv4 subtype: nvme subsystem treq: not specified, sq flow control disable supported portid: 1 trsvcid: 4420 subnqn: nvmet-test traddr: 10.147.27.85 sectype: none $ sudo nvme connect -t tcp -n nvmet-test -a 10.147.27.85 -s 4420 $ sudo nvme list Node SN Model Namespace Usage Format FW Rev ------------- ------------------- --------------- --------- ----------- --------- ------- /dev/nvme0n1 610d2342db36e701 Linux 1 2.20 GB / 2.20 GB 512 B + 0 B You now have a remote NVMe block device exported via an NVMe over Fabrics network using TCP. Now you can write to and read from it like any other locally attached high-performance block device. Performance To compare NVMe-RDMA and NVMe-TCP, we used a pair of Oracle X7-2 hosts each with a Mellanox ConnectX-5 running Oracle Linux (OL8.2) with UEK6 (v5.4.0-1944). A pair of 40Gb ConnectX-5 ports were configured with RoCEv2 (RDMA), performance tests run, reconfigured to use TCP and performance tests rerun . The performance utility fio was used to measure I/Os per second (IOPS) and latency. When testing for IOPS, a single-threaded 8k read test with a queue depth of 32 showed RDMA significantly outperformed TCP, but when additional threads were added that better take advantage of the NVMe queuing model, TCP IOPS performance increased. When the number of threads reached 32, TCP IOPS performance match that of RDMA. Latency was measured using a 8k read from a single thread with a queue depth of 1. TCP latency was 30% higher than RDMA. Much of the difference is due to the buffer copies required by TCP. Conclusion Although NVMe-TCP suffers from newness, TCP does not, and with its dominance in the data center, there is no doubt NVMe-TCP will be a dominant player in the data center SAN space. Over the next year, expect many third-party NVMe-TCP products to be introduced, from NVMe-TCP optimized ethernet adapters to NVMe-TCP SAN products.

Oracle Linux Kernel engineer Alan Adamson provides an introduction to connecting NVMe flash storage using TCP.   Oracle Linux UEK5 introduced NVMe over Fabrics which allows transferring NVMe storage...

Announcements

Announcing the Unbreakable Enterprise Kernel Release 5 Update 4 for Oracle Linux

The Unbreakable Enterprise Kernel (UEK) for Oracle Linux provides the latest open source innovations and key optimizations and security to enterprise cloud and on-premises workloads. It is the Linux kernel that powers Oracle Cloud and Oracle Engineered Systems such as Oracle Exadata Database Machine as well as Oracle Linux on 64-bit Intel and AMD or 64-bit Arm platforms. What's New? The Unbreakable Enterprise Kernel Release 5 Update 4 (UEK R5U4) is based on the mainline kernel version 4.14.35. Through actively monitoring upstream check-ins and collaboration with partners and customers, Oracle continues to improve and apply critical bug and security fixes to UEK R5 for Oracle Linux. This update includes several new features, added functionality, and bug fixes across a range of subsystems. UEK R5 Update 4 can be recognized with a release number starting with 4.14.35-2025.400.8. Notable changes: Core Kernel Functionality. UEK R5U4 provides equivalent core kernel functionality to UEK R5U3 and older, making use of the same upstream mainline kernel release, with additional patches to enhance existing functionality and provide some minor bug fixes and security improvements. Process Virtual Address Space Reservation. Process Virtual Address Space Reservation allows reservation of process virtual address ranges; This feature is specifically developed to improve Oracle Database stability when ASLR (Address Space Layout Randomization) is enabled. File system and storage fixes.  NFS. NFS is updated with several upstream fixes, along with improvements and optimizations for page cache, RPC call handling and NFSv4 clients. OCFS2.  Issues with NFS kernel server when hosted on OCSF2 filesystem are resolved Networking. UEK R5U4 supports 1/10/25/50/100 Gb Ethernet ports. 200 Gb Ethernet ports are not enabled in UEK R5U4 as the changes required to support this affect the kernel ABI. If you require the use of 200 Gb Ethernet ports use UEK R6. Enhanced TCP Stack for diagnostics.  Enhancements are added to the tcp stack to facilitate better diagnostics through extended Berkeley Packet Filter (eBPF) tracepoints along with several optimizations that allow more rapid diagnostics and testing, but also to reduce performance overhead related to tracing. Security. Spectre-v1 mitigation extensions.  Patches available in the upstream Linux 5.6 kernel are included to extend Spectre-v1 mitigation by preventing index computations from causing speculative loads into the L1 cache. Driver Updates. UEK R5 supports a large number of hardware and devices. In close cooperation with hardware and storage vendors, Oracle has updated several device drivers from the versions in mainline Linux 4.14.35. Further details are available on section "1.2.1 Notable Driver Features and Updates" of the Release Notes. For more details on these and other new features and changes, please consult the Release Notes for the UEK R5 Update 4. Security (CVE) Fixes A full list of CVEs fixed in this release can be found in the Release Notes for the UEK R5 Update 4. Supported Upgrade Path Customers can upgrade existing Oracle Linux 7 servers using the Unbreakable Linux Network or the Oracle Linux yum server by pointing to "UEK Release 5" Yum Channel. Software Download Oracle Linux can be downloaded, used, and distributed free of charge and all updates and errata are freely available. This allows organizations to decide which systems require a support subscription and makes Oracle Linux an ideal choice for development, testing, and production systems. The user decides which support coverage is the best for each system individually, while keeping all systems up-to-date and secure. Customers with Oracle Linux Premier Support also receive access to zero-downtime kernel updates using Oracle Ksplice. Compatibility UEK R5 Update 4 is fully compatible with the UEK R5 GA release. The kernel ABI for UEK R5 remains unchanged in all subsequent updates to the initial release. About Oracle Linux Oracle Linux is an open and complete operating environment that helps accelerate digital transformation. It delivers leading performance and security for hybrid and multi cloud deployments. Oracle Linux is 100% application binary compatible with Red Hat Enterprise Linux. And, with an Oracle Linux Support subscription, customers have access to award-winning Oracle support resources and Linux support specialists, zero-downtime patching with Ksplice, cloud native tools such as Kubernetes and Kata Containers, KVM virtualization and oVirt-based virtualization manager, DTrace, clustering tools, Spacewalk, Oracle Enterprise Manager, and lifetime support. All this and more is included in a single cost-effective support offering. Unlike many other commercial Linux distributions, Oracle Linux is easy to download and completely free to use, distribute, and update. Resources – Oracle Linux Documentation Oracle Linux Software Download Oracle Linux Blogs Oracle Linux Blog Oracle Virtualization Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux - education.oracle.com/linux

The Unbreakable Enterprise Kernel (UEK) for Oracle Linux provides the latest open source innovations and key optimizations and security to enterprise cloud and on-premises workloads. It is the Linux...

Linux

Fuzzing the Linux kernel (x86) entry code, Part 3 of 3

In this third and final in the series Oracle Linux kernel and ksplice engineer Vegard Nossum explores even further into kernel fuzzing.   Note: Make sure to read part 1 and part 2 if you missed them. In the previous blog post we ended up with a basic working fuzzer, but there are still many things we can do to improve the usability and utility of the fuzzer. In this part we will explore some obvious and some non-obvious extensions. General-purpose registers + delta states One thing we haven't really covered so far is setting the other general-purpose registers to random values as well. The entry code does use some of those general-purpose registers in the course of its work and if we do run into a bug somewhere then it might be more likely to crash with a random value. We might also want to try to find more subtle bugs where the kernel doesn't crash outright, but perhaps leaks a kernel address in some register that userspace never otherwise looks at. One way to check that the kernel behaves correctly and preserves our registers/flags/etc. would be to write out the register state just after returning from kernel mode. It's not very difficult to achieve, as we can movq all (or at least most of) the registers into fixed addresses (e.g. in the data page which we're already using for other things). The difficulty here is integrating this with running multiple entry attempts/syscalls in a single child process, since one would need to interleave the sanity checking with the entry attempts and that could be quite fiddly to get right. Minimising the probability of crashing We already mentioned in part 2 that crashing the child process is pretty expensive, since it means starting an entirely new child process. Avoiding crashing as much as possible (and running as many entry attempts as possible in the same child process) could therefore be a viable strategy towards improving the performance of the fuzzer. There are two main parts to this: Saving/restoring state that is needed down the line, e.g. you'll want to save and restore %rsp so that subsequent pushf/popf instructions will continue to work. Recovering from signal handlers, e.g. by installing handlers which can restore the process to a known good state. Checking the generated assembly code It's very easy to make a mistake in the assembly code generation and never notice because the program was crashing anyway and you couldn't tell that you were getting an unexpected result. I had a bug like that where I didn't notice for 2 years that I had accidentally used the wrong byte order when encoding the address of the ljmp operand, so it never actually ran anything in 32-bit compatibility mode — oops! One very quick and easy way to check the assembly code is to use a disassembly library like udis86 and then verify some of the generated code by hand. All you need is something like this: #include <udis86.h> ... ud_t u; ud_init(&u); ud_set_vendor(&u, UD_VENDOR_INTEL); ud_set_mode(&u, 64); ud_set_pc(&u, (uint64_t) mem); ud_set_input_buffer(&u, (unsigned char *) mem, (char *) out - (char *) mem); ud_set_syntax(&u, UD_SYN_ATT); while (ud_disassemble(&u)) fprintf(stderr, " %08lx %s\n", ud_insn_off(&u), ud_insn_asm(&u)); fprintf(stderr, "\n"); KVM/Xen/Intel/AMD interactions In at least one case, we saw an interaction with KVM where starting any KVM instance would corrupt the size of the GDTR (GDT register) and allow the fuzzer to hit a crash by using one of the segments outside the intended size of the GDT. This turned out to be exploitable to get ring 0 execution. In at least one other case, we saw an interaction when running in a hardware-accelerated nested guest (guest within a guest). In general, KVM needs to emulate some aspects of the underlying hardware and this adds quite a lot of complexity. It is quite possible that the fuzzer can find bugs in hypervisors such as KVM or Xen, so it is probably valuable to run the fuzzer both on different bare metal CPUs and under a variety of hypervisors. To create a KVM instance programmatically, see KVM host in a few lines of code by Serge Zaitsev. A related fun experiment could be to compile the fuzzer for Windows or other operating systems running on x86 and see how they fare. I briefly tested the Linux binary on WSL (Windows Subsystem for Linux) and nothing bad happened, so there's that. Config/boot options Config and boot options affect the exact operation of the entry code. On a recent kernel, I get these: $ grep -o 'CONFIG_[A-Z0-9_]*' arch/x86/entry/entry_64*.S | sort | uniq CONFIG_DEBUG_ENTRY CONFIG_IA32_EMULATION CONFIG_PARAVIRT CONFIG_RETPOLINE CONFIG_STACKPROTECTOR CONFIG_X86_5LEVEL CONFIG_X86_ESPFIX64 CONFIG_X86_L1_CACHE_SHIFT CONFIG_XEN_PV There are actually more options hidden behind header files as well. Building multiple kernels with different combinations of these options could help reveal combinations that are broken, perhaps only in edge cases triggered by the fuzzer. By looking through Documentation/admin-guide/kernel-parameters.txt you can also find a number of options that may influence the entry code. Here's an example Python script that generates random combinations of config options which is useful for passing on the kernel command line with KVM: import random flags = """nopti nospectre_v1 nospectre_v2 spectre_v2_user=off spec_store_bypass_disable=off l1tf=off mds=off tsx_async_abort=off kvm.nx_huge_pages=off noapic noclflush nosmap nosmep noexec32 nofxsr nohugeiomap nosmt nosmt noxsave noxsaveopt noxsaves intremap=off nolapic nomce nopat nopcid norandmaps noreplace-smp nordrand nosep nosmp nox2apic""".split() print(' '.join(random.sample(flags, 5)), "nmi_watchdog=%u" % (random.randrange(2), )) ftrace Ftrace inserts some code into the entry code when enabled, e.g. for system call and irqflags tracing. This could be worth testing as well, so I would recommend occasionally tweaking these files (under /sys/kernel/tracing) before running the fuzzer: .divTable { display: table; width: 80%; } .divTableRow { display: table-row; } .divTableHeading { display: table-header-group; background-color: #ddd; font-weight: bold; } .divTableCell { display: table-cell; padding: 3px 10px; border: 1px solid #999999; } Path Value ftrace/enable 1 trace_options userstacktrace events/preemptirq/enable 1 events/preemptirq/irq_disable/enable 1 events/page_isolation/enable 1 events/raw_syscalls/enable 1 events/irq/enable 1 events/irq_vectors/enable 1 options/func_stack_trace 1 events/stacktrace 1 events/function-trace 1   PTRACE_SYSCALL We've already seen that ptrace changes the way system call entry/exit is handled (since the process needs to be stopped and the tracer needs to be notified), so it's a good idea to run some fraction of entry attempts under ptrace() using PTRACE_SYSCALL. It could also be interesting to try to tweak some/all of the traced's process' registers while it is stopped by ptrace. Getting this completely right is pretty hairy, so I'll consider it out of scope for this blog post. mkinitrd.sh When I do my testing in a VM I prefer to bundle up the program in an initrd and run it as init (pid 1) so that I don't need to copy it onto a filesystem image. You can use a simple script like this: #! /bin/bash set -e set -x rm -rf initrd/ mkdir initrd/ g++ -static -Wall -std=c++14 -O2 -g -o initrd/init main.cc -lm (cd initrd/ && (find | cpio -o -H newc)) \ | gzip -c \ > initrd.entry-fuzz.gz If you're using Qemu/KVM, just pass -initrd initrd.entry-fuzz.gz and it will run the fuzzer as the first thing after booting. Taint checking If the fuzzer ever does stumble upon some kind of kernel crash or bug, it's useful to make sure we don't miss it! I personally pass oops=panic panic_on_warn panic=-1 on the kernel command line and -no-reboot to Qemu/KVM; this will ensure that any warning or panic will immediately cause Qemu to exit (leaving any diagnostics on the terminal). If you are doing a dedicated bare metal run (e.g. using the initrd method above), you would probably want panic=0 instead, which just hangs the machine. If you're doing a bare metal run on your regular workstation and don't want your whole machine to hang, another thing you can do is to check whether the kernel becomes tainted (which it does whenever there is a WARNING or a BUG) and simply exit. int tainted_fd = open("/proc/sys/kernel/tainted", O_RDONLY); if (tainted_fd == -1) error(EXIT_FAILURE, errno, "open()"); char tainted_orig_buf[16]; ssize_t tainted_orig_len = pread(tainted_fd, tainted_orig_buf, sizeof(tainted_orig_buf), 0); if (tainted_orig_len == -1) error(EXIT_FAILURE, errno, "pread()"); while (1) { // generate + run test case ... char tainted_buf[16]; ssize_t tainted_len = pread(tainted_fd, tainted_buf, sizeof(tainted_buf), 0); if (tainted_len == -1) error(EXIT_FAILURE, errno, "pread()"); if (tainted_len != tainted_orig_len || memcmp(tainted_buf, tainted_orig_buf, tainted_len)) { fprintf(stderr, "Kernel became tainted, stopping.\n"); // TODO: dump hex bytes or disassembly exit(EXIT_FAILURE); } } Network logging In case that the kernel ever crashes and it's not clear from the crash what the problem was, it can be very useful to log everything that is being attempted to the network. I'll just give a quick sketch of a UDP logging scheme: int main(...) { int udp_socket = socket(AF_INET, SOCK_DGRAM, 0); if (udp_socket == -1) error(EXIT_FAILURE, errno, "socket(AF_INET, SOCK_DGRAM, 0)"); struct sockaddr_in remote_addr = {}; remote_addr.sin_family = AF_INET; remote_addr.sin_port = htons(21000); inet_pton(AF_INET, "10.5.0.1", &remote_addr.sin_addr.s_addr); if (connect(udp_socket, (const struct sockaddr *) &remote_addr, sizeof(remote_addr)) == -1) error(EXIT_FAILURE, errno, "connect()"); ... } Then, after the code for each entry/exit has been generated, you can simply dump it on this socket: write(udp_socket, (char *) mem, out - (uint8_t *) mem); Hopefully the last data received by the logging server (here 10.5.0.1:21000) will contain the assembly code that caused the crash. Depending on the exact use case, it might be worth adding some kind of framing so you can easily tell exactly where a test case starts and ends. Check that the fuzzer catches known bugs There have been a number of bugs in the entry code over the years. It could be interesting to build some of these old, buggy kernels and run the fuzzer on them to make sure it actually catches those known bugs as a sanity check. We could perhaps also use the time it takes to find the bug as a measure of the fuzzer's efficiency, although we have to be careful not to optimize it to only find these bugs! Code coverage/instrumentation feedback Instrumentation One of the things that makes fuzzers like AFL and syzkaller so effective is that they use code coverage to very accurately gauge the effect of tweaking individual bits of a test case. This is usually achieved by compiling C code with a special compiler flag that emits extra code to gather the coverage data. That's a lot tricker with assembly code, especially the entry code, since we don't know exactly what state the CPU is in (and what registers/state we can clobber) without manually inspecting each instruction of the code. However, if we really want code coverage, there is a way to do it. The x86 instruction set fortunately includes an instruction that takes both an immediate value and an immediate address, and which doesn't affect any other state (e.g. flags): movb $value, (addr). The only thing we need to be careful about is making sure that addr is a compile time constant address which is always mapped to some physical memory and marked present in the page tables so we don't incur a page fault while accessing it. Linux fortunately already has a mechanism for this: fixmaps AKA "compile-time virtual memory allocation". With this we can statically allocate a compile-time constant virtual address which points to the same underlying physical page for all tasks and contexts. Since it is shared between tasks, we would have to clear or otherwise save/restore these values when switching between processes. By using a combination of C macros and assembler macros we can obtain a fairly non-intrusive coverage primitive that you can drop in anywhere in the entry code to record a taken code path. I have a patch for this, but there are a few corner cases to work out (e.g. it doesn't quite work when SMAP is enabled). Besides, I doubt the x86 maintainers would relish the thought of littering the entry code with these coverage annotations. One thing that makes instrumentation feedback a lot more complicated on the fuzzer side is that you need a whole system to keep track of test cases, outcomes, and (possibly) which mutations you've applied to each test case. Because of this I've chosen to ignore code coverage for now; in any case, that's a general fuzzing topic and doesn't pertain much to x86 or the entry code in particular. Performance counters/hardware feedback A completely different approach to gathering code coverage would be to use performance counters. I know of two recent projects that are doing this: Resmack Fuzz Test kAFL The big benefit here is obviously that no instrumentation (modification of the kernel) is required. Probably the biggest potential drawback is that performance counters are not completely deterministic (perhaps due to external factors like hardware interrupts or thermal throttling). Perhaps it also won't really work for the entry code, since only a really short amount of time is spent in the assembly code. In any case, here are a couple of links for further reading: https://man7.org/linux/man-pages/man2/perf_event_open.2.html http://www.brendangregg.com/perf.html Bugs found CVE-2018-10901 — Local ring-0 privilege escalation Tracing vs. CR2 (multiple crashes) Enable FSGSBASE instructions at least one otherwise undisclosed crash in a third party kernel several crashes during internal testing Ksplice — also, we're hiring! Ksplice is Oracle's technology for patching security vulnerabilities in the Linux kernel without rebooting. Ksplice supports patching entry code, and we have shipped several updates that do exactly this, including workarounds for many of the CPU vulnerabilities that were discovered in recent years: CVE-2014-9090: Privilege escalation in double-fault handling on bad stack segment. CVE-2014-9322: Denial-of-service in double-fault handling on bad stack segment. (BadIRET) CVE-2015-2830: mis-handling of int80 fork from 64bits application. CVE-2015-3290, CVE-2015-3291, CVE-2015-5157: Multiple privilege escalation in NMI handling. CVE-2017-5715: Spectre v2 CVE-2018-14678: Privilege escalation in Xen PV guests. CVE-2018-8897: Denial-of-service in KVM breakpoint handling. (MovSS) CVE-2019-1125: Information leak in kernel entry code when swapping GS. CVE-2019-11091, CVE-2018-12126, CVE-2018-12130, CVE-2018-12127: Microarchitectural Data Sampling. CVE-2019-11135: Side-channel information leak in Intel TSX. Information leak in compatibility syscalls. x86/asm/entry/32: Simplify pushes of zeroed pt_regs->REGs x86/entry/64/compat: Clear registers for compat syscalls, to reduce speculation attack surface SMAP bypass in NMI handler. x86/asm/64: Clear AC on NMI entries Some of these updates were pretty challenging for various reasons and required ingenuity and a lot of attention to detail. Part of the reason we decided to write a fuzzer for the entry code was so that we could test our updates more effectively. If you've enjoyed this blog post and you think you would enjoy working on these kinds of problems, feel free to drop us a line at ksplice-support_ww@oracle.com. We are a diverse, fully remote team, spanning 3 continents. We look at a ton of Linux kernel patches and ship updates for 5-6 different distributions, totalling more than 1,100 unique vulnerabilities in a year. Of course, nobody can ever hope to be familiar with every corner of the kernel (and vulnerabilities can appear anywhere), so patch- and source-code comprehension are essential skills. We also patch important userspace libraries like glibc and OpenSSL, which enables us to update programs using those libraries without restarting anything and without requiring any special support in those applications themselves. Other projects we've worked on include Known Exploit Detection or porting Ksplice to new architectures like ARM.

In this third and final in the series Oracle Linux kernel and ksplice engineer Vegard Nossum explores even further into kernel fuzzing.   Note: Make sure to read part 1 and part 2 if you missed them. In...

Linux

Fuzzing the Linux kernel (x86) entry code, Part 2 of 3

Oracle Linux kernel and ksplice engineer Vegard Nossum delves further into kernel fuzzing in this second of a three part series of blogs.   In part 1 of this series we looked at what the Linux kernel entry code does and how to JIT-assemble and call a system call. In this part, we'll have a closer look at flag registers, the stack pointer, segment registers, debug registers, and different ways to enter the kernel. More flags (%rflags) The Direction flag is not the only flag that could be interesting to play with. The Wikipedia article for %rflags lists a couple of others that look interesting to me: bit 8: Trap flag (used for single-step debugging) bit 18: Alignment check Most of the arithmetic-related flags (carry, parity, etc.) are not so interesting because they change a lot during normal operation of regular code, which means the kernel's handling of those is probably quite well tested. Some of the flags that could be interesting (e.g. the Interrupt Enable flag) may not be modified by userspace, so it's not very useful to even try. The Trap flag is interesting because when set, the CPU delivers a debug exception after every instruction, which naturally also interferes with the normal operation of the entry code. The Alignment check flag is interesting because it causes the CPU to deliver an alignment check exception when a misaligned pointer is dereferenced. Although the CPU is not supposed to perform alignment checking when executing in ring 0, it could still be interesting to see whether there are any bugs related to entering the kernel because of an alignment check exception (we'll get back to this later). The Wikipedia article gives a procedure for modifying these flags, but we can do a tiny bit better: 0: 9c pushfq 1: 48 81 34 24 00 01 00 00 xorq $0x100,(%rsp) 9: 48 81 34 24 00 04 00 00 xorq $0x400,(%rsp) 11: 48 81 34 24 00 00 04 00 xorq $0x40000,(%rsp) 19: 9d popfq This code pushes the contents of %rflags onto the stack, then directly modifies the flag to the value on the stack before popping that value back into %rflags. We actually have a choice here between using orq or xorq; I'm going with xorq since it will toggle whatever value was already in the register. This way, if we do multiple system calls (or kernel entries) in a row, we can toggle the flags at random without having to care what the existing value was. Since we're modifying %rflags anyway, we might as well bake the Direction Flag change into it and combine the modification of all three flags together into a single instruction. It's a tiny optimization, but there is no reason not to do it. The result is something like this: // pushfq *out++ = 0x9c; uint32_t mask = 0; // trap flag mask |= std::uniform_int_distribution<unsigned int>(0, 1)(rnd) << 8; // direction flag mask |= std::uniform_int_distribution<unsigned int>(0, 1)(rnd) << 10; // alignment check mask |= std::uniform_int_distribution<unsigned int>(0, 1)(rnd) << 18; // xorq $mask, 0(%rsp) *out++ = 0x48; *out++ = 0x81; *out++ = 0x34; *out++ = 0x24; *out++ = mask; *out++ = mask >> 8; *out++ = mask >> 16; *out++ = mask >> 24; // popfq *out++ = 0x9d; If we don't want our process to be immediately killed with SIGTRAP when the trap flag is set, we need to register a signal handler that will effectively ignore this signal (apparently using SIG_IGN is not enough): static void handle_child_sigtrap(int signum, siginfo_t *siginfo, void *ucontext) { // this gets called when TF is set in %rflags; do nothing } ... struct sigaction sigtrap_act = {}; sigtrap_act.sa_sigaction = &handle_child_sigtrap; sigtrap_act.sa_flags = SA_SIGINFO | SA_ONSTACK; if (sigaction(SIGTRAP, &sigtrap_act, NULL) == -1) error(EXIT_FAILURE, errno, "sigaction(SIGTRAP)"); You might wonder about the reason for the SA_ONSTACK flag; we will talk about that in the next section! Stack pointer (%rsp) After modifying %rflags, we don't really need to use the stack again, which means we are free to change it without impacting the execution of our program. Why would we want to change the stack pointer, though? It's not like the kernel will use our userspace stack for anything, right..? Well, actually, it might: Debugging tools like ftrace and perf occasionally dereference the userspace stack during e.g. system call tracing. In fact, I found at least two different bugs in this area: report 1 (July 16, 2019), report 2 (May 10, 2020). When delivering signals to userspace, the signal handler's stack frame is created by the kernel and usually located just above the interrupted thread's current stack pointer. If, by some mistake, %rsp is accessed directly by the kernel, it might not be noticed during normal operation, since the stack pointer usually always points to a valid address. To catch this kind of bug, we can simply point it at a non-mapped address (or perhaps even a kernel address!). To help us test various potentially interesting values of the stack pointer, we can define a helper: static void *page_not_present; static void *page_not_writable; static void *page_not_executable; static uint64_t get_random_address() { // very occasionally hand out a non-canonical address if (std::uniform_int_distribution<int>(0, 100)(rnd) < 5) return 1UL << 63; uint64_t value = 0; switch (std::uniform_int_distribution<int>(0, 4)(rnd)) { case 0: break; case 1: value = (uint64_t) page_not_present; break; case 2: value = (uint64_t) page_not_writable; break; case 3: value = (uint64_t) page_not_executable; break; case 4: static const uint64_t kernel_pointers[] = { 0xffffffff81000000UL, 0xffffffff82016000UL, 0xffffffffc0002000UL, 0xffffffffc2000000UL, }; value = kernel_pointers[std::uniform_int_distribution<int>(0, ARRAY_SIZE(kernel_pointers))(rnd)]; // random ~2MiB offset value += PAGE_SIZE * std::uniform_int_distribution<unsigned int>(0, 512)(rnd); break; } // occasionally intentionally misalign it if (std::uniform_int_distribution<int>(0, 100)(rnd) < 25) value += std::uniform_int_distribution<int>(-7, 7)(rnd); return value; } int main(...) { page_not_present = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0); page_not_writable = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0); page_not_executable = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0); ... } Here I used a few kernel pointers that I found in /proc/kallsyms on my system. They are not necessarily very good choices, but you get the idea. As I mentioned earlier, there is a balance to be found between picking values that are so crazy that nobody had ever thought of handling them (we are after all trying to find corner cases here), but without getting lost in the huge sea of uninteresting values; we could just uniformly pick any random 64-bit value, but that is exceedingly unlikely to give us any valid pointers at all (most of them will probably be non-canonical addresses). Part of the art of fuzzing is smoking out the relevant corner cases by making informed guesses about what will and what won't matter. Now it's just a matter of setting the value, which we can luckily do by loading the 64-bit value directly into %rsp: movq $0x12345678aabbccdd, %rsp In code: uint64_t rsp = get_random_address(); // movq $imm, %rsp *out++ = 0x48; *out++ = 0xbc; for (int i = 0; i < 8; ++i) *out++ = rsp >> (8 * i); However, there is an important interaction with %rflags mentioned above that we need to take care of. The problem is that once we enable single-stepping in %rflags, the CPU will deliver a debug exception on every subsequently executed instruction. The kernel will handle the debug exception by delivering a SIGTRAP signal to the process. By default, this signal is delivered on the stack given by the value of %rsp when the trap is delivered... if %rsp is not valid, the kernel instead kills the process with an uncatcheable SIGSEGV. In order to deal with situations like this, the kernel does offer a mechanism to set %rsp to a known-good value when delivering the signal: sigaltstack(). All we have to do is use it like this: stack_t ss = {}; ss.ss_sp = malloc(SIGSTKSZ); if (!ss.ss_sp) error(EXIT_FAILURE, errno, "malloc()"); ss.ss_size = SIGSTKSZ; ss.ss_flags = 0; if (sigaltstack(&ss, NULL) == -1) error(EXIT_FAILURE, errno, "sigaltstack()"); and then pass SA_ONSTACK in the sa_flags of the sigaction() call for SIGTRAP. Segment registers When it comes to segment registers, you will frequently see the claim that they are not actually used that much on 64-bit anymore. However, that is not the whole truth. It is true that you can't change the base address or segment sizes, but almost everything else is still relevant. In particular, some things that are relevant for us are: %cs, %ds, %es, and %ss must hold valid 16-bit segment selectors referring to valid entries in the GDT (global descriptor table) or the LDT (local descriptor table). %cs cannot be loaded using the mov instruction, but we can use the ljmp (far/long jump) instruction. The CPL (current privilege level) field of %cs is the privilege level that the CPU is executing at. Normally, 64-bit userspace processes run with a %cs of 0x33, which is index 6 of the GDT and privilege level 3, and the kernel runs with a %cs of 0x10, which is index 2 of the GDT and privilege level 0 (hence the term "ring 0"). We can actually install entries in the LDT using the modify_ldt() system call, but note that the kernel does sanitize the entries so that we can't for example create a call gate pointing to a segment with DPL 0. %fs and %gs have base addresses specified by MSRs. These registers are typically used for TLS (Thread-Local Storage) and per-CPU data by userspace processes and the kernel, respectively. We can change the values of these registers using the arch_prctl() system call. On some CPUs/kernels, we can use the wrfsbase and wrgsbase instructions. Using the mov or pop instructions to set %ss causes the CPU to inhibit interrupts, NMIs, breakpoints, and single-stepping traps for one instruction following the mov or pop instruction. If this next instruction causes an entry into the kernel, those interrupts, NMIs, breakpoints, or single-stepping traps will be delivered after the CPU has started executing in kernel space. This was the source of CVE-2018-8897, where the kernel did not properly handle this case. LDT Since we'll potentially load segment registers with segments from the LDT we might as well start by setting up the LDT. There is no glibc wrapper for modify_ldt() so we have to call it using the syscall() function: #include <asm/ldt.h> #include <sys/syscall.h> #include <sys/types.h> #include <sys/user.h> for (unsigned int i = 0; i < 4; ++i) { struct user_desc desc = {}; desc.entry_number = i; desc.base_addr = std::uniform_int_distribution<unsigned long>(0, ULONG_MAX)(rnd); desc.limit = std::uniform_int_distribution<unsigned int>(0, UINT_MAX)(rnd); desc.seg_32bit = std::uniform_int_distribution<int>(0, 1)(rnd); desc.contents = std::uniform_int_distribution<int>(0, 3)(rnd); desc.read_exec_only = std::uniform_int_distribution<int>(0, 1)(rnd); desc.limit_in_pages = std::uniform_int_distribution<int>(0, 1)(rnd); desc.seg_not_present = std::uniform_int_distribution<int>(0, 1)(rnd); desc.useable = std::uniform_int_distribution<int>(0, 1)(rnd); syscall(SYS_modify_ldt, 1, &desc, sizeof(desc)); } We may want to check the return value here; we shouldn't be generating invalid LDT entries, so it's useful to know if we ever do. static uint16_t get_random_segment_selector() { unsigned int index; switch (std::uniform_int_distribution<unsigned int>(0, 2)(rnd)) { case 0: // The LDT is small, so favour smaller indices index = std::uniform_int_distribution<unsigned int>(0, 3)(rnd); break; case 1: // Linux defines 32 GDT entries by default index = std::uniform_int_distribution<unsigned int>(0, 31)(rnd); break; case 2: // Max table size index = std::uniform_int_distribution<unsigned int>(0, 255)(rnd); break; } unsigned int ti = std::uniform_int_distribution<unsigned int>(0, 1)(rnd); unsigned int rpl = std::uniform_int_distribution<unsigned int>(0, 3)(rnd); return (index << 3) | (ti << 2) | rpl; } Data segment (%ds) And using it (only showing %ds here): if (std::uniform_int_distribution<unsigned int>(0, 100)(rnd) < 20) { uint16_t sel = get_random_segment_selector(); // movw $imm, %ax *out++ = 0x66; *out++ = 0xb8; *out++ = sel; *out++ = sel >> 8; // movw %ax, %ds *out++ = 0x8e; *out++ = 0xd8; } %fs and %gs For %fs and %gs we need to use the system call arch_prctl(). In normal (non-JIT-assembled) code, this would be: #include <asm/prctl.h> #include <sys/prctl.h> ... syscall(SYS_arch_prctl, ARCH_SET_FS, get_random_address()); syscall(SYS_arch_prctl, ARCH_SET_GS, get_random_address()); Unfortunately, doing this is very likely to cause glibc/libstdc++ to crash on any code that uses thread-local storage (which may happen even as soon as the second get_random_address() call). If we want to generate the system calls to do it, we can do it more easily with a bit of support code: enum machine_register { // 0 RAX, RCX, RDX, RBX, RSP, RBP, RSI, RDI, // 8 R8, R9, R10, R11, R12, R13, R14, R15, }; const unsigned int REX = 0x40; const unsigned int REX_B = 0x01; const unsigned int REX_W = 0x08; static uint8_t *emit_mov_imm64_reg(uint8_t *out, uint64_t imm, machine_register reg) { *out++ = REX | REX_W | (REX_B * (reg >= 8)); *out++ = 0xb8 | (reg & 7); for (int i = 0; i < 8; ++i) *out++ = imm >> (8 * i); return out; } static uint8_t *emit_call_arch_prctl(uint8_t *out, int code, unsigned long addr) { // int arch_prctl(int code, unsigned long addr); out = emit_mov_imm64_reg(out, SYS_arch_prctl, RAX); out = emit_mov_imm64_reg(out, code, RDI); out = emit_mov_imm64_reg(out, addr, RSI); // syscall *out++ = 0x0f; *out++ = 0x05; return out; } Note that in addition to needing a few registers to do the system call itself, the syscall instruction also overwrites %rcx with the return address (i.e. the address of the instruction after the syscall instruction), so we'll probably want to make these calls before anything else. Stack segment (%ss) %ss should be the last register we set before the instruction that enters the kernel so that we're sure to see the effect of any delayed traps or exceptions. We can use the same code as for %ds above; the reason we don't use popw %ss is because we might have already set %rsp to point to a "weird" location and so the stack is probably not usable at that point. 32-bit compatibility mode (%cs) Fun fact: you can actually change your 64-bit process into a 32-bit process on the fly, no need to even tell the kernel about it. The CPU includes a mechanism for this which is allowed in ring 3: far jumps. In particular, the instruction we'll be using is "jump far, absolute indirect, address given in m16:32". Since this can be a bit tricky to work out the exact syntax and bytes for, I'll give a full assembly example first: .global main main: ljmpl *target 1: .code32 movl $1, %eax # __NR_exit == 1 from asm/unistd_32.h movl $2, %ebx # status == 0 sysenter ret .data target: .long 1b # address (32 bits) .word 0x23 # segment selector (16 bits) Here, the ljmpl instruction uses the memory at our target label, which is a 32-bit instruction pointer followed by a 16-bit segment selector (here pointing to the 32-bit code segment for userspace, 0x23). The target address here, 1b, is not a hexadecimal value, it's actually a reference to the label 1; the b stands for "backwards". The code at this label is 32-bit, which is why we are using sysenter and not syscall that we used before. The calling conventions are also different, and, in fact, we need to use the system call numbers from the 32-bit ABI (SYS_exit is 60 on 64-bit, but 1 here). Another fun thing is that if you try to run this under strace, you will see something like this: [...] write(1, "\366\242[\204\374\177\0\0\0\0\0\0\0\0\0\0\376\242[\204\374\177\0\0\t\243[\204\374\177\0\0"..., 140722529079224 <unfinished ...> +++ exited with 0 +++ strace clearly thought we were still a 64-bit process and thought that we had called write(), when we were really calling exit() (as evidenced by the last line, which plainly told us the process exited). Now that we know what bytes to use, we can port the whole thing to C. Since both the ljmp memory operand and the target address are 32 bits, we need to make sure that they are both located in addresses where the upper 32 bits are all 0. The best way to do this is to allocate memory using mmap() and the MAP_32BIT flag. struct ljmp_target { uint32_t rip; uint16_t cs; } __attribute__((packed)); struct data { struct ljmp_target ljmp; }; static struct data *data; int main(...) { ... void *addr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0); if (addr == MAP_FAILED) error(EXIT_FAILURE, errno, "mmap()"); data = (struct data *) addr; ... } void emit_code() { ... // ljmp *target *out++ = 0xff; *out++ = 0x2c; *out++ = 0x25; for (unsigned int i = 0; i < 4; ++i) *out++ = ((uint64_t) &data->ljmp) >> (8 * i); // cs:rip (jump target; in our case, the next instruction) data->ljmp.cs = 0x23; data->ljmp.rip = (uint64_t) out; ... } A couple of things to note here: This changes the CPU mode, which means subsequent instructions must be valid in 32-bit (otherwise, you may get a general protection fault or invalid opcode exception). The instruction sequence we used to load segment registers above (e.g. movw ..., %ax; movw %ax, %ss) has the exact same encoding on 32-bit and 64-bit, so we can execute it after switching to a 32-bit code segment without any trouble — this is particularly useful for ensuring that we can still load %ss just before entering the kernel. We can choose whether to always change to segment 4 (segment selector 0x23), or we can try changing to a random segment selector (e.g. using get_random_segment_selector()). If we select a random one, we might not even know whether we would still be executing in 32-bit or 64-bit mode. We may want to try to jump back to our normal code segment (segment 6, segment selector 0x33) after returning from the kernel, assuming we didn't exit, crash, or get killed. The procedure is exactly the same modulo the different segment selector. Debug registers (%dr0, etc.) Debug registers on x86 are used to set code breakpoints and data watchpoints. The registers %dr0 through %dr3 are used to set the actual breakpoint/watchpoint addresses and register %dr7 is used to control how those four addresses are used (whether they are breakpoints or watchpoints, etc.). Setting debug registers is a bit trickier than what we've seen so far because you can't load them directly in userspace. Like changing the LDT, the kernel wants to make sure that we don't for example set a breakpoint or watchpoint on a kernel address, but even more importantly, the CPU itself doesn't allow ring 3 to modify these registers directly. The only way to set the debug registers that I know of is using ptrace(). ptrace() is a notoriously difficult API to use. There is a lot of implicit state that the tracer needs to track manually and a lot of corner cases around signal handling. Luckily, in this case, we can get by with just attaching to the child process, setting the debug registers, and detaching; the debug register changes will persist even after we stop tracing: #include <sys/ptrace.h> #include <sys/user.h> #include <signal.h> #include <stddef.h> // for offsetof() int main(...) { pid_t child = fork(); if (child == -1) error(EXIT_FAILURE, errno, "fork()"); if (child == 0) { // make us a tracee of the parent if (ptrace(PTRACE_TRACEME, 0, 0, 0) == -1) error(EXIT_FAILURE, errno, "ptrace(PTRACE_TRACEME)"); // give the parent control raise(SIGTRAP); ... exit(EXIT_SUCCESS); } // parent; wait for child to stop while (1) { int status; if (waitpid(child, &status, 0) == -1) { if (errno == EINTR) continue; error(EXIT_FAILURE, errno, "waitpid()"); } if (WIFEXITED(status)) exit(WEXITSTATUS(status)); if (WIFSIGNALED(status)) exit(EXIT_FAILURE); if (WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP) break; continue; } // set debug registers and stop tracing if (ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[0]), ...) == -1) error(EXIT_FAILURE, errno, "ptrace(PTRACE_POKEUSER)"); if (ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[7]), ...) == -1) error(EXIT_FAILURE, errno, "ptrace(PTRACE_POKEUSER)"); if (ptrace(PTRACE_DETACH, child, 0, 0) == -1) error(EXIT_FAILURE, errno, "ptrace(PTRACE_DETACH)"); ... } Even in this small example, waiting for the child to stop is a bit fiddly. It's always possible that waitpid() returns before the the child has reached raise(SIGTRAP), e.g. if it was killed by some external process. We handle those cases by simply exiting as well. Since setting debug registers requires tracing, signals, and multiple context switches (which are all pretty slow), I would suggest doing this just once for each child process and then letting the child run multiple attempts at entering the kernel in a row. Setting any of the debug registers could fail, so in the actual fuzzer we would probably want to ignore any errors and set %dr7 one breakpoint at a time, e.g. something like: // stddef.h offsetof() doesn't always allow non-const array indices, // so precompute them here. const unsigned int debugreg_offsets[] = { offsetof(struct user, u_debugreg[0]), offsetof(struct user, u_debugreg[1]), offsetof(struct user, u_debugreg[2]), offsetof(struct user, u_debugreg[3]), }; for (unsigned int i = 0; i < 4; ++i) { // try random addresses until we succeed while (true) { unsigned long addr = get_random_address(); if (ptrace(PTRACE_POKEUSER, child, debugreg_offsets[i], addr) != -1) break; } // Condition: // 0 - execution // 1 - write // 2 - (unused) // 3 - read or write unsigned int condition = std::uniform_int_distribution<unsigned int>(0, 2)(rnd); if (condition == 2) condition = 3; // Size // 0 - 1 byte // 1 - 2 bytes // 2 - 8 bytes // 3 - 4 bytes unsigned int size = std::uniform_int_distribution<unsigned int>(0, 3)(rnd); unsigned long dr7 = ptrace(PTRACE_PEEKUSER, child, offsetof(struct user, u_debugreg[7]), 0); dr7 &= ~((1 | (3 << 16) | (3 << 18)) << i); dr7 |= (1 | (condition << 16) | (size << 18)) << i; ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[7]), dr7); } Entering the kernel We already saw how to emit the code for making a system call in part 1 of this blog series; here, we use the same basic approach, but also take into account all the other ways to enter the kernel. As I mentioned earlier, the syscall instruction is not the only way to enter the kernel on 64-bit, it's not even the only way to make a system call. For system calls, we have the following options: int $0x80 sysenter syscall It can also be useful to look at the table of hardware-generated exceptions. Many of these exceptions are handled slightly differently from system calls and regular interrupts; for example, when you try to load a segment register with an invalid segment selector the CPU will push an error code onto the (kernel) stack. We can trigger many of the exceptions, but not all of them. For example, it is trivial to generate a division by zero by simply carrying out a division by zero, but we can't easily generate an NMI on demand. (That said, there may be things we can do to make NMIs more likely to happen, albeit in an uncontrollable fashion: if we are testing the kernel in a VM, we can inject NMIs from the host, or we can enable the kernel NMI watchdog feature.) enum entry_type { // system calls + software interrupts ENTRY_SYSCALL, ENTRY_SYSENTER, ENTRY_INT, ENTRY_INT_80, ENTRY_INT3, // exceptions ENTRY_DE, // Divide error ENTRY_OF, // Overflow ENTRY_BR, // Bound range exceeded ENTRY_UD, // Undefined opcode ENTRY_SS, // Stack segment fault ENTRY_GP, // General protection fault ENTRY_PF, // Page fault ENTRY_MF, // x87 floating-point exception ENTRY_AC, // Alignment check NR_ENTRY_TYPES, }; enum entry_type type = (enum entry_type) std::uniform_int_distribution<int>(0, NR_ENTRY_TYPES - 1)(rnd); // Some entry types require a setup/preamble; do that here switch (type) { case ENTRY_DE: // xor %eax, %eax *out++ = 0x31; *out++ = 0xc0; break; case ENTRY_MF: // pxor %xmm0, %xmm0 *out++ = 0x66; *out++ = 0x0f; *out++ = 0xef; *out++ = 0xc0; break; case ENTRY_BR: // xor %eax, %eax *out++ = 0x31; *out++ = 0xc0; break; case ENTRY_SS: { uint16_t sel = get_random_segment_selector(); // movw $imm, %bx *out++ = 0x66; *out++ = 0xbb; *out++ = sel; *out++ = sel >> 8; } break; default: // do nothing break; } ... switch (type) { // system calls + software interrupts case ENTRY_SYSCALL: // syscall *out++ = 0x0f; *out++ = 0x05; break; case ENTRY_SYSENTER: // sysenter *out++ = 0x0f; *out++ = 0x34; break; case ENTRY_INT: // int $x *out++ = 0xcd; *out++ = std::uniform_int_distribution<uint8_t>(0, 255)(rnd); break; case ENTRY_INT_80: // int $0x80 *out++ = 0xcd; *out++ = 0x80; break; case ENTRY_INT3: // int3 *out++ = 0xcc; break; // exceptions case ENTRY_DE: // div %eax *out++ = 0xf7; *out++ = 0xf0; break; case ENTRY_OF: // into (32-bit only!) *out++ = 0xce; break; case ENTRY_BR: // bound %eax, data *out++ = 0x62; *out++ = 0x05; *out++ = 0x09; for (unsigned int i = 0; i < 4; ++i) *out++ = ((uint64_t) &data->bound) >> (8 * i); break; case ENTRY_UD: // ud2 *out++ = 0x0f; *out++ = 0x0b; break; case ENTRY_SS: // Load %ss again, with a random segment selector (this is not // guaranteed to raise #SS, but most likely it will). The reason // we don't just rely on the load above to do it is that it could // be interesting to trigger #SS with a "weird" %ss too. // movw %bx, %ss *out++ = 0x8e; *out++ = 0xd3; break; case ENTRY_GP: // wrmsr *out++ = 0x0f; *out++ = 0x30; break; case ENTRY_PF: // testl %eax, (xxxxxxxx) *out++ = 0x85; *out++ = 0x04; *out++ = 0x25; for (int i = 0; i < 4; ++i) *out++ = ((uint64_t) page_not_present) >> (8 * i); break; case ENTRY_MF: // divss %xmm0, %xmm0 *out++ = 0xf3; *out++ = 0x0f; *out++ = 0x5e; *out++ = 0xc0; break; case ENTRY_AC: // testl %eax, (page_not_writable + 1) *out++ = 0x85; *out++ = 0x04; *out++ = 0x25; for (int i = 0; i < 4; ++i) *out++ = ((uint64_t) page_not_writable + 1) >> (8 * i); break; } Putting it all together We now have almost everything we need to actually start some fuzzing! Just a couple more things, though... If you run the code we have so far, you'll quickly run into some issues. First of all, many of the instructions we've used may cause crashes (and deliberately so), which makes the fuzzer slow. By installing signal handlers for a few common terminating signals (SIGBUS, SIGSEGV, etc.), we can skip over the faulting instruction and (hopefully) continue executing within the same child process. Secondly, some of the system calls we make may have unintended side effects. In particular, we don't really want to block on I/O, since that will effectively stop the fuzzer in its tracks. One way to do this is to install an interval timer alarm to detect when a child process has hung. Another way could be to filter out certain system calls which are known to block (e.g. read(), select(), sleep(), etc.). Other "unfortunate" system calls might be fork(), exit(), and kill(). It's less likely that the fuzzer is able to delete files or otherwise mess up the system, but we might want to use some form of sandboxing (e.g. setuid(65534)). If you just want to see the final result, here is a link to the code (made available under The Universal Permissive License): https://github.com/oracle/linux-blog-sample-code/tree/fuzzing-the-linux-kernel-x86-entry-code To be continued... Be sure to check out part 3, where we discuss further improvements and ideas for the fuzzer. Ksplice — also, we're hiring! Ksplice is Oracle's technology for patching security vulnerabilities in the Linux kernel without rebooting. Ksplice supports patching entry code, and we have shipped several updates that do exactly this, including workarounds for many of the CPU vulnerabilities that were discovered in recent years: CVE-2014-9090: Privilege escalation in double-fault handling on bad stack segment. CVE-2014-9322: Denial-of-service in double-fault handling on bad stack segment. (BadIRET) CVE-2015-2830: mis-handling of int80 fork from 64bits application. CVE-2015-3290, CVE-2015-3291, CVE-2015-5157: Multiple privilege escalation in NMI handling. CVE-2017-5715: Spectre v2 CVE-2018-14678: Privilege escalation in Xen PV guests. CVE-2018-8897: Denial-of-service in KVM breakpoint handling. (MovSS) CVE-2019-1125: Information leak in kernel entry code when swapping GS. CVE-2019-11091, CVE-2018-12126, CVE-2018-12130, CVE-2018-12127: Microarchitectural Data Sampling. CVE-2019-11135: Side-channel information leak in Intel TSX. Information leak in compatibility syscalls. x86/asm/entry/32: Simplify pushes of zeroed pt_regs->REGs x86/entry/64/compat: Clear registers for compat syscalls, to reduce speculation attack surface SMAP bypass in NMI handler. x86/asm/64: Clear AC on NMI entries Some of these updates were pretty challenging for various reasons and required ingenuity and a lot of attention to detail. Part of the reason we decided to write a fuzzer for the entry code was so that we could test our updates more effectively. If you've enjoyed this blog post and you think you would enjoy working on these kinds of problems, feel free to drop us a line at ksplice-support_ww@oracle.com. We are a diverse, fully remote team, spanning 3 continents. We look at a ton of Linux kernel patches and ship updates for 5-6 different distributions, totalling more than 1,100 unique vulnerabilities in a year. Of course, nobody can ever hope to be familiar with every corner of the kernel (and vulnerabilities can appear anywhere), so patch- and source-code comprehension are essential skills. We also patch important userspace libraries like glibc and OpenSSL, which enables us to update programs using those libraries without restarting anything and without requiring any special support in those applications themselves. Other projects we've worked on include Known Exploit Detection or porting Ksplice to new architectures like ARM.

Oracle Linux kernel and ksplice engineer Vegard Nossum delves further into kernel fuzzing in this second of a three part series of blogs.   In part 1 of this series we looked at what the Linux kernel...

Linux

Fuzzing the Linux kernel (x86) entry code, Part 1 of 3

Oracle Linux kernel and ksplice engineer Vegard Nossum provides some great insight into kernel fuzzing in this first of a three part series of blogs.   Introduction If you've been following Linux kernel development or system call fuzzing at all, then you probably know about trinity and syzkaller. These programs have uncovered many kernel bugs over the years. The way they work is basically throwing random system calls at the kernel in the hope that some of them will crash the kernel or otherwise trigger a detectable fault (e.g. a buffer overflow) in the kernel code. While these fuzzers effectively test the system calls themselves (and the code reachable through system calls), one thing they don't test very well is what happens at the actual transition point between userspace and the kernel. There is more to this boundary than meets the eye; it is written in assembly code and there is a lot of architectural state (CPU state) that must be verified or sanitized before the kernel can safely start executing its C code. This blog post explores how one might go about writing a fuzzer targeting the Linux kernel entry code on x86. Before continuing, you may want to have a brief look at the main two files involved on 64-bit kernels: entry_64.S: entry code for 64-bit processes entry_64_compat.S: entry code for 32-bit processes In all, the entry code is around 1,700 lines of assembly code (including comments), so it's not exactly trivial, but at the same time only a very tiny part of the whole kernel. A memset() example To start off, I would like to give a concrete example of CPU state that the kernel needs to verify when entering the kernel from userspace. On x86, memset() is typically implemented using the rep stos instruction, since it is highly optimized by the CPU/microcode for writing to a contiguous range of bytes. Conceptually, this is a hardware loop that repeats (rep) a store (stos) a number of times; the target address is given by the %rdi register and the number of iterations is given by the %rcx register. You could for example implement memset() using inline assembly like this: static inline void memset(void *dest, int value, size_t count) { asm volatile ("rep stosb" // 4 : "+D" (dest), "+c" (count) // 1, 2 : "a" (value) // 3 : "cc", "memory"); // 5 } If you are unfamiliar with inline assembly, this tells GCC to: place the dest variable in the %rdi register (the + means the value might be modified by the inline assembly), place count in the %rcx register, place value in the %eax register (whether we place it in %rax, %eax, %ax, or %al is irrelevant, since rep stosb only uses the lower byte, which corresponds to the value in %al), insert rep stosb into the assembly code, reload any values which might depend on condition codes ("cc" AKA %rflags on x86) or memory. For reference, you can also have a look at the current mainline implementation of memset() on x86. Here is the kicker: The %rflags register contains a rarely used bit called DF (the "Direction flag"). This bit controls whether rep stos increments or decrements %rdi after writing each byte; when DF is set to 0, the affected memory range is %rdi..(%rdi + %rcx), whereas when DF is set to 1, the affected memory range is (%rdi - %rcx)..%rdi! This has pretty big implications for what memset() ends up doing, so we had better make sure that DF is always set to 0. The x86_64 SysV ABI (which is more or less what the kernel uses) actually mandates that the DF bit must always be 0 on function entry and return (page 15): The direction flag DF in the %rFLAGS register must be cleared (set to “forward” direction) on function entry and return. Other user flags have no specified role in the standard calling sequence and are not preserved across calls. This is a convention that the kernel actually relies heavily on internally; if the DF flag was somehow set to 1 when calling memset() it would overwrite a completely wrong part of memory. Consequently, one of the jobs of the kernel's entry code is to make sure that the DF flag is always set to 0 before we enter any of the kernel's C code. We can do this with a single instruction: cld ("clear direction flag"), and this is indeed what the kernel does in many of its entry paths, see e.g. paranoid_entry() or error_entry(). The fuzzer Now that we understand how important even a single bit of CPU state can affect the kernel, let's try to enumerate all the various CPU state variables that the entry code needs to handle: flags register (%rflags) stack pointer (%rsp) segment registers (%cs, %fs, %gs) debug registers (%dr0 through %dr3, %dr7) Something we've skirted around until now is that there are many different ways to enter the kernel from userspace, not just system calls (and not just one mechanism for system calls). Let's try to enumerate those too: int instruction sysenter instruction syscall instruction int3/into/int1 instructions division by zero debug exception breakpoint exception overflow exception invalid opcode general protection fault page fault floating-point exception external hardware interrupts non-maskable interrupts The goal of the fuzzer should be to test all possible combinations of CPU states and userspace/kernel transitions. Ideally, we would do an exhaustive search, but if you consider all possible combinations of register values and methods of entry the search space is simply too large. We will consider two main strategies to improve our chances of finding a bug: Focus on those values/cases that we suspect are more likely to cause interesting/unusual things to happen. This typically comes down to looking at x86 documentation (Wikipedia, Intel manuals, etc.) and the entry code itself. For example, the entry code documents several cases of processor errata which we can use directly to hit known edge cases. Collapse classes of values that we think will not make a difference. For example, when picking random values to load a register with, it is much more important to try different classes of pointers (e.g. kernel, userspace, non-canonical, mapped, non-mapped, etc.) than trying all possible values. It's also worth mentioning that the kernel already has an excellent regression test suite for x86-specific code under tools/testing/selftests/x86/, developed mainly by Andy Lutomirski. It contains test cases for various methods of entering/leaving the kernel and can be a useful source of inspiration. High-level architecture Our fuzzer will be a userspace program run by the kernel we are fuzzing. Since we need very precise control over some of the instructions used to trigger a transition to the kernel we will actually not write this code directly in C; instead, we will dynamically generate x86 machine code at runtime and then execute it. For simplicity, and in order to avoid having to recover to a clean state after setting up the desired CPU state (if that is even possible), we will execute the generated machine code in a child process which can be thrown away after the entry attempt. We start with a basic fork loop: #include <sys/mman.h> #include <sys/wait.h> #include <error.h> #include <errno.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> static void *mem; static void emit_code(); typedef void (*generated_code_fn)(void); int main(int argc, char *argv[]) { mem = mmap(NULL, PAGE_SIZE, // prot PROT_READ | PROT_WRITE | PROT_EXEC, // flags MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, // fd, offset -1, 0); if (mem == MAP_FAILED) error(EXIT_FAILURE, errno, "mmap()"); while (1) { emit_code(); pid_t child = fork(); if (child == -1) error(EXIT_FAILURE, errno, "fork()"); if (child == 0) { // we're the child; call our newly generated function ((generated_code_fn) mem)(); exit(EXIT_SUCCESS); } // we're the parent; wait for the child to exit while (1) { int status; if (waitpid(child, &status, 0) == -1) { if (errno == EINTR) continue; error(EXIT_FAILURE, errno, "waitpid()"); } break; } } return 0; } We may then also implement a very basic emit_code(), which so far just creates a function containing just a single retq instruction: static void emit_code() { uint8_t *out = (uint8_t *) mem; // retq *out++ = 0xc3; } If you read the code carefully, you might wonder why we're creating the mapping using the MAP_32BIT flag. This is because we'll want the fuzzer to be able to enter the kernel while executing in 32-bit compatibility mode, and for that we need to be executing at a valid 32-bit address. Making a system call System calls are a bit of a messy affair on x86. First of all, there is the fact that system calls first evolved on 32-bit, where they started out using the relatively slow int instruction. Then Intel and AMD developed their own separate mechanisms for fast system calls (using brand new mutually-incompatible instructions, sysenter and syscall, respectively). And to make matters worse, 64-bit needs to handle both 32-bit processes (using any of the 32-bit system call mechanisms), 64-bit processes, and (potentially) a third mode of operation referred to as x32, where code is 64-bit like usual (and has access to 64-bit registers), but pointers are 32 bits to save memory. As they vary in exactly what CPU state is saved/modified when entering kernel mode, most of these different system call mechanisms take slightly different paths through the kernel's entry code. This is also one of the reasons why the entry code can be quite difficult to follow! For a more in-depth guide to system calls on x86, see this excellent (if a bit dated by now) series over at LWN: Anatomy of a system call, part 1 Anatomy of a system call, part 2 A good way to get started with system calls by hand is to get comfortable using the GNU assembler to prototype assembly snippets that we can then use in the fuzzer. For example, we can make a single read(STDIN_FILENO, NULL, 0) call to the kernel like this (save as e.g. read.S): .text .global main main: movl $0, %eax # SYS_read/__NR_read movl $0, %edi # fd = STDIN_FILENO movl $0, %esi # buf = NULL movl $0, %edx # count = 0 syscall movl $0, %eax retq As you can see from this snippet, when using the syscall instruction the system call number itself is passed in %rax and arguments are passed in %rdi, %rsi, %rdx, etc. The Linux x86 syscall ABI is as far as I know documented "officially" in entry_SYSCALL_64() in the entry code itself (We use %eXX instead of %rXX here since the machine code is slightly shorter; setting %eXX to 0 will also clear the upper 32 bits of %rXX). We can build this with gcc read.S and check that it does the right thing using strace: $ strace ./a.out execve("./a.out", ["./a.out"], [/* 53 vars */]) = 0 [...] read(0, NULL, 0) = 0 exit_group(0) = ? +++ exited with 0 +++ To get the bytes of machine code after assembling, we can compile with gcc -c read.S and use objdump -d read.o: 0000000000000000 <main>: 0: b8 00 00 00 00 mov $0x0,%eax 5: bf 00 00 00 00 mov $0x0,%edi a: be 00 00 00 00 mov $0x0,%esi f: ba 00 00 00 00 mov $0x0,%edx 14: 0f 05 syscall 16: b8 00 00 00 00 mov $0x0,%eax 1b: c3 retq To add this sequence to our JIT-assembled function, we can use code such as this: // mov $0, %eax *out++ = 0xb8; *out++ = 0x00; *out++ = 0x00; *out++ = 0x00; *out++ = 0x00; [...] // syscall *out++ = 0x0f; *out++ = 0x05; Revisiting memset() and the Direction flag We now have most of the pieces we need to write a test for our memset() example above. To set DF, we can use the std instruction ("set direction flag") before making our system call: // std *out++ = 0xfd; Since we're writing a fuzzer we probably want to actually randomize the value of the flag. If we're using C++ we can initialize a PRNG with this code: #include <random> static std::default_random_engine rnd; int main(...) { std::random_device rdev; rnd = std::default_random_engine(rdev()); ... } and then we can set (or clear) the flag before making the system call using something like this: switch (std::uniform_int_distribution<int>(0, 1)(rnd)) { case 0: // cld *out++ = 0xfc; break; case 1: // std *out++ = 0xfd; break; } (Again, these bytes came from assembling a short test program and then looking at the objdump output.) Note: We need to be careful about generating random numbers in a child process; we don't want all the children to generate the same thing! That's why we're actually generating the code in the parent and simply executing it in the child process. To be continued... Be sure to check out part 2, where we dig into the stack pointer, segment registers (including 32-bit compatibility mode), debug registers, and actually entering the kernel! Ksplice — also, we're hiring! Ksplice is Oracle's technology for patching security vulnerabilities in the Linux kernel without rebooting. Ksplice supports patching entry code, and we have shipped several updates that do exactly this, including workarounds for many of the CPU vulnerabilities that were discovered in recent years: CVE-2014-9090: Privilege escalation in double-fault handling on bad stack segment. CVE-2014-9322: Denial-of-service in double-fault handling on bad stack segment. (BadIRET) CVE-2015-2830: mis-handling of int80 fork from 64bits application. CVE-2015-3290, CVE-2015-3291, CVE-2015-5157: Multiple privilege escalation in NMI handling. CVE-2017-5715: Spectre v2 CVE-2018-14678: Privilege escalation in Xen PV guests. CVE-2018-8897: Denial-of-service in KVM breakpoint handling. (MovSS) CVE-2019-1125: Information leak in kernel entry code when swapping GS. CVE-2019-11091, CVE-2018-12126, CVE-2018-12130, CVE-2018-12127: Microarchitectural Data Sampling. CVE-2019-11135: Side-channel information leak in Intel TSX. Information leak in compatibility syscalls. x86/asm/entry/32: Simplify pushes of zeroed pt_regs->REGs x86/entry/64/compat: Clear registers for compat syscalls, to reduce speculation attack surface SMAP bypass in NMI handler. x86/asm/64: Clear AC on NMI entries Some of these updates were pretty challenging for various reasons and required ingenuity and a lot of attention to detail. Part of the reason we decided to write a fuzzer for the entry code was so that we could test our updates more effectively. If you've enjoyed this blog post and you think you would enjoy working on these kinds of problems, feel free to drop us a line at ksplice-support_ww@oracle.com. We are a diverse, fully remote team, spanning 3 continents. We look at a ton of Linux kernel patches and ship updates for 5-6 different distributions, totalling more than 1,100 unique vulnerabilities in a year. Of course, nobody can ever hope to be familiar with every corner of the kernel (and vulnerabilities can appear anywhere), so patch- and source-code comprehension are essential skills. We also patch important userspace libraries like glibc and OpenSSL, which enables us to update programs using those libraries without restarting anything and without requiring any special support in those applications themselves. Other projects we've worked on include Known Exploit Detection or porting Ksplice to new architectures like ARM.

Oracle Linux kernel and ksplice engineer Vegard Nossum provides some great insight into kernel fuzzing in this first of a three part series of blogs.   Introduction If you've been following Linux kernel...

Announcements

Announcing the release of Oracle Linux 7 Update 9 Beta

Oracle is pleased to announce the availability of the Oracle Linux 7 Update 9 Beta Release for the 64-bit Intel and AMD (x86_64) and 64-bit Arm (aarch64) platforms. Oracle Linux 7 Update 9 Beta is an update release that includes bug fixes, security fixes, and enhancements. The beta release allows Oracle partners and customers to test these capabilities before Oracle Linux 7 Update 9 becomes generally available.  It is 100% application binary compatible with Red Hat Enterprise Linux 7 Update 9 Beta. Updates include: An improved SCAP security guide for Oracle Linux 7 Updated device drivers for both UEK as well as Red Hat Compatible Kernel Wayland display server protocol is now available as a technology preview Updated virt-v2v release that now support Ubuntu and Debian conversion from VMware to Oracle Linux KVM The Oracle Linux 7 Update 9 Beta Release includes the following kernel packages: kernel-uek-5.4.17-2011.4.4 for x86_64 and aarch64 platforms - The Unbreakable Enterprise Kernel Release 6, which is the default kernel. kernel-3.10.0-1136 for x86_64 platform - The latest Red Hat Compatible Kernel (RHCK). To get started with Oracle Linux 7 Update 9 Beta Release, you can simply perform a fresh installation by using the ISO images available for download from Oracle Technology Network. Or, you can perform an upgrade from an existing Oracle Linux 7 installation by using the Beta channels for Oracle Linux 7 Update 9 Beta on the Oracle Linux yum server or the Unbreakable Linux Network (ULN).  # vi /etc/yum.repos.d/oracle-linux-ol7.repo [ol7_beta] name=Oracle Linux $releasever Update 9 Beta ($basearch) baseurl=https://yum$ociregion.oracle.com/repo/OracleLinux/OL7/beta/$basearch/ gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle gpgcheck=1 enabled=1 [ol7_optional_beta] name=Oracle Linux $releasever Update 9 Beta ($basearch) Optional baseurl=https://yum$ociregion.oracle.com/repo/OracleLinux/OL7/optional/beta/$basearch/ gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle gpgcheck=1 enabled=1 If your instance is running on Oracle Cloud Infrastructure (OCI), the value "$ociregion" will be automatically valued to use OCI yum mirrors. Modify the yum channel setting and enable the Oracle Linux 7 Update 9 Beta channels. You can then perform the upgrade. # yum update After the upgrade is completed, reboot the system and you will have Oracle Linux 7 Update 9 Beta running. [root@v2v-app: ~]# cat /etc/oracle-release Oracle Linux Server release 7.9 This release is provided for development and test purposes only and is not covered by Oracle Linux support; Beta releases cannot be used in production and no support will be provided to any customers running beta in production environments Further technical details and known issues for Oracle Linux 7 Update 9 Beta Release are available on Oracle Community - Oracle Linux and UEK Preview space. Oracle Linux team welcome your questions and feedback on Oracle Linux 7 Update 9 Beta Release. You may contact the Oracle Linux team at oraclelinux-info_ww_grp@oracle.com or post your questions and comments on the Oracle Linux and UEK Preview Space on the Oracle Community.

Oracle is pleased to announce the availability of the Oracle Linux 7 Update 9 Beta Release for the 64-bit Intel and AMD (x86_64) and 64-bit Arm (aarch64) platforms. Oracle Linux 7 Update 9 Beta is an...

Announcements

Announcing the release of Spacewalk 2.10 for Oracle Linux

Oracle is pleased to announce the release of Spacewalk 2.10 Server for Oracle Linux 7 along with updated Spacewalk 2.10 Client for Oracle Linux 7 and Oracle Linux 8. Client support is also provided for Oracle Linux 6 and Oracle Linux 5 (for extended support customers only). In addition to numerous fixes and other small enhancements, the Spacewalk 2.10 release includes the following significant features: Spacewalk can now sync and distribute Oracle Linux 8 content including support for mirroring a repository that contains module metadata. The module metadata can then be made available to downstream clients. Python 2 packages are no longer required on systems that have Python 3 as the default. It is now possible to manage errata severity via Spacewalk server The dwr package has been updated to version 3.0.2 to fix security vulnerabilities. Updated API calls: errata.create/setDetails: provides the capability for managing severities. system.schedulePackageRemoveByNevra: supports the removal of packages that are not in the database. For more details on this release, including additional new features and changes, please consult the Spacewalk 2.10 Release Notes. Limited support for Oracle Linux 8 clients Spacewalk 2.10 Server can mirror a repository that contains module and AppStream metadata and make that metadata available to downstream clients. This feature is sufficient to support an Oracle Linux 8 client when using the dnf tool.However, the Spacewalk 2.10 web interface and API are not AppStream or module aware and therefore has limited support for managing for Oracle Linux 8 clients. Please review section 1.4 of the Spacewalk 2.10 Release Notes for a comparison of the Spacewalk functionality that is available to each Oracle Linux client version.  

Oracle is pleased to announce the release of Spacewalk 2.10 Server for Oracle Linux 7 along with updated Spacewalk 2.10 Client for Oracle Linux 7 and Oracle Linux 8. Client support is also provided...

Announcements

Shoe Carnival Increases Security and Availability with Oracle Ksplice

In this article, we will discuss how Shoe Carnival increased their IT systems security and availability using Oracle Ksplice. Shoe Carnival, Inc. is one of the nation’s largest family footwear retailers, offering a broad assortment of moderately priced dress, casual and athletic footwear for men, women and children with emphasis on national name brands. The company operates 390 stores in 35 states and Puerto Rico, and offers online shopping. In keeping with the carnival spirit of rewarding surprises, Shoe Carnival offers their customers chances to win various coupons and discounts.  Customers can spontaneously win while spinning the carnival wheel in the store or redeeming an a promotional offer. These specials encourage customers to make a purchase. Customers are also eligible to earn loyalty rewards via a “Shoe Perks” membership. This loyalty program allows them to earn points with each purchase and receive exclusive offers. Members can redeem points and awards either when in store or shopping online. Shoe Carnival is focused on customer service and giving its clients a positive experience. To this end, its technology infrastructure plays a very important role in supporting  customer-facing operations. In each of its 395 stores, Shoe Carnival has 2 servers supporting the register systems.  These systems are critical for  helping to ensure business runs smoothly. A store clerk uses this system to look up a customer’s loyalty account, apply appropriate discounts,  and ultimately complete sales. When the system is down, it can impede Shoe Carnival’s ability to provide a high-quality customer customer experience.   Previously, Shoe Carnival was running its systems on Red Hat. They switched to Oracle Linux for several reasons including increased availability and security, improved support, and lower overall costs. Security is top of mind for retailers handing customer information. In particular, there are many compliance and regulatory mandates for handling a customer’s personal and financial payment information. To thwart and protect against cyber security threats, the Shoe Carnival IT team needs to regularly patch and update its Linux operating systems (OS) with the latest fixes. Oracle Ksplice allows them to do automated live patching without any downtime. Whereas before, it was a struggle to update all servers in a timely fashion and avoid service disruptions. Today, Ksplice has enabled Shoe Carnival to reduce planned downtime by more than 20%.  The automated features have also saved up to 35% of administrator time per system. Lastly, being a long standing Oracle Database and Oracle Exadata customer, Shoe Carnival found value in using the same support vendor for its OS.  Using Oracle Linux Premier Support has yielded a 50% faster support ticket resolution.   We are proud to help customers like Shoe Carnival increase IT systems security and availability, enabling it to deliver an improved customer experience.  Watch this video to learn more!  

In this article, we will discuss how Shoe Carnival increased their IT systems security and availability using Oracle Ksplice.Shoe Carnival, Inc. is one of the nation’s largest family...

Linux

Pella Optimizes IT Infrastructure and Reduces License Costs With Oracle Linux and Virtualization

In this article, we will discuss how Pella transformed their IT infrastructure with a newly virtualized environment. The Pella Corporation is a privately held window and door manufacturing company headquartered in Pella, Iowa.  They have manufacturing and sales operations in a number of locations in the United States. Pella Corporation employs more than 8,000 people  with 17 manufacturing sites and 200 showrooms throughout the United States and select regions of Canada. Pella’s continuous business growth has proved to be a big challenge for the IT department.  As the company’s needs increased, its older infrastructure, which was based on Unix physical servers, struggled to keep pace.  Pella needed a more flexible platform that would allow them to easily build out capacity and improve functionality. This provided a unique opportunity for the IT team. The team wanted a reliable infrastructure that could support both the current capacity, and easily expand to accommodate growth while keeping costs to a minimum. For these reasons, the IT team decided to move to a virtualized x86-server environment.    As a long time Oracle customer, Pella was already using Oracle applications and Oracle Database.  Therefore, Pella was inclined to  evaluate Oracle’s Virtualization and Linux solutions to facilitate their IT transformation.  Oracle Linux was an obvious choice for Pella primarily because it is optimized for existing Oracle workloads.  They also decided to virtualize their environment with Oracle VM mainly for the license structure advantages. With Oracle VM, Pella is able to pin CPUs to specific VMs, which in turn translated to saving on licensing costs for Oracle applications. Today, Pella’s IT departments uses Oracle Linux in 95% of their Linux environment and Oracle VM to run all of their Linux VMs. This combination has proven to be very advantageous for multiple reasons. First, Pella has significantly reduced their IT costs, including a savings of 75% on licensing costs by switching to a virtualized environment. It also saved nearly 5x on what would have been spent on purchasing physical hardware. Second, Pella saved on CPU utilization. In its previous Unix physical environment, the utilization was 50%, now it’s down to 5%.   Thirdly, Pella has simplified its operations and streamlined their support ticket process. Because they run a large number of Oracle workloads, the team has been able to use one portal to share tickets between the DBA, Linux, and applications teams.  Having a single vendor has improved their support experience and ensured their mission-critical applications are running at their best. With its new IT platform based on Oracle Linux and Oracle VM, Pella now has the ability to scale up and out as needed. It also has a reliable platform to support its manufacturing. We are proud to help customers like Pella transform their IT landscapes! Watch this video to learn more.  

In this article, we will discuss how Pella transformed their IT infrastructure with a newly virtualized environment. The Pella Corporation is a privately held window and door manufacturing company...

Linux

Getting Started With The Oracle Cloud Infrastructure Python SDK

In a recent blog post I illustrated how to use the OCI Command Line Interface (CLI) in shell scripts. While the OCI CLI is comprehensive and powerful, it may not be the best solution when you need to handle a lot of data in shell scripts. In such cases using a programming language such as Python and the Oracle Cloud Infrastructure Python SDK makes more sense. Data manipulation is much easier, and the API is —as expected— more complex. In an attempt to demystify the use of the OCI Python SDK, I have re-written and improved the sample oci-provision.sh shell script in Python. This sample project is named oci-compute and is published on GitHub. This blog post highlights the key concepts of the OCI Python SDK, and together with the oci-compute sample code it should help you to get started easily. About oci-compute The oci-compute tool does everything oci-provision.sh does; better, faster and with some additional capabilities: List available Platform, Custom and Marketplace images Create Compute Instances from a Platform, Custom or Marketplace image A cloud-init file can be specified to run custom scripts during instance configuration List, start, stop and terminate Compute Instances Command line syntax and parameters naming are similar to the OCI CLI tool. See the project README for more information on usage and configuration. I am using this tool on a daily basis to easily manage OCI Compute instances from the command line. OCI Python SDK installation At the time of this writing, the SDK supports Python version 3.5 or 3.6 and can be easily installed using pip, preferably in a Python virtual environment. Installation and required dependencies are described in detail in the documentation. oci-compute installation The oci-compute utility is distributed as a Python package. The setup.py file lists the SDK as dependency; installing the tool will automatically pull the SDK if not already installed. See the README file for detailed installation steps, but in short it is as simple as creating a virtual environment and running: $ pip3 install . The package is split in two main parts: cli.py: handles the command line parsing using the Click package. It defines all the commands, sub-commands and their parameters; instantiate the OciCompute class and invoke its methods. oci_compute.py: defines the OciCompute class which interacts with the OCI SDK. This is the most interesting part of this project. OCI SDK Key concepts This section describes the key concepts used by the OCI SDK. Configuration The first step for using the OCI SDK is to create a configuration dictionary (Python dict). While you can build it manually, you will typically use the oci.config.from_file API call to load it from a configuration file. The default configuration file is ~/.oci/config. It is worth noticing that the OCI CLI uses the same configuration file and provides a command to create it: $ oci setup config For oci-compute, the configuration file is loaded during the class initialization: self._config = oci.config.from_file(config_file, profile) API Service Clients The OCI API is organized in Services, and for each Service you will have to instantiate a Service Client. For example, our oci-compute package uses the following Services: Compute Service (part of Core Services): to manage the Compute Services (provision and manage compute hosts). Virtual Network Service (part of Core Services): to manage the Networking Components (virtual cloud network, Subnet, …) Identity Service: to manage users, groups, compartments, and policies. Marketplace Service: to manage applications in Oracle Cloud Infrastructure Marketplace We instantiate the Service Clients in the class initialization: # Instantiate clients self._compute_client = oci.core.ComputeClient(self._config) self._identity_client = oci.identity.IdentityClient(self._config) self._virtual_network_client = oci.core.VirtualNetworkClient(self._config) self._marketplace_client = oci.marketplace.MarketplaceClient(self._config) Models Models allows you to create objects needed by the API calls. Example: to use an image from the Marketplace, we need to subscribe to the Application Catalog. This is done with the ComputeClient create_app_catalog_subscription method. This method needs an CreateAppCatalogSubscriptionDetails object as parameter. We will use the corresponding model to create such object: oci.core.models.CreateAppCatalogSubscriptionDetails. In oci-compute: app_catalog_subscription_detail = oci.core.models.CreateAppCatalogSubscriptionDetails( compartment_id=compartment_id, listing_id=app_catalog_listing_agreements.listing_id, listing_resource_version=app_catalog_listing_agreements.listing_resource_version, oracle_terms_of_use_link=app_catalog_listing_agreements.oracle_terms_of_use_link, eula_link=app_catalog_listing_agreements.eula_link, signature=app_catalog_listing_agreements.signature, time_retrieved=app_catalog_listing_agreements.time_retrieved ) self._compute_client.create_app_catalog_subscription(app_catalog_subscription_detail).data Pagination All list operations are paginated; that is: they will return a single page of data and you will need to call the method again to get additional pages. The pagination module allows you, amongst other, to retrieve all data in a single API call. Example: to list the available images in a compartment we could do: response = self._compute_client.list_images(compartment_id) which will only return the first page of data. To get get all images at once we will do instead: response = oci.pagination.list_call_get_all_results(self._compute_client.list_images, compartment_id) The first parameter to list_call_get_all_results is the paginated list method, subsequent parameters are the ones of the list method itself. Waiters and Composite operations To wait for an operation to complete (e.g.: wait until an instance is started), you can use the wait_until function. Alternatively, there are convenience classes in the SDK which will perform an action on a resource and wait for it to enter a particular state: the CompositeOperation classes. Example: start an instance and wait until it is started. The following code snippet shows how to start an instance and wait until it is up and running: compute_client_composite_operations = oci.core.ComputeClientCompositeOperations(self._compute_client) compute_client_composite_operations.instance_action_and_wait_for_state( instance_id=instance_id, action='START', wait_for_states=[oci.core.models.Instance.LIFECYCLE_STATE_RUNNING]) Error handling A complete list of exceptions raised by the SDK is available in the exception handling section of the documentation. In short, if your API calls are valid (correct parameters, …) the main exception you should care about is the ServiceError one which is raised when a service returns an error response; that is: a non-2xx HTTP status. For the sake of simplicity and clarity in the sample code, oci-compute does not capture most exceptions. Service Errors will result in a Python stack traceback. A simple piece of code where we have to consider the Service Error exception is illustrated here: for vnic_attachment in vnic_attachments: try: vnic = self._virtual_network_client.get_vnic(vnic_attachment.vnic_id).data except oci.exceptions.ServiceError: vnic = None if vnic and vnic.is_primary: break Putting it all together The oci-compute sample code should be self explanatory, but let’s walk through what happens when e.g. oci-compute provision platform --operating-system "Oracle Linux" --operating-system-version 7.8 --display-name ol78 is invoked. First of all, the CLI parser will instantiate an OciCompute object. This is done once at the top level, for any oci-compute command: ctx.obj['oci'] = OciCompute(config_file=config_file, profile=profile, verbose=verbose The OciCompute class initialization will: Load the OCI configuration from file Instantiate the Service Clients The Click package will then invoke provision_platform function which in turn will call the OciCompute.provision_platform method. We use the oci.core.ComputeClient.list_images to retrieve the most recent Platform Image matching the given Operating System and its version: images = self._compute_client.list_images( compartment_id, operating_system=operating_system, operating_system_version=operating_system_version, shape=shape, sort_by='TIMECREATED', sort_order='DESC').data if not images: self._echo_error("No image found") return None image = images[0] We then call OciCompute._provision_image for the actual provisioning. This method uses all of the key concepts explained earlier. Pagination is used to retrieve the Availability Domains using the Identity Client list_availability_domains method: availability_domains = oci.pagination.list_call_get_all_results( self._identity_client.list_availability_domains, compartment_id ).data VCN and subnet are retrieved using the Virtual Network Client (list_vcns and list_subnets methods) Metadata is populated with the SSH public key and a cloud-init file if provided: # Metadata with the ssh keys and the cloud-init file metadata = {} with open(ssh_authorized_keys_file) as ssh_authorized_keys: metadata['ssh_authorized_keys'] = ssh_authorized_keys.read() if cloud_init_file: metadata['user_data'] = oci.util.file_content_as_launch_instance_user_data(cloud_init_file) Models are used to create an instance launch details (oci.core.models.InstanceSourceViaImageDetails, oci.core.models.CreateVnicDetails and oci.core.models.LaunchInstanceDetails methods): instance_source_via_image_details = oci.core.models.InstanceSourceViaImageDetails(image_id=image.id) create_vnic_details = oci.core.models.CreateVnicDetails(subnet_id=subnet.id) launch_instance_details = oci.core.models.LaunchInstanceDetails( display_name=display_name, compartment_id=compartment_id, availability_domain=availability_domain.name, shape=shape, metadata=metadata, source_details=instance_source_via_image_details, create_vnic_details=create_vnic_details) Last step is to use the launch_instance_and_wait_for_state Composite Operation to actually provision the instance and wait until it is available: compute_client_composite_operations = oci.core.ComputeClientCompositeOperations(self._compute_client) response = compute_client_composite_operations.launch_instance_and_wait_for_state( launch_instance_details, wait_for_states=[oci.core.models.Instance.LIFECYCLE_STATE_RUNNING], waiter_kwargs={'wait_callback': self._wait_callback}) We use the optional waiter callback to display a simple progress indicator oci-compute demo Short demo of the oci-compute tool: Conclusion In this post, I’ve shown how to use oci-compute to easily provision and manage your OCI Compute Instances from the command line as well as how to create your own Python scripts using the Oracle Cloud Infrastructure Python SDK.

In a recent blog post I illustrated how to use the OCI Command Line Interface (CLI) in shell scripts. While the OCI CLI is comprehensive and powerful, it may not be the best solution when you need to...

Partners

Noesis Solutions Certifies its Optimus Process Integration and Design Optimization Software with Oracle Linux

We are pleased to introduce Noesis Solutions’ Optimus into the ecosystem of ISV applications certified with Oracle Linux. Noesis recently certified its Optimus 2020.1 release with Oracle Linux 6 and 7. Optimus is an industry-leading process integration and design optimization (PIDO) software platform, bundling a powerful range of capabilities for engineering process integration, design space exploration, engineering optimization, and robustness and reliability. These PIDO technologies help direct engineering simulations toward design candidates that can outsmart competition while taking into account relevant design constraints - effectively implementing an objectives-driven engineering process. Optimus advanced workflow technologies offer the unique capability to intuitively automate engineering processes by capturing the related simulation workflow. These workflows free users from repetitive manual model changes, data processing, and performance evaluation tasks. Optimus simulation workflows can be executed in local environments or cloud infrastructures. Optimus also offers effective data mining technologies that help engineering teams to gain deeper insights and visualize the design space in a limited time window, to help them make informed decisions. Noesis Solutions is an engineering innovation company that works with manufacturers in engineering-intense industries. Specialized in solutions that enable objectives driven draft-to-craft engineering processes, its software products and services help customers adopt a targeted development strategy that helps resolve their toughest multi-disciplinary engineering challenges.  

We are pleased to introduce Noesis Solutions’ Optimus into the ecosystem of ISV applications certified with Oracle Linux. Noesis recently certified its Optimus 2020.1 release with Oracle Linux 6 and 7. Op...

Linux Kernel Development

Cgroup v2 Checkpoint

In this blog post, Oracle Linux kernel developer Tom Hromatka provides a checkpoint on Oracle Linux's journey to embrace cgroup v2. Cgroup v2 Checkpoint With the release of UEK5 in 2018, Oracle embarked on the long journey to fully transition to cgroup v2. UEK6 is the latest major milestone on the path to this significant upgrade. In UEK5, we added the cpu, cpuset, io, memory, pids, and rdma cgroup v2 controllers. While no new controllers were added for UEK6, emphasis was placed on reliability, usability, and security. Furthermore, we continue to focus on defining and implementing a holistic solution that once adopted by applications will allow them to seamlessly operate on a cgroup-v1 system or a cgroup-v2 system. Note that both UEK5 and UEK6 can currently meet your cgroup v1, cgroup v2, or multi-mode cgroup needs. Entirely cgroup v1 applications - This is default and no special action is required of the user Entirely cgroup v2 applications - By passing cgroup_no_v1=all in on the kernel command line, all cgroup v1 controllers will be disabled. The Cgroup v2 filesystem can then be mounted via mount -t cgroup2 cgroup2 /path/to/mount/cgroupv2 Applications that need a combination of v1 and v2 - By passing cgroup_no_v1=controller1,controller2, controller1 and controller2 will not be enabled in cgroup v1. They can then be mounted as a cgroup v2 mount outlined above. Brief Cgroup v2 vs v1 Recap Cgroup v1 was a jack-of-all-trades and master-of-none solution. It provided the user with tremendous flexibility and a myriad of configuration options. This came at the cost of complexity, performance, and (at least within the kernel code itself) maintainability. In practice most users only utilized cgroup v1 in a couple different fashions, yet the kernel still needed to support the possibility of the many, many other quirky and now nonstandard v1 configurations. With cgroup v2, these nonstandard and unintuitive usages were removed, and a much more streamlined hierarchy was established. LWN ran an excellent article on the high-level differences between cgroup v1 and v2. The challenges to enterprise users go well above and beyond these differences; below are a but a few of the changes that may affect our enterprise customers: In cgroup v2, many of the cgroup psuedofiles have been renamed and their range of values have changed as well. For example, cpu.shares in cgroup v1 provides similar behavior to cpu.weight or cpu.weight.nice in cgroup v2, but with a different span of valid settings. Cgroup v1's memory.limit_in_bytes correlates with v2's memory.max, memory.soft_limit_in_bytes is analogous to memory.high, and so on. Some v1 pseudofiles have been removed entirely. As cgroup v1 grew and changed organically over time, many controls were added. Ultimately this led to a large, confusing folder hierarchy with an inconsistent and complex interface for the user to manage. As cgroup v2 was being designed, these suboptimal psuedofiles were removed. For example, the cgroup v1 memory controller has 26 psuedofiles, whereas v2 has only 13 files. All new development is going into cgroup v2. Cgroup v1 will continue to see bug fixes for the foreseeable future, but no new features are being added to v1. Finally, there's another major advantage to move to cgroup v2 - PSI. Pressure Stall Information is a powerful performance-monitoring tool that was added to UEK5-U2 and is again available in UEK6. If a UEK5/UEK6 system is booted with the kernel command line parameter psi=1, then system-wide psi data is available in /proc/pressure/. If the system is also using cgroup v2, then PSI data is available for each cgroup as well. PSI can immediately pinpoint the culprit of performance bottlenecks - be it I/O, memory, or CPU. A Bright Future But a Challenging Road Cgroup v2 is undoubtedly a technical improvement for both the kernel and the users of cgroups, but it currently comes at a heavy opportunity cost to enterprise cgroup users. Enterprise customers will soon face a difficult decision - which cgroup version to support within their applications? Should a customer jump directly to cgroup v2? Cgroup v1 still largely reigns supreme, but its time may be nearing an end. Unfortunately many applications interact directly with the cgroup mount in sysfs which makes the transition to v2 even more arduous. With cgroup v2's drastically different hierarchy, restrictions on leaf nodes, and different pseudofiles, migrating to cgroup v2 is much more challenging than simply performing a find and replace. And what if an application needs to run on both older and newer systems? In this case the application will need to be cognizant of the underlying system and its capabilities, adjusting its cgroup settings and configurations accordingly. This is a large and complex undertaking that may consume many, many engineering-hours, stealing precious resources away from development on the revenue-generating features of the code. Help is on the Way We at Oracle have been working hard to ease the transition from cgroup v1 to cgroup v2 for our customers. We have been working closely with internal partners to devise a plan that will allow them - and all our customers - to take advantage of the new and exciting features of cgroup v2 without endangering their product lines, schedules, and bottom line. Some key requirements we have identified: Minimize the changes required within the enterprise application Many applications provide long-term support and need to be able to run on systems with a wide variety of features and capabilities. A major goal is to run the exact same user binary on a cgroup v1 or a cgroup v2 system. Encourage enterprise customers to interact with helper libraries (like libcgroup) rather than directly interacting with cgroup's sysfs. This will centralize the cgroup management in a single location rather than having a bunch of piecemeal solutions spread throughout each application. Stretch goal - implement a usability layer that will allow applications to specify required behavior rather than specific cgroup settings. Even with helper libraries, managing cgroups is complex and often requires expert-level knowledge to maximize performance and minimize security risks. In some cases, a user would prefer to request a behavior (e.g. protection from side-channel attacks) rather than identify the cgroup settings required to implement such a behavior. Given the above requirements, we have embarked on the following roadmap: Add cgroup v2 support to libcgroup. Libcgroup was started in 2008 during the early days of cgroup v1 but has largely languished over the last few years. As maintainers of libcgroup, Oracle's Dhaval Giani and I are defining, guiding, and implementing the library's transition to full cgroup v2 support. In 2019, we restarted development on libcgroup and have since added automated unit tests, automated functional tests, and code coverage. We recently added an "ignore" feature to cgrules for an internal customer, and currently have a patchset out for review to add cgroup v2 support to cgget and cgset. Create an abstraction layer that can receive cgroup v1 (or v2) requests and translate them to the correct underlying system settings - be it v1 or v2. This layer should allow cgroup v1 users to continue to specify v1 settings even if the application is running on a v2 system, thus minimizing changes to the application. And finally create a usability layer to further remove the user from the intricacies and pitfalls of cgroup management. Not all users are cgroup experts and not all users want to be cgroup experts. A usability layer would give these users the ability to consistently and safely configure their systems every single time. What Should a User Do Now? While we are making good progress on the abstraction and usability layers, they will not be ready for some time yet. In the meantime, users can ease the transition to cgroup v2 by: Identifying where the application interacts with cgroups. If the application is directly interacting with sysfs, I would strongly recommend that the code be updated to interact with libcgroup's APIs. There are several current advantages of using libcgroup over sysfs directly, and these advantages will continue to grow as more features and abstractions added to libcgroup. Using libcgroup's APIs will significantly ease the transition to cgroup v2 Documenting the application's cgroup hierarchy. As outlined in the cgroup v2 vs v1 recap above, cgroup v2 only supports having processes in leaf nodes. Now would be a good time to revisit the application's cgroup hierarchy and ensure that it is compatible with v2's stricter requirements Flattening the application's cgroup hierarchy. Due to a shared semaphore in the kernel, heavily nested cgroups are potentially subject to nontrivial performance degradations. If possible, flattening the application's cgroup hierarchy could be an easy path to improved performance Helping us define the abstraction and usability layers. Comments and thoughts are always welcome on the libcgroup mailing list. Conclusion Cgroup v2 is an exciting technology with a lot of benefits over its predecessor. Oracle is working on defining and implementing the kernel and low-level userspace code that will allow our users to take advantage of all that cgroup v2 has to offer. Interested in following along? Subscribe to the libcgroup mailing list and monitor our progress at libcgroup by clicking the "Watch" button in the top right. Interested in participating or helping to define the abstraction or usability layers? Please email the libcgroup mailing list with your thoughts. We would love to hear your input.

In this blog post, Oracle Linux kernel developer Tom Hromatka provides a checkpoint on Oracle Linux's journey to embrace cgroup v2. Cgroup v2 Checkpoint With the release of UEK5 in 2018, Oracle embarked...

Linux

Zero Copy Networking in UEK6

Rao Shoaib, Oracle Linux kernel developer talks about enhancements to Linux networking relating to zero copy networking.   Zero copy networking has always been the goal of Linux networking, and over the years a lot of techniques have been developed in the mainline Linux kernel to achieve it. This blog post highlights recent enhancements to zero copy networking 1. All of these enhancements are included in UEK6. Zero Copy Send Zero copy send avoids copying on transmit by pinning down the send buffers. It introduces a new socket option, SO_ZEROCOPY, and a new send flag, MSG_ZEROCOPY. To use this feature the SO_ZEROCOPY socket option has to be set and the send calls have to be made using the flag MSG_ZEROCOPY. setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)); send(fd, buf, sizeof(buf), MSG_ZEROCOPY); Once the kernel has transmitted the buffer it notifies the application via the socket MSG_ERRORQUEUE that the buffer is available for reuse. Each successful send is identified by a 32 bit sequentially increasing number, so the application can co-relate the sends and buffers. A notification may contain a range that covers several sends. Zerocopy and non zero copy sends can be intermixed. recvmsg(fd, &msg, MSG_ERRQUEUE); Due to the overhead associated with pinning down pages, MSG_ZEROCOPY is generally only effective at writes larger than 10 KB. MSG_ZEROCOPY is supported for TCP, UDP, RAW, and RDS TCP sockets. Due to various reasons, zerocopy may not always be possible, in which case a copy of the data is used. Performance Process cycles measurements were made over all CPU's in the system. CPU cycles measurements were made only over CPU's on which the application ran. Cycles measured are UNHALTED_CORE_CYCLES. TCP Zero Copy Recv Zero copy receive is a much harder problem to solve as it requires data buffer to be page aligned for it to be mmaped. The presence of protocol headers and small MTU size inhibits this alignment. The new zero copy receive API introduces mmap() for TCP sockets. mmap() calls on a TCP socket, carve out address space where incoming data buffer can be mapped provided the data buffer is page aligned and urgent data is not present. Note it does not require the size to be a multiple of page size. To use this feature, the application issues a mmap() call on the TCP socket to obtain an address range for mapping data buffers. When data becomes available reading the application issues a getsockopt() call with the TCP_ZEROCOPY_RECEIVE option to pass the address information to the kernel. The following structure is used to pass the mapping information, address contains the address where the mapping begins and length contains the length of the mapped space. struct tcp_zerocopy_receive { __u64 address; /* Input, address of mmap carved space */ __u32 length; /* Input/Output, length of address space , amount of data mapped */ __u32 recv_skip_hint; /* Output, bytes to read via recvmsg before reading mmap region */ }; If the getsockopt() call is successful, the kenel would have mapped the data buffer at the address passed in. On return length contains the length of the mapped data. In certain circumstances it may not be possible to map the entire data, in which case recv_skip_hint contains the length of data to be read via regular recv calls before reading mapped data. If no data can be mapped getsockopt() returns with an error and regular recv calls have to be used. Once the data has been consumed, mmap address space can be freed for reuse via another TCP_ZEROCOPY_RECEIVE call. The kernel will reuse the address space for mapping the next data buffer. munmap() can be used to unmap and release the address space. Note that reuse does not require an mmap() call. This API shows performance gains only when conditions pertaining to buffer alignment are met. So this is not a general purpose API. On a setup using header splitting capable NIC's and an MTU of 61512 bytes, data processing went from 129µs/MB to just 45µs/MB. AF_XDP, Zero Copy AF_XDP sockets have been enhanced to support zero copy send and receive by using Rx and Tx rings in the user space. A large chunk of memory called UMEM is allocated. This memory is registered with the socket via the setsockopt XDP_UMEM_REG. The block of memory is further divided into smaller chunks called descriptors, which are addressed by their index within the UMEM block. Descriptors are used as send and receive buffers. UMEM can be shared between different sockets and processes.     To use the chunks as Rx buffers, two circular queues 'fill queue' and 'receive queue' are created. 'fill queue' is used to pass the index of the descriptor within UMEM to be used for copying data, 'receive queue' is updated by the kernel with indexes that have been used. 'fill queue' is allocated via the setsockopt XDP_UMEM_FILL_QUEUE and is mapped in user space.     'receive queue' is allocated via the setsockopt XDP_RX_QUEUE and is also mmaped in user space. Once the kernel has copied data in one of the descriptors whose index was provided in the fill queue, it places that descriptor's index in this queue to indicate that it has been used and data is available to be read.     A similar mechanism is used for Tx. A transmission queue 'Tx Q' is created via the setsockopt XDP_TX_QUEUE and UMEM indexes of packets to be transmitted are placed in this queue. A 'Completion Q' is created via the setsockopt XDP_UMEM_COMPLETION_QUEUE. The kernel adds UMEM indexes of buffers that have been transmitted to the Completion Q.     Performance Performance improved from 15 Mpps to 39 Mpps for Rx and from 25 Mpps to 68 Mpps for Tx. This enhancement makes XDP performance on par with DPDK. Use of this feature requires that the application create it's own packets by inserting all the necessary headers and also perform protocol processing. It also requires use of libbpf. Similar requirements exist when using DPDK. Conclusion UEK6 delivers continued network performance enhancements and new technology to build faster networking products. References: Material presented has been borrowed from several sources, some are listed below. sendmsg copy avoidance with MSG_ZEROCOPY [Patch net-next 5/5] selftests: net: add tcp_mmap program AF_XDP, zero-copy support The Path to DPDK Speeds for AF_XDP AF_XDP

Rao Shoaib, Oracle Linux kernel developer talks about enhancements to Linux networking relating to zero copy networking.   Zero copy networking has always been the goal of Linux networking, and over the...

Linux

An Introduction to the io_uring Asynchronous I/O Framework

In this blog Oracle Linux kernel developer Bijan Mottahedeh talks about the io_uring asynchronous I/O framework included in the Unbreakable Enterprise Kernel 6.     Introduction This blog post is a brief introduction to the io_uring asynchronous I/O framework available in release 6 of the Unbreakable Enterprise Kernel (UEK). It highlights the motivations for introducing the new framework, describes its system call and library interfaces with the help of sample applications, and provides a list of references that describe the technology in further detail including more usage examples. The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality but prefer the kernel to perform the I/O. This could be in order to exploit benefits running on top of a filesystem, or to leverage features such as mirroring and block-level encryption. This is in contrast to SPDK applications for example, that explicitly do not want the kernel to perform I/O because they implement their own filesystem and features. Motivation The native Linux AIO framework suffers from various limitations, which io_uring aims to overcome: It does not support buffered I/O, only direct I/O is supported. It has non-deterministic behavior which may block under various circumstances. It has a sub-optimal API, which requires at least two system calls per I/O, one to submit a request, and one to wait for its completion. Each submission needs to copy 64 + 8 bytes of data, and each completion needs to copy 32 bytes.   Communication channel An io_uring instance has two rings, a submission queue (SQ) and a completion queue (CQ), shared between the kernel and the application. The queues are single producer, single consumer, and power of two in size. The queues provide a lock-less access interface, coordinated with memory barriers. The application creates one or more SQ entries (SQE), and then updates the SQ tail. The kernel consumes the SQEs , and updates the SQ head. The kernel creates CQ entries (CQE) for one or more completed requests, and updates the CQ tail. The application consumes the CQEs and updates the CQ head. Completion events may arrive in any order but they are always associated with specific SQEs. System call API The io_uring API consists of three system calls: io_uring_setup(2), io_uring_register(2) and io_uring_enter(2), described in the following sections. The full manual pages for the system calls are available here. io_uring_setup setup a context for performing asynchronous I/O int io_uring_setup(u32 entries, struct io_uring_params *p); The io_uring_setup() system call sets up a submission queue and completion queue with at least entries elements, and returns a file descriptor which can be used to perform subsequent operations on the io_uring instance. The submission and completion queues are shared between the application and the kernel, which eliminates the need to copy data when initiating and completing I/O. params is used by the application in order to configure the io_uring instance, and by the kernel to convey back information about the configured submission and completion ring buffers. An io_uring instance can be configured in three main operating modes: Interrupt driven - By default, the io_uring instance is setup for interrupt driven I/O. I/O may be submitted using io_uring_enter() and can be reaped by checking the completion queue directly. Polled - Perform busy-waiting for an I/O completion, as opposed to getting notifications via an asynchronous IRQ (Interrupt Request). The file system (if any) and block device must support polling in order for this to work. Busy-waiting provides lower latency, but may consume more CPU resources than interrupt driven I/O. Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag. When a read or write is submitted to a polled context, the application must poll for completions on the CQ ring by calling io_uring_enter(). It is illegal to mix and match polled and non-polled I/O on an io_uring instance. Kernel polled - In this mode, a kernel thread is created to perform submission queue polling. An io_uring instance configured in this way enables an application to issue I/O without ever context switching into the kernel. By using the submission queue to fill in new submission queue entries and watching for completions on the completion queue, the application can submit and reap I/Os without doing a single system call. If the kernel thread is idle for more than a user configurable amount of time, it will go idle after notifying the application first. When this happens, the application must call io_uring_enter() to wake the kernel thread. If I/O is kept busy, the kernel thread will never sleep. io_uring_setup() returns a new file descriptor on success. The application may then provide the file descriptor in a subsequent mmap(2) call to map the submission and completion queues, or to the io_uring_register() or io_uring_enter() system calls. io_uring_register register files or user buffers for asynchronous I/O int io_uring_register(unsigned int fd, unsigned int opcode, void *arg, unsigned int nr_args); The io_uring_register() system call registers user buffers or files for use in an io_uring instance referenced by fd. Registering files or user buffers allows the kernel to take long term references to internal kernel data structures associated with the files, or create long term mappings of application memory associated with the buffers, only once during registration as opposed to during processing of each I/O request, therefore reducing per-I/O overhead. Registered buffers will be locked in memory and charged against the user's RLIMIT_MEMLOCK resource limit. Additionally, there is a size limit of 1GiB per buffer. Currently, the buffers must be anonymous, non-file-backed memory, such as that returned by malloc(3) or mmap(2) with the MAP_ANONYMOUS flag set. Huge pages are supported as well. Note that the entire huge page will be pinned in the kernel, even if only a portion of it is used. It is perfectly valid to setup a large buffer and then only use part of it for an I/O, as long as the range is within the originally mapped region. An application can increase or decrease the size or number of registered buffers by first unregistering the existing buffers, and then issuing a new call to io_uring_register() with the new buffers. An application can dynamically update the set of registered files without unregistering them first. It is possible to use eventfd(2) to get notified of completion events on an io_uring instance. If this is desired, an eventfd file descriptor can be registered through this system call. The credentials of the running application can be registered with io_uring which returns an id associated with those credentials. Applications wishing to share a ring between separate users/processes can pass in this credential id in the SQE personality field. If set, that particular SQE will be issued with these credentials. io_uring_enter initiate and/or complete asynchronous I/O int io_uring_enter(unsigned int fd, unsigned int to_submit, unsigned int min_complete, unsigned int flags, sigset_t *sig); io_uring_enter() is used to initiate and complete I/O using the shared submission and completion queues setup by a call to io_uring_setup(). A single call can both submit new I/O and wait for completions of I/O initiated by this call or previous calls to io_uring_enter(). fd is the file descriptor returned by io_uring_setup(). to_submit specifies the number of I/Os to submit from the submission queue. If so directed by the application, the system call will attempt to wait for min_complete event completions before returning. If the io_uring instance was configured for polling, then min_complete has a slightly different meaning. Passing a value of 0 instructs the kernel to return any events which are already complete, without blocking. If min_complete is a non-zero value, the kernel will still return immediately if any completion events are available. If no event completions are available, then the call will poll either until one or more completions become available, or until the process has exceeded its scheduler time slice. Note that for interrupt driven I/O, an application may check the completion queue for event completions without entering the kernel at all. io_uring_enter() supports a wide variety of operations, including Open, close, and stat files Read and write into multiple buffers or pre-mapped buffers Socket I/O operations Synchronize file state Asynchronously monitor a set of file descriptors Create a timeout linked to a specific operation in the ring Attempt to cancel an operation that is currently in flight Create I/O chains Ordered execution within a chain Parallel execution of multiple chains   When the system call returns that a certain amount of SQEs have been consumed and submitted, it's safe to reuse SQE entries in the ring. This is true even if the actual IO submission had to be punted to async context, which means that the SQE may in fact not have been submitted yet. If the kernel requires later use of a particular SQE entry, it will have made a private copy of it. Liburing Liburing provides a simple higher level API for basic use cases and allows applications to avoid having to deal with the full system call implementation details. The API also avoids duplicated code for operations such as setting up an io_uring instance. For example, after getting back a ring file descriptor from io_uring_setup(), an application must always call mmap() in order to map the submission and completion queues for access, as described in the io_uring_setup() manual page. The entire sequence is somewhat lengthy but can be accomplish with the following liburing call: int io_uring_queue_init(unsigned entries, struct io_uring *ring, unsigned flags); The sample applications below are included in liburing source, and help illustrates these points. The source code for liburing sample applications is available here. There is no liburing API documentation currently available and the API is described in liburing.h header file. Sample application io_uring-test io_uring-test reads a maximum of 16KB from a user specified file using 4 SQEs. Each SQE is a request to read 4KB of data from a fixed file offset. io-uring then reaps each CQE and checks whether the full 4KB was read from the file as requested. If the file is smaller than 16KB, all 4 SQEs are still submitted but some CQE results will indicate either a partial read, or zero bytes read, depending on the actual size of the file. io-uring finally reports the number of SQEs and CQEs it has processed. The full source code is listed below followed by the description of the liburing calls. /* SPDX-License-Identifier: MIT */ /* * Simple app that demonstrates how to setup an io_uring interface, * submit and complete IO against it, and then tear it down. * * gcc -Wall -O2 -D_GNU_SOURCE -o io_uring-test io_uring-test.c -luring */ #include <stdio.h> #include <fcntl.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include "liburing.h" #define QD 4 int main(int argc, char *argv[]) { struct io_uring ring; int i, fd, ret, pending, done; struct io_uring_sqe *sqe; struct io_uring_cqe *cqe; struct iovec *iovecs; off_t offset; void *buf; if (argc < 2) { printf("%s: file\n", argv[0]); return 1; } ret = io_uring_queue_init(QD, &ring, 0); if (ret < 0) { fprintf(stderr, "queue_init: %s\n", strerror(-ret)); return 1; } fd = open(argv[1], O_RDONLY | O_DIRECT); if (fd < 0) { perror("open"); return 1; } iovecs = calloc(QD, sizeof(struct iovec)); for (i = 0; i < QD; i++) { if (posix_memalign(&buf, 4096, 4096)) return 1; iovecs[i].iov_base = buf; iovecs[i].iov_len = 4096; } offset = 0; i = 0; do { sqe = io_uring_get_sqe(&ring); if (!sqe) break; io_uring_prep_readv(sqe, fd, &iovecs[i], 1, offset); offset += iovecs[i].iov_len; i++; } while (1); ret = io_uring_submit(&ring); if (ret < 0) { fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret)); return 1; } done = 0; pending = ret; for (i = 0; i < pending; i++) { ret = io_uring_wait_cqe(&ring, &cqe); if (ret < 0) { fprintf(stderr, "io_uring_wait_cqe: %s\n", strerror(-ret)); return 1; } done++; ret = 0; if (cqe->res != 4096) { fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res); ret = 1; } io_uring_cqe_seen(&ring, cqe); if (ret) break; } printf("Submitted=%d, completed=%d\n", pending, done); close(fd); io_uring_queue_exit(&ring); return 0; } Description An io_uring instance is created in the default interrupt driven mode, specifying only the size of the ring. ret = io_uring_queue_init(QD, &ring, 0); All of the ring SQEs are next fetched and prepared for the IORING_OP_READV operation which provides an asynchronous interface to readv(2) system call. Liburing provides numerous helper functions to prepare io_uring operations. Each SQE will point to an allocated buffer described by an iovec structure. The buffer will contain the result of the corresponding readv operation upon completion. sqe = io_uring_get_sqe(&ring); io_uring_prep_readv(sqe, fd, &iovecs[i], 1, offset); The SQEs are submitted with a single call to io_uring_submit() which returns the number of submitted SQEs. ret = io_uring_submit(&ring); The CQEs are reaped with repeated calls to io_uring_wait_cqe(), and the success of a given submission is verified with the cqe->res field; each matching call to io_uring_cqe_seen() informs the kernel that the given CQE has been consumed. ret = io_uring_wait_cqe(&ring, &cqe); io_uring_cqe_seen(&ring, cqe); The io_uring instance is finally dismantled. void io_uring_queue_exit(struct io_uring *ring) Sample application link-cp link-cp copies a file using the io_uring SQE chaining feature. As noted before, io_uring supports the creation of I/O chains. The I/O operations within a chain are sequentially executed, and multiple I/O chains can execute in parallel. To copy the file, link-cp creates SQE chains of length two. The first SQE in the chain is a read request from the input file into a buffer. The second request, linked to the first, is a write request from the same buffer to the output file. /* SPDX-License-Identifier: MIT */ /* * Very basic proof-of-concept for doing a copy with linked SQEs. Needs a * bit of error handling and short read love. */ #include <stdio.h> #include <fcntl.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include <assert.h> #include <errno.h> #include <inttypes.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/ioctl.h> #include "liburing.h" #define QD 64 #define BS (32*1024) struct io_data { size_t offset; int index; struct iovec iov; }; static int infd, outfd; static unsigned inflight; static int setup_context(unsigned entries, struct io_uring *ring) { int ret; ret = io_uring_queue_init(entries, ring, 0); if (ret < 0) { fprintf(stderr, "queue_init: %s\n", strerror(-ret)); return -1; } return 0; } static int get_file_size(int fd, off_t *size) { struct stat st; if (fstat(fd, &st) < 0) return -1; if (S_ISREG(st.st_mode)) { *size = st.st_size; return 0; } else if (S_ISBLK(st.st_mode)) { unsigned long long bytes; if (ioctl(fd, BLKGETSIZE64, &bytes) != 0) return -1; *size = bytes; return 0; } return -1; } static void queue_rw_pair(struct io_uring *ring, off_t size, off_t offset) { struct io_uring_sqe *sqe; struct io_data *data; void *ptr; ptr = malloc(size + sizeof(*data)); data = ptr + size; data->index = 0; data->offset = offset; data->iov.iov_base = ptr; data->iov.iov_len = size; sqe = io_uring_get_sqe(ring); io_uring_prep_readv(sqe, infd, &data->iov, 1, offset); sqe->flags |= IOSQE_IO_LINK; io_uring_sqe_set_data(sqe, data); sqe = io_uring_get_sqe(ring); io_uring_prep_writev(sqe, outfd, &data->iov, 1, offset); io_uring_sqe_set_data(sqe, data); } static int handle_cqe(struct io_uring *ring, struct io_uring_cqe *cqe) { struct io_data *data = io_uring_cqe_get_data(cqe); int ret = 0; data->index++; if (cqe->res < 0) { if (cqe->res == -ECANCELED) { queue_rw_pair(ring, BS, data->offset); inflight += 2; } else { printf("cqe error: %s\n", strerror(cqe->res)); ret = 1; } } if (data->index == 2) { void *ptr = (void *) data - data->iov.iov_len; free(ptr); } io_uring_cqe_seen(ring, cqe); return ret; } static int copy_file(struct io_uring *ring, off_t insize) { struct io_uring_cqe *cqe; size_t this_size; off_t offset; offset = 0; while (insize) { int has_inflight = inflight; int depth; while (insize && inflight < QD) { this_size = BS; if (this_size > insize) this_size = insize; queue_rw_pair(ring, this_size, offset); offset += this_size; insize -= this_size; inflight += 2; } if (has_inflight != inflight) io_uring_submit(ring); if (insize) depth = QD; else depth = 1; while (inflight >= depth) { int ret; ret = io_uring_wait_cqe(ring, &cqe); if (ret < 0) { printf("wait cqe: %s\n", strerror(ret)); return 1; } if (handle_cqe(ring, cqe)) return 1; inflight--; } } return 0; } int main(int argc, char *argv[]) { struct io_uring ring; off_t insize; int ret; if (argc < 3) { printf("%s: infile outfile\n", argv[0]); return 1; } infd = open(argv[1], O_RDONLY); if (infd < 0) { perror("open infile"); return 1; } outfd = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0644); if (outfd < 0) { perror("open outfile"); return 1; } if (setup_context(QD, &ring)) return 1; if (get_file_size(infd, &insize)) return 1; ret = copy_file(&ring, insize); close(infd); close(outfd); io_uring_queue_exit(&ring); return ret; } Description The three routines copy_file(), queue_rw_pair(), and handle_cqe(), implement the file copy. copy_file() implements the high level copy loop; it calls queue_rw_pair() to construct each SQE pair queue_rw_pair(ring, this_size, offset); and submits all the constructed SQE pairs in each iteration with a single call to io_uring_submit(). if (has_inflight != inflight) io_uring_submit(ring); copy_file() maintains up to QD SQEs inflight as long as data remains to be copied; it waits for and reaps all CQEs after the input file has been fully read. ret = io_uring_wait_cqe(ring, &cqe); if (handle_cqe(ring, cqe)) queue_rw_pair() constructs a read-write SQE pair. The IOSQE_IO_LINK flag is set for the read SQE which marks the start of the chain. The flag is not set for the write SQE which marks the end of the chain. The user data field is set for both SQEs to the same data descriptor for the pair, and will be used during completion handling. sqe = io_uring_get_sqe(ring); io_uring_prep_readv(sqe, infd, &data->iov, 1, offset); sqe->flags |= IOSQE_IO_LINK; io_uring_sqe_set_data(sqe, data); sqe = io_uring_get_sqe(ring); io_uring_prep_writev(sqe, outfd, &data->iov, 1, offset); io_uring_sqe_set_data(sqe, data); handle_cqe() retrieves from a CQE the data descriptor saved by queue_rw_pair() and records the retrieval in the descriptor. struct io_data *data = io_uring_cqe_get_data(cqe); data->index++; handle_cqe() resubmits the read-write pair if the request was canceled. if (cqe->res == -ECANCELED) { queue_rw_pair(ring, BS, data->offset); The following excerpt from the io_uring_enter() manual page describes the behavior of chained requests in more detail: IOSQE_IO_LINK When this flag is specified, it forms a link with the next SQE in the submission ring. That next SQE will not be started before this one completes. This, in effect, forms a chain of SQEs, which can be arbitrarily long. The tail of the chain is denoted by the first SQE that does not have this flag set. This flag has no effect on previous SQE submissions, nor does it impact SQEs that are outside of the chain tail. This means that multiple chains can be executing in parallel, or chains and individual SQEs. Only members inside the chain are serialized. A chain of SQEs will be broken, if any request in that chain ends in error. io_uring considers any unexpected result an error. This means that, eg, a short read will also terminate the remainder of the chain. If a chain of SQE links is broken, the remaining unstarted part of the chain will be terminated and completed with -ECANCELED as the error code. handle_cqe() frees the shared data descriptor after both members of a CQE pair have been processed. if (data->index == 2) { void *ptr = (void *) data - data->iov.iov_len; free(ptr); } handle_cqe() finally informs the kernel that the given CQE has been consumed. io_uring_cqe_seen(ring, cqe); Liburing API io_uring-test and link-cp use the following subset of the liburing API: /* * Returns -1 on error, or zero on success. On success, 'ring' * contains the necessary information to read/write to the rings. */ int io_uring_queue_init(unsigned entries, struct io_uring *ring, unsigned flags); /* * Return an sqe to fill. Application must later call io_uring_submit() * when it's ready to tell the kernel about it. The caller may call this * function multiple times before calling io_uring_submit(). * * Returns a vacant sqe, or NULL if we're full. */ struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring); /* * Set the SQE user_data field. */ void io_uring_sqe_set_data(struct io_uring_sqe *sqe, void *data); /* * Prepare a readv I/O operation. */ void io_uring_prep_readv(struct io_uring_sqe *sqe, int fd, const struct iovec *iovecs, unsigned nr_vecs, off_t offset); /* * Prepare a writev I/O operation. */ void io_uring_prep_writev(struct io_uring_sqe *sqe, int fd, const struct iovec *iovecs, unsigned nr_vecs, off_t offset); /* * Submit sqes acquired from io_uring_get_sqe() to the kernel. * * Returns number of sqes submitted */ int io_uring_submit(struct io_uring *ring); /* * Return an IO completion, waiting for it if necessary. Returns 0 with * cqe_ptr filled in on success, -errno on failure. */ int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr); /* * Must be called after io_uring_{peek,wait}_cqe() after the cqe has * been processed by the application. */ static inline void io_uring_cqe_seen(struct io_uring *ring, struct io_uring_cqe *cqe); void io_uring_queue_exit(struct io_uring *ring); References Efficient IO with io_uring Ringing in a new asynchronous I/O API The rapid growth of io_uring Faster I/O through io_uring System call API Liburing API Examples Liburing repository

In this blog Oracle Linux kernel developer Bijan Mottahedeh talks about the io_uring asynchronous I/O framework included in the Unbreakable Enterprise Kernel 6.     Introduction This blog post is a brief...

Linux Toolchain & Tracing

DTrace for the Application Developer - Counting Function Calls

This blog entry was provided by Ruud van der Pas   Introduction DTrace is often positioned as an operating system analysis tool for the system administrators, but it has a wider use than this. In particular the application developer may find some features useful when trying to understand a performance problem. In this article we show how DTrace can be used to print a list of the user-defined functions that are called by the target executable. We also show how often these functions are called. Our solution presented below works for a multithreaded application and the function call counts for each thread are given. Motivation There are several reasons why it may be helpful to know how often functions are called: Identify candidates for compiler-based inlining. With inlining, the function call is replaced by the source code of that function. This eliminates the overhead associated with calling a function and also provides additional opportunities for the compiler to better optimize the code. The downsides are an increase in the usage of registers and potentially a reduced benefit from an instruction cache. This is why inlining works best on small functions called very often. Test coverage. Although much more sophisticated tools exist for this, for example gcov, function call counts can be useful to quickly verify if a function is called at all. Note that gcov requires the executable to be instrumented and the source has to be compiled with the appropriate options. In case the function call counts vary across the threads of a multithreaded program, there may be a load imbalance. The counts can also be used to verify which functions are executed by a single thread only.   Target Audience No background in DTrace is assumed. All DTrace features and constructs used are explained. It is expected the reader has some familiarity with developing applications, knows how to execute an executable, and has some basic understanding of shell scripts. The DTrace Basics DTrace provides dynamic tracing of both the operating system kernel and user processes. Kernel and process activities can be observed across all processes running, or be restricted to a specific process, command, or executable. There is no need to recompile or have access to the source code of the process(es) that are monitored. A probe is a key concept in DTrace. Probes define the events that are available to the user to trace. For example, a probe can be used to trace the entry to a specific system call. The user needs to specify the probe(s) to monitor. The simple D language is available to program the action(s) to be taken in case an event occurs. DTrace is safe, unintrusive, and supports kernel as well as application observability. DTrace probes are organized in sets called providers. The name of a provider is used in the definition of a probe. The user can bind one or more tracing actions to any of the probes that have been provided. A list of all of the available probes on the system is obtained using the -l option on the dtrace command that is used to invoke DTrace. Below an example is shown, but only snippets of the output are listed, because on this system there are over 110,000 probes. # dtrace -l ID PROVIDER MODULE FUNCTION NAME 1 dtrace BEGIN 2 dtrace END 3 dtrace ERROR <lines deleted> 16 profile tick-1000 17 profile tick-5000 18 syscall vmlinux read entry 19 syscall vmlinux read return 20 syscall vmlinux write entry 21 syscall vmlinux write return <lines deleted> 656 perf vmlinux syscall_trace_enter sys_enter 657 perf vmlinux syscall_slow_exit_work sys_exit 658 perf vmlinux emulate_vsyscall emulate_vsyscall 659 lockstat vmlinux intel_put_event_constraints spin-release 660 lockstat vmlinux intel_stop_scheduling spin-release 661 lockstat vmlinux uncore_pcibus_to_physid spin-release <lines deleted> 1023 sched vmlinux __sched_setscheduler dequeue 1024 lockstat vmlinux tg_set_cfs_bandwidth spin-release 1025 sched vmlinux activate_task enqueue 1026 sched vmlinux deactivate_task dequeue 1027 perf vmlinux ttwu_do_wakeup sched_wakeup 1028 sched vmlinux do_set_cpus_allowed enqueue <many more lines deleted> 155184 fbt xt_comment comment_mt return 155185 fbt xt_comment comment_mt_exit entry 155186 fbt xt_comment comment_mt_exit return 163711 profile profile-99 163712 profile profile-1003 # Each probe in this output is identified by a system-dependent numeric identifier and four fields with unique values:   provider - The name of the DTrace provider that is publishing this probe. module - If this probe corresponds to a specific program location, the name of the kernel module, library, or user-space program in which the probe is located. function - If this probe corresponds to a specific program location, the name of the kernel, library, or executable function in which the probe is located. name - A name that provides some idea of the probe's semantic meaning, such as BEGIN, END, entry, return, enqueue, or dequeue.   All probes have a provider name and a probe name, but some probes, such as the BEGIN, END, ERROR, and profile probes, do not specify a module and function field. This type of probe does not instrument any specific program function or location. Instead, these probes refer to a more abstract concept. For example, the BEGIN probe always triggers at the start of the tracing process. Wild cards in probe descriptions are supported. An empty field in the probe description is equivalent to * and therefore matches any possible value for that field. For example, to trace the entry to the malloc() function in libc.so.6 in a process with PID 365, the pid365:libc.so.6:malloc:entry probe can be used. To probe the malloc() function in this process regardless of the specific library it is part of, either the pid365::malloc:entry or pid365:*:malloc:entry probe can be used.   Upon invocation of DTrace, probe descriptions are matched to determine which probes should have an action associated with them and need to be enabled. A probe is said to fire when the event it represents is triggered.   The user defines the actions to be taken in case a probe fires. These need to be written in the D language, which is specific to DTrace, but readers with some programming experience will find it easy to learn. Different actions may be specified for different probe descriptions. While these actions can be specified at the command line, in this article we put all the probes and associated actions in a file. This D program, or script, by convention has the extension ".d". Aggregations are important in DTrace. Since they play a key role in this article we add a brief explanation here. The syntax for an aggregation is @user_defined_name[keys] = aggregation_function(). An example of an aggregation function is sum(arg). It takes a scalar expression as an argument and returns the total value of the specified expressions. For those readers who like to learn more about aggregations in particular we recommend to read this section on aggregations from the Oracle Linux DTrace Guide. This section also includes a list of the available aggregation functions. Testing Environment and Installation Instructions The experiments reported upon here have been conducted in an Oracle Cloud Infrastructure ("OCI") instance running Oracle Linux. The following kernel has been used: $ uname -srvo Linux 4.14.35-1902.3.1.el7uek.x86_64 #2 SMP Mon Jun 24 21:25:29 PDT 2019 GNU/Linux $ The 1.6.4 version of the D language and the 1.2.1 version of DTrace have been used: $ sudo dtrace -Vv dtrace: Sun D 1.6.4 This is DTrace 1.2.1 dtrace(1) version-control ID: e543f3507d366df6ffe3d4cff4beba2d75fdb79c libdtrace version-control ID: e543f3507d366df6ffe3d4cff4beba2d75fdb79c $ DTrace is available on Oracle Linux and can be installed through the following yum command: $ sudo yum install dtrace-utils After the installation has completed, please check your search path! DTrace is invoked through the dtrace command in /usr/sbin. Unfortunately there is a different tool with the same name in /usr/bin. You can check the path is correct through the following command: $ which dtrace /usr/sbin/dtrace $   Oracle Linux is not the only operating system that supports DTrace. It actually has its roots in the Oracle Solaris operating system, but it is also available on macOS and Windows. DTrace is also supported on other Linux based operating systems. For example, this blog article outlines how DTrace could be used on Fedora. Counting Function Calls In this section we show how DTrace can be used to count function calls. Various D programs are shown, successively refining the functionality. The Test Program In the experiments below, a multithreaded version of the multiplication of a matrix with a vector is used. The program is written in C and the algorithm has been parallelized using the Pthreads API. This is a relatively simple test program and makes it easy to verify the call counts are correct. Below is an example of a job that multiplies a 1000x500 matrix with a vector of length 500 using 4 threads. The output echoes the matrix sizes, the number of threads used, and the time it took to perform the multiplication: $ ./mxv.par.exe -m 1000 -n 500 -t 4 Rows = 1000 columns = 500 threads = 4 time mxv = 510 (us) $   A First DTrace Program The D program below lists all functions that are called when executing the target executable. It also shows how often these functions have been executed. Line numbers have been added for ease of reference: 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option quiet 4 5 BEGIN { 6 printf("\n======================================================================\n"); 7 printf(" Function Call Count Statistics\n"); 8 printf("======================================================================\n"); 9 } 10 pid$target:::entry 11 { 12 @all_calls[probefunc,probemod] = count(); 13 } 14 END { 15 printa(@all_calls); 16 } The first line invokes DTrace and uses the -s option to indicate the D program is to follow. At line 3, a pragma is used to supress some information DTrace prints by default. The BEGIN probe spans lines 5-9. This probe is executed once at the start of the tracing and is ideally suited to initialize variables and, as in this case, print a banner. At line 10 we use the pid provider to enable tracing of a user process. The target process is either specified using a particular process id (e.g. pid365), or through the $target macro variable that expands to the process id of the command specified at the command line. The latter form is used here. The pid provider offers the flexibility to trace any command, which in this case is the execution of the matrix-vector multiplication executable. We use wild cards for the module name and function. The probe name is entry and this means that this probe fires upon entering any function of the target process. Lines 11 and 13 contain the mandatory curly braces that enclose the actions taken. In this case there is only one action and it is at line 12. Here, the count() aggregation function is used. It returns how often it has been called. Note that this is on a per-probe basis, so this line counts how often each probe fires. The result is stored in an aggregation with the name @all_calls. Since this is an aggregation, the name has to start with the "@" symbol. The aggregation is indexed through the probefunc and probemod built-in DTrace variables. They expand to the function name that caused the probe to trigger and the module this function is part of. This means that line 12 counts how many times each function of the parent process is executed and the library or exectuable this function is part of. The END probe spans lines 14-16. Recall this probe is executed upon termination of the tracing. Although aggregations are automatically printed upon termination, we explicitly print the aggregation using the printa function. The function and module name(s), plus the respective counts, are printed. Below is the output of a run using the matrix-vector program. It is assumed that the D program shown above is stored in a file with the name fcalls.d. Note that root privileges are needed to use DTrace. This is why we use the sudo tool to execute the D program. By default the DTrace output is mixed with the program output. The -o option is used to store the DTrace output in a separate file. The -c option is used to specifiy the command or executable that needs to be traced. Since we use options on the executable, quotes are needed to delimit the full command. Since the full output contains 149 lines, only some snippets are shown here:   $ sudo ./fcalls.d -c "./mxv.par.exe -m 1000 -n 500 -t 4" -o fcalls.out $ cat fcalls.out ====================================================================== Function Call Count Statistics ====================================================================== _Exit libc.so.6 1 _IO_cleanup libc.so.6 1 _IO_default_finish libc.so.6 1 _IO_default_setbuf libc.so.6 1 _IO_file_close libc.so.6 1 <many more lines deleted> init_data mxv.par.exe 1 main mxv.par.exe 1 <many more lines deleted> driver_mxv mxv.par.exe 4 getopt libc.so.6 4 madvise libc.so.6 4 mempcpy ld-linux-x86-64.so.2 4 mprotect libc.so.6 4 mxv_core mxv.par.exe 4 pthread_create@@GLIBC_2.2.5 libpthread.so.0 4 <many more lines deleted> _int_free libc.so.6 1007 malloc libc.so.6 1009 _int_malloc libc.so.6 1012 cfree libc.so.6 1015 strcmp ld-linux-x86-64.so.2 1205 __drand48_iterate libc.so.6 500000 drand48 libc.so.6 500000 erand48_r libc.so.6 500000 $   The output lists every function that is part of the dynamic call tree of this program, the module it is part of, and how many times the function is called. The list is sorted by default with respect to the function call count. The functions from module mxv.par.exe are part of the user source code. The other functions are from shared libraries. We know that some of these, e.g. drand48(), are called directly by the application, but the majority of these library functions are called indirectly. To make things a little more complicated, a function like malloc() is called directly by the application, but may also be executed by library functions deeper in the call tree. From the above output we cannot make such a distinction. Note that the DTrace functions stack() and/or ustack() could be used to get callstacks to see the execution path(s) where the calls originate from. In many cases this feature is used to zoom in on a specific part of the execution flow and therefore restricted to a limited set of probes. A Refined DTrace Program While the D program shown above is correct, the list with all functions that are called is quite long, even for this simple application. Another drawback is that there are many probes that trigger, slowing down program execution. In the second version of our D program, we'd like to restrict the list to user functions called from the executable mxv.par.exe. We also want to format the output, print a header and display the function list in alphabetical order. The modified version of the D program is shown below: 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option quiet 4 #pragma D option aggsortkey=1 5 #pragma D option aggsortkeypos=0 6 7 BEGIN { 8 printf("\n======================================================================\n"); 9 printf(" Function Call Count Statistics\n"); 10 printf("======================================================================\n"); 11 } 12 pid$target:a.out::entry 13 { 14 @call_counts_per_function[probefunc] = count(); 15 } 16 END { 17 printf("%-40s %12s\n\n", "Function name", "Count"); 18 printa("%-40s %@12lu\n", @call_counts_per_function); 19 } Two additional pragmas appear at lines 4-5. The pragma at line 4 enables sorting the aggregations by a key and the next one sets the key to the first field, the name of the function that triggered the probe. The BEGIN probe is unchanged, but the probe spanning lines 12-15 has two important differences compared to the similar probe used in the first version of our D program. At line 12, we use a.out for the name of the module. This is an alias for the module name in the pid probe. It is replaced with the name of the target executable, or command, to be traced. In this way, the D program does not rely on a specific name for the target. The second change is at line 14, where the use of the probemod built-in variable has been removed because it is no longer needed. By design, only functions from the target executable trigger this probe now. The END probe has also been modified. At line 17, a statement has been added to print the header. The printa statement at line 18 has been extended with a format string to control the layout. This string is optional, but ideally suitable to print (a selection of) the fields of an aggregation. We know the first field is a string and the result is a 64 bit unsigned integer number, hence the use of the %s and %lu formats. The thing that is different compared to a regular printf format string in C/C++ is the use of the "@" symbol. This is required when printing the result of an aggregation function. Below is the output using the modified D program. The command to invoke this script is exactly the same as before. ====================================================================== Function Call Count Statistics ====================================================================== Function name Count allocate_data 1 check_results 1 determine_work_per_thread 4 driver_mxv 4 get_user_options 1 get_workload_stats 1 init_data 1 main 1 mxv_core 4 my_timer 2 print_all_results 1 The first thing to note is that with 11 entries, the list is much shorter. By design, the list is alphabetically sorted with respect to the function name. Since we no longer trace every function called, the tracing overhead has also been reduced substantially. A DTrace Program with Support for Multithreading With the above D program one can easily see how often our functions are executed. Although our goal of counting user function calls has been achieved, we'd like to go a little further. In particular, to provide statistics on the multithreading characteristics of the target application:   Print the name of the executable that has been traced, as well as the total number of calls to user defined functions. Print how many function calls each thread executed. This shows whether all threads approximately execute the same number of function calls. Print a function list with the call counts for each thread. This allows us to identify those functions executed sequentially and also provides a detailed comparison to verify load balancing at the level of the individual functions.   The D program that implements this additional functionality is shown below. 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option quiet 4 #pragma D option aggsortkey=1 5 #pragma D option aggsortkeypos=0 6 7 BEGIN { 8 printf("\n======================================================================\n"); 9 printf(" Function Call Count Statistics\n"); 10 printf("======================================================================\n"); 11 } 12 pid$target:a.out:main:return 13 { 14 executable_name = execname; 15 } 16 pid$target:a.out::entry 17 { 18 @total_call_counts = count(); 19 @call_counts_per_function[probefunc] = count(); 20 @call_counts_per_thr[tid] = count(); 21 @call_counts_per_function_and_thr[probefunc,tid] = count(); 22 } 23 END { 24 printf("\n============================================================\n"); 25 printf("Name of the executable : %s\n" , executable_name); 26 printa("Total function call counts : %@lu\n", @total_call_counts); 27 28 printf("\n============================================================\n"); 29 printf(" Aggregated Function Call Counts\n"); 30 printf("============================================================\n"); 31 printf("%-40s %12s\n\n", "Function name", "Count"); 32 printa("%-40s %@12lu\n", @call_counts_per_function); 33 34 printf("\n============================================================\n"); 35 printf(" Function Call Counts Per Thread\n"); 36 printf("============================================================\n"); 37 printf("%6s %12s\n\n", "TID", "Count"); 38 printa("%6d %@12lu\n", @call_counts_per_thr); 39 40 printf("\n============================================================\n"); 41 printf(" Thread Level Function Call Counts\n"); 42 printf("============================================================\n"); 43 printf("%-40s %6s %10s\n\n", "Function name", "TID", "Count"); 44 printa("%-40s %6d %@10lu\n", @call_counts_per_function_and_thr); 45 } The first 11 lines are unchanged. Lines 12-15 define an additional probe that looks remarkably similar to the probe we have used so far, but there is an important difference. The wild card for the function name is gone and instead we specify main explicitly. That means this probe only fires upon entry of the main program. This is exactly what we want here, because this probe is only used to capture the name of the executable. It is available through the built-in variable execname. Another minor difference is that this probe triggers upon the return from this function. This is purely for demonstration purposes, because the same result would be returned if the trigger was on the entry to this function. One may wonder why we do not capture the name of the executable in the BEGIN probe. After all, it fires at the start of the tracing process and only once. The issue is that at this point in the tracing, execname does not return the name of the executable, but the file name of the D program. The probe used in the previous version of the D program has been extended to gather more statistics. There are now four aggregations at lines 18-21:   At line 18 we simply increment the counter each time this probe triggers. In other words, aggregation @total_call_counts contains the total number of function calls. The statement at line 19 is identical to what was used in the previous version of this probe. At line 20, the tid built-in variable is used as the key into an aggregation called @call_counts_per_thr. This variable contains the integer id of the thread triggering the probe. The count() aggregation function is used as the value. Therefore this statement counts how many function calls a specific thread has executed. Another aggregation called @call_counts_per_function_and_thr is used at line 21. Here we use both the probefunc and tid built-in variables as a key. Again the count() aggregation function is used as the value. In this way we break down the number of calls from the function(s) triggering this probe by the thread id.   The END probe is more extensive than before and spans lines 23-45. There are no new features or constructs though. The aggregations are printed in a similar way and the "@" symbol is used in the format string to print the results of the aggregations. The results of this D program are shown below. ====================================================================== Function Call Count Statistics ====================================================================== ============================================================ Name of the executable : mxv.par.exe Total function call counts : 21 ============================================================ Aggregated Function Call Counts ============================================================ Function name Count allocate_data 1 check_results 1 determine_work_per_thread 4 driver_mxv 4 get_user_options 1 get_workload_stats 1 init_data 1 main 1 mxv_core 4 my_timer 2 print_all_results 1 ============================================================ Function Call Counts Per Thread ============================================================ TID Count 20679 13 20680 2 20681 2 20682 2 20683 2 ============================================================ Thread Level Function Call Counts ============================================================ Function name TID Count allocate_data 20679 1 check_results 20679 1 determine_work_per_thread 20679 4 driver_mxv 20680 1 driver_mxv 20681 1 driver_mxv 20682 1 driver_mxv 20683 1 get_user_options 20679 1 get_workload_stats 20679 1 init_data 20679 1 main 20679 1 mxv_core 20680 1 mxv_core 20681 1 mxv_core 20682 1 mxv_core 20683 1 my_timer 20679 2 print_all_results 20679 1 Right below the header, the name of the executable (mxv.par.exe) and the total number of function calls (21) are printed. This is followed by the same table we saw before. The second table is titled "Function Call Counts Per Thread". The data confirms that 5 threads have been active. There is one master thread and it creates the other four threads. The thread ids are in the range 20679-20683. Note that these numbers are not fixed. A subsequent run most likely shows different numbers. What is presumably the main thread executes 13 function calls. The other four threads execute two function calls each. These numbers don't tell us much about what is really going on under the hood and this is why we generate a third table titled "Thread Level Function Call Counts". The data is sorted with respect to the function names. What we see in this table is that the main thread executes all functions, other than driver_mxv and mxv_core. These two functions are executed by the four threads that have been created. We also see that function determine_work_per_thread is called four times by the main thread. This function is used to compute the amount of work to be executed by each thread. In a more scalable design, this should be handled by the individual threads. Function my_timer is executed twice by the main thread. That is because this function is called at the start and end of the matrix-vector multiplication. While this table shows the respective thread ids, it is not immediately clear which function(s) each thread executes. It is not difficult to create a table that shows the sorted thread ids in the first column and the function names, as well as the respective counts, next to the ids. This is left as an exercise to the reader. There is one more thing we would like to mention. While the focus has been on the user written functions, there is no reason why other functions cannot be included. For example, we know this program uses the Pthreads library libpthreads.so. In case functions from this library should be counted as well, a one line addition to the main probe is sufficient: 1 pid$target:a.out::entry, 2 pid$target:libpthread.so:pthread_*:entry 3 { 4 @total_call_counts = count(); 5 @call_counts_per_function[probefunc] = count(); 6 @call_counts_per_thr[tid] = count(); 7 @call_counts_per_function_and_thr[probefunc,tid] = count(); 8 } The differences are in lines 1-2. Since we want to use the same actions for both probes, we simply place them back to back, separated by a comma. The second probe specifies the module (libpthread.so), but instead of tracing all functions from this library, for demonstration purposes we use a wild card to only select function names starting with pthread_. Additional Reading Material The above examples, plus the high level coverage of the DTrace concepts and terminology, are hopefully sufficient to get started. More details are beyond the scope of this article, but luckily, DTrace is very well documented. For example, the Oracle Linux DTrace Guide, covers DTrace in detail and includes many short code fragments. In case more information is needed, there are many other references and examples. Regarding the latter, the Oracle DTrace Tutorial contains a variety of example programs.

This blog entry was provided by Ruud van der Pas   Introduction DTrace is often positioned as an operating system analysis tool for the system administrators, but it has a wider use than this. In...

Linux Kernel Development

Writing the Ultimate Locking Check

In this blog post, Oracle Linux kernel developer Dan Carpenter discusses a recent improvement that he made to his static analyzer Smatch to look for locking errors. Introduction One common type of bug is when programmers forget to unlock on an error path and it leads to a system hang later when the kernel tries to re-take the lock. These kinds of bugs hard to catch in testing because they happen on the failure paths but they're ideally suited for static analysis. Static analysis tools look at the source code to find bugs instead of doing it through testing. I wrote a static analysis tool called Smatch and the website for it is here: https://github.com/error27/smatch In theory a clever programmer could discover all the bugs in a piece of software just by examining it carefully, but in reality humans can't keep track of everything and they get distracted easily. A computer could use the same logic and find the bugs through static analysis. There are two main limitations for static analysis. The first is that it is hard to know the difference between a bug and feature. Here we're going to specify that holding a lock for certain returns is a bug. This rule is generally is true but occasionally the kernel programmers hold a lock deliberately. The second limitation is that to understand the code, sometimes you need to understand how the variables are related to each other. It's difficult to know in advance which variables are related and it's impossible to track all the relationships without running out of memory. This will become more clear later. Nevertheless, static analysis can find many bugs so it is a useful tool. Many static analysis tools have a check for locking bugs. Smatch has had one since 2002 but it wasn't exceptional. My first ten patches in the Linux kernel git history fixed locking bugs and I have written hundreds of these fixes in the years since. When Smatch gained the ability to do cross function analysis in 2010, I knew that I had to re-write the locking check to take advantage of the new cross function analysis feature. When you combine cross function analysis with top of the line flow analysis available and in depth knowledge of kernel locks then the result is the Ultimate Locking Check! Unfortunately, I have a tendency towards procrastination and it took me a decade to get around to it, but it is done now. This blog will step through how the locking analysis works. Locking Functions The kernel uses the __acquires() and __releases() annotations to mark the locking functions. Smatch ignores these. Partly it is for legacy reasons but it's also because the locking annotations are a bit clumsy. Not all locks are annotated so it was never going to be a complete solution. The annotations serve as a sort of cross function markup, but since Smatch can do cross function analysis directly they aren't required. Instead Smatch has a table which describes the locking functions. The locking table has around 300 entries which look like this: code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} pre > code.sourceCode { white-space: pre; position: relative; } pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } code.sourceCode > span { color: inherit; text-decoration: inherit; } div.sourceCode { margin: 1em 0; } pre.sourceCode { margin: 0; } @media screen { div.sourceCode { overflow: auto; } } @media print { pre > code.sourceCode { white-space: pre-wrap; } pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } pre.numberSource code > span { position: relative; left: -4em; counter-increment: source-line; } pre.numberSource code > span > a:first-child::before { content: counter(source-line); position: relative; left: -1em; text-align: right; vertical-align: baseline; border: none; display: inline-block; -webkit-touch-callout: none; -webkit-user-select: none; -khtml-user-select: none; -moz-user-select: none; -ms-user-select: none; user-select: none; padding: 0 4px; width: 4em; color: #aaaaaa; } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode { } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } code span.al { color: #ff0000; font-weight: bold; } /* Alert */ code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ code span.at { color: #7d9029; } /* Attribute */ code span.bn { color: #40a070; } /* BaseN */ code span.bu { } /* BuiltIn */ code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ code span.ch { color: #4070a0; } /* Char */ code span.cn { color: #880000; } /* Constant */ code span.co { color: #60a0b0; font-style: italic; } /* Comment */ code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ code span.do { color: #ba2121; font-style: italic; } /* Documentation */ code span.dt { color: #902000; } /* DataType */ code span.dv { color: #40a070; } /* DecVal */ code span.er { color: #ff0000; font-weight: bold; } /* Error */ code span.ex { } /* Extension */ code span.fl { color: #40a070; } /* Float */ code span.fu { color: #06287e; } /* Function */ code span.im { } /* Import */ code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ code span.kw { color: #007020; font-weight: bold; } /* Keyword */ code span.op { color: #666666; } /* Operator */ code span.ot { color: #007020; } /* Other */ code span.pp { color: #bc7a00; } /* Preprocessor */ code span.sc { color: #4070a0; } /* SpecialChar */ code span.ss { color: #bb6688; } /* SpecialString */ code span.st { color: #4070a0; } /* String */ code span.va { color: #19177c; } /* Variable */ code span.vs { color: #4070a0; } /* VerbatimString */ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ {"down_write_trylock", LOCK, write_lock, 0, ret_one}, {"down_write_killable", LOCK, write_lock, 0, ret_zero}, {"write_unlock", UNLOCK, write_lock, 0, ret_any}, The fields are mostly self explanatory. The type was originally supposed to be used to warn about places that call schedule() while holding a spin_lock. Currently, it's only used to warn about nesting calls where some type of locks are allow to be nested, like bottom half locks, and some like spin_lock_irq() are not. The "0" means that the first (zeroeth) argument to down_write_trylock(&my_lock) is the lock variable. The "ret_one" means if down_write_trylock(&my_lock) returns one the lock is acquired but "ret_zero" for down_write_killable(&my_lock) a zero return means the lock is acquired. Loading the Locking Table Information The lock table is loaded into the function hooks: if (lock->return_type == ret_zero) { return_implies_state(lock->function, 0, 0, &match_lock_held, idx); return_implies_state(lock->function, -4095, -1, &match_lock_failed, idx); } In this code sample, "lock->function" is down_write_killable(). The next two parameters are a range of possible returns. The "0, 0" range means the lock is held. The "-4095, -1" represents the range of normal error codes in the kernel and these returns means that we weren't able to take the lock. This works for almost everything except for one caller which does this: if (mutex_lock_interruptible(&my_lock) == -EINTR) return; Here only the -EINTR error code is handled. When we build the cross function database, Smatch thinks mutex_lock_interruptible() can return either -EINTR or -EALREADY. So when Smatch parses this code it prints a warning that the -EALREADY return wasn't handled correctly: drivers/scsi/foo.c:1374 reset_fn() error: double unlocked '&my_lock' (orig line 1369) The work around for this is to add a line in smatch_data/db/kernel.return_fixes mutex_lock_interruptible (-35),(-4) (-4) After the database is rebuilt, the .return_fixes file is used to update the database and changes the return from "(-35),(-4)" to "(-4)". Going back to the function hooks code, the final two arguments are a pointer to the match_lock_held() function which takes "idx" as an argument. The simplified code for match_lock_held() looks like this: static void do_lock(const char *name, struct symbol *sym) { struct sm_state *sm; add_tracker(&locks, my_id, name, sym); sm = get_sm_state(my_id, name, sym); if (!sm) set_start_state(name, sym, &unlocked); warn_on_double(sm, &locked); set_state(my_id, name, sym, &locked); } static void match_lock_held(const char *fn, struct expression *call_expr, struct expression *assign_expr, void *_index) { int index = PTR_INT(_index); char *name; struct symbol *sym; name = get_full_name(call_expr, index, &sym); do_lock(name, sym); } The name is "my_lock". The add_tracker() function keeps a list of locks we have touched in this function. If this is the first time we have seen "my_lock" then we record that "my_lock" started as unlocked. The warn_on_double() function prints a warning if it was already locked. And finally, we set the state of "my_lock" to &locked. States, SM States, and Strees There are three important struct types to understand this code: smatch_state, sm_state and stree. The smatch_states in this function are &locked and &unlocked. The states are declared at the start of the check_locking.c file. STATE(locked); STATE(unlocked); STATE(restore); STATE(start_state); STATE(impossible); There are two additional global smatch_states that which are &undefined and &merged. The sm_state struct links a variable to a smatch_state. struct sm_state { const char *name; // <<-- variable ("my_lock") struct symbol *sym; unsigned short owner; // <-- check_locking.c unsigned short merged:1; unsigned int line; struct smatch_state *state; // <<-- state ("&locked") struct stree *pool; // <-- where this state was created struct sm_state *left; struct sm_state *right; struct state_list *possible; // <-- possible states ("&locked") }; In this case, the variable is "my_lock" and the smatch_state is &locked. If the smatch_state is &merged then we could look at the list of sm->possible states to see if "my_lock" is ever &locked/&unlocked at this point. The sm_state struct also has a pointer to the stree where the sm_state was created. Finally, it has left and right pointers which point to previous sm_states and previous strees if smatch_state is &merged. A stree is a group of sm_states. The name comes from "states" which are stored in a "tree". The "cur_stree" is the current set of all the sm_states that Smatch is tracking. The sm_state has the "owner" field so that we can identify the locking states out of the "cur_stree" and ignore the other checks. In programming languages we have branches and merges: if (something) { // <-- branch ... } // <-- merge When a merge happens, the strees from both side are frozen and preserved until the end of the function and a new merged stree is created. We saw earlier that each sm_state has links to previous strees. These links let us manipulate strees in useful ways. We can ask "assume mutex_lock_interruptible() returned -5", then Smatch looks through the history and returns the stree based on that assumption. The code for this is in smatch_implied.c. In that returned stree "my_lock" would be &unlocked. In this way, the stree represents connection between states and the relationship between the return value and the locked state. Printing Warning Messages To get back to the locking check, after a function has been parsed we look at all the locks from add_tracker(). For each lock, if it is held on some success paths but not others, or if it is held on an error path but not on the success path then print a warning message. We can ignore cases where the lock is always held on the success path because those will be always caught in testing. The simplified code to implement that looks like: static void check_lock(char *name, struct symbol *sym) { int locked_buckets[NUM_RETURN_BUCKETS] = {}; int unlocked_buckets[NUM_RETURN_BUCKETS] = {}; struct smatch_state *state, *return_state; struct stree *stree, *orig; int bucket; FOR_EACH_PTR(get_all_return_strees(), stree) { orig = __swap_cur_stree(stree); return_state = get_state(RETURN_ID, "return_ranges", NULL); state = get_state(my_id, name, sym); if (!return_state || !state) goto swap_stree; if (state != &locked && state != &unlocked) goto swap_stree; bucket = success_fail_positive(estate_rl(return_state)); if (state == &locked) locked_buckets[bucket] = true; else unlocked_buckets[bucket] = true; swap_stree: __swap_cur_stree(orig); } END_FOR_EACH_PTR(stree); if (locked_buckets[SUCCESS] && unlocked_buckets[SUCCESS]) goto complain; if (locked_buckets[FAIL] && unlocked_buckets[SUCCESS])) goto complain; if (locked_buckets[ERR_PTR]) goto complain; return; complain: sm_msg("warn: inconsistent returns '%s'", name); } static void match_func_end(struct symbol *sym) { struct tracker *tracker; FOR_EACH_PTR(locks, tracker) { check_lock(tracker->name, tracker->sym); } END_FOR_EACH_PTR(tracker); } One new feature introduced in these functions is "estate_rl(return_state)". An "estate" is a "smatch extra" state. The "extra" naming seems silly now because smatch_extra.c is the core feature of Smatch. The smatch_extra.c module tracks all the possible values of the variables used in a function. The "rl" in estate_rl() stands for struct range_list which is how Smatch represents a range of numbers like "(-35),(-4),0-1". This "estate_rl(return_state)" code is looking at the returned values to decide if it is an error path or a success path. The Cross Function Database When a function is called, we record information from the current stree in the caller_info table. Each call becomes a single stree in the called function. Unfortunately that means that the relationships between variables and links to previous strees are lost. For example, sometimes developers will write code like: if (pointer && !size_from_user_is_valid(size)) return -EINVAL; some_function(pointer, size); For the programmer, it's obvious that if "pointer" is NULL then we do not care about the size. But when Smatch records this in the database the relationship is lost and the call is flattened to a single stree. Smatch only knows that pointer can be NULL or non-NULL and that the size has not necessarily been checked. What I'm saying is don't write code like that. Always check the sizes from the user even when the pointer is not used. Do this: if (!size_from_user_is_valid(size)) return -EINVAL; some_function(pointer, size); Similarly, each return is represented in the cross function database as a stree. Actually, returns are more complicated because often we need to split a single return into multiple strees. For example, if we are parsing code like this: ret = 0; fail_unlock: mutex_unlock(&my_lock); return ret; The example code uses one return statement but Smatch would probably try split it up into success vs failure path. There are around ten different approaches or heuristics that Smatch uses to split returns into meaningful strees. Sometimes there are too many states and the return cannot be split. We need to parse the return information quickly so there are hard limits on how much data we can record in the database. Storing Information in the Cross Function Database The first step to store locking information in the cross function database is to add some new enum info_type types to smatch.h LOCKED = 8020, UNLOCKED = 8021, LOCK_RESTORED = 9023, The numbers are the type used in the database. They don't mean anything. They are out of order because I didn't realized until later that LOCK_RESTORED was required. LOCK_RESTORED is for irqrestore because restoring is not necessarily the same as enabling the IRQs. The Smatch database has a number of different tables but the lock check only uses the return_states table. The code to insert data into return_states looks like this: static void match_return_info(int return_id, char *return_ranges, struct expression *expr) { struct sm_state *sm; const char *param_name; int param; FOR_EACH_MY_SM(my_id, __get_cur_stree(), sm) { if (sm->state != &locked && sm->state != &unlocked && sm->state != &restore) continue; if (sm->state == get_start_state(sm) continue; param = get_param_lock_name(sm, expr, ¶m_name); sql_insert_return_states(return_id, return_ranges, get_db_type(sm), param, param_name, ""); } END_FOR_EACH_SM(sm); } The "return_id" is a unique ID and "return_ranges" is a string like "(-4095)-(-1)". The "expr" is returned value. This function iterates through all the locks and if they have changed then it records that in the return_states table. The get_db_type() function returns LOCKED or UNLOCKED that we added to info_type. If the function returns a struct holding the lock then param is -1, otherwise it's the parameter which holds the lock. The param_name is going to be something like "$->lock" where the dollar sign is a wild card because the callers might call the parameter different names. Reading from the Database That's how we insert locking information into return_states and the code to select it looks like this: static void db_param_locked_unlocked(struct expression *expr, int param, char *key, char *value, enum action lock_unlock) { struct expression *call, *arg; char *name; struct symbol *sym; call = expr; while (call->type == EXPR_ASSIGNMENT) call = strip_expr(call->right); if (call->type != EXPR_CALL) return; if (func_in_lock_table(call->fn)) return; if (param == -1) { if (expr->type != EXPR_ASSIGNMENT) return; name = get_variable_from_key(expr->left, key, &sym); } else { arg = get_argument_from_call_expr(call->args, param); if (!arg) return; name = get_variable_from_key(arg, key, &sym); } if (!name || !sym) goto free; if (lock_unlock == LOCK) do_lock(name, sym); else if (lock_unlock == UNLOCK) do_unlock(name, sym); else if (lock_unlock == RESTORE) do_restore(name, sym); free: free_string(name); } In this code "expr" can either be an assignment like "ret = spin_trylock(&my_lock);" or it can be a function call like "spin_lock(&my_lock);". The "param" variable is the parameter that is locked. If param is -1 that means the returned pointer is locked. The other thing to note is the check: if (func_in_lock_table(call->fn)) return; Functions such as spin_lock_irq() are in both the database and the function table so they were counted as two locks in a row and triggered a double lock warning. This check means we ignore information from the database when we have the locking information in both places. In an ideal world this would be the end of the story. But it is only the beginning. Guess work The first problem is that there are some functions where it's hard to tie the lock to a specific parameter. Perhaps the lock is a global variable, or the parameter might be a key and we have to look up the lock in a hash table. Or maybe we have two pointers which point to the same lock. The work around for this is that if the caller cannot tie a lock to a parameter, then it returns that the parameter is -2. In the caller, if the parameter is -2 or it otherwise fails to match an unlock to a lock then it uses the get_best_match() function to find the correct lock. The get_best_match() function looks takes a lock name such as "$->foo->bar->baz" and tries to find a known lock which ends with "->bar->baz". Bugs in Smatch Another early problem that I faced was parsing code like: sound/isa/gus/gus_mem.c 18 void snd_gf1_mem_lock(struct snd_gf1_mem * alloc, int xup) 19 { 20 if (!xup) { 21 mutex_lock(&alloc->memory_mutex); 22 } else { 23 mutex_unlock(&alloc->memory_mutex); 24 } 25 } This function locks or unlocks depending on the "xup" parameter. These type of locking functions are normally small so Smatch parses them inline. This raised a problem because if you have a literal zero, Smatch treats it as known, but if you have a variable set to zero Smatch treats it as only "sort of known". The caller is passing literal values to this function but they are assigned to "xup" and downgraded to only sort of known. I made known inline parameters a special case where the "sort of known" values get promoted to "all the way known". I ran into a number of other general bugs in Smatch. Here is an example of some code that was hard to parse. I have snipped away the irrelevant lines. drivers/scsi/libsas/sas_port.c 108 for (i = 0; i < sas_ha->num_phys; i++) { 109 port = sas_ha->sas_port[i]; 110 spin_lock(&port->phy_list_lock); 111 if (...) { 116 break; 117 } 118 spin_unlock(&port->phy_list_lock); 119 } 121 if (i == sas_ha->num_phys) { 133 } 134 135 if (i >= sas_ha->num_phys) { 138 return; 139 } The problem is that Smatch isn't sure if we are holding the "port->phy_list_lock" when we return on line 138. The Smatch subsystem to handle comparisons where the values of the variables are unknown is smatch_comparison.c. It was not preserving the links to previous strees correctly and the "i == sas_has->num_phys" comparison over wrote the links to the previous strees from the for loop. I also discovered a bug in Smatch flow analysis where if we were in unreachable code and encountered a switch statement then the code was marked reachable again. That resulted in a bug where Smatch recorded that a return unlocked but really that return was unreachable. A different bug was that Smatch did not handle conditional returns correctly when the conditional was a function. return foo() ?: ({ spin_lock(&my_lock; 0; }); This style of return worked where the condition was a variable but not when it was a function. I had to create an module called smatch_parsed_conditions.c to handle this. Work Arounds There were other inlines that were tricky to handle: drivers/md/bcache/btree.h 228 static inline void rw_lock(bool w, struct btree *b, int level) 229 { 230 w ? down_write_nested(&b->lock, level + 1) 231 : down_read_nested(&b->lock, level + 1); 232 if (w) 233 b->seq++; 234 } In this case, I just added rw_lock() and rl_unlock() to the locking table (as mentioned earlier the locking table trumps the database). It is not perfect solution and there are still a couple false positives related to rw_lock(). One function that was especially difficult was the dma_resv_lock() function. It returns zero on success or -EINTR on failure, however if the second argument is NULL then it can't fail. The dma_resv_lock() is called over a hundred times and generated a lot of false positives. I wrote a special return_implies hook to handle it: static void match_dma_resv_lock_NULL(const char *fn, struct expression *call_expr, struct expression *assign_expr, void *_index) { struct expression *lock, *ctx; char *lock_name; struct symbol *sym; lock = get_argument_from_call_expr(call_expr->args, 0); ctx = get_argument_from_call_expr(call_expr->args, 1); if (!expr_is_zero(ctx)) return; lock_name = lock_to_name_sym(lock, &sym); if (!lock_name || !sym) goto free; do_lock(lock_name, sym, NULL); free: free_string(lock_name); } It just tests if the "ctx" parameter is NULL and sets the state to &locked if it is. What happens is the standard code sets this to a merged locked and unlocked state and then this function immediately over writes it to say that it's locked. I considered other fixes such as marking the -EINTR path as impossible. Those approaches are valid but this seemed easiest. One place where Smatch struggles is if the function calls a callback in a different thread. My solution was to create a file with manual locking annotations in smatch_data/db/kernel.insert.return_states These entries are inserted into the cross function database after it is rebuilt. # The format to insert is: # file (optional), function, "return values" | type, param, "key", "value" # mlx5_cmd_comp_handler, "" | 8021, -2, "*sem", "" smc_connect_abort, "1-s32max[==$1]" | 8021, -2, "smc_client_lgr_pending", "" smc_connect_abort, "s32min-(-12),(-10)-(-1)[==$1]" | 8021, -2, "smc_client_lgr_pending", "" smc_connect_abort, "1-s32max[==$1]" | 8021, -2, "smc_server_lgr_pending", "" smc_connect_abort, "s32min-(-12),(-10)-(-1)[==$1]" | 8021, -2, "smc_server_lgr_pending", "" dlfb_submit_urb, "0" | 8021, 0, "$->urbs.limit_sem", "" There are a few locking functions which are just strange. drivers/char/ipmi/ipmi_ssif.c 314 static unsigned long *ipmi_ssif_lock_cond(struct ssif_info *ssif_info, 315 unsigned long *flags) 316 { 317 spin_lock_irqsave(&ssif_info->lock, *flags); 318 return flags; 319 } Both the "flags" parameter and the returned "ret_flags" represent the saved IRQ flags, but the callers always use the returned values. The database only records that the flags are set in the parameter. To work around this problem I added a line to smatch_data/db/fixup_kernel.sh to change it to the returned value. update return_states set parameter = -1, key = '\$' where function = 'ipmi_ssif_lock_cond' and type = 8020 and parameter = 1; I had to add the "ipmi_ssif_lock_cond" function to the smatch_data/kernel.no_inline_functions file so that Smatch would use the modified information from the disk instead of parsing it inline. Inline functions use the same code as the on-disk database, but they work using an in-memory database. Parsing inline functions is normally transparent, but sometimes the in-memory database will return different, hopefully more accurate information than the on-disk database. Changes from smatch_data/db/kernel.return_fixes will be applied to the in memory database, but raw SQL commands from the smatch_data/db/fixup_kernel.sh only affect the on-disk database. Finally, when all else failed I decided to just silence some warnings so I created a false positive table. static const char *false_positives[][2] = { {"fs/jffs2/", "->alloc_sem"}, {"fs/xfs/", "->b_sema"}, }; The Debugging Process Writing a Smatch check is an iterative process. I started with a basic heuristic that forgetting to unlock on an error path should generate a warning. Then I tested my code. Then I tried to fix the issues one at a time. There three main debugging methods I used. The first is to use the --debug=check_locking option which prints all the locking state transitions. The second option is to add "#include "/path/to/smatch/check_debug.h" to the parsed file and __smatch_sates("check_locking"); at the appropriate lines. Like this: arch/x86/mm/init.c 710 #include "/home/dcarpenter/smatch/check_debug.h" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 711 void __init poking_init(void) 712 { 713 spinlock_t *ptl; 714 pte_t *ptep; 715 716 poking_mm = copy_init_mm(); [ snip ] 737 ptep = get_locked_pte(poking_mm, poking_addr, &ptl); 738 __smatch_states("check_locking"); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 739 BUG_ON(!ptep); 740 __smatch_states("check_locking"); 741 pte_unmap_unlock(ptep, ptl); 742 __smatch_states("check_locking"); When I run Smatch it displays: arch/x86/mm/init.c:738 poking_init() [check_locking] *ptl = 'merged' [merged] (locked, merged, start_state) arch/x86/mm/init.c:740 poking_init() [check_locking] *ptl = 'locked' [merged] arch/x86/mm/init.c:742 poking_init() [check_locking] *ptl = 'unlocked' This output represents the correct fixed output after I had finished debugging the issue. The other key debugging tool is the local_debug flag. Include the check_debug.h as in the previous example, then add: 736 __smatch_local_debug_on(); 737 ptep = get_locked_pte(poking_mm, poking_addr, &ptl); 738 __smatch_local_debug_off(); Then add the appropriate debug printfs to the Smatch check: if (local_debug) sm_msg("expr = '%s' key = '%s'", expr_to_str(expr), key); There is a fourth option to use __smatch_debug_on/off() which prints every state change for every single module but if you have to resort to that then you should probably just give up. Future Work There were a few problems which I wasn't able to fix. One is that Smatch doesn't handle call backs correctly: if (kref_put(&zhdr->refcount, release_z3fold_page_locked)) { In this code release_z3fold_page_locked() unlocks zhdr->page_lock but Smatch does not see it. It should be possible for Smatch to parse kref_put() correctly but I have not implemented that code yet. Another problem is that Smatch doesn't parse bitwise logic correctly like: if (test_bit(PAGE_HEADLESS, &page->private)) spin_lock(&lock); ... if (test_bit(PAGE_HEADLESS, &page->private)) spin_unlock(&lock); In this code Smatch doesn't know that the lock is always unlocked at the end. Handling bitwise logic is not necessarily difficult to do but it is a lot of code to write so I haven't gotten around to it. I knew that this was a problem before but as I wrote this code I was able to silence around 90% of the false positives. As the false positives got fewer the perceived seriousness went from "Bitwise logic only causes a few false positives" to "This a major cause of the remaining false positives". A second idea to silence that warning would be to mark test_bit() as a pure function which doesn't have side effects. With pure functions, if the parameters are the same, then the result is the same so we can pair the two conditions nicely. I believe that GCC does this sort of analysis. The kernel uses assertions such as spin_is_locked(). It would be nice to add a warning to say if these tests can fail. Another potential use would be to print a warning when data structures are accessed without holding a lock. We would need some annotation to tie the lock to the data it protects. Summary Looking back, the original Smatch locking check makes me cringe. It was the first check I wrote and I made some design errors. I'm very happy with the new check. It is still new so there is likely to be the occasional embarrassing bug, but overall the approach is sound and the bugs can now be fixed one at a time without doing a another rewrite. The original Smatch locking check generated 98 warnings about inconsistent locking. Kernel developers have fixed all the real bugs so all 98 warnings are false positives. The new check has 28 warnings all together. Around half of the new warnings look like real bugs. They are mostly minor bugs in old code and don't likely have a noticeable real life impact. However, this check will be used for many years to come and I expect it will lead to hundreds of bug fixes.

In this blog post, Oracle Linux kernel developer Dan Carpenter discusses a recent improvement that he made to his static analyzer Smatch to look for locking errors. Introduction One common type of bug...

Announcements

Oracle’s Linux Team Wishes the Java Community a Happy 25th

Thanks to Kurt Goebel and Van Okamura for their help with this post.   From one open source community to another, Oracle’s Linux team would like to congratulate the Java community on its 25th anniversary! Java has an impressive history. It was a breakthrough in programming languages, allowing developers to write once and have code run anywhere. And, it has enabled developers to create a myriad of innovative solutions that help run our world. Read Georges Saab’s post to learn more. Both open source technologies, Java and Linux benefit from communities that collectively drive their advancements. While the technologies aren’t similar, there are areas where both work together and complement each other. One area is Java’s support for Linux HugePages. Using Linux HugePages can improve system performance by reducing the amount of resources needed to manage memory. The result of less overhead in the system means more resources are available for Java and the Java app, which can make both run faster. Another area is OpenJDK. It has been shipping with Linux distributions continuously and every Linux distribution has Java support out of the box. Linux was and is ubiquitous on a wide range of hardware platforms. As part of Linux, OpenJDK and Java are also running on many different hardware architectures. This helped to bring a Java ecosystem to embedded devices. Today, Java and Linux are used in virtually all industries and on everything from laptops to data centers, clouds to satellites, game consoles to scientific supercomputers. Here's to 25 more years of being moved by Java. From all of us (and Oracle Tux), we wish the Java community continued success. #MovedbyJava #OracleLinux  

Thanks to Kurt Goebel and Van Okamura for their help with this post.   From one open source community to another, Oracle’s Linux team would like to congratulate the Java community on its...

Linux

Btrfs on the Unbreakable Enterprise Kernel 6

In this blog we delve into the new features and enhancements for Btrfs that are available in the Unbreakable Enterprise Kernel 6, as described by Oracle Linux kernel engineer Anand Jain. Oracle's release of the Unbreakable Enterprise Kernel 6 (UEK6) is based on the Linux kernel version 5.4. In which Btrfs continues to be a fully supported file-system. Let's look at some of the notable new features and enhancements in Btrfs on UEK6. Compression level Btrfs supports three compression types, zlib, lzo and zstd and there are three hierarchical ways to set the scale of compression in a Btrs file-system. You can set the scale to encompass the entire file-system, a specific subvolume or at the file/directory level. Like for example, mount option -o compress=<type> will set the compression type at the scale of file-system, where as btrfs property set <subvolume> compression <type> will set the compression type at the scale of subvolume and btrfs property set <file|directory-path> compression <type> is used to set the compression type on the file or directory level. The compression type applies to new writes only. Existing data may be compressed using the command btrfs filesystem defragment -r -c<type> <path>. In UEK6, the compression types zlib and zstd expose the compression level as a tunable parameter. The level matches the default zlib and zstd levels. You can now set the compression level using the mount option mount -o compress=<type>:<level> as shown below. $ mount -o compress=zstd:9 /dev/sda /mnt The zlib and zstd compressoin level ranges and the default level are outlined in the following table: .divTable { display: table; width: 80%; } .divTableRow { display: table-row; } .divTableHeading { display: table-header-group; background-color: #ddd; font-weight: bold; } .divTableCell { display: table-cell; padding: 3px 10px; border: 1px solid #999999; } Type Level Default zlib 1 - 9 3 zstd 1 - 15 3   Any level specified outside of the accepted range will simply set the level to the default level without any error. System log output using the command dmesg -k can be used to determine the actual applied compression level. For example: $ dmesg -k | grep compression BTRFS info (device sda): use zstd compression, level 9 The compression speed and ratio depends on the file data. A higher level provides a better ratio at the cost of slower compression speed. At any level, the decompression-speed and memory consumption remain almost constant. The higher compression level is expected to benefit read-mostly file-systems, or when creating images. Early detection of in-compressible data Until now Btrfs used the trial and error method to determine if the file would benefit from compression. The file inode which doesn't provide any compression benefit gets the NOCOMPRESS flag and the file is written uncompressed. While the trial and error method is the most accurate, it is less efficient, and wastes a lot of CPU cycles if the data is determined to be incompressible. It wastes even more CPU cycles if the file-system is mounted with the -o compress-force option. This mount option ignores the NOCOMPRESS flag for every new write on the file. In UEK6, this trial and error method for the early detection of incompressible data is replaced with a heuristic that does repeated pattern detection, frequency sampling, and Shannon entropy calculation to find out if the file is compressible. Fallocate zero-range Btrfs on UEK6 adds support for fallocate zero-range (FALLOC_FL_ZERO_RANGE) and joins the other file-systems (ext4 and xfs) that support it. So now after calling fallocate(1) with the zero-range option, you can expect the blocks on the device to be zeroed. Swapfile support Btrfs didn't support swapfile because it uses bmap to make a mapping of extents in the file. The Btrfs bmap call would return logical addresses that weren't suitable for IO as they would change frequently as COW operations happen. The logical addresses could be on different devices configured as a raid, and therfore the swapfile mapping of extents in the file would be wrong. Now, with the address_space_operations activation for the swapfiles, Btrfs code is enhanced to support swapfiles. Note that using a Btrfs swapfile comes with a few restrictions. The swapfile must be fully allocated as NOCOW, compression cannot be used, and it must reside on one device. rmdir(1) a subvolume rmdir(1) is now allowed to delete an empty subvolume. The rmdir(1) call will check the necessary user permission for the delete. Non-sudo users can now fully manage a subvolume similar to a directory. $ id -u 1000 $ btrfs subvolume create /btrfs/sv1 Create subvolume '/btrfs/sv1' $ btrfs subvolume delete /btrfs/sv1 Delete subvolume (no-commit): '/btrfs/sv1' ERROR: Could not destroy subvolume/snapshot: Operation not permitted $ rmdir /btrfs/sv1 The mount option -o user_subvol_rm_allowed will continue to allow non-empty subvolume delete from a non-sudoer. Forget scanned devices You can now un-register devices previously added by a device scan. A new ioctl BTRFS_IOC_FORGET_DEV frees the previously scanned devices that are unmounted. $ btrfs device scan $ btrfs device scan --forget Out of band deduplication In UEK5, the deduplication limit for ioctl fideduperange(2) is 16 MiB, and Btrfs silently limited the deduplication to the first 16 MiB. In UEK6 deduplication is no longer limited to the first 16 MiB, this is overcome by splitting the range of out-of-band deduplication into 16 MiB chunks. Change file-system UUID instantly With Btrfs on UEK6 you can assign a new file-system UUID without overwriting all metadata blocks. The original UUID is stored as metadata_uuid in the super-block. This provides a faster way to change the file-system UUID using the btrfstune(1) command with the -M|m option as shown below: $ btrfs filesystem show /dev/sda Label: none uuid: 16a0e00c-cb98-4f44-8fb4-730bf0a32ab4 Total devices 1 FS bytes used 176.00KiB devid 1 size 12.00GiB used 20.00MiB path /dev/sda The following example shows how to change the file-system UUID: $ time btrfstune -m /dev/sda real 0m0.052s user 0m0.009s sys 0m0.009s $ btrfs filesystem show /dev/sda Label: none uuid: caa6a218-4b23-43d4-9b0a-08a42f0ddca5 Total devices 1 FS bytes used 176.00KiB devid 1 size 12.00GiB used 20.00MiB path /dev/sda Summary In this blog, we have looked at the notable enhancements and new features in Btrfs on UEK6, which also contains a number of other Brtfs stability fixes as well.

In this blog we delve into the new features and enhancements for Btrfs that are available in the Unbreakable Enterprise Kernel 6, as described by Oracle Linux kernel engineer Anand Jain. Oracle's...

Linux

IT Convergence Improves End User Experience with Quicker Server Builds, Improved SLAs and Reduced Support Costs

IT Convergence is a global applications services provider. For the past 20 years, it has offered customers Oracle solutions, such as enterprise applications like Oracle E-Business Suite. This article explores how IT Convergence built servers faster, improved its SLAs, and reduced support costs since moving to a hosted cloud services environment running on Oracle Linux and Oracle VM.   As an Oracle Platinum Partner, IT Convergence has a comprehensive service offering across all three pillars of the Cloud (IaaS, PaaS, SaaS).  It can build, manage, and optimize customer solutions.  Additionally, it can provide connectivity into Oracle Cloud Infrastructure . These solutions create value for thousands of customers globally, including one-third of Fortune 500 companies. Before IT Convergence moved their environment to Oracle Linux and Oracle VM, they had a hybrid environment running Red Hat Enterprise Linux and VMware. Upon choosing Oracle Linux, IT Convergence decided to use the Unbreakable Enterprise Kernel (UEK) for Oracle Linux as it proved particularly fast with Oracle E-Business Suite. This video interview explains how easy and painless the conversion to Oracle Linux and Oracle VM was for IT Convergence. Its teams were able to convert 2000 servers online, without any downtime or reboots, within three months. This move to Oracle Linux and Oracle VM has resulted in several benefits. By using Oracle VM and Oracle VM Templates, IT Convergence can now build servers more rapidly for its customers. What previously took 20 hours to manually build now takes the team about two hours. Oracle VM Templates are self-contained and pre-configured virtual machines of key Oracle technologies. Each Oracle VM Template is packaged using Oracle best practices, which helps eliminate installation and configuration costs, reduces risk, and dramatically shortens deployment time. Other benefits from migrating to Oracle Linux and Oracle VM are related to technical support.  IT Convergence was supporting multiple operating systems (OS) and hypervisor solutions. This added complexity when attempting to resolve support tickets across different vendors. Specifically, lots of time was spent trying to determine the root cause analysis. Consequently, the operations team was not always able to complete a support ticket within its two-hour SLA window. Given the rest of IT Convergence’s stack is largely Oracle, from the database to the applications level, using Oracle Linux and Oracle VM simplified its vendor portfolio. It also unified support across the applications, OS and hypervisors. Now, any support tickets go through a single vendor for resolution. This has improved  the team’s overall technical support SLA capabilities.   Additionally, by moving to Oracle Linux and Oracle VM Premier Support, IT Convergence saved approximately $100,000 annually. These support cost savings in turn allow IT Convergence to offer more competitive pricing to its customers.  A win-win!    We are proud to help customers like IT Convergence to improve operational capabilities and its customer offerings.  Watch this video to learn more!  

IT Convergence is a global applications services provider. For the past 20 years, it has offered customers Oracle solutions, such as enterprise applications like Oracle E-Business Suite. This article...

Linux

Getting Started With The Vagrant Libvirt Provider For Oracle Linux

Introduction As recently announced by Sergio we now support the libvirt provider for our Oracle Linux Vagrant Boxes. The libvirt provider is a good alternative to the virtualbox one when you already use KVM on your host, as KVM and VirtualBox virtualization are mutually exclusive. It is also a good choice when running Vagrant on Oracle Cloud Infrastructure. This blog post will guide you through the simple steps needed to use these new boxes on your Oracle Linux host (Release 7 or 8). Virtualization Virtualization is easily installed using the Virtualization Host package group. On Oracle Linux 7, first enable the ol7_kvm_utils channel to get recent version of the packages: sudo yum-config-manager --enable ol7_kvm_utils After installing the packages, start the libvirtd service and add you user to the libvirt group: sudo yum group install "Virtualization Host" sudo systemctl enable --now libvirtd sudo usermod -a -G libvirt opc Do not forget to re-login to activate the group change for your user! Vagrant We need to install HashiCorp Vagrant as well as the Vagrant Libvirt Provider contributed plugin: # Vagrant itself: sudo yum install https://releases.hashicorp.com/vagrant/2.2.9/vagrant_2.2.9_x86_64.rpm # Libraries needed for the plugin: sudo yum install libxslt-devel libxml2-devel libvirt-devel \ libguestfs-tools-c ruby-devel gcc make Oracle Linux 8: at the time of this writing there is a compatibility issue between system libraries and the ones embedded with Vagrant. Run the following script as root to update the Vagrant libraries: #!/usr/bin/env bash # Description: override krb5/libssh libraries in Vagrant embedded libraries set -e # Get pre-requisites dnf -y install \ libxslt-devel libxml2-devel libvirt-devel \ libguestfs-tools-c ruby-devel \ gcc byacc make cmake gcc-c++ mkdir -p vagrant-build cd vagrant-build dnf download --source krb5-libs libssh # krb5 rpm2cpio krb5-1.17-*.src.rpm | cpio -idmv krb5-1.17.tar.gz tar xzf krb5-1.17.tar.gz pushd krb5-1.17/src ./configure make cp -a lib/crypto/libk5crypto.so.3* /opt/vagrant/embedded/lib64/ popd # libssh rpm2cpio libssh-0.9.0-*.src.rpm | cpio -imdv libssh-0.9.0.tar.xz tar xJf libssh-0.9.0.tar.xz mkdir build pushd build cmake ../libssh-0.9.0 -DOPENSSL_ROOT_DIR=/opt/vagrant/embedded make cp lib/libssh* /opt/vagrant/embedded/lib64/ popd We are now ready to install the plugin (as your non-privileged user): vagrant plugin install vagrant-libvirt Firewall The libvirt provider uses NFS to mount the /vagrant shared folder in the guest. Your firewall must be configured to allow the NFS traffic between the host and the guest. Oracle Linux 7 You can allow NFS traffic in your default zone: sudo firewall-cmd --permanent --add-service=nfs3 sudo firewall-cmd --permanent --add-service=mountd sudo firewall-cmd --permanent --add-service=rpc-bind sudo systemctl restart firewalld Alternatively you can add the libvirt bridge to your trusted zone: sudo firewall-cmd --zone=trusted --add-interface=virbr1 sudo systemctl restart firewalld Oracle Linux 8 With Oracle Linux 8, the libvirt bridge is automatically added to the libvirt zone. Traffic must be allowed in that zone: sudo firewall-cmd --permanent --zone libvirt --add-service=nfs3 sudo firewall-cmd --permanent --zone libvirt --add-service=mountd sudo firewall-cmd --permanent --zone libvirt --add-service=rpc-bind sudo systemctl restart firewalld Privileges considerations To configure NFS, Vagrant will require root privilege when you start/stop guest instances. Unless you are happy to enter your password on every vagrant up you should consider enabling password-less sudo for your user. Alternatively you can enable fine grained sudoers access as described in Root Privilege Requirement section of the Vagrant documentation. Using libvirt boxes Your first libvirt guest You are now ready to use livirt enabled boxes! mkdir ol7 cd ol7 vagrant init oraclelinux/7 https://oracle.github.io/vagrant-boxes/boxes/oraclelinux/7.json vagrant up Libvirt configuration While the libvirt provider exposes quite a lot of configuration parameters, most Vagrantfiles will run with no or little modification. Typically when you have for VirtualBox: config.vm.provider "virtualbox" do |vb| vb.cpus = 4 vb.memory = 4096 end You will need for libvirt: config.vm.provider :libvirt do |libvirt| libvirt.cpus = 4 libvirt.memory = 4096 end The Oracle vagrant-boxes repository is being updated to support the new libvirt boxes. Tips and tricks Virsh The virsh command can be used to monitor the libvirt resources. By default vagrant-libvirt uses the qemu:///system URI to connect to the KVM hypervisor and images are stored in the default storage pool. Example: [opc@bommel ~]$ vagrant global-status id name provider state directory -------------------------------------------------------------------------------------------------- 7ec55b3 ol7-vagrant libvirt shutoff /home/opc/src/vagrant-boxes/OracleLinux/7 3fd9dd9 registry libvirt shutoff /home/opc/src/vagrant-boxes/ContainerRegistry c716711 ol7-docker-engine libvirt running /home/opc/src/vagrant-boxes/DockerEngine 6a0cb46 worker1 libvirt running /home/opc/src/vagrant-boxes/OLCNE a262a29 worker2 libvirt running /home/opc/src/vagrant-boxes/OLCNE 538e659 master1 libvirt running /home/opc/src/vagrant-boxes/OLCNE b6d2661 ol6-vagrant libvirt running /home/opc/src/vagrant-boxes/OracleLinux/6 41aaa7e oracle-19c-vagrant libvirt running /home/opc/src/vagrant-boxes/OracleDatabase/19.3.0 [opc@bommel ~]$ virsh -c qemu:///system list --all Id Name State ------------------------------------------------- 23 DockerEngine_ol7-docker-engine running 24 OLCNE_worker1 running 25 OLCNE_worker2 running 26 OLCNE_master1 running 30 6_ol6-vagrant running 31 19.3.0_oracle-19c-vagrant running - 7_ol7-vagrant shut off - ContainerRegistry_registry shut off [opc@bommel ~]$ virsh -c qemu:///system vol-list --pool default Name Path ----------------------------------------------------------------------------------------------------------------------------------------------- 19.3.0_oracle-19c-vagrant.img /var/lib/libvirt/images/19.3.0_oracle-19c-vagrant.img 6_ol6-vagrant.img /var/lib/libvirt/images/6_ol6-vagrant.img 7_ol7-vagrant.img /var/lib/libvirt/images/7_ol7-vagrant.img ContainerRegistry_registry-vdb.qcow2 /var/lib/libvirt/images/ContainerRegistry_registry-vdb.qcow2 ContainerRegistry_registry.img /var/lib/libvirt/images/ContainerRegistry_registry.img DockerEngine_ol7-docker-engine-vdb.qcow2 /var/lib/libvirt/images/DockerEngine_ol7-docker-engine-vdb.qcow2 DockerEngine_ol7-docker-engine.img /var/lib/libvirt/images/DockerEngine_ol7-docker-engine.img ol7-latest_vagrant_box_image_0.img /var/lib/libvirt/images/ol7-latest_vagrant_box_image_0.img OLCNE_master1.img /var/lib/libvirt/images/OLCNE_master1.img OLCNE_worker1.img /var/lib/libvirt/images/OLCNE_worker1.img OLCNE_worker2.img /var/lib/libvirt/images/OLCNE_worker2.img oraclelinux-VAGRANTSLASH-6_vagrant_box_image_6.10.130.img /var/lib/libvirt/images/oraclelinux-VAGRANTSLASH-6_vagrant_box_image_6.10.130.img oraclelinux-VAGRANTSLASH-6_vagrant_box_image_6.10.132.img /var/lib/libvirt/images/oraclelinux-VAGRANTSLASH-6_vagrant_box_image_6.10.132.img oraclelinux-VAGRANTSLASH-7_vagrant_box_image_7.7.17.img /var/lib/libvirt/images/oraclelinux-VAGRANTSLASH-7_vagrant_box_image_7.7.17.img oraclelinux-VAGRANTSLASH-7_vagrant_box_image_7.8.135.img /var/lib/libvirt/images/oraclelinux-VAGRANTSLASH-7_vagrant_box_image_7.8.135.img Removing box image The vagrant box remove command removes the box from the user .vagrant directory, but not from the storage pool. Use virsh to cleanup the pool: [opc@bommel ~]$ vagrant box list oraclelinux/6 (libvirt, 6.10.130) oraclelinux/6 (libvirt, 6.10.132) oraclelinux/7 (libvirt, 7.8.131) oraclelinux/7 (libvirt, 7.8.135) [opc@bommel ~]$ vagrant box remove oraclelinux/6 --provider libvirt --box-version 6.10.130 Removing box 'oraclelinux/6' (v6.10.130) with provider 'libvirt'... Vagrant-libvirt plugin removed box only from your LOCAL ~/.vagrant/boxes directory From Libvirt storage pool you have to delete image manually(virsh, virt-manager or by any other tool) [opc@bommel ~]$ virsh -c qemu:///system vol-delete --pool default oraclelinux-VAGRANTSLASH-6_vagrant_box_image_6.10.130.img Vol oraclelinux-VAGRANTSLASH-6_vagrant_box_image_6.10.130.img deleted Libvirt CPU emulation mode The default libvirt CPU emulation mode is host-model, that is: the guest inherits capabilities from the host. Should the guest not start in this mode, you can override it using the custom mode – e.g.: config.vm.provider :libvirt do |libvirt| libvirt.cpu_mode = 'custom' libvirt.cpu_model = 'Skylake-Server-IBRS' libvirt.cpu_fallback = 'allow' end You can list the available CPU models with virsh cpu-models x86_64. Storage By default, the Vagrant Libvirt provider will use the default libvirt storage pool which stores images in /var/lib/libvirt/images. The storage_pool_name option allows you to use any other pool/location. Example: On the libvirt side, create a pool: [opc@bommel ~]$ virsh -c qemu:///system Welcome to virsh, the virtualization interactive terminal. Type: 'help' for help with commands 'quit' to quit virsh # pool-define-as vagrant dir --target /data/vagrant Pool vagrant defined virsh # pool-start vagrant Pool vagrant started virsh # pool-autostart vagrant Pool vagrant marked as autostarted In your Vagrantfile, set the storage_pool_name option: config.vm.provider :libvirt do |libvirt| libvirt.storage_pool_name = 'vagrant' end Vagrant Libvirt defaults If you have site specific options, instead of modifying all your Vagrantfiles, you can define them globally in ~/.vagrant.d/Vagrantfile (see Load Order and Merging). E.g: # Vagrant local defaults Vagrant.configure(2) do |config| config.vm.provider :libvirt do |libvirt| libvirt.cpu_mode = 'custom' libvirt.cpu_model = 'Skylake-Server-IBRS' libvirt.cpu_fallback = 'allow' libvirt.storage_pool_name = 'vagrant' end end VirtualBox and libvirt on the same host You cannot run VirtualBox and libvirt guests at the same time, but you still can have both installed and switch from the one to the other providing there is no guest VM running when you switch. The only thing you have to do is to stop/start their respective services – e.g. to switch from VirtualBox to libvirt: systemctl stop vboxdrv.service systemctl start libvirtd.service Screencast

Introduction As recently announced by Sergio we now support the libvirt provider for our Oracle Linux Vagrant Boxes. The libvirt provider is a good alternative to the virtualbox one when you already...

Linux

Using rclone to copy data in and out of Oracle Cloud Object Storage

Introduction In this blog post I’ll show how to use rclone on Oracle Linux with free object storage services included in Oracle Cloud Free Tier. Free tier includes 20GiB of object storage. Rclone is a command line program to sync files and directories to and from various cloud-based storage services. Oracle Cloud Object Storage is Amazon S3 compatible, so I’ll use Rclone’s S3 capabilities to move data between my local Oracle Linux system and object storage. One way to configure Rclone is to run rclone config and step through a series of questions, adding Oracle Cloud Object Storage as an S3 compatible provider. Instead, I’m going to use Oracle Cloud Infrastructure’s Cloud Shell to gather the relevant data and construct what’s ultimately a small configuration file. The high level steps are: On Oracle Cloud Infrastructure: Create an object storage bucket Create an Access Key/Secret Key pair Gather relevant values for Rclone configuration On your local Linux system Install Rclone Create an Rclone config file Create Object Storage Bucket One of the benefits of Cloud Shell is that it includes pre-configured OCI Client tools so you can begin using the command line interface without any configuration steps. Accessing OCI Cloud Shell   Starting in Cloud Shell, set up environment variables to make running subsequent commands easier. The following stores your region and tenancy OCID, and storage namespace in environment variables. I'm using both JMESPath and jq to parse JSON for illustration purposes. export R=$(curl -s http://169.254.169.254/opc/v1/instance/ | jq -r '.region') export C=$(oci os ns get-metadata --query 'data."default-s3-compartment-id"' --raw-output) export N=$(oci os ns get | jq -r '.data') Cloud Shell in action   To create a storage bucket: oci os bucket create --name mybucket --compartment-id $C Create an Access Key/Secret Key pair The Amazon S3 Compatibility API relies on a signing key called a Customer Secret Key. export U=$(oci os bucket list --compartment-id=$C --query 'data [?"name"==`mybucket`] | [0]."created-by"' --raw-output) oci iam customer-secret-key create --display-name storagekey --user-id $U export K=$(oci iam customer-secret-key list --user-id $U | jq -r '[.data[] | select (."display-name"=="storagekey")][0]."id"') In the response from oci iam customer-secret-key, id corresponds to the access key and key represents the secret key. Make a note of the key immediately because it will not be shown to you again! Finally, gather up the relevant values for the Rclone configuration. Remember to copy the secret key and save it somewhere. Run the following to collect and display the information you need. echo "ACCESS KEY: $K"; echo "SECRET KEY: check your notes"; echo "NAMESPACE: $N"; echo "REGION: $R" Set Up Linux System with Rclone Over to the local system on which Rclone will be used to move files to- and from object storage. Install Rclone To install Rclone: $ sudo yum install -y oracle-epel-release-el7 && sudo yum install -y rclone Create the Rclone Configuration File In your home directory, create a file, .rclone.conf using the contents below, replacing the values you gathered earlier: [myobjectstorage] type = s3 provider = Other env_auth = false access_key_id = <ACCESS KEY> secret_access_key = <SECRET KEY> endpoint = <NAMESPACE>.compat.objectstorage.<REGION>.oraclecloud.com Note that if the storage bucket you created is not in your home region, you must also need to add this entry to the [myobjectstorage] stanza: region = <REGION> Running Rclone You are now ready to start copying files to object storage. The following copies a file, myfile.txt to object storage. You can show the contents of object storage using rclone ls. $ echo `date` > myfile.txt $ rclone copy myfile.txt myobjectstorage:/mybucket $ rclone ls myobjectstorage:/mybucket Conclusion Rclone is a useful command line utility to interact with, among other types, S3 compatible cloud-based object storage. Oracle Cloud Object Storage has an S3 compatible API. In this blog post, I showed how to install Rclone from Oracle Linux yum server and configure it using free Oracle Cloud Object Storage. References Rclone documentation

Introduction In this blog post I’ll show how to use rclone on Oracle Linux with free object storage services included in Oracle Cloud Free Tier. Free tier includes 20GiB of object storage. Rclone is a...

Linux

Oracle Linux Vagrant Boxes Now Include Catalog Data and Add Support for libvirt Provider

Introduction We recently made some changes in the way we publish Oracle Linux Vagrant Boxes. First, we added boxes for the libvirt provider, for use with KVM. Secondly, we added Vagrant catalog data in JSON format. Vagrant Box Catalog Data With the catalog data in place, instead of launching vagrant environments using a URL to a box file, you launch it by pointing to a JSON file. For example: $ vagrant init oraclelinux/7 https://oracle.github.io/vagrant-boxes/boxes/oraclelinux/7.json $ vagrant up $ vagrant ssh This creates a Vagrantfile that includes the following two lines causes the most recently published Oracle Linux 7 box to be downloaded (if needed) and started: config.vm.box = "oraclelinux/7" config.vm.box_url = "https://oracle.github.io/vagrant-boxes/boxes/oraclelinux/7.json" Using this catalog-based approach to referencing Vagrant boxes, adds version-awareness and the ability to update boxes. During vagrant up you’ll be notified when a newer version of a box is available for your environment: $ vagrant up; vagrant ssh Bringing machine 'default' up with 'virtualbox' provider... ==> default: Checking if box 'oraclelinux/7' version '7.7.15' is up to date... ==> default: A newer version of the box 'oraclelinux/7' for provider 'virtualbox' is ==> default: available! You currently have version '7.7.15'. The latest is version ==> default: '7.8.128'. Run `vagrant box update` to update. To update a box: $ vagrant box update ==> default: Checking for updates to 'oraclelinux/7' default: Latest installed version: 7.7.15 default: Version constraints: default: Provider: virtualbox ==> default: Updating 'oraclelinux/7' with provider 'virtualbox' from version ==> default: '7.7.15' to '7.8.128'... ==> default: Loading metadata for box 'https://oracle.github.io/vagrant-boxes/boxes/oraclelinux/7.json' ==> default: Adding box 'oraclelinux/7' (v7.8.128) for provider: virtualbox default: Downloading: https://yum.oracle.com/boxes/oraclelinux/ol7/ol7u8-virtualbox-b128.box default: Calculating and comparing box checksum... ==> default: Successfully added box 'oraclelinux/7' (v7.8.128) for 'virtualbox'! To check if a later version of a box is available: $ vagrant box outdated --global * 'rpmcheck' for 'virtualbox' wasn't added from a catalog, no version information * 'oraclelinux/7' for 'virtualbox' is outdated! Current: 7.7.15. Latest: 7.8.128 * 'oraclelinux/6' for 'virtualbox' is outdated! Current: 6.10.13. Latest: 6.10.127 * 'oraclelinux/6' for 'virtualbox' is outdated! Current: 6.8.3. Latest: 6.10.127 Another benefit of using catalog data to install Oracle Linux Vagrant boxes, is that checksums are automatically verified after download. Vagrant Boxes for libvirt Provider With the newly released boxes for the libvirt provider you can create Oracle Linux Vagrant environments using KVM as the hypervisor. In this blog post, Philippe explains how to get started. Conclusion In this blog post, I discussed changes we made to the way we publish Oracle Linux Vagrant boxes and showed how to use Vagrant box catalog data to install and manage box versions.

Introduction We recently made some changes in the way we publish Oracle Linux Vagrant Boxes. First, we added boxes for the libvirt provider, for use with KVM. Secondly, we added Vagrant catalog data in...

Announcements

Updated Oracle Database images now available in the Oracle Cloud Marketplace

Oracle is pleased and honored to announce the updated "Oracle Database" availability in the "Oracle Cloud MarketPlace". By leveraging the "Oracle Database" you will have the option to automatically deploy a fully functional Database environment by pasting a simple cloud-config script; the deployment allows for basic customization of the environment, further configurations, like adding extra disks, NICs, is always possible post-deployment. The framework allows for simple cleanup and re-deployment, via the Marketplace interface (terminate instance and re-launch), or cleanup the Instance within and re-deploy the same Instance with changed settings (see Usage Info below). To easily introduce to the different customization options, available with the "Oracle Database" we also created a dedicated document with examples on the Oracle Database customization deployment. The deployed Instance will be based on the following software stack: Oracle Cloud Infrastructure Native Instance Oracle Linux 7.8 UEK5 (Unbreakable Enterprise Kernel, release 5) Update 3 Updated Oracle Database 12cR2, 18c and 19c with April, 2020 Critical Patch Update For further information: Oracle Database deployment on Oracle Cloud Infrastructure Oracle Database on Oracle Cloud MarketPlace Oracle Cloud Marketplace Oracle Cloud: Try it for free Oracle Database Templates for Oracle VM

Oracle is pleased and honored to announce the updated "Oracle Database" availability in the "Oracle Cloud MarketPlace". By leveraging the "Oracle Database" you will have the option to automatically...

Announcements

Announcing the release of Oracle Linux 8 Update 2

Oracle is pleased to announce the general availability of Oracle Linux 8 Update 2. Individual RPM packages are available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. ISO installation images will soon be available for download from the Oracle Software Delivery Cloud, and Docker images will soon be available via Oracle Container Registry and Docker Hub. Starting with Oracle Linux 8 Update 2, the Unbreakable Enterprise Kernel Release 6 (UEK R6) is included on the installation image along with the Red Hat Compatible Kernel (RHCK). For new installations, UEK R6 is enabled and installed as the default kernel on first boot. UEK R6 is a heavily tested and optimized operating system kernel for Oracle Linux 7 Update 7, and later, and Oracle Linux 8 Update 1, and later. The kernel is developed, built, and tested on 64-bit Arm (aarch64) and 64-bit AMD/Intel (x86-64) platforms. UEK R6 is based on the mainline Linux kernel version 5.4 and includes driver updates, bug fixes, and security fixes; additional features are enabled to provide support for key functional requirements and patches are applied to improve performance and optimize the kernel for use in enterprise operating environments. Oracle Linux 8 Update 2 ships with: UEK R6 (kernel-uek-5.4.17-2011.1.2.el8uek) for x86_64 (Intel & AMD) and aarch64 (Arm) platforms RHCK (kernel-4.18.0-193.el8) for x86_64 (Intel & AMD) platform where both include bug fixes, security fixes and enhancements. Notable New Features for All Architectures Unbreakable Enterprise Kernel Release 6 (UEK6) For information about UEK6, please refer to the UEK6 announcement Red Hat Compatible Kernel (RHCK) "kexec-tools" documentation now includes Kdump FCoE target support "numactl" manual page updated to clarify information about memory usage "rngd" can run with non-root privileges Secure Boot available by default Compiler and Development Toolset (available as Application Streams) Compiler and Toolset Clang toolset updated to version 9.0.0 Rust toolset updated to version 1.39 Go toolset updated to 1.13.4 GCC Toolset 9 GCC version updated to 9.2.1 GDB version updated to 8.3 For further GCC Toolset updates, please check Oracle Linux 8 Update 2 release notes Database Oracle Linux 8 Update 2 ships with version 8.0 of the MySQL database Dynamic Programming Languages, Web "maven:3.6" module stream available "Python 3.8" is provided by a new python38 module. Python 3.6 continues to be supported in Oracle Linux 8. The introduction of "Python 3.8" in Oracle Linux 8 Update 2 requires that you specify which version of "mod_wsgi" you want to install, as "Python 3.6" is also supported in this release. "perl-LDAP" and "perl-Convert-ASN1" packages are now released as part of Oracle Linux 8 Update 2. Infrastructure Services "bind" updated to version 9.11.13 "tuned" updated to version 2.13 Networking "eBPF" for Traffic Control kernel subsystem supported (previously available as a technology preview) "firewalld" updated to version 0.8 Podman, Buildah, and Skopeo Container Tools are now supported on both UEK R6 and RHCK Security "audit" updated to version 3.0-0.14; several improvements introduced between Kernel version 4.18 (RHCK) and version 5.4 (UEK R6) of Audit. "lvmdbusd" service confined by SELinux. "openssl-pkcs11" updated to version 0.4.10. "rsyslog" updated to version 8.1911.0. SCAP Security Guide includes ACSC (Australian Cyber Security Centre) Essential Eight support. SELinux SELinux setools-gui and setools-console-analyses packages included SELinux improved to enable confined users to manage user session services semanage export able to display customizations related to permissive domains semanage includes capability for listing and modifying SCTP and DCCP ports "sudo" updated to version 1.8.29-3. "udica" is now capable of adding new allow rules generated from SELinux denials to an existing container policy. Virtualization Nested Virtual Machines (VM) capability added; this enhancement enables an Oracle Linux 7 or Oracle Linux 8 VM that is running on an Oracle Linux 8 physical host to perform as a hypervisor, and host its own VMs. Note: On AMD64 systems, nested KVM virtualization continues to be a Technology Preview. virt-manager application deprecated; Oracle recommends using the Cockpit web console to manage virtualization in a GUI. Cockpit Web Console Cockpit web console login timeout; web console sessions will be automatically logged out after 15 minutes of inactivity Option for logging into the web console with a TLS client certificate added Creating a new file system in the web console requires a specified mount point Virtual Machines management page improvements Important changes in this release UEK R6 brought back support for Btrfs and OCFS2 file systems. These are not available while using the Red Hat Compatible Kernel (RHCK). Further Information on Oracle Linux 8 For more details about these and other new features and changes, please consult the Oracle Linux 8 Update 2 Release Notes and Oracle Linux 8 Documentation. Oracle Linux can be downloaded, used, and distributed free of charge and all updates and errata are freely available. Customers decide which of their systems require a support subscription. This makes Oracle Linux an ideal choice for development, testing, and production systems. The customer decides which support coverage is best for each individual system while keeping all systems up to date and secure. Customers with Oracle Linux Premier Support also receive support for additional Linux programs, including Gluster Storage, Oracle Linux Software Collections, and zero-downtime kernel updates using Oracle Ksplice. Application Compatibility Oracle Linux maintains user-space compatibility with Red Hat Enterprise Linux (RHEL), which is independent of the kernel version that underlies the operating system. Existing applications in user space will continue to run unmodified on the Unbreakable Enterprise Kernel Release 6 (UEK R6) and no re-certifications are needed for RHEL certified applications. For more information about Oracle Linux, please visit www.oracle.com/linux.

Oracle is pleased to announce the general availability of Oracle Linux 8 Update 2. Individual RPM packages are available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. ISO...

Linux

How to emulate block devices with QEMU

QEMU has a really great feature of being able to emulate block devices, this article by Dongli Zhang from the Oracle Linux kernel team shows you how to get started. Block devices are significant components of the Linux kernel. In this article, we introduce the usage of QEMU to emulate some of these block devices, including SCSI, NVMe, Virtio and NVDIMM. This ability facilitates Linux administrators or developers, to study, debug and develop the Linux kernel, as it is much easier to customize the configuration and topology of block devices with QEMU. In addition, it is also considerably faster to reboot a QEMU/KVM virtual machine than to reboot a baremetal server. For all examples in this article the KVM virtual machine is running Oracle Linux 7, the virtual machine kernel version is 5.5.0, and the QEMU version is 4.2.0. All examples run the boot disk (ol7.qcow2) as default IDE, while all other disks (e.g., disk.qcow2, disk1.qcow2 and disk2.qcow2) are used for corresponding devices. The article focuses on the usage of QEMU with block devices. It does not cover any prerequisite knowledge on block device mechanisms. megasas The below example demonstrates how to emulate megasas by adding two SCSI LUNs to the HBA. qemu-system-x86_64 -machine accel=kvm -vnc :0 -smp 4 -m 4096M \ -net nic -net user,hostfwd=tcp::5022-:22 \ -hda ol7.qcow2 -serial stdio \ -device megasas,id=scsi0 \ -device scsi-hd,drive=drive0,bus=scsi0.0,channel=0,scsi-id=0,lun=0 \ -drive file=disk1.qcow2,if=none,id=drive0 \ -device scsi-hd,drive=drive1,bus=scsi0.0,channel=0,scsi-id=1,lun=0 \ -drive file=disk2.qcow2,if=none,id=drive1 The figure below depicts the SCSI bus topology for this example. There are two SCSI adapters (each with one SCSI LUN) connecting to the HBA via the same SCSI channel. Syslog output extracted from the virtual machine verifies the SCSI bus topology scanned by the kernel matches QEMU's configuration. The only exception being QEMU specifies a channel id of 1, while the kernel assigns a channel id of 2. The disks disk1.qcow2 and disk2.qcow2 map to '2:2:0:0' and '2:2:1:0' respectively. [ 2.439170] scsi host2: Avago SAS based MegaRAID driver [ 2.445926] scsi 2:2:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 2.446098] scsi 2:2:1:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 2.463466] sd 2:2:0:0: Attached scsi generic sg1 type 0 [ 2.467891] sd 2:2:1:0: Attached scsi generic sg2 type 0 [ 2.478002] sd 2:2:0:0: [sdb] Attached SCSI disk [ 2.485895] sd 2:2:1:0: [sdc] Attached SCSI disk lsi53c895a This section demonstrates how to emulate lsi53c895a by adding two SCSI LUNs to the HBA. qemu-system-x86_64 -machine accel=kvm -vnc :0 -smp 4 -m 4096M \ -net nic -net user,hostfwd=tcp::5023-:22 \ -hda ol7.qcow2 -serial stdio \ -device lsi53c895a,id=scsi0 \ -device scsi-hd,drive=drive0,bus=scsi0.0,channel=0,scsi-id=0,lun=0 \ -drive file=disk1.qcow2,if=none,id=drive0 \ -device scsi-hd,drive=drive1,bus=scsi0.0,channel=0,scsi-id=1,lun=0 \ -drive file=disk2.qcow2,if=none,id=drive1 The figure below depicts the SCSI bus topology for this example. There are two SCSI adapters (each with one SCSI LUN) connecting to the HBA via the same SCSI channel. As with megasas, syslog output from the virtual machine shows the SCSI bus topology scanned by the kernel matches what is configured with QEMU. Disks disk1.qcow2 and disk2.qcow2 are mapped to '2:0:0:0' and '2:0:1:0' respectively. [ 2.443221] scsi host2: sym-2.2.3 [ 5.534188] scsi 2:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 5.544931] scsi 2:0:1:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 5.558896] sd 2:0:0:0: Attached scsi generic sg1 type 0 [ 5.559889] sd 2:0:1:0: Attached scsi generic sg2 type 0 [ 5.574487] sd 2:0:0:0: [sdb] Attached SCSI disk [ 5.579512] sd 2:0:1:0: [sdc] Attached SCSI disk virtio-scsi This section demonstrates the usage of paravirtualized virtio-scsi. The virtio-scsi device provides really good 'multiqueue' support. Therefore it can be used to study or debug the 'multiqueue' feature of SCSI and the block layer. The below example creates a 4-queue virtio-scsi HBA with two LUNs (which both belong to the same SCSI target). qemu-system-x86_64 -machine accel=kvm -vnc :0 -smp 4 -m 4096M \ -net nic -net user,hostfwd=tcp::5023-:22 \ -hda ol7.qcow2 -serial stdio \ -device virtio-scsi-pci,id=scsi0,num_queues=4 \ -device scsi-hd,drive=drive0,bus=scsi0.0,channel=0,scsi-id=0,lun=0 \ -drive file=disk1.qcow2,if=none,id=drive0 \ -device scsi-hd,drive=drive1,bus=scsi0.0,channel=0,scsi-id=0,lun=1 \ -drive file=disk2.qcow2,if=none,id=drive1 The figure below illustrates the SCSI bus topology for this example, which has one single SCSI adapter with two LUNs. Again as with previous examples syslog extracted from the virtual machine verifies the SCSI bus topology scanned by the kernel matches QEMU's configuration. In this scenario disks disk1.qcow2 and disk2.qcow2 map to '2:0:0:0' and '2:0:0:1' respectively. [ 1.212182] scsi host2: Virtio SCSI HBA [ 1.213616] scsi 2:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 1.213851] scsi 2:0:0:1: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5 [ 1.371305] sd 2:0:0:0: [sdb] Attached SCSI disk [ 1.372284] sd 2:0:0:1: [sdc] Attached SCSI disk [ 2.400542] sd 2:0:0:0: Attached scsi generic sg0 type 0 [ 2.403724] sd 2:0:0:1: Attached scsi generic sg1 type 0 The information below extracted from the virtual machine confirms that the virtio-scsi HBA has 4 I/O queues. Each I/O queue has one virtio0-request interrupt. # ls /sys/block/sdb/mq/ 0 1 2 3 # ls /sys/block/sdc/mq/ 0 1 2 3 24: 0 0 0 0 PCI-MSI 65536-edge virtio0-config 25: 0 0 0 0 PCI-MSI 65537-edge virtio0-control 26: 0 0 0 0 PCI-MSI 65538-edge virtio0-event 27: 30 0 0 0 PCI-MSI 65539-edge virtio0-request 28: 0 140 0 0 PCI-MSI 65540-edge virtio0-request 29: 0 0 34 0 PCI-MSI 65541-edge virtio0-request 30: 0 0 0 276 PCI-MSI 65542-edge virtio0-request virtio-blk In this section we demonstrate usage of the paravirtualized virtio-blk device. This example shows virtio-blk with 4 I/O queues, with the backend device being disk.qcow2. qemu-system-x86_64 -machine accel=kvm -vnc :0 -smp 4 -m 4096M \ -net nic -net user,hostfwd=tcp::5023-:22 \ -hda ol7.qcow2 -serial stdio \ -device virtio-blk-pci,drive=drive0,id=virtblk0,num-queues=4 \ -drive file=disk.qcow2,if=none,id=drive0 The information below extracted from the virtual machine confirms the virtio-blk device has 4 I/O queues. Each I/O queue having one virtio0-req.X interrupt. # ls /sys/block/vda/mq/ 0 1 2 3 24: 0 0 0 0 PCI-MSI 65536-edge virtio0-config 25: 3 0 0 0 PCI-MSI 65537-edge virtio0-req.0 26: 0 31 0 0 PCI-MSI 65538-edge virtio0-req.1 27: 0 0 33 0 PCI-MSI 65539-edge virtio0-req.2 28: 0 0 0 0 PCI-MSI 65540-edge virtio0-req.3 nvme This section demonstrates how to emulate NVMe. This example shows an NVMe device with 8 hardware queues. As the virtual machine has as 4 vcpus, only 4 hardware queues would be used. The backend NVMe device is disk.qcow2. qemu-system-x86_64 -machine accel=kvm -vnc :0 -smp 4 -m 4096M \ -net nic -net user,hostfwd=tcp::5023-:22 \ -hda ol7.qcow2 -serial stdio \ -device nvme,drive=nvme0,serial=deadbeaf1,num_queues=8 \ -drive file=disk.qcow2,if=none,id=nvme0 Syslog extracted from the virtual machine ensures detection of the NVMe device by the Linux kernel is successful. [ 1.181405] nvme nvme0: pci function 0000:00:04.0 [ 1.212434] nvme nvme0: 4/0/0 default/read/poll queues The information below taken from the virtual machine confirms the NVMe device has 4 I/O queues in addition to the admin queue. Each queue having one nvme0qX interrupt. 24: 0 11 0 0 PCI-MSI 65536-edge nvme0q0 25: 40 0 0 0 PCI-MSI 65537-edge nvme0q1 26: 0 41 0 0 PCI-MSI 65538-edge nvme0q2 27: 0 0 0 0 PCI-MSI 65539-edge nvme0q3 28: 0 0 0 4 PCI-MSI 65540-edge nvme0q4 NVMe in QEMU also supports 'cmb_size_mb' which is used to configure the amount of memory available as Controller Memory Buffer (CMB). In addition, upstream development are continually adding more features to NVMe emulation for QEMU such as multiple namespaces. nvdimm This section briefly demonstrates how to emulate NVDIMM by adding one 6GB NVDIMM to the virtual machine. qemu-system-x86_64 -machine pc,nvdimm,accel=kvm -vnc :0 -smp 4 -m 4096M \ -net nic -net user,hostfwd=tcp::5023-:22 \ -hda ol7.qcow2 -serial stdio \ -m 2G,maxmem=10G,slots=4 \ -object memory-backend-file,share,id=md1,mem-path=nvdimm.img,size=6G \ -device nvdimm,memdev=md1,id=nvdimm1 The following information extracted from the virtual machine demonstrates that the NVDIMM device is exported as block device /dev/pmem0. # dmesg | grep NVDIMM [ 0.020285] ACPI: SSDT 0x000000007FFDFD85 0002CD (v01 BOCHS NVDIMM 00000001 BXPC 00000001) # ndctl list [ { "dev":"namespace0.0", "mode":"raw", "size":6442450944, "sector_size":512, "blockdev":"pmem0" } ] # lsblk | grep pmem pmem0 259:0 0 6G 0 disk NVDIMM features and configuration are quite complex with extra support continually being added to QEMU. For more NVDIMM usage with QEMU, please refer to QEMU NVDIMM documentation at https://docs.pmem.io Power Management This section introduces how QEMU can be used to emulate power management, e.g. freeze/resume. While this is not limited to block devices, we will demonstrate using a NVMe device. This helps to understand how block device drivers work with power management. The first step is to boot a virtual machine with an NVMe device. The only difference with the prior NVMe example is to use "-monitor stdio" instead of "-serial stdio" to facilitate interaction with QEMU via a shell. qemu-system-x86_64 -machine accel=kvm -vnc :0 -smp 4 -m 4096M \ -net nic -net user,hostfwd=tcp::5023-:22 \ -hda ol7.qcow2 -monitor stdio \ -device nvme,drive=nvme0,serial=deadbeaf1,num_queues=8 \ -drive file=disk.qcow2,if=none,id=nvme0 (qemu) To suspend the operating system run the following within the virtual machine: # echo freeze > /sys/power/state This will have the effect of freezing the virtual machine. To resume run the following from the QEMU shell: (qemu) system_powerdown The following extract from the virtual machine syslog demonstrates the behaviour of the Linux kernel during the freeze/resume cycle. [ 26.945439] PM: suspend entry (s2idle) [ 26.951256] Filesystems sync: 0.005 seconds [ 26.951922] Freezing user space processes ... (elapsed 0.000 seconds) done. [ 26.953489] OOM killer disabled. [ 26.953942] Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. [ 26.955631] printk: Suspending console(s) (use no_console_suspend to debug) [ 26.962704] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 26.962972] sd 0:0:0:0: [sda] Stopping disk [ 54.674206] sd 0:0:0:0: [sda] Starting disk [ 54.678859] nvme nvme0: 4/0/0 default/read/poll queues [ 54.707283] OOM killer enabled. [ 54.707710] Restarting tasks ... done. [ 54.708596] PM: suspend exit [ 54.834191] ata2.01: NODEV after polling detection [ 54.834379] ata1.01: NODEV after polling detection [ 56.770115] e1000: ens3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX This method can be used to reproduce and analyze bugs related to NVMe and power management such as http://lists.infradead.org/pipermail/linux-nvme/2019-April/023237.html This highlights that QEMU is not just an emulator, it can also be used to study, debug and develop the Linux kernel. Summary In this article we discussed how to emulate numerous different block devices with QEMU. Facilitating the abilty to customize the configuration of block devices such as the topology of the SCSI bus or the number of queues in NVMe. Block devices support many features, and the best way to fully understand how to configure block devices in QEMU is to study the QEMU source code. This will also assist in understanding block device specifications. This article should primarily be used as cheat sheet.

QEMU has a really great feature of being able to emulate block devices, this article by Dongli Zhang from the Oracle Linux kernel team shows you how to get started. Block devices are significant...

Announcements

NTT Data Intellilink - Powering Mission Critical Workloads with Oracle Linux

This article highlights customer NTT Data Intellilink and their use of Oracle Linux. As an NTT DATA group company, they aim to provide value for their customers through design, implementation and operation of mission-critical information and communication systems platform built with the latest technologies. Within NTT Data Intellilink, there is a business unit which focuses on providing customers with Oracle solutions, support, and implementation services using a wide range of Oracle products such as Oracle Database, Oracle Fusion Middleware, Oracle Linux, and Oracle Engineered Systems. These solutions are deployed on premise or in Oracle Cloud. Previously, NTT Data Intellilink had been using Red Hat Enterprise Linux before switching to Oracle Linux with the Unbreakable Enterprise Kernel (UEK). This change resulted in multiple benefits which they speak about in this video. These include optimized workload performance, improved support across the entire stack, increased security, and lowering costs by 50% overall. NTT Data Intellilink also found that Oracle's flexible support contracts were easier to manage. Additionally, NTT Data Intellilink has improved its systems management experience by leveraging Oracle Enterprise Manager and Oracle Ksplice across its portfolio. Ksplice, zero-downtime patching, allows patches and critical bug fixes to be applied without taking systems down. Both are included at no additional cost with Oracle Linux Premier Support. We are proud to enable customers like NTT Data Intellilink to deliver mission-critical systems at a lower cost. Watch this video to learn more!

This article highlights customer NTT Data Intellilink and their use of Oracle Linux. As an NTT DATA group company, they aim to provide value for their customers through design, implementation...

Announcements

Staying Ahead of Cyberthreats: Protecting Your Linux Systems with Oracle Ksplice

In this recently published white paper, "Staying Ahead of Cyberthreats: Protecting Your Linux Systems with Oracle Ksplice," we explain why regular operating system patching is so important and how Oracle Ksplice can help better protect your Linux systems. In the face of increasingly sophisticated cyberthreats, protecting IT systems has become vitally important. To help administrators more easily and regularly apply Linux updates, Oracle Ksplice offers an automated zero-downtime solution that simplifies the patching process. Ksplice allows users to automate patching of the Linux kernel, both Xen and KVM hypervisors, and critical user space libraries. It is currently the only solution to offer user space patching. Ksplice also offers several other customer benefits, which are explained in the white paper. Additionally, you will find links to customer videos that highlight the value Ksplice is providing in production environments. Customers with an Oracle Linux Premier Support subscription have access to Ksplice at no additional cost. It is available for both on premise and cloud deployments. We hope you have a chance to learn more by reading the white paper and listening to what customers are saying about Ksplice.  

In this recently published white paper, "Staying Ahead of Cyberthreats: Protecting Your Linux Systems with Oracle Ksplice," we explain why regular operating system patching is so important and how...

Announcements

Announcing the release of Oracle Linux 7 Security Technical Implementation Guide (STIG) OpenSCAP profile

On February 28 2020, the Defence Information Systems Agency (DISA) released the Oracle Linux 7 Security Technical Implementation Guide (STIG) Release 1 Version 1 (R1V1). Oracle has implemented the published STIG in Security Content Automation Protocol (SCAP) format and included it in the latest release of the scap-security-guide package for Oracle Linux 7. This can be used in conjunction with the OpenSCAP tool shipped with Oracle Linux to validate a server against the published implementation guide. The validation process can also suggest and in some cases automatically apply remediation in cases where compliance is not met. Running a STIG compliance scan with OpenSCAP To validate a server against the published profile, you will need to install the OpenSCAP scanner tool and the SCAP Security Guide content: # yum install openscap scap-security-guide Loaded plugins: ovl, ulninfo Resolving Dependencies --> Running transaction check ---> Package openscap.x86_64 0:1.2.17-9.0.3.el7 will be installed ... Dependencies Resolved =============================================================================================================================== Package Arch Version Repository Size =============================================================================================================================== Installing: openscap x86_64 1.2.17-9.0.3.el7 ol7_latest 3.8 M scap-security-guide noarch 0.1.46-11.0.2.el7 ol7_latest 7.9 M Installing for dependencies: libxslt x86_64 1.1.28-5.0.1.el7 ol7_latest 241 k openscap-scanner x86_64 1.2.17-9.0.3.el7 ol7_latest 62 k xml-common noarch 0.6.3-39.el7 ol7_latest 26 k Transaction Summary =============================================================================================================================== Install 2 Packages (+3 Dependent packages) ... Installed: openscap.x86_64 0:1.2.17-9.0.3.el7 scap-security-guide.noarch 0:0.1.46-11.0.2.el7 ... Complete! To confirm you have the STIG profile available, run: # oscap info --profile stig /usr/share/xml/scap/ssg/content/ssg-ol7-xccdf.xml Document type: XCCDF Checklist Profile Title: DISA STIG for Oracle Linux 7 Id: stig Description: This profile contains configuration checks that align to the DISA STIG for Oracle Linux V1R1. To start an evaluation of the host against the profile, run: # oscap xccdf eval --profile stig \ --results /tmp/`hostname`-ssg-results.xml \ --report /var/www/html/`hostname`-ssg-results.html \ --cpe /usr/share/xml/scap/ssg/content/ssg-ol7-cpe-dictionary.xml \ /usr/share/xml/scap/ssg/content/ssg-ol7-xccdf.xml WARNING: This content points out to the remote resources. Use `--fetch-remote-resources' option to download them. WARNING: Skipping https://linux.oracle.com/security/oval/com.oracle.elsa-all.xml.bz2 file which is referenced from XCCDF content Title Remove User Host-Based Authentication Files Rule no_user_host_based_files Result pass Title Remove Host-Based Authentication Files Rule no_host_based_files Result pass Title Uninstall rsh-server Package Rule package_rsh-server_removed Result pass ... The results will be saved to /tmp/hostname-ssg-results.xml and a human-readable report will be saved to /var/www/html/hostname-ssg-results.html as well. For further details on additional options for running OpenSCAP compliance checks, including ways to generate a full security guide from SCAP content, please see the Oracle Linux 7 Security Guide. For details on methods to automate OpenSCAP scanning using Spacewalk, please see the Spacewalk for Oracle Linux: Client Life Cycle Management Guide. For community-based support, please visit the Oracle Linux space on the Oracle Groundbreakers Community.

On February 28 2020, the Defence Information Systems Agency (DISA) released the Oracle Linux 7 Security Technical Implementation Guide (STIG) Release 1 Version 1 (R1V1). Oracle has implemented the...

Announcements

Announcing Oracle Linux Cloud Native Environment Release 1.1

Oracle is pleased to announce the general availability of Oracle Linux Cloud Native Environment Release 1.1. This release includes several new features for cluster management, updates to the existing Kubernetes module, and introduces new Helm and Istio modules. Oracle Linux Cloud Native Environment is an integrated suite of software and tools for the development and management of cloud-native applications. Based on the Open Container Initiative (OCI) and Cloud Native Computing Foundation (CNCF) standards, Oracle Linux Cloud Native Environment delivers a simplified framework for installations, updates, upgrades, and configuration of key features for orchestrating microservices. New features and notable changes Several improvements and enhancements have been made to the installation and management of Oracle Linux Cloud Native Environment, including: Cluster installation: the olcnectl module install command automatically installs and configures any required RPM packages and services. Load balancer installation: the olcnectl module install command automatically deploys a software load balancer when the --virtual-ip parameter is provided. Cluster upgrades: the olcnectl module update command can update module components. For multi-master deployments, this is done with no cluster service downtime. Cluster scaling: the olcnectl module update command can add and remove both master and worker nodes in a running cluster.  Updated Kubernetes module Oracle Linux Cloud Native Environment Release 1.1 includes Kubernetes Release 1.17.4. Please review the Release Notes for a list of the significant administrative and API changes between Kubernetes 1.14.8 and 1.17. New Helm module Helm is a package manager for Kubernetes that simplifies the task of deploying and managing software inside Kubernetes clusters. In this release, the Helm module is not supported for general use but is required and supported to deploy the Istio module. New Istio module Istio is a fully featured service mesh for deploying microservices into Kubernetes clusters. Istio can handle most aspects of microservice management, including identity, authentication, transport security, and metric scraping. The Istio module includes embedded instances of the Prometheus monitoring and Grafana graphing tools which are automatically configured with specific dashboards to better understand Istio-managed workloads. For more information about installing and using the Istio module, see Service Mesh. Installation and upgrade Oracle Linux Cloud Native Environment is installed using packages from the Unbreakable Linux Network or the Oracle Linux yum server as well as container images from the Oracle Container Registry. Existing deployments can be upgraded in place using the olcnectl module update command. For more information on installing or upgrading Oracle Linux Cloud Native Environment, please see Getting Started. Support for Oracle Linux Cloud Native Environment Support for Oracle Linux Cloud Native Environment is included with an Oracle Linux Premier Support subscription. Documentation and training Oracle Linux Cloud Native Environment documentation Oracle Linux Cloud Native Environment training

Oracle is pleased to announce the general availability of Oracle Linux Cloud Native Environment Release 1.1. This release includes several new features for cluster management, updates to the existing...

Linux

What’s new for NFS in Unbreakable Enterprise Kernel Release 6?

Oracle Linux kernel engineer Calum Mackay provides some insight into the new features for NFS in release 6 of the Unbreakable Enterprise Kernel (UEK).   UEK R6 is based on the upstream long-term stable Linux kernel v5.4, and introduces many new features compared to the previous version UEK R5, which is based on the upstream stable Linux kernel v4.14. In this blog, we look at what has been improved in the UEK R6 NFS client & server implementations. Server-side Copy (NFSv4.2 clients & servers) UEK R6 adds initial experimental support for parts of the NFSv4.2 server-side copy (SSC) mechanism. This is a feature that considerably increases efficiency when copying a file between two locations on a server, via NFS. Without SSC, this operation requires that the NFS client use READ requests to read all the file's data, then WRITE requests to write it back to the server as a new file, with every byte travelling over the network twice. With SSC, the NFS client may use one of two new NFSv4.2 operations to ask the server to perform the copy locally, on the server itself, without the file data needing to traverse the network at all. Obviously this will be enormously faster. 1. NFS COPY NFS COPY is a new operation which can be used by the client to request that the server locally copy a range of bytes from one file to another, or indeed the entire file. However, NFS COPY requires use of the copy_file_range client system call. Currently, no bundled utilities in Linux distributions appear to make use of this system call, but an application may easily be written or converted to use it. As soon as support for the copy_file_range client system call is added to client utilities, they will be able to make use of the NFSv4.2 COPY operation. Note that NFS COPY does not require any special support within the NFS server filesystem itself. 2. NFS_CLONE The new NFS CLONE operation allows clients to ask the server to use the exported filesystem's reflink mechanism to create a copy-on-write clone of the file, elsewhere within the same server filesystem. NFS CLONE requires the use of client utilities that support reflink; currently cp includes this support, with its --reflink option. In addition, NFS CLONE requires that the NFS server filesystem supports the reflink operation. The filesystems available in Oracle Linux NFS servers that support the reflink operation are btrfs, OCFS2 & XFS. NFS CLONE is much faster even than NFS COPY, since it uses copy-on-write, on the NFS server, to clone the file, provided the source and destination files are within the same filesystem. Note that in some cases the server filesystem may need to have been originally created with reflink support, especially if they were created on Oracle Linux 7 or earlier. The NFSv4.2 SSC design specifies both intra-server and inter-server operations. UEK R6 supports intra-server operations, i.e. the source and destination files exist on the same NFS server. Support for inter-server SSC (copies between two NFS servers) will be added in the future. Use of these features requires that both NFS client and server support NFSv4.2 SSC; currently server support for SSC is only available with Linux NFS servers. As an example of the performance gains possible with NFSv4.2 SSC, here's an example copying a 2GB file between two locations on an NFS server, over a relatively slow network:   .divTable { display: table; width: 100%; } .divTableRow { display: table-row; } .divTableHeading { display: table-header-group; background-color: #ddd; font-weight: bold; } .divTableCell { display: table-cell; padding: 3px 10px; border: 1px solid #999999; } Method Time Traditional NFS READ/WRITE 5 mins 22 Seconds NFSv4.2 COPY (via custom app using copy_file_range syscall) 12 Seconds   SSC is specific to NFS version 4.2 or greater. In Oracle Linux 7, NFSv4.2 is supported (provided the latest UEK kernel and userland packages are installed), but it is not the default NFS version used by an NFS client when mounting filesystems. By default, an OL7 NFS client will mount using NFSv4.1 (provided the NFS server supports it). An NFSv4.2 mount may be performed on an OL7 client, as follows: # command line mount -o vers=4.2 server:/export /mnt # /etc/fstab server:/export /mnt nfs noauto,vers=4.2 0 0 An Oracle Linux 8 NFS client will mount using NFSv4.2 by default. Just like OL7, if the NFS server does not support that, the OL8 client will try to successively lower NFS versions until it finds one that the server supports. Multiple TCP connections per NFS server (NFSv4.1 and later clients) For NFSv4.1 and later mounts over TCP, a new nconnect mount option enables an NFS client to set up multiple TCP connections, using the same client network interface, to the same NFS server. This may improve total throughput in some cases, particularly with bonded networks. Multiple transports allow hardware parallelism on the network path to be fully exploited. However, there are also improvements even using just one NIC; thanks to various efficiency savings. Using multiple connections will help most when a single TCP connection is saturated while the network itself and the server still has capacity. It will not help if the network itself is saturated, and it will still be bounded by the performance of the storage at the NFS server. Enhanced statistics reporting has been added to report on all transports when using multiple connections. Improved handling of soft mounts (NFSv4 clients) NOTE: we do not recommend the use of the soft and rw mount options together (and remember that rw is the default) unless you fully understand the implications, including possible data loss or corruption. By default, i.e. without the soft mount option, NFS mounts are described as hard, which means that NFS operations will not timeout in the case of an unresponsive NFS server, or network partition. NFS operations, including READ and WRITE, will wait indefinitely, until the NFS server is again reachable. In particular, this means that any such affected NFS filesystem cannot be unmounted, and the NFS client system itself cannot be cleanly shutdown, until the NFS server responds. When an NFS filesystem is mounted with the soft mount option, NFS operations will timeout after a certain period (based on the timeo and retrans mount options) and the associated system calls (e.g. read, write, fsync, etc) will return an EIO error to the application. The NFS filesystem may be unmounted, and the NFS client system may be cleanly shutdown. This might sound like a useful feature, but it can cause problems, which can be especially severe in the case of rw (read-write) filesystems, because of the following: Client applications often don't expect to get EIO from file I/O request system calls, and may not handle them appropriately. NFS uses asynchronous I/O which means that the first client write system call isn't necessarily going to return the error, which may instead get reported by a subsequent write, or close, or perhaps only by a subsequent fsync, which the client might not even perform; close may not be guaranteed to report the error, either. Obviously, reporting the error via a subsequent write/fsync makes it harder for the application to deal with correctly. Write interleaving may mean that the NFS client kernel can't always precisely track which file descriptors are involved, so the error may perhaps not even be guaranteed to be delivered, or not delivered via the right descriptor on close/fsync. It's important to realize that the above issues may result in NFS WRITE operations being lost, when using the soft mount option, resulting in file corruption and data loss, depending on how well the client application handles these situations. For that reason, it is dangerous to use the soft mount option with rw mounted filesystems, even with UEK R6, unless you are fully aware of how your application(s) handle EIO errors from file I/O request system calls. In UEK R6, the handling of soft mounts with NFSv4 has been improved, in particular: Reducing the risk of false-positive timeouts, e.g. in the case where the NFS server is merely congested. Faster failover of NFS READ and WRITE operations after a timeout. Better association of errors with process/fd. A new optional additional softerr mount option to return ETIMEDOUT (instead of EIO) to the application after a timeout, so that applications written to be aware of this may better differentiate between the the timeout case, e.g. to drive a failover response, and other I/O errors, so that the client application may choose a different recovery action for those cases. Mounts using only the soft mount option will see the other improvements, but timeout errors will still be returned to the application with the EIO error. General improvements will still benefit to an extent existing applications not written specifically to deal with ETIMEDOUT/EIO using the soft mount option with NFSv4, as follows: The client kernel will give the server longer to reply, without returning EIO to the application, as long as the network connection remains connected, for example if the server is badly congested. Swifter handling of real timeouts, and better correlation of error to process file descriptor. Be aware that the same caveats still apply: it's still dangerous to use soft with rw mounts unless you fully understand the implications, and all client applications are correctly written to handle the issues. If you are in any doubt about whether your applications behave correctly in the face of EIO & ETIMEDOUT errors, do not use soft rw mounts. New knfsd file descriptor cache (NFSv3 servers) UEK R6 NFSv3 servers benefit from a new knfsd file descriptor cache, so that the NFS server's kernel doesn't have to perform internal open and close calls for each NFSv3 READ or WRITE. This can speed up I/O in some cases. It also replaces the readahead cache. When an NFSv3 READ or WRITE request comes in to an NFS server, knfsd initially opens a new file descriptor, then it perform the read/write, and finally it closes the fd. While this is often a relatively inexpensive thing to do for most local server filesystems, it is usually less so for FUSE, clustered, networked and other filesystems with a slow open routine, that are being exported by knfsd. This improvement attempts to reduce some of that cost by caching open file descriptors so that they may be reused by other incoming NFSv3 READ/WRITE requests for the same file. Performance General Much work has been done to further improve the performance of NFS & RPC over RDMA transports. The performance of RPC requests has been improved, by removing BH (bottom-half soft IRQ) spinlocks. NFS clients Optimization of the default readahead size, to suit modern NFS block sizes and server disk latencies. RPC client parallelization optimizations. Performance optimizations for NFSv4 LOOKUP operations, and delegations, including not unnecessarily returning NFSv4 delegations, and locking improvements. Support the statx mask & query flags to enable optimizations when the user is requesting only attributes that are already up to date in the inode cache, or is specifying AT_STATX_DONT_SYNC. NFS servers Remove the artificial limit on NFSv4.1 performance by limiting the number of oustanding RPC requests from a single client. Increase the limit of concurrent NFSv4.1 clients, i.e. stop a few greedy clients using up all the NFS server's session slots. Diagnostics To improve debugging and diagnosibility, a large number of ftrace events have been added. Work to follow will include having a subset of these events optionally enabled during normal production, to aid fix on first occurrence without adversely impacting performance. Expose information about NFSv4 state held by servers on behalf of clients. This is especially important for NFSv4 OPEN calls, which are currently invisible to user space on the server, unlike locks (/proc/locks) and local processes' opens (/proc/pid/). A new directory (/proc/fs/nfsd/clients/) is added, with subdirectories for each active NFSv4 client. Each subdirectory has an info file with some basic information to help identify the client and a states/ directory that lists the OPEN state held by that client. This also allows forced revocation of client state. (NFSv3/NLM) Cleanup and modify lock code to show the pid of lockd as the owner of NLM locks. Miscellaneous NFS clients Finer-grained NFSv4 attribute checking. For NFS mounts over RDMA, the port=20049 (sic) mount option is now the default. NFS servers Locking and data structure improvements for duplicate request cache (DRC) and other caches. Improvements for running NFS servers in containers, including replacing the global duplicate reply cache with separate caches per network namespace; it is now possible to run separate NFS server processes in each network namespace, each with their own set of exports. NFSv3 clients and servers Improved handling of correctness and reporting of NFS WRITE errors, on both NFSv3 clients and servers. This is especially important given that NFS WRITE operations are generally done asynchronously to application write system calls. Summary In this blog we've looked at the changes and new features relating to NFS & RPC, for both clients and servers available in the latest Unbreakable Enterprise Kernel Release 6.

Oracle Linux kernel engineer Calum Mackay provides some insight into the new features for NFS in release 6 of the Unbreakable Enterprise Kernel (UEK).   UEK R6 is based on the upstream long-term stable...

Announcements

Downloading Oracle Linux ISO Images

Updated to incorporate new download options and changes in Software Delivery Cloud This post summarizes options to download Oracle Linux installation media Introduction There are several types of downloads: Full ISO image: contains everything needed to boot a system and install Oracle Linux. This is the most common download. UEK Boot ISO image: contains everything that is required to boot a system with Unbreakable Enterprise Kernel (UEK) and start an installation Boot ISO image: contains everything that is required to boot a system with Red Hat compatible kernel (RHCK) and start an installation Source DVDs ISO image: ISO images containing all the source RPMs for Oracle Linux Other types of images: Depending on the release of Oracle Linux, there are optional ISO images with additional drivers, etc. See also the documentation for Oracle Linux 7 and Oracle Linux 8 for more details on obtaining and preparing installation media. Download Oracle Linux from Oracle Linux Yum Server If all you need is an ISO image to perform an installation of a recent Oracle Linux release, your best bet is to download directly from Oracle Linux yum server. From here you can directly download full ISO images and boot ISO images for the last few updates of Oracle Linux 8, 7 and 6 for both x86_64 and Arm (aarch64). No registration or sign in required. Start here. Download from Oracle Software Delivery Cloud Oracle Sofware Delivery Cloud is the official source to download all Oracle software, including Oracle Linux. Unless you are looking for older releases of Oracle Linux or complementary downloads other than the regular installation ISO, it’s probably quicker and easier download from Oracle Linux yum server. That said, to download from Oracle Sofware Delivery Cloud, start here and sign in. Choose one of the following methods to obtain your product: If your product is included in the Popular Downloads window, then select that product to add it to the cart. If your product is not included in the Popular Downloads window, then do the following: Type “Oracle Linux 7” or “Oracle Linux 8” in the search box, then click Search. From the search results list, select the product you want to download to add it to the cart. Note that for these instructions, there is no difference between a Release (REL) and a Download Package (DLP). Click Checkout. From the Platform/Languages drop-down list, select your system’s platform, then Continue. On the next page, review and accept the terms of licenses, then click Continue. Next, you have several options to download the files you are interested in. Directly by Clicking on the File Link If you only need one or two of the files and don’t anticipate any download hiccups that require stopping and resuming a download, simply click on the filename, e.g. V995537-01.iso   Image showing how to download a single file   Using a Download Manager Use a download manager if you want to download multiple files at the same time or pause and resume file download. A Download manager can come in handy when you are having trouble completing a download, or want to queue up several files for unattended downloading. Remember to de-select any files you are not interested in. Image of download manager   Using wget If want to download directly to a system with access to a command line, only, use the WGET option to download a shell script. Download from Unofficial Mirrors In addition locations listed above Oracle Linux ISOs can be download from several unoffical mirror sites. Note that these site are not endorsed by Oracle, but that you can verify the downloaded files using the procedure outlined below. Remember to Verify Oracle Linux downloads can be verified to ensure that they are exactly the downloads as published by Oracle and that they were downloaded without any corruption. For checksum files, signing keys and steps to verify the integrity of your downloads, see these instructions. Downloading Oracle Linux Source Code To download Oracle Linux source code, use the steps described under Download from Oracle Software Delivery Cloud to onbtain Source DVD ISOs. Alternatively, you can find individual source RPMs on oss.oracle.com/sources or Oracle Linux yum server

Updated to incorporate new download options and changes in Software Delivery Cloud This post summarizes options to download Oracle Linux installation media Introduction There are several types of...

Announcements

Announcing the release of Oracle Linux 7 Update 8

Oracle is pleased to announce the general availability of Oracle Linux 7 Update 8. Individual RPM packages are available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. ISO installation images will soon be available for download from the Oracle Software Delivery Cloud and Docker images are available via Oracle Container Registry and Docker Hub. Oracle Linux 7 Update 8 ships with the following kernel packages, which include bug fixes, security fixes and enhancements: Unbreakable Enterprise Kernel (UEK) Release 5 for x86-64 and aarch64 kernel-uek-4.14.35-1902.300.11.el7uek Red Hat Compatible Kernel (RHCK) for x86-64 only kernel-3.10.0-1127.el7 Notable new features for all architectures Oracle Linux 7 Update 8 ISO includes latest Unbreakable Kernel Release 5 Update 3 "Unbreakable Kernel Release 6" available by Oracle Linux 7 Yum channel SELinux enhancements for Tomcat domain access and graphical login sessions rsyslog has a new option for managing letter-case preservation by using the FROMHOST property for the imudp and imtcp modules Pacemaker concurrent-fencing cluster property defaults to true, speeding up recovery in a large cluster where multiple nodes are fenced Further information are available in the Release Notes for Oracle Linux 7 Update 8. Application Compatibility Oracle Linux maintains user space compatibility with Red Hat Enterprise Linux (RHEL), which is independent of the kernel version that underlies the operating system. Existing applications in user space will continue to run unmodified on Oracle Linux 7 Update 8 with UEK Release 5 and no re-certifications are needed for applications already certified with Red Hat Enterprise Linux 7 or Oracle Linux 7. About Oracle Linux The Oracle Linux operating environment delivers leading performance, scalability and reliability for business-critical workloads deployed on premise or in the cloud. Oracle Linux is the basis of Oracle Autonomous Linux and runs Oracle Gen 2 Cloud. Unlike many other commercial Linux distributions, Oracle Linux is easy to download and completely free to use, distribute, and update. Oracle Linux Support offers access to award-winning Oracle support resources and Linux support specialists; zero-downtime updates using Ksplice; additional management tools such as Oracle Enterprise Manager and Spacewalk; and lifetime support, all at a low cost. For more information about Oracle Linux, please visit www.oracle.com/linux.

Oracle is pleased to announce the general availability of Oracle Linux 7 Update 8. Individual RPM packages are available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server....

Announcements

Announcing Oracle Linux Virtualization Manager 4.3

Oracle is pleased to announce the general availability of Oracle Linux Virtualization Manager, release 4.3. This server virtualization management platform can be easily deployed to configure, monitor, and manage an Oracle Linux Kernel-based Virtual Machine (KVM) environment with enterprise-grade performance and support from Oracle. This release is based on the 4.3.6 release of the open source oVirt project. New Features with Oracle Linux Virtualization Manager 4.3 In addition to the base virtualization management features required to operate your data center, notable features added with the Oracle Linux Virtualization Manager 4.3 release include: Self-Hosted Engine: The oVirt Self-Hosted Engine is a hyper-converged solution in which the oVirt engine runs on a virtual machine on the hosts managed by that engine. The virtual machine is created as part of the host configuration, and the engine is installed and configured in parallel to the host configuration process. The primary benefit of the Self-Hosted Engine is that it requires less hardware to deploy an instance of the Oracle Linux Virtualization Manager as the engine runs as a virtual machine, not on physical hardware. Additionally, the engine is configured to be highly available. If the Oracle Linux host running the engine virtual machine goes into maintenance mode, or fails unexpectedly, the virtual machine will be migrated automatically to another Oracle Linux host in the environment. Gluster File System 6.0: oVirt has been integrated with GlusterFS, an open source scale-out distributed filesystem, to provide a hyper-converged solution where both compute and storage are provided from the same hosts. Gluster volumes residing on the hosts are used as storage domains in oVirt to store the virtual machine images. Oracle Linux Virtualization Manager is run as the Self Hosted Engine within a virtual machine on these hosts. GlusterFS 6.0 is released as an Oracle Linux 7 program. Virt-v2v: The virt-v2v tool converts a single guest from another hypervisor to run on Oracle Linux KVM. It can read Linux and Windows Virtual Machines running on Oracle VM or other hypervisors, and convert them to KVM machines managed by Oracle Linux Virtualization Manager. New guest OS support: Oracle Linux Virtualization Manager guest operating system support has been extended to include Oracle Linux 8, Red Hat Enterprise Linux 8, CentOS 8, SUSE Linux Enterprise Server (SLES) 12 SP5 and SLES 15 SP1. oVirt 4.3 features and bug fixes: Improved performance when running Windows as a guest OS. Included with this release are the latest Oracle VirtIO Drivers for Microsoft Windows. Higher level of security with TLSv1 and TLSv1.1 protocols now disabled for vdsm communications. Numerous engine, vdsm, UI, and bug fixes. More information on these features can be found in the Oracle Linux Virtualization Manager Document Library which has been updated for this release. Visit the Oracle Linux Virtualization Manager Training website for videos, documents, other useful links, and further information on setting up and managing this solution. Oracle Linux Virtualization Manager allows enterprise customers to continue supporting their on-premise data center deployments with the KVM hypervisor available on Oracle Linux 7 Update 7 with the Unbreakable Enterprise Kernel Release 5. This 4.3 release is an update release for Oracle Linux Virtualization Manager 4.2. Getting Started Oracle Linux Virtualization Manager 4.3 can be installed from the Oracle Linux yum server or the Oracle Unbreakable Linux Network. Customers that have already deployed Oracle Linux Virtualization Manager 4.2 can upgrade to 4.3 using these same sites. Two new channels have been created in the Oracle Linux 7 repositories that users will access to install or update Oracle Linux Virtualization Manager: oVirt 4.3 - base packages required for Oracle Linux Virtualization Manager oVirt 4.3 Extra Packages - additional packages for Oracle Linux Virtualization Manager Oracle Linux 7 Update 7 hosts can be installed with installation media (ISO images) available from the Oracle Software Delivery Cloud. Step-by-step instructions to download the Oracle Linux 7 Update 7 ISO can be found on the Oracle Linux Community website. Using the "Minimal Install" option during the installation process sets up a base KVM system which can then be updated using the KVM Utilities channel in the Oracle Linux 7 repositories. These KVM enhancements and other important packages for your Oracle Linux KVM host can be installed from the Oracle Linux yum server and the Oracle Unbreakable Linux Network: Latest - Latest packages released for Oracle Linux 7 UEK Release 5 - Latest Unbreakable Enterprise Kernel Release 5 packages for Oracle Linux 7 KVM Utilities - KVM enhancements (QEMU and libvirt) for Oracle Linux 7 Optional Latest - Latest optional packages released for Oracle Linux 7 Gluster 6 Packages - Latest Gluster 6 packages for Oracle Linux 7 Both Oracle Linux Virtualization Manager and Oracle Linux can be downloaded, used, and distributed free of charge and all updates and errata are freely available. Oracle Linux Virtualization Manager Support Support for Oracle Linux Virtualization Manager is available to customers with an Oracle Linux Premier Support subscription. Refer to Oracle Linux 7 License Information User Manual for information about Oracle Linux support levels. Oracle Linux Virtualization Manager Resources Oracle Linux Resources Oracle Virtualization Resources Oracle Linux yum server Oracle Linux Virtualization Manager Training

Oracle is pleased to announce the general availability of Oracle Linux Virtualization Manager, release 4.3. This server virtualization management platform can be easily deployed to configure, monitor,...

Linux

Oracle Linux Learning Library: Start On a Video Path Now

The Oracle Linux Learning Library provides you learning paths that are adapted to different environments and infrastructures. These free video based learning paths permit you to start training at any time, from anywhere and to advance at your own pace. Learning paths are enhanced on an ongoing basis. Get started today on the learning path that suits your needs and interests: Linux on Oracle Cloud Infrastructure: See how to use Linux to deliver powerful compute and networking performance with a comprehensive portfolio of infrastructure and platform cloud services. Oracle Linux Cloud Native Environment: Learn how you can deploy the software and tools to develop microservices-based applications in-line with open standards and specifications. Oracle Linux 8: This learning path is being built out so you can develop skills to use Linux on Oracle Cloud Infrastructure, on-premise, or on other public clouds. Become savvy on this operating system that is free to use, free to distribute, free to update and easy to download. Oracle Linux Virtualization Manager: Use resources available to adopt this open-source distributed server virtualization solution. Gain proficiency in deploying, configuring, monitoring, and managing an Oracle Linux Kernel-based Virtual Machine (KVM) environment with enterprise-grade performance. Resources: Oracle Linux product documentation Oracle Cloud Infrastructure product documentation Oracle Linux Virtualization Manager product documentation  

The Oracle Linux Learning Library provides you learning paths that are adapted to different environments and infrastructures. These free video based learning paths permit you to start training at any...

Announcements

Announcing the Unbreakable Enterprise Kernel Release 6 for Oracle Linux

Oracle is pleased to announce the general availability of the Unbreakable Enterprise Kernel Release 6 for Oracle Linux. The Unbreakable Enterprise Kernel (UEK) for Oracle Linux provides the latest open source innovations and business-critical performance and security optimizations for cloud and on-premise deployment. It is the Linux kernel that powers Oracle Gen 2 Cloud and Oracle Engineered Systems such as Oracle Exadata Database Machine. Oracle Linux with UEK is available on the x86-64 and 64-bit Arm (aarch64) architectures.   Notable UEK6 new features and enhancements: Linux 5.4 kernel: Based on the mainline Linux kernel version 5.4, this release includes many upstream enhancements. Arm: Enhanced support for the Arm (aarch64) platform, including improvements in the areas of security and virtualization. Cgroup v2: Cgroup v2 functionality was first introduced in UEK R5 to enable the CPU controller functionality. UEK R6 includes all Cgroup v2 features, along with several enhancements. ktask: ktask is a framework for parallelizing CPU-intensive work in the kernel. It can be used to speed up large tasks on systems with available CPU power, where a task is single-threaded in user space. Parallelized kswapd: Page replacement is handled in the kernel asynchronously by kswapd, and synchronously by direct reclaim. When free pages within the zone free list are low, kswapd scans pages to determine if there are unused pages that can be evicted to free up space for new pages. This optimization improves performance by avoiding direct reclaims, which can be resource intensive and time consuming. Kexec firmware signing: The option to check and validate a kernel image signature is enabled in UEK R6. When kexec is used to load a kernel from within UEK R6, kernel image signature checking and validation can be implemented to ensure that a system only loads a signed and validate kernel image. Memory management: Several performance enhancements have been implemented in the kernel's memory management code to improve the efficiency of clearing pages and cache, as well as enhancements to fault management and reporting. NVDIMM: NVDIMM feature updates have been implemented so that persistent memory can be used as traditional RAM. DTrace: DTrace support is enabled and has been re-implemented to use the Berkeley Packet Filter (BPF) that is integrated into the Linux kernel. OCFS2: Support for the OCFS2 file system is enabled. Btrfs: Support for the Btrfs file system is enabled and support to select Btrfs as a file system type when formatting devices is available Important UEK6 changes in this release: The following sections describe the important changes in the Unbreakable Enterprise Kernel Release 6 (UEK R6) relative to UEK R5. Core Kernel Functionality High-performance asynchronous I/O with io_uring: The io_uring is a fast, scalable asynchronous I/O interface for both buffered and unbuffered I/Os. It also supports asynchronous polled I/O. A user space library, liburing, provides basic functionality for applications with helpers to allow applications to easily set up an io_uring instance and submit/complete I/O. NVDIMM: Persistent memory can now be used as traditional RAM. Furthermore fixes, were implemented around the security-related commands within libnvdimm that allowed the use of keys where payload data was filled with zero values to allow secure operations to continue to take place where a zero-key is in use. Cryptography Simplified key description management: Keys and keyrings are more namespace aware. Zstandard compression: Zstandard compression (zstd) is added to crypto and compress. Filesystems  Brtfs: Btrfs continues to be supported. Several improvements and patches have been applied in this update, including support for swap files, ZStandard compression, and various performance improvements. ext4: 64-bit timestamps have been added to the superblock fields. OCFS2: OCFS2 continues to be supported. Several improvements and patches have been applied in this update, including support for the 'nowait' AIO feature and support on Arm platforms. XFS: A new online health reporting infrastructure with user space ioctl provide metadata health status after online fsck. Added support for fallocate swap files and swap files on real-time devices. Various performance improvements have also been made. NFS: Performance improvements and enhancements have been made to RPC and the NFS client and server components. Memory Management TLB flushing code is improved to avoid unnecessary flushes and to reduce TLB shootdowns. Memory management is enhanced to improve throughput by leveraging clearing of huge pages more optimally. Page cache efficiency is improved by using the more efficient Xarray data type. Fragmentation avoidance algorithms are improved and compaction and defragmentation times are faster. Improvements have been implemented to the handling of Transparent Huge Page faults and to provide better reporting on Transparent Huge Page status. Networking TCP Early Departure Time: The TCP stack now uses the Early Departure Time model, instead of the As Fast As Possible model, for sending packets. This brings several performance gains as it resolves a limitation in the original TCP/IP framework, and introduces the scheduled release of packets, to overcome hardware limitations and bottlenecks. Generic Receive Offload (GRO): GRO is enabled for the UDP protocol. TLS Receive: UEK R5 enabled the kernel to send TLS messages. This release enables the kernel to also receive TLS messages. Zero-copy TCP Receive: UEK R5 introduced a zero-copy TCP feature for sending packets to the network. The UEK R6 release enables receive functionality for zero-copy TCP. Packet Filtering: nftables is now the default backend for firewall rules. BPF-based networking filtering (bpfilter) is also added in this release. Express data path (XDP): XDP is a flexible, minimal, kernel-based packet transport for high speed networking has been added. Security Lockdown mode: Lockdown mode is improved. This release distinguishes between the integrity and confidentiality modes. When Secure Boot is enabled in UEK R6, lockdown integrity mode is enforced by default. IBRS: Indirect Branch Restricted Speculation (IBRS) continues to be supported for processors that do not have built-in hardware mitigations for Speculative Execution Side Channel Vulnerabilities. Improved protection in world writable directories: UEK R6 discourages spoofing attacks by disallowing the opening of FIFOs or regular files not owned by the user in world writable sticky directories, such as /tmp. Arm KASLR: Kernel virtual address randomization is enabled by default for Arm platforms. aarch64 pointer authentication: Adds primitives that can be used to mitigate certain classes of memory stack corruption attacks on Arm platforms. Storage, Virtualization, and Driver Updates NVMe: NVMe over Fabrics TCP host and the target drivers have been added. Support for multi-path and passthrough commands has been added. VirtIO: The VirtIO PMEM feature adds a VirtIO-based asynchronous flush mechanism and simulates persistent memory to a guest, allowing it to bypass a guest page cache. A VirtIO-IOMMU para-virtualized driver is also added in this release, allowing IOMMU requests over the VirtIO transport without emulating page tables. Arm platform: Guests on Arm aarch64 platform systems include pointer authentication (ARM v8.3) and Scalable Vector Extension (SVE) support. Device drivers: UEK R6 supports a large number of hardware server platforms and devices. In close cooperation with hardware and storage vendors, Oracle has updated several device drivers from the versions in mainline Linux 5.4. A complete list of the driver modules/versions included in UEK R6 is provided in the Release Notes appendix, "Appendix B, Driver Modules in Unbreakable Enterprise Kernel Release 6 (x86_64)". Security (CVE) Fixes A full list of CVEs fixed in this release can be found in the Release Notes for the UEK R6. Supported Upgrade Path Customers can upgrade existing Oracle Linux 7 and Oracle Linux 8 servers using the Unbreakable Linux Network or the Oracle Linux yum server by pointing to "UEK Release 6" Yum Channel. Software Download Oracle Linux can be downloaded, used, and distributed free of charge and updates and errata are freely available. This allows organizations to decide which systems require a support subscription and makes Oracle Linux an ideal choice for development, testing, and production systems. The user decides which support coverage is the best for each system individually, while keeping all systems up-to-date and secure. Customers with Oracle Linux Premier Support also receive access to zero-downtime kernel updates using Oracle Ksplice. About Oracle Linux The Oracle Linux operating environment delivers leading performance, scalability and reliability for business-critical workloads deployed on premise or in the cloud. Oracle Linux is the basis of Oracle Autonomous Linux and runs Oracle Gen 2 Cloud. Unlike many other commercial Linux distributions, Oracle Linux is easy to download and completely free to use, distribute, and update. Oracle Linux Support offers access to award-winning Oracle support resources and Linux support specialists; zero-downtime updates using Ksplice; additional management tools such as Oracle Enterprise Manager and Spacewalk; and lifetime support, all at a low cost.

Oracle is pleased to announce the general availability of the Unbreakable Enterprise Kernel Release 6 for Oracle Linux. The Unbreakable Enterprise Kernel (UEK) for Oracle Linux provides the latest open...

Announcements

Announcing the Unbreakable Enterprise Kernel Release 5 Update 3 for Oracle Linux

The Unbreakable Enterprise Kernel (UEK) for Oracle Linux provides the latest open source innovations and key optimizations and security to enterprise cloud workloads. It is the Linux kernel that powers Oracle Cloud and Oracle Engineered Systems such as Oracle Exadata Database Machine as well as Oracle Linux on Intel-64, AMD-64 or ARM hardware. What's New? UEK R5 Update 3 is based on the mainline kernel version 4.14.35. Through actively monitoring upstream check-ins and collaboration with partners and customers, Oracle continues to improve and apply critical bug and security fixes to the Unbreakable Enterprise Kernel (UEK) R5 for Oracle Linux. This update includes several new features, added functionality, and bug fixes across a range of subsystems. UEK R5 Update 3 can be recognized with release number starting with 4.14.35-1902.300. Notable changes: 64-bit Arm (aarch64) Architecture. Significant improvements have been made to a number of drivers, through vendor contributions, for better support on embedded 64-bit Arm platforms. Core Kernel Functionality. UEK R5U3 provides equivalent core kernel functionality to UEK R5U2, making use of the same upstream mainline kernel release, with additional patches to enhance existing functionality and provide some minor bug fixes and security improvements. On-Demand Paging. On-Demand-Paging (ODP) is a virtual memory management technique to ease memory registration. File system and storage fixes.  XFS.  A deadlock bug that caused the file system to freeze lock and not release has been fixed. CIFS.  An upstream patch was applied to resolve an issue that could cause POSIX lock leakages and system crashes. Virtualization and QEMU. Minor bugfix for hardware incompatibility with QEMU.  A minor bugfix was applied to KVM code in line with upstream fixes that resolved a trivial testing issue with certain versions of QEMU on some hardware. Driver updates. In close cooperation with hardware and storage vendors, Oracle has updated several device drivers from the versions in mainline Linux 4.14.35; further updates are provided in the Appendix A (Driver Modules in Unbreakable Enterprise Kernel Release 5 Update 3) of the Release notes. For more details on these and other new features and changes, please consult the Release Notes for the UEK R5 Update 3. Security (CVE) Fixes A full list of CVEs fixed in this release can be found in the Release Notes for the UEK R5 Update 3. Supported Upgrade Path Customers can upgrade existing Oracle Linux 7 servers using the Unbreakable Linux Network or the Oracle Linux yum server by pointing to "UEK Release 5" Yum Channel. Software Download Oracle Linux can be downloaded, used, and distributed free of charge and all updates and errata are freely available. This allows organizations to decide which systems require a support subscription and makes Oracle Linux an ideal choice for development, testing, and production systems. The user decides which support coverage is the best for each system individually, while keeping all systems up-to-date and secure. Customers with Oracle Linux Premier Support also receive access to zero-downtime kernel updates using Oracle Ksplice. Compatibility UEK R5 Update 3 is fully compatible with the UEK R5 GA release. The kernel ABI for UEK R5 remains unchanged in all subsequent updates to the initial release. About Oracle Linux The Oracle Linux operating system is engineered for an open cloud infrastructure. It delivers leading performance, scalability and reliability for enterprise SaaS and PaaS workloads as well as traditional enterprise applications. Oracle Linux Support offers access to award-winning Oracle support resources and Linux support specialists; zero-downtime updates using Ksplice; additional management tools such as Oracle Enterprise Manager and Spacewalk; and lifetime support, all at a low cost. And unlike many other commercial Linux distributions, Oracle Linux is easy to download, completely free to use, distribute, and update. Oracle tests the UEK intensively with demanding Oracle workloads, and recommends the UEK for Oracle deployments and all other enterprise deployments. Resources – Oracle Linux Documentation Oracle Linux Software Download Oracle Linux Blogs Oracle Linux Blog Oracle Virtualization Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux - education.oracle.com/linux

The Unbreakable Enterprise Kernel (UEK) for Oracle Linux provides the latest open source innovations and key optimizations and security to enterprise cloud workloads. It is the Linux kernel that...

Linux

Connect PHP 7 to Oracle Database using packages from Oracle Linux Yum Server

Note: This post was updated to include the latest available release of PHP as well as simplified installation intstructions for Oracle Instant Client introduced starting with 19c.   We recently added PHP 7.4 to our repos on Oracle Linux yum server. These repos include also include the PHP OCI8 extenstion to connect your PHP applications to Oracle database. In this post I describe the steps to install PHP 7.4, PHP OCI8 and Oracle Instant Client on Oracle Linux to connect PHP to Oracle Database. For this blog post, I used a free Autonomous Database included in Oracle Cloud Free Tier. Install Oracle Instant Client Oracle Instant Client RPMs are available on Oracle Linux yum server also. To access them, install the oracle-release-el7 package first to setup the appropriate repositories:   $ sudo yum -y install oracle-release-el7 $ sudo yum -y install oracle-instantclient19.5-basic If you want to be able to use SQL*Plus (this can come in handy for some sanity checks), install the SQL*Plus RPM also: $ sudo yum -y install oracle-instantclient19.5-sqlplus Create a Schema and Install the HR Sample Objects (Optional) You can use any schema you already have in your database. I’m going to use the HR schema from the Oracle Database Sample Schemas on github.com If you already have a schema with database objects to work with, you can skip this step. $ yum -y install git $ git clone https://github.com/oracle/db-sample-schemas.git $ cd db-sample-schemas/human_resources As SYSTEM (or ADMIN, if you are using Autonomous Database), create a user PHPTEST SQL> grant connect, resource, create view to phptest identified by <YOUR DATABASE PASSWORD>; SQL> alter user PHPTEST quota 5m on USERS; If you are using Autonomous Database like I am, change the tablespace above to DATA: SQL> alter user phptest quota 5m on DATA; As the PHPTEST user, run the scripts hr_cre.sql and hr_popul.sql to create and populate the HR database objects SQL> connect phptest/<YOUR DATABASE PASSWORD>@<YOUR CONNECT STRING> SQL> @hr_cre.sql SQL> @hr_popul.sql Install PHP and PHP OCI8 To install PHP 7.4, make sure you have the latest oracle-php-release-el7 package installed first. $ sudo yum install -y oracle-php-release-el7 Next, install PHP and the PHP OCI8 extenstion corresponding to the Oracle Instant Client installed earlier: $ sudo yum -y install php php-oci8-19c Running the following php code snippet should verify that we can connect PHP to the database and bring back data. Make sure you replace the schema and connect string as appropriate. Create a file emp.php based on the code above. Run it! $ php emp.php This should produce the following: King Kochhar De Haan Hunold Ernst Austin Pataballa Lorentz Greenberg Faviet Chen Sciarra ...  

Note: This post was updated to include the latest available release of PHP as well as simplified installation intstructions for Oracle Instant Client introduced starting with 19c.   We recently added...

Linux

Taming Tracepoints in the Linux Kernel

Have you always wanted to learn how to implement tracepoints in the Linux Kernel? Then this blog is for you. Oracle Linux kernel engineer Alan Maguire explains how to implement a tracepoint in the Linux kernel.   Here we are going to describe what tracepoints are, how they are defined and finally demonstrate the various ways they can be used. By fleshing out all of the steps, I'm hoping others may find this process a bit easier. As always it's a good idea to start with the Linux kernel documentation : https://www.kernel.org/doc/Documentation/trace/tracepoints.txt What are tracepoints? In kernel tracing, we have various classes of events we can trace. We can trace kernel function entry and return with kprobes, but the problem is functions appear and disappear as the kernel evolves, so this isn't a very stable way of tracing functionality. Further, kernel functions may be inlined sometimes. If you're not famililar, inlining occurs when the compiler (perhaps with an inline suggestion from the developer or indeed an insistence using __always_inline) replaces a call to a function with the body of that function every time it is called. This can increase the kernel image size, but can be a saving too, avoiding the dance required when the function is called. If a function is inlined, it cannot be kprobe'd as easily. Note that it's not impossible, but in such cases we'd need to figure out the offset in the function body where inline code begins (kprobes can be placed on most instructions, not just function entry). So all of the above is a long-winded way of establishing that we need something else other than kprobes in kernel tracing. Tracepoints are that something else. Tracepoints simply represent hooks to events of interest. We'll talk about what we hook to them later on, but the idea is to represent events which may occur in possibly multiple places in the code rather than have something so tied to implementation details such as function entry. What tracepoints are available? Take a look at /sys/kernel/debug/tracing/events. The directories here represent the tracing subsystems that are available. On a 5.3 kernel this looks like $ sudo ls /sys/kernel/debug/tracing/events alarmtimer ftrace kmem page_pool sunrpc bcache gpio kvm percpu swiotlb block hda kvmmmu power syscalls bpf_test_run hda_controller kyber preemptirq task bridge hda_intel libata printk tcp cfg80211 header_event lock qdisc thermal cgroup header_page mac80211 random timer clk huge_memory mac80211_msg ras tlb cma hwmon mce raw_syscalls udp compaction hyperv mdio rcu v4l2 context_tracking i2c mei regmap vb2 cpuhp i915 migrate regulator vmscan devfreq initcall module rpcgss vsyscall devlink intel_ish mpx rpm wbt dma_fence iommu msr rseq workqueue drm irq napi rtc writeback enable irq_matrix neigh sched x86_fpu exceptions irq_vectors net scsi xdp fib iwlwifi nfsd signal xen fib6 iwlwifi_data nmi skb xfs filelock iwlwifi_io oom smbus xhci-hcd filemap iwlwifi_msg page_isolation snd_pcm fs_dax iwlwifi_ucode pagemap sock An individual subsystem may define multiple events; for example in the net subsystem we see: $ sudo ls /sys/kernel/debug/tracing/events/net enable netif_receive_skb filter netif_receive_skb_entry napi_gro_frags_entry netif_receive_skb_exit napi_gro_frags_exit netif_receive_skb_list_entry napi_gro_receive_entry netif_receive_skb_list_exit napi_gro_receive_exit netif_rx net_dev_queue netif_rx_entry net_dev_start_xmit netif_rx_exit net_dev_xmit netif_rx_ni_entry net_dev_xmit_timeout netif_rx_ni_exit The directories correspond to recognizable events in generic network stack processing. What about the enable and filter file entries above? If we look we notice they are marked rw. As you might guess, the enable file is used to switch on the probe or set of probes. We can echo 1 to the appropriate enable file to enable events. For more details please refer to: https://www.kernel.org/doc/Documentation/trace/events.txt We can also filter events using filter expressions which operate on trace data fields. We'll describe both these features below. If we look in the directory of a specific trace event - say netif_rx - we see: $ sudo ls -al /sys/kernel/debug/tracing/events/net/netif_rx total 0 drwxr-xr-x. 2 root root 0 Oct 18 10:20 . drwxr-xr-x. 20 root root 0 Oct 18 10:20 .. -rw-r--r--. 1 root root 0 Oct 18 10:20 enable -rw-r--r--. 1 root root 0 Oct 18 10:20 filter -r--r--r--. 1 root root 0 Oct 18 10:20 format -r--r--r--. 1 root root 0 Oct 18 10:20 id -rw-r--r--. 1 root root 0 Oct 18 10:20 trigger So we see some additional files here. id is a unique identifier that can be used to attach to a specific tracepoint: $ sudo cat /sys/kernel/debug/tracing/events/net/netif_rx/id 1320 The format describes the fields used by the tracepoint. Each tracepoint prints a message which can be displayed by tracing tools, and to help it to do so it defines a set of fields which are populated by the raw tracing arguments. $ sudo cat /sys/kernel/debug/tracing/events/net/netif_rx/format name: netif_rx ID: 1320 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1;signed:0; field:int common_pid; offset:4; size:4; signed:1; field:void * skbaddr; offset:8; size:8; signed:0; field:unsigned int len; offset:16; size:4; signed:0; field:__data_loc char[] name; offset:20; size:4; signed:1; print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr, REC->len Here we see how the formatted message displays the device name, the address of the skb, etc. Finding tracepoints in the Linux source Taking the netif_rx tracepoint, let's see where it's placed. The convention is that tracepoints are called via trace_<tracepoint_name>, so we need to look for trace_netif_rx(). If we look in net/core/dev.c we see: static int netif_rx_internal(struct sk_buff *skb) { int ret; net_timestamp_check(netdev_tstamp_prequeue, skb); trace_netif_rx(skb); ... We see the tracepoint is called with a struct sk_buff * as an argument. From that sk_buff *, the fields described in the format are filled out, and those field values can be used in the formatted trace event output. How do we define a tracepoint? We'll describe that briefly here. Tracepoint definition Tracepoints are defined in header files under include/trace/events. Our netif_rx tracepoint is defined in include/trace/events/net.h. Each tracepoint definition consists of a description of TP_PROTO, the function prototype used in calling the tracepoint. In the case of netif_rx(), that's simply struct sk_buff *skb TP_ARGS, the argument names; (skb) TP_STRUCT__entry field definitions; these correspond to the fields which are assigned when the tracepoint is triggered TP_fast_assign statements which take the raw argument to the tracepoint (the skb) and set the associated field values (skb len, skb pointer etc) TP_printk which is responsible for using those field values to display a relevant tracing message Note: tracepoints defined via TRACE_EVENT specify all of the above, whereas we can also define an event class which shares fields, assignments and messages. In fact netif_rx is of event class net_dev_template, so our field assignments and message come from that event class. Using tracepoints 1: debugfs We can use the debugfs enable file associated with a tracepoint to switch it on, and harvest the trace data from /sys/kernel/debug/tracing/trace_pipe. $ echo 1 > /sys/kernel/debug/tracing/events/net/netif_receive_skb/enable $ cat /sys/kernel/debug/tracing/trace_pipe <idle>-0 [030] ..s. 17429.831685: netif_receive_skb: dev=eth0 skbaddr=ffff99a3b231d600 len=52 <idle>-0 [030] .Ns. 17429.831694: netif_receive_skb: dev=eth0 skbaddr=ffff99a3b231d600 len=52 The messages we see above come from the netif_receive_skb tracepoint, which is defined as being of the net_dev_template class of events; the latter defines the message format: TP_printk("dev=%s skbaddr=%p len=%u", __get_str(name), __entry->skbaddr, __entry->len) Next we can use the filter to limit event reporting to larger skbs; $ echo "len > 128" > filter $ cat filter len > 128 $ cat /sys/kernel/debug/tracing/trace_pipe <idle>-0 [038] ..s. 578.249650: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156 <idle>-0 [038] ..s. 579.249051: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156 <idle>-0 [038] ..s. 580.250971: netif_receive_skb: dev=eth0 skbaddr=ffff9fa2043c6e00 len=156 Using tracepoints 2: perf perf record -e can be used to trace tracepoint entry, and the associated tracing messages will be recorded. Here we record the net subsystem's netif_receive_skb tracepoint, and while doing so we ping the system doing the recording to trigger events: $ perf record -e net:netif_receive_skb $ perf report Samples: 8 of event 'net:netif_receive_skb', Event count (approx.): 8 Overhead Trace output 50.00% dev=eth0 skbaddr=0xffff99a3b1c10400 len=84 12.50% dev=eth0 skbaddr=0xffff999a9658ef00 len=76 12.50% dev=eth0 skbaddr=0xffff99a3b26de400 len=52 12.50% dev=eth0 skbaddr=0xffff99a3b2765500 len=76 12.50% dev=eth0 skbaddr=0xffff99a7026f0300 len=46 The messages we see above again come from the TP_printk associated with the netif_receive_skb event. Using tracepoints 3: bpf BPF gives us a few ways to connect to tracepoints via different program types: BPF_PROG_TYPE_TRACEPOINT: this program type gives access to the TP_STRUCT_entry data available at tracepoint entry; i.e. the data assigned from the raw tracepoint arguments via the TP_fast_assign() section in the tracepoint definition. The tricky part with these tracepoints is the context definition must match that described in the TP_STRUCT_entry definition exactly. For example, in samples/bpf/xdp_redirect_cpu_kern.c the tracepoint format for xdp_redirect tracepoints is defined as follows /* Tracepoint format: /sys/kernel/debug/tracing/events/xdp/xdp_redirect/format * Code in: kernel/include/trace/events/xdp.h */ struct xdp_redirect_ctx { u64 __pad; // First 8 bytes are not accessible by bpf code int prog_id; // offset:8; size:4; signed:1; u32 act; // offset:12 size:4; signed:0; int ifindex; // offset:16 size:4; signed:1; int err; // offset:20 size:4; signed:1; int to_ifindex; // offset:24 size:4; signed:1; u32 map_id; // offset:28 size:4; signed:0; int map_index; // offset:32 size:4; signed:1; }; // offset:36 Then the tracepoint program uses this context: static __always_inline int xdp_redirect_collect_stat(struct xdp_redirect_ctx *ctx) { u32 key = XDP_REDIRECT_ERROR; struct datarec *rec; int err = ctx->err; if (!err) key = XDP_REDIRECT_SUCCESS; ... BPF_PROG_TYPE_RAW_TRACEPOINT: BPF programs are in this case provided with the raw arguments to the tracepoint, i.e. before the fast assign is done. The context to such programs is defined as struct bpf_raw_tracepoint_args { __u64 args[0]; }; ...so the __u64 representation of the raw arguments can be accessed from args[]. Using tracepoints 4: DTrace DTrace in Oracle Linux supports a perf provider which, similar to BPF, allows us to access the raw tracepoint arguments. We can see the available probes via dtrace -l: $ modprobe sdt $ dtrace -l -P perf ID PROVIDER MODULE FUNCTION NAME 656 perf vmlinux syscall_trace_enter sys_enter 657 perf vmlinux syscall_slow_exit_work sys_exit 658 perf vmlinux emulate_vsyscall emulate_vsy scall 665 perf vmlinux trace_xen_mc_flush_reason xen_mc_flus h_reason ... We can gather stack traces, count events etc using DTrace; here we count how many times the tracepoint fires with a specific stack trace, and see we get 6 events with the same stack trace: $ dtrace -n 'perf:::netif_receive_skb { @c[stack()] = count(); }' dtrace: description 'perf:::netif_receive_skb ' matched 1 probe ^C vmlinux`do_invalid_op+0x20 vmlinux`invalid_op+0x11a vmlinux`__netif_receive_skb_core+0x3f vmlinux`__netif_receive_skb+0x18 vmlinux`netif_receive_skb_internal+0x45 vmlinux`napi_gro_receive+0xd8 virtio_net`__dta_receive_buf_110+0xf2 virtio_net`__dta_virtnet_poll_91+0x134 vmlinux`net_rx_action+0x289 vmlinux`__do_softirq+0xd9 vmlinux`irq_exit+0xe6 vmlinux`do_IRQ+0x59 vmlinux`common_interrupt+0x1c2 vmlinux`native_safe_halt+0x12 vmlinux`default_idle+0x1e cpuidle_haltpoll`haltpoll_enter_idle+0xa9 vmlinux`cpuidle_enter_state+0xa4 vmlinux`cpuidle_enter+0x17 vmlinux`call_cpuidle+0x23 vmlinux`do_idle+0x172 4 $ Next steps Now that we've seen some of the basics around tracepoints and how to gather information from them, we're ready to create a new one!

Have you always wanted to learn how to implement tracepoints in the Linux Kernel? Then this blog is for you. Oracle Linux kernel engineer Alan Maguire explains how to implement a tracepoint in...

Announcements

Announcing Gluster Storage Release 6 for Oracle Linux

The Oracle Linux and Virtualization team is pleased to announce the release of Gluster Storage Release 6 for Oracle Linux, bringing customers higher performance, new storage capabilities and improved management. Gluster Storage is an open source, POSIX compatible file system capable of supporting thousands of clients while using commodity hardware. Gluster provides a scalable, distributed file system that aggregates disk storage resources from multiple servers into a single global namespace. Gluster provides built-in optimization for different workloads and can be accessed using an optimized Gluster FUSE client or standard protocols including SMB/CIFS. Gluster can be configured to enable both distribution and replication of content with quota support, snapshots, and bit-rot detection for self-healing. New Features Gluster Storage Release 6 introduces support for the following important capabilities: Gluster Geo-Replication: Geo-replication provides a continuous, asynchronous, and incremental replication service from one site to another over a LAN, WAN or across the Internet. Geo-replication uses a master–slave model, whereby replication and mirroring occurs: Master – the geo-replication source GlusterFS volume Slave – the geo-replication target GlusterFS volume Session - Unique identifier of Geo-replication session Differences between Replicated-volumes and Geo-replication:   Replicated Volumes Gluster Geo-Replication Mirrors data across clusters Mirrors data across geographically distributed clusters Provides high-availability Ensure backing up of data for disaster and recovery Synchronous Replication (each and every file operation is sent across all the bricks) A-synchronous Replication (checks for the changes in files periodically and syncs them upon detecting differences)   Support for Oracle Linux 8: Gluster Storage Release 6 for Oracle Linux introduced the support for Oracle Linux 8 with Red Hat Compatible Kernel, in addition to Oracle Linux 7 Gluster Storage Release 6 for Oracle Linux 8 has been build as an Application Stream Module named glusterfs; module profiles available are: Server Client An additional glusterfs-developer module is available as a technology preview and introduces the option to leverage gluster-ansible RPMS to automate gluster deployment and management by Ansible. Additional enhancements in Gluster Storage Release 6 for Oracle Linux: Several stability fixes Client side inode garbage collection Performance Improvements Gluster Storage Release 6 for Oracle Linux is today supported on following configurations: Platform Operating System Release Minimum Operating System Maintenance Release Kernel x86_64 Oracle Linux 8 Oracle Linux 8 Update 1 Red Hat Compatible Kernel (RHCK) x86_64 Oracle Linux 7 Oracle Linux 7 Update 7 Unbreakable Enterprise Kernel Release 5 (UEK R5) Unbreakable Enterprise Kernel Release 4 (UEK R4) Red Hat Compatible Kernel (RHCK) aarch64 Oracle Linux 7 Oracle Linux 7 Update 7 Unbreakable Enterprise Kernel Release 5 (UEK R5)   Installation Gluster Storage is available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. For more information on hardware requirements and how to install and configure Gluster, please review the Gluster Storage for Oracle Linux Release 6 Documentation. Support Support for Gluster Storage is available to customers with an Oracle Linux Premier Support Subscription. Oracle Linux Resources: Documentation Oracle Linux Software Download Oracle Linux Oracle Container Registry Blogs Oracle Linux Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux For community-based support, please visit the Oracle Linux space on the Oracle Developer Community.

The Oracle Linux and Virtualization team is pleased to announce the release of Gluster Storage Release 6 for Oracle Linux, bringing customers higher performance, new storage capabilities and...

Linux

Easy Provisioning Of Cloud Instances On Oracle Cloud Infrastructure With The OCI CLI

As a developer, I often provision ephemeral instances in OCI for small projects or for testing purposes. Between the Browser User Interface which is not very convenient for repetitive tasks and Terraform which would be over-engineered for my simple needs the OCI Command Line Interface (CLI) offers a simple but powerful interface to the Oracle Cloud Infrastructure. In this article I will share my experience with this tool and provide as example the script I am using to provision cloud instances. The OCI CLI The OCI CLI requires python version 3.5 or later, running on Mac, Windows, or Linux. Installation instructions are provided on the OCI CLI Quickstart page. The examples from this article have been been tested on Linux, macOS and Windows. Windows users can use either Windows Subsystem for Linux or Git BASH. These examples assume that the OCI CLI is already installed and configured; and that the compartment is saved in the ~/.oci/oci_cli_rc file: [DEFAULT] compartment-id = ocid1.compartment.oc1..xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Handling the OCI CLI output: JMESPath The main challenge of using the OCI CLI in scripts is handling its responses. By default, all responses to a command are returned in JSON format. E.g. $ oci os ns get { "data": "mynamespace" } Alternatively, a table format is also available: $ oci os ns get --output table +-------------+ | Column1 | +-------------+ | mynamespace | +-------------+ But none of these formats are directly usable in a shell script. One could use the well known jq JSON processor, but the OCI CLI is built with the JMESPath library which allows JSON manipulation without the need of an third party tool. With the same simple request we can select the data field: $ oci os ns get --query 'data' "mynamespace" Finally we can get rid of the quotes using the raw output format: $ oci os ns get --query 'data' --raw-output mynamespace And to capture the output in a shell variable: $ ns=$(oci os ns get --query 'data' --raw-output) $ echo $ns mynamespace As a less trivial example, the following returns the image OCID of the latest Oracle Linux 7.7 image compatible with the VM.Standard2.1 shape: $ ocid=$(oci compute image list \ --operating-system "Oracle Linux" \ --operating-system-version "7.7" \ --shape "VM.Standard2.1" \ --sort-by TIMECREATED \ --query 'data[0].id' \ --raw-output) $ echo $ocid ocid1.image.oc1.eu-frankfurt-1.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx The --raw-output option is only effective when the output of the query returns a single string value. When multiple values are expected we will concatenate them in the query. Depending on the format of the fields, I typically use two different constructions to retrieve the data: concatenate with space or new line separators. The space construct is the simplest, but it obviously won’t work if your fields are free text. $ response=$(oci compute image list \ --operating-system "Oracle Linux" \ --operating-system-version "7.7" \ --shape "VM.Standard2.1" \ --sort-by TIMECREATED \ --query '[data[0].id, data[0]."display-name"] | join('"'"' '"'"',@)' \ --raw-output) $ read ocid display_name <<< "${response}" $ echo $ocid ocid1.image.oc1.eu-frankfurt-1.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx $ echo $display_name Oracle-Linux-7.7-2020.01.28-0 Note: never use pipes to read and store data in shell variables as pipes are run in sub-shells! The new line construct is slightly more complex, but can be used with fields containing spaces: $ response=$(oci compute image list \ --operating-system "Oracle Linux" \ --operating-system-version "7.7" \ --shape "VM.Standard2.1" \ --sort-by TIMECREATED \ --query '[data[0].id, data[0]."display-name"] | join(`\n`,@)' \ --raw-output) $ { read ocid; read display_name; } <<< "${response}" $ echo $ocid ocid1.image.oc1.eu-frankfurt-1.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx $ echo $display_name Oracle-Linux-7.7-2020.01.28-0 Notes: You can use inverted quotes instead of quotes for strings in JMESPath queries, it makes the overall quoting more readable. If you use a bash shell under Windows (Git BASH), make sure it properly handles DOS type end-of-line by setting the IFS environment variable: IFS=$' \t\r\n' Provisioning script Using the above constructions you can easily write a script to facilitate image provisioning. The provisioning script from which the code snippets are extracted is part of the ol-sample-scripts project on GitHub. My goal is to be able to swiftly provision instances in the same environment, so the script assumes there are a Virtual Cloud Network (VCN) and a Subnet already defined in the tenancy. A Public IP is always assigned. The following sections describe the high level steps needed to provision an image. Platform images This is the easiest case: the image OCID can be retrieved with a simple query: image_list=$(oci compute image list \ --operating-system "${operating_system}" \ --operating-system-version "${operating_system_version}" \ --shape ${shape} \ --sort-by TIMECREATED \ --query '[data[0].id, data[0]."display-name"] | join(`\n`,@)' \ --raw-output) it is important to include the target shape in the query to only retrieve compatible images. The Availability Domain is retrieved using pattern matching: availability_domain=$(oci iam availability-domain list \ --all \ --query 'data[?contains(name, `'"${availability_domain}"'`)] | [0].name' \ --raw-output) We also need the VCN and Subnet OCIDs: ocid_vcn=$(oci network vcn list \ --query "data [?\"display-name\"=='${vcn_name}'] | [0].id" \ --raw-output) ocid_subnet=$(oci network subnet list \ --vcn-id ${ocid_vcn} \ --query "data [?\"display-name\"=='${subnet_name}'] | [0].id" \ --raw-output) We now have all the data needed to launch the instance: ocid_instance=$(oci compute instance launch \ --display-name ${instance_name} \ --availability-domain "${availability_domain}" \ --subnet-id "${ocid_subnet}" \ --image-id "${ocid_image}" \ --shape "${shape}" \ --ssh-authorized-keys-file "${public_key}" \ --assign-public-ip true \ --wait-for-state RUNNING \ --query 'data.id' \ --raw-output) We use the --wait-for-state option to wait until the image is up and running. This allows us to retrieve and print the IP address, so we can immediately connect to our new instance: public_ip=$(oci compute instance list-vnics \ --instance-id "${ocid_instance}" \ --query 'data[0]."public-ip"' \ --raw-output) Marketplace images Unfortunately, the oci compute image list command only returns Platform and Custom images. What if we want to provision Oracle images from the Marketplace (Cloud Developer, Autonomous Linux, …)? This is a bit more complex as these images require you to accept the Oracle Standard Terms and Restrictions before using them. The Marketplace is also known as the Product Image Catalog (PIC) and the corresponding API calls are done with the oci pic commands. To instantiate an image from the Marketplace we need to: Get the image listing OCID – the query must be specific enough to return a single row. pic_listing=$(oci compute pic listing list \ --all \ --query 'data[?contains("display-name", `'"${image_name}"'`)].join('"'"' '"'"', ["listing-id", "display-name"]) | join(`\n`, @)' \ --raw-output) Using that listing OCID, find the latest image OCID in that listing: version_list=$(oci compute pic version list --listing-id "${ocid_listing}" \ --query 'sort_by(data,&"time-published")[*].join('"'"' '"'"',["listing-resource-version", "listing-resource-id"]) | join(`\n`, reverse(@))' \ --raw-output) The above query does not allow to specify a shape like we do for the Platform images. We have to browse the list until we find a compatible image: available=$(oci compute pic version get --listing-id "${ocid_listing}" \ --resource-version "${image_version}" \ --query 'data."compatible-shapes"|contains(@, `'${shape}'`)' \ --raw-output) Now that we have a compatible image OCID, we need to retrieve the agreement for the listing OCID: agreement=$(oci compute pic agreements get --listing-id "${ocid_listing}" \ --resource-version "${image_version}" \ --query '[data."oracle-terms-of-use-link", data.signature, data."time-retrieved"] | join(`\n`,@)' \ --raw-output) And eventually subscribe to the agreement: subscription=$(oci compute pic subscription create --listing-id "${ocid_listing}" \ --resource-version "${image_version}" \ --signature "${signature}" \ --oracle-tou-link "${oracle_tou_link}" \ --time-retrieved "${time_retrieved}" \ --query 'data."listing-id"' \ --raw-output) Once subscribed, we can proceed as we did for the Platform images. Cloud-init Beyond the simple provisioning, I like to have a ready to use instance with my favorite tools installed and configured (shell, editor preferences, …). This can be done with a cloud-init file. Cloud-init files can be very complex (see the cloud-init documentation), but in its simplest form it can just be a shell script. The file is passed as paramter to the oci compute instance launch command. As illustration, the project repository contains a simple oci-cloud-init.sh file. Sample session $ ./oci-provision.sh --help Usage: oci-provision.sh OPTIONS Provision an OCI compute instance. Options: --help, -h show this text and exit --os operating system (default: Oracle Linux) --os-version operating system version --image IMAGE image search pattern in the Marketplace os/os-version are ignored when image is specified --name NAME compute VM instance name --shape SHAPE VM shape (default: VM.Standard2.1) --ad AD Availability Domain (default: AD-1) --key KEY public key to access the instance --vcn VCN name of the VCN to attach to the instance --subnet SUBNET name of the subnet to attach to the instance --cloud-init CLOUD-INIT optional clout-init file to provision the instance Default values for parameters can be stored in ./oci-provision.env $ ./oci-provision.sh --image "Cloud Dev" \ --name Development \ --ad AD-3 \ --key ~/.ssh/id_rsa.pub \ --vcn "VCN-Dev" \ --subnet "Public Subnet" \ --cloud-init oci-cloud-init.sh +++ oci-provision.sh: Getting image listing oci-provision.sh: Selected image: oci-provision.sh: Image : Oracle Cloud Developer Image oci-provision.sh: Summary : Oracle Cloud Developer Image oci-provision.sh: Description: An Oracle Linux 7-based image with the latest development tools, languages, Oracle Cloud Infrastructure Software Development Kits and Database connectors at your fingertips +++ oci-provision.sh: Getting latest image version oci-provision.sh: Version Oracle_Cloud_Developer_Image_19.11 selected +++ oci-provision.sh: Getting agreement and subscribing... oci-provision.sh: Term of use: https://objectstorage.us-ashburn-1.oraclecloud.com/n/partnerimagecatalog/b/eulas/o/oracle-apps-terms-of-use.txt oci-provision.sh: Subscribed +++ oci-provision.sh: Retrieving AD name +++ oci-provision.sh: Retrieving VCN +++ oci-provision.sh: Retrieving subnet +++ oci-provision.sh: Provisioning Development with VM.Standard2.1 (oci-cloud-init.sh) Action completed. Waiting until the resource has entered state: ('RUNNING',) +++ oci-provision.sh: Getting public IP address oci-provision.sh: Public IP is: xxx.xxx.xxx.xxx Demo

As a developer, I often provision ephemeral instances in OCI for small projects or for testing purposes.Between the Browser User Interface which is not very convenient for repetitive tasks and...

Linux

Generating a vmcore in OCI

  In this blog, Oracle Linux kernel developer Manjunath Patil demostrates how you can configure your Oracle Linux instances (both bare metal and virtual machine) running in Oracle Cloud for crash dumps. OCI instances can generate a vmcore using kdump. kdump is a mechanism to dump the 'memory contents of a system' [vmcore] when the system crashes. The vmcore later can be analyzed using the crash utility to understand the cause of the system crash. The kdump mechanism works by booting a second kernel [called kdump kernel or capture kernel] when the system running first kernel[called panicked kernel] crashes. The kdump kernel runs in its own reserved memory so that it wont affect the memory used by the system. The OCI systems are all pre-configured with kdump. When an OCI instance crashes, it will generate the vmcore which can be shared with developers to understand the cause of the crash. How to configure your Oracle Linux system with kdump 1. Pre-requisites Make sure you have the kexec-tools rpm installed shell # yum install kexec-tools # yum list installed | grep kexec-tools This is the main rpm which contains the tools to configure the kdump 2. Reserve memory for kdump kernel kdump kernel needs its own reserved memory so that when it boots, it won't use the first kernel's memory. The first kernel is told to reserve the memory for kdump kernel using crashkernel=auto kernel parameter. The first kernel needs to be rebooted for the kernel parameter to be effective. a. Here is how we can check if the memory is reserved # cat /proc/iomem | grep -i crash 27000000-370fffff : Crash kernel # dmesg | grep -e "Reserving .* crashkernel" [0.000000] Reserving 257MB of memory at 624MBfor crashkernel (System RAM: 15356MB) b. How to set kernel parameters OL6 systems - update the /etc/grub.conf file OL7 systems - update the /etc/default/grub file [GRUB_CMDLINE_LINUX= line] and re-generate the grub.cfg [grub2-mkconfig -o /boot/grub2/grub.cfg] 3. Setup the serial console Setting serial console prints progress of kdump kernel onto serial console. It would also help debugging any of the kdump kernel related issues. This setting is optional. To set: add 'console=tty0 console=ttyS0,115200n8' kernel parameters. Addition of kernel parameters require a reboot to be effective. 4. Configuring kdump /etc/kdump.conf is used to configure the kdump. The following are the two main configurations - a. where to dump the vmcore? Default location is /var/crash/ To change, update the line starting with 'path'. Make sure the new path has enough space to accommodate vmcore. b. minimize the size of vmcore We can reduce the size of vmcore by excluding memory pages such as pages filled with zero, user process data pages, free pages etc. This is controlled by 'core_collector' line in the config file. Default value is 'core_collector makedumpfile -p --message-level 1 -d 31' -p = compress the data using snappy --message-level = print messages on console. Range 0[brevity] to 31[verbose]. -d = dump level = dictates size of vmcore. Range 0[biggest] to 31[smallest] More on dump level and message-level in 'man makedumpfile' 5. Make kdump service run at boot time OL6: # chkconfig kdump on; chkconfig kdump --list OL7: # systemctl enable kdump; systemctl is-enabled kdump 6. Manually crash the system to make sure it's working # echo c > /proc/sysrq-trigger [After reboot] # ls -l /var/crash/* Keep the system in configured state, so that when system crashes vmcore is collected. 8. Examples a. OL6U10 - VM [root@ol6u10-vm ~]# cat /proc/cmdline ro root=UUID=... crashkernel=auto ... console=tty0 console=ttyS0,9600 [root@ol6u10-vm ~]# service kdump status Kdump is operational [root@ol6u10-vm ~]# cat /proc/iomem | grep -i crash 27000000-370fffff : Crash kernel [root@ol6u10-vm ~]# dmesg | grep -e "Reserving .* crashkernel" [ 0.000000] Reserving 257MB of memory at 624MB for crashkernel (System RAM: 15356MB) [root@ol6u10-vm ~]# echo c > /proc/sysrq-trigger [After reboot] [root@ol6u10-vm ~]# cat /etc/kdump.conf | grep -v '#' | grep -v '^$' path /var/crash core_collector makedumpfile -l --message-level 1 -d 31 [root@ol6u10-vm ~]# ls -lhs /var/crash/127.0.0.1-20.../ total 96M 96M -rw-------. 1 root root 96M Dec 13 00:48 vmcore 44K -rw-r--r--. 1 root root 41K Dec 13 00:48 vmcore-dmesg.txt [root@ol6u10-vm ~]# free -h total used free shared buffers cached Mem: 14G 395M 14G 208K 10M 114M -/+ buffers/cache: 270M 14G Swap: 8.0G 0B 8.0G b. OL6U10 - BM [root@ol6u10-bm ~]# cat /proc/cmdline ro root=UUID=... crashkernel=auto ... console=tty0 console=ttyS0,9600 [root@ol6u10-bm ~]# service kdump status Kdump is operational [root@ol6u10-bm ~]# cat /proc/iomem | grep -i crash 27000000-37ffffff : Crash kernel [root@ol6u10-bm ~]# dmesg | grep -e "Reserving .* crashkernel" [ 0.000000] Reserving 272MB of memory at 624MB for crashkernel (System RAM: 262010MB) [root@ol6u10-bm ~]# echo c > /proc/sysrq-trigger [After Reboot] [root@ol6u10-bm ~]# cat /etc/kdump.conf | grep -v '#' | grep -v '^$' path /var/crash core_collector makedumpfile -l --message-level 1 -d 31 [root@ol6u10-bm ~]# ls -lhs /var/crash/127.0.0.1-20.../ total 1.1G 1.1G -rw-------. 1 root root 1.1G Dec 18 05:24 vmcore 92K -rw-r--r--. 1 root root 90K Dec 18 05:23 vmcore-dmesg.txt [root@ol6u10-bm ~]# free -h total used free shared buffers cached Mem: 251G 1.4G 250G 224K 13M 157M -/+ buffers/cache: 1.2G 250G Swap: 8.0G 0B 8.0G c. OL7U7 - VM [root@ol7u7-vm opc]# free -h total used free shared buff/cache available Mem: 14G 290M 13G 16M 274M 13G Swap: 8.0G 0B 8.0G [root@ol7u7-vm opc]# service kdump status Redirecting to /bin/systemctl status kdump.service kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: active (exited) ... ... ol7u7-vm systemd[1]: Started Crash recovery kernel arming. [root@ol7u7-vm opc]# cat /proc/cmdline BOOT_IMAGE=... crashkernel=auto ... console=tty0 console=ttyS0,9600 [root@ol7u7-vm opc]# dmesg | grep -e "Reserving .* crashkernel" [ 0.000000] Reserving 257MB of memory at 624MB for crashkernel (System RAM: 15356MB) [root@ol7u7-vm opc]# cat /proc/iomem | grep -i crash 27000000-370fffff : Crash kernel [root@ol7u7-vm ~]# ls -lhs /var/crash/127.0.0.1-20.../ total 90M 90M -rw-------. 1 root root 90M Dec 18 13:48 vmcore 48K -rw-r--r--. 1 root root 47K Dec 18 13:48 vmcore-dmesg.txt [root@ol7u7-vm ~]# cat /etc/kdump.conf | grep -v '#' | grep -v "^$" path /var/crash core_collector makedumpfile -l --message-level 1 -d 31 d. OL7U7-BM [root@ol7u7-bm opc]# free -h total used free shared buff/cache available Mem: 251G 1.1G 250G 17M 298M 249G Swap: 8.0G 0B 8.0G [root@ol7u7-bm opc]# service kdump status Redirecting to /bin/systemctl status kdump.service kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: active (exited) ... ... ol7u7-bm systemd[1]: Started Crash recovery kernel arming. [root@ol7u7-bm opc]# cat /proc/cmdline BOOT_IMAGE=... crashkernel=auto ... console=tty0 console=ttyS0,9600 [root@ol7u7-bm opc]# cat /proc/iomem | grep -i crash 25000000-35ffffff : Crash kernel [root@ol7u7-bm opc]# dmesg | grep -e "Reserving .* crashkernel" [ 0.000000] Reserving 272MB of memory at 592MB for crashkernel (System RAM: 262010MB) [root@ol7u7-bm opc]# cat /etc/kdump.conf | grep -v '#' | grep -v '^$' path /var/crash core_collector makedumpfile -l --message-level 1 -d 31 [root@ol7u7-bm opc]# echo c > /proc/sysrq-trigger [root@ol7u7-bm ~]# ls -lhs /var/crash/127.0.0.1-20.../ total 1.1G 1.1G -rw-------. 1 root root 1.1G Dec 18 14:16 vmcore 116K -rw-r--r--. 1 root root 114K Dec 18 14:16 vmcore-dmesg.txt

  In this blog, Oracle Linux kernel developer Manjunath Patil demostrates how you can configure your Oracle Linux instances (both bare metal and virtual machine) running in Oracle Cloud for...

Linux

Building (Small) Oracle Linux Images For The Cloud

Overview Oracle Linux Image Tools is a sample project to build small or customized Oracle Linux Cloud images in a repeatable way. It provides a bash modular framework which uses HashiCorp Packer to build images in Oracle VM VirtualBox. Images are then converted to an appropriate format depending on the Cloud provider. This article shows you how to build the sample images from this repository and how to use the framework to build custom images. The framework is based around two concepts: Distribution and Cloud modules. A Distribution module is responsible for the installation and configuration of Oracle Linux as well as the packages needed for your project. The sample ol7-slim and ol8-slim distributions provide Oracle Linux images with a minimalist set of packages (about 250 packages – smaller than an Oracle Linux Minimal Install). A Cloud module ensures that the image is properly configured and packaged for a particular cloud provider. The following modules are currently available: oci: Oracle Cloud Infrastructure (QCOW2 file) olvm: Oracle Linux Virtualization Manager (OVA file) ovm: Oracle VM Server (OVA file) azure: Microsoft Azure (VHD file) vagrant-virtualbox and vagrant-libvirt: Vagrant boxes (BOX file) none: no cloud customization (OVA file) Build requirements Environment A Linux environment is required for building images. The project is developed and tested with Oracle Linux 7 and 8, but should run on most Linux distribution. If your environment is a virtual machine, it must support nested virtualization. The build tool needs root privileges to mount the generated images. Ensure sudo is properly configured for the user running the build. Software You will need the following software installed: HashiCorp Packer and Oracle VM VirtualBox Oracle Linux 7 yum --enablerepo=ol7_developer install packer VirtualBox-6.1 Oracle Linux8 dnf --enablerepo=ol8_developer install VirtualBox-6.1 Download and install Packer from HashiCorp kpartx and qemu-img to manipulate the artifacts yum install kpartx qemu-img Disk space You will need at least twice the size of your images as free disk space. That is: building a 30GB image will require 60GB of free space. Building the project images Building the images from the project is straightforward. Configuration Build configuration is done by editing the env.properties file (or better, a copy of it). Options are documented in the property file, but at the very least you must provide: WORKSPACE: the directory used for the build ISO_URL / ISO_SHA1_CHECKSUM: location of the Oracle Linux ISO image. You can download it from the Oracle Software Delivery Cloud or use a public mirror. The image is cached in the workspace. DISTR: the Distribution to build CLOUD: the target cloud provider. Sample build The following env.properties.oci property file is used to build a minimal OL7 image for the Oracle Cloud Infrastructure, using all default parameters: WORKSPACE="/data/workspace" ISO_URL="http://my.mirror.example.com/iso/ol7/OracleLinux-R7-U7-Server-x86_64-dvd.iso" ISO_SHA1_CHECKSUM="3ef94628cf1025dab5f10bbc1ed2005ca0cb0933" DISTR="ol7-slim" CLOUD="oci" Run the script: $ ./bin/build-image.sh --env env.properties.oci +++ build-image.sh: Parse arguments +++ build-image.sh: Load environment +++ build-image.sh: Stage Packer files +++ build-image.sh: Stage kickstart file +++ build-image.sh: Generate Packer configuration file +++ build-image.sh: Run Packer build-image.sh: Spawn HTTP server build-image.sh: Invoke Packer ... build-image.sh: Package image +++ build-image.sh: Cleanup Workspace +++ build-image.sh: All done +++ build-image.sh: Image available in /data/workspace/OL7U7_x86_64-oci-b0 $ That’s it! The /data/workspace/OL7U7_x86_64-oci-b0 directory now contains OL7U7_x86_64-oci-b0.qcow, a QCOW2 file which can be imported and run on OCI. Adding new modules Directory layout Each Distribution module is represented by a subdirectory of the distr directory. Each Cloud module is represented by a subdirectory of the cloud directory. Additionally, Cloud actions for a specific Distribution can be defined in the cloud/<cloud>/<distr> directory. Any element not necessary can be omitted – e.g. the none cloud module only provides a packaging function. All the env.properties files are merged and made available to the scripts at runtime. They define parameters with default values which can be overridden by the user in the global env.properties file in the project base directory. Adding a distribution To add a new distribution, create a directory in distr/ with the following files: env.properties: parameters for the distribution. ks.cfg: a kickstart file to bootstrap the installation. This is the only mandatory file. image-scripts.sh: a shell script with the following optional functions which will be invoked on the build host: distr::validate: validate the parameters before the build. distr::kickstart: tailor the kickstart file based on the parameters. distr::image_cleanup: disk image cleanup run at the end of the build. provision.sh: a shell script with the following optional functions which will be invoked on the VM used for the build: distr::provision: image provisioning (install/configure software) distr::cleanup: image cleanup (uninstall software, …) files directory: the files in this directory are copied to the image in /tmp/distr and can be used by the provisioning scripts. Adding a cloud The process is similar to the distribution: create a directory in cloud/ with the following files: env.properties: parameters for the cloud. image-scripts.sh: a shell script with the following optional functions which will be invoked on the build host: cloud::validate: validate the parameters before the build. cloud::kickstart: tailor the kickstart file based on the parameters. cloud::image_cleanup: disk image cleanup run at the end of the build. cloud::image_package: package the image in a suitable format for the cloud provider. This is the only mandatory function. provision.sh: a shell script with the following optional functions which will be invoked on the VM used for the build: cloud::provision: image provisioning (install/configure software) cloud::cleanup: image cleanup (uninstall software, …) files directory: the files in this directory are copied to the image in /tmp/cloud and can be used by the provisioning scripts. If some cloud actions are specific to a particular distribution, they can be specified in the <cloud>/<distr> subdirectory. If a cloud_distr::image_package function is provided it will override the cloud::image_package one. Builder flow The complete build flow is illustrated hereunder: The builder goes through the following steps: Build environment All the env.properties files are sourced and merged. The user provided one is sourced last and defines the build behavior The validate() functions are called. These hooks perform a sanity check on the parameters Packer configuration and run The distribution kickstart file is copied and the kickstart() hooks have the opportunity to customize it The distribution is installed in a VirtualBox VM using this kickstart files The files directories are copied in /tmp on the VM The provision() functions are run in the VM The cleanup() functions are run in the VM Packer will then shutdown and export the VM Image cleanup The generated image is unpacked and mounted on the host The image_cleanup() functions are called The image is unmounted The final package is created by the image_package() function, either from cloud_distr or from cloud

Overview Oracle Linux Image Tools is a sample project to build small or customized Oracle Linux Cloud images in a repeatable way. It provides a bash modular framework which uses HashiCorp Packer to...

Events

Live Webcast: Top 5 Reasons to Build your Virtualization with Oracle Linux KVM

Register Today: February 27, 2020 EMEA: 10:00 a.m. GMT/11:00 CET/12:00 SAST/14:00 GST APAC: 10:30 AM IST/ 1:00 PM SGT/4:00 PM AEDT North America: 09:00 AM Pacific Standard Time Recent industry surveys indicate that most enterprises have a strategy of using multiple clouds. Most who are planning to migrate to cloud start with modernizing their on premises data center. The choice of Linux and Virtualization can make a big impact on their infrastructure, both today and tomorrow. Oracle Linux Virtualization Manager, based on the open source oVirt project, can be easily deployed to configure, monitor, and manage an Oracle Linux KVM environment with enterprise-grade performance and support for both on premise and cloud.     Join this webcast to learn from Oracle experts about the top 5 reasons to build your virtualization infrastructure using Oracle Linux KVM:   Accelerated deployment with ready to go VMs with Oracle software  Increased performance and security Simplified, easy management of the full stack Improved licensing costs through hard partitioning Lower licensing and support costs while increasing benefits   Featured Speakers Simon Coter Director of Product Management for Linux and Virtualization, Oracle Simon is responsible for both Oracle Linux and Virtualization, the Unbreakable Enterprise Kernel along with all its sub-components and add-ons, including Oracle Linux KVM, Oracle Linux Virtualization Manager, Ceph, Gluster, Oracle VM and VirtualBox. John Priest  Product Management Director for Oracle Server Virtualization John covers all aspects of the Oracle Linux Virtualization Manager and Oracle VM product life-cycles.

Register Today: February 27, 2020 EMEA: 10:00 a.m. GMT/11:00 CET/12:00 SAST/14:00 GST APAC: 10:30 AM IST/ 1:00 PM SGT/4:00 PM AEDT North America: 09:00 AM Pacific Standard Time Recent industry surveys...

Libcgroup in the Twenty-First Century

In this blog post, Oracle Linux kernel developer Tom Hromatka writes about the new testing frameworks, continuous integration and code coverage capabilities that have been added to libcgroup. In 2008 libcgroup was created to simplify how users interact with and manage cgroups. At the time, only cgroups v1 existed, the libcgroup source was hosted in a subversion repository on Sourceforce, and System V still ruled the universe. Fast forward to today and the landscape is changing quickly. To pave the way for cgroups v2 support in libcgroup, we have added unit tests, functional tests, continuous integration, code coverage, and more. Unit Testing In May 2019 we added the googletest unit testing framework to libcgroup. libcgroup has many large, monolithic functions that perform the bulk of the cgroup management logic, and adding cgroup v2 support to these complex functions could easily introduce regressions. To combat this, we plan on adding tests before we add cgroup v2 support. Functional Testing In June 2019 we added a functional test framework to libcgroup. The functional test framework consists of several Python classes that either represent cgroup data or can be used to manage cgroups and the system. Years ago tests were added to libcgroup, but they have proven difficult to run and maintain because they are destructive to the host system's libcgroup hierarchy. With the advent of containers, this problem can easily be avoided. The functional test framework utilizes LXC containers and the LXD interfaces to encapsulate the tests. Running the tests within a container provides a safe environment where cgroups can be created, deleted, and modified in an easily reproducible setting - without destructively modifying the host's cgroup hierachy. libcgroup's functional tests are quick and easy to write and provide concise and informative feedback on the status of the run. Here's a simple example of a successful test run: $ ./001-cgget-basic_cgget.py ----------------------------------------------------------------- Test Results: Run Date: Dec 02 17:54:28 Passed: 1 test(s) Skipped: 0 test(s) Failed: 0 test(s) ----------------------------------------------------------------- Timing Results: Test Time (sec) --------------------------------------------------------- setup 5.02 001-cgget-basic_cgget.py 0.76 teardown 0.00 --------------------------------------------------------- Total Run Time 5.79 And here's an example of where something went wrong. In this case I have artificially caused the Run() class to raise an exception early in the test run. The framework reports the test and the exact command that failed. The return code, stdout, and stderr from the failing command are also reported to facilitate debugging. And of course the log file contains a chronological history of the entire test run to further help in troubleshooting the root cause. $ ./001-cgget-basic_cgget.py ----------------------------------------------------------------- Test Results: Run Date: Dec 02 18:11:47 Passed: 0 test(s) Skipped: 0 test(s) Failed: 1 test(s) Test: 001-cgget-basic_cgget.py - RunError: command = ['sudo', 'lxc', 'exec', 'TestLibcg', '--', '/home/thromatka/git/libcgroup/src/tools/cgset', '-r', 'cpu.shares=512', '001cgget'] ret = 0 stdout = b'' stderr = b'I artificially injected this exception' ----------------------------------------------------------------- Continuous Integration and Code Coverage In September 2019 we added continuous integration and code coverage to libcgroup. libcgroup's github repository is now linked with Travis CI to automatically configure the library, build the library, run the unit tests, and run the functional tests every time a commit is pushed to the repo. If the tests pass, Travis CI invokes coveralls.io to gather code coverage metrics. The continuous integration status and the current code coverage percentage are prominently displayed on the github source repository. Currently all two :) tests are passing and code coverage is at 16%. I have many more tests currently in progress, so expect to see these numbers improve significantly in the next few months. Future Work Ironically, after all these changes, we're now nearly ready to start the "real work." A loose roadmap of our upcoming improvements: Add an "ignore" rule to cgrulesengd. (While not directly related to the cgroup v2 work, this new ignore rule will heavily utilize the testing capabilities outlined above) Add a ton more tests - both unit and functional Add cgroup v2 support to our functional testing framework. I have a really rough prototype working, but I think automating it will require help from the Travis CI development team Add cgroup v2 capabilities to libcgroup utilities like cgget, cgset, etc. Design and implement a cgroup abstraction layer that will abstract away all of the gory detail differences between cgroup v1 and cgroup v2

In this blog post, Oracle Linux kernel developer Tom Hromatka writes about the new testing frameworks, continuous integration and code coverage capabilities that have been added to libcgroup. In 2008...

Events

Join the Oracle Linux and Virtualization Team in London at Oracle OpenWorld Europe

The Oracle OpenWorld Global Series continues with our next stop at ExCeL London, February 12–13, 2020. With just 5 days left to register, you’ll want to sign up now for your complementary pass and reserve your place. Across the two days, you can immerse yourself in the infinite possibilities of a data-driven world. Wednesday, 12 February | Insight Starts Here | Outpace Change with Intelligence Explore how leading companies—faced with an ever-accelerating pace of change—are unlocking insights with data to re-engineer the core of their business, elevate the value they deliver to customers, pioneer new ways of working, and drive completely new opportunities. Thursday, 13 February | Innovation Starts Here | Technology-Powered Possibilities Dive deep into the transformational and autonomous technologies fundamentally changing work and life. Fuel innovation by pulling value from vast amounts of data at scale and unleashing opportunities with AI and machine learning and a long list of featured speakers and luminaries. We look forward to seeing you there. Be sure to add these two sessions to your agenda: Wim Coekaerts, SVP, Software Development, will present a Solution Keynote: Cloud Platform and Middleware Strategy and Roadmap [SOL1194-LON] Thursday, Feb 13 | 09:00 - 10:20 | Arena F - Zone 6 In this session, Wim Coekaerts will discuss the strategy and vision for Oracle’s comprehensive cloud platform services and on-premise software. Customers are on a number of journeys to the cloud: moving and modernizing workloads out of data centers; transitioning off on-premises apps to SaaS; innovating with new API-first, chatbot-based container native applications; optimizing IT operations and security from the cloud; and getting real-time insight leveraging big data and analytics from the cloud. Hear from customers about how they leverage Oracle Cloud for their digital transformation. And hear how Oracle’s application development, integration, systems management, and security solutions leverage artificial intelligence to drive cost savings and operational efficiency for hybrid and multicloud ecosystems. Simon Coter, Product Management Director, Oracle Linux and Virtualization, delivers a Breakout Session: Tools and Techniques for Modern Cloud Native Development [SES1270-LON]  Thursday, Feb 13 | 13:05 - 13:40 | Arena C - Zone 2 Simon Coter will explore the tools, techniques, and strategies you can apply using Oracle Linux to help you evolve toward a cloud native future. On-premise or in the cloud, you'll learn how Oracle Linux Cloud Native Environment enables you to deploy reliable, secure, and scalable applications. You will also discover how Kubernetes, Docker, CRI-O, and Kata Containers, available for free with Oracle Linux Premier Support, and Oracle VM VirtualBox deliver an exceptional DevSecOps solution. Explore all of the conference’s content through the detailed content catalogue and attend keynote sessions and other sessions of your interest. Join Us at The Exchange | Zone 3 Talk with product experts and experience the latest Oracle Linux and Oracle Virtualization technologies first hand. You’ll find us at two stands in Zone 3 of The Exchange. Don’t miss the Raspberry Pi “Mini” Super Computer search for aliens!  @ Groundbreakers Hub | Zone 3    Mini is the sibling of Super Pi, the super computer demonstrated at Oracle OpenWorld San Francisco in October, 2019, and among the top 10 Raspberry Pi projects last year.  Mini is a portable Pi cluster in a large pelican-like case on wheels with 84 Raspberry Pi 3B+ boards running Oracle Linux 8. Check out Mini as it searches for aliens with SETI@home.   Bold ideas. Breakthrough technologies. Better possibilities. It all starts here. Register now. We look forward to meeting you in London. Join the conversation: @OracleLinux @OracleOpenWorld #OOWLON #OracleTux

The Oracle OpenWorld Global Series continues with our next stop at ExCeL London, February 12–13, 2020. With just 5 days left to register, you’ll want to sign up now for your complementary passand...

Linux Kernel Development

Unbinding Parallel Jobs in Padata

Oracle Linux kernel developer Daniel Jordan contributes this post on enhancing the performance of padata. padata is a generic framework for parallel jobs in the kernel -- with a twist. It not only schedules jobs on multiple CPUs but also ensures that each job is properly serialized, i.e. finishes in the order it was submitted. This post will provide some background on this somewhat obscure feature of the core kernel and cover recent efforts to enhance its parallel performance in preparation for more multithreading in the kernel. How Padata Works padata allows users to create an instance that represents a certain class of parallel jobs, for example IPsec decryption (see pdecrypt in the kernel source). The instance serves as a handle when submitting jobs to padata so that all jobs submitted with the same handle are serialized amongst themselves. An instance also allows for fine-grained control over which CPUs are used to run work, and contains other internal data such as the next sequence number to assign for serialization purposes and the workqueue used for parallelization. To initialize a job (known cryptically as padata_priv in the code), a pair of function pointers are required, parallel and serial, where parallel is responsible for doing the actual work in a workqueue worker and serial completes the job once padata has serialized it. The user submits the job along with a corresponding instance to the framework via padata_do_parallel to start it running, and once the job's parallel part is finished, the user calls padata_do_serial to inform padata of this. padata_do_serial is currently always called from parallel, but this is not strictly required. padata ensures that a job's serial function is called only when the serial functions of all previously-submitted jobs from the same instance have been called. Though parallelization is ultimately padata's (and this blog post's) reason for being, its serialization algorithm is the most technically interesting part, so I'll go on a tangent to explain a bit about it. For scalability reasons, padata allocates internal per-CPU queues, and there are three types, parallel, reorder, and serial, where each type is used for a different phase of a padata job's lifecycle. When a job is submitted to padata, it's atomically assigned a unique sequence number within the instance that determines the order its serialization callback runs. The sequence number is hashed to a CPU that is used to select which queue a job is placed on. When the job is preparing to execute its parallel function, it is placed on a parallel per-CPU queue that determines which CPU it runs on (this becomes important later in the post). Using a per-CPU queue allows multiple tasks to submit parallel jobs concurrently with only minimal contention from the atomic op on the sequence number, avoiding a shared lock. When the parallel part finishes and the user calls padata_do_serial, padata then places the job on the reorder queue, again corresponding to the CPU that the job hashed to. And finally, a job is placed on the serial queue once all jobs before it have been serialized. During the parallel phase, jobs may finish out of order relative to when they were submitted. Nevertheless, each call to padata_do_serial places the job on its corresponding reorder queue and attempts to process the entire reorder queue across all CPUs, which entails repeatedly checking whether the job with the next unserialized sequence number has finished until there are no more jobs left to reorder. These jobs may or may not include the one passed to padata_do_serial because again, jobs finish out of order. This process of checking for the next unserialized job is the biggest potential bottleneck in all of padata because a global lock is used. Without the lock, multiple tasks might process the reorder queues at once, leading to duplicate serial callbacks and list corruption. However, if all calls to padata_do_serial were to wait on the lock when only one call actually ends up processing all the jobs, the rest of the tasks would be waiting for no purpose and introduce unnecessary latency in the system. To avoid this situation, the lock is acquired with a trylock call, and if a task fails to get the lock, it can safely bail out of padata knowing that a current or future lock holder will take care of its job. This serialization process is important for the use case that prompted padata to begin with, IPsec. IPsec throughput was a bottleneck in the kernel because a single CPU, the one that the receiving NIC's interrupt ran on, was doing all the work, with the CPU-intensive portion largely consisting of the crypto operations. Parallelization could address this, but serialization was required to maintain the packet ordering that the upper layer protocols required, and getting that right was not an easy task. See this presentation from Steffen Klassert, the original author of padata, for more background. More Kernel Multithreading Though padata was designed to be generic, it currently has just the one IPsec user. There are more kernel codepaths that can benefit from parallelization, such as struct page initialization, page clearing in various memory management paths (huge page fallocate, get_user_pages), and page freeing at munmap and exit time. Two previous blog posts and an LWN article on ktask have covered some of these. Recent upstream feedback has called for merging ktask with padata, and the first step in that process is to change where padata schedules its parallel workers. To that end, I posted a series on the mailing lists, merged for the v5.3 release, that adds a second workqueue per padata instance dedicated to parallel jobs. Earlier in the post, I described padata's per-CPU parallel queues. To assign a job to one of these queues, padata uses a simple round-robin algorithm to hash a job's sequence number to a CPU, and then runs the job bound to that CPU alone. Each successive job submitted to the instance runs on the next CPU. There are two problems with this approach. First, it's not NUMA-aware, so on multi-socket systems, a job may not run locally. Second, on a busy system, a job will likely complete faster if it allows the scheduler to select the CPU within the NUMA node it's run on. To solve both problems, the series uses an unbound workqueue, which is NUMA-aware by default and not bound to a particular CPU (hence the name). Performance Results The numbers from tcrypt, a test module in the kernel's crypto layer, look promising. Parts are shown here, see the upstream post for the full data. Measurements are from a 2-socket, 20-core, 40-CPU Xeon server. For repeatability, modprobe was bound to a CPU and the serial cpumasks for both pencrypt and pdecrypt were also restricted to a CPU different from modprobe's. # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3 # modprobe tcrypt mode=211 sec=1 # modprobe tcrypt mode=215 sec=1 Busy system (tcrypt run while 10 stress-ng tasks were burning 100% CPU) base test ---------------- --------------- speedup key_sz blk_sz ops/sec stdev ops/sec stdev (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211) 117.2x 160 16 960 30 112555 24775 135.1x 160 64 845 246 114145 25124 113.2x 160 256 993 17 112395 24714 111.3x 160 512 1000 0 111252 23755 110.0x 160 1024 983 16 108153 22374 104.2x 160 2048 985 22 102563 20530 98.5x 160 4096 998 3 98346 18777 86.2x 160 8192 1000 0 86173 14480 multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215) 242.2x 160 16 2363 141 572189 16846 242.1x 160 64 2397 151 580424 11923 231.1x 160 256 2472 21 571387 16364 237.6x 160 512 2429 24 577264 8692 238.3x 160 1024 2384 97 568155 6621 216.3x 160 2048 2453 74 530627 3480 209.2x 160 4096 2381 206 498192 19177 176.5x 160 8192 2323 157 410013 9903 Idle system (tcrypt run by itself) base test ---------------- --------------- speedup key_sz blk_sz ops/sec stdev ops/sec stdev (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211) 2.5x 160 16 63412 43075 161615 1034 4.1x 160 64 39554 24006 161653 981 6.0x 160 256 26504 1436 160110 1158 6.2x 160 512 25500 40 157018 951 5.9x 160 1024 25777 1094 151852 915 5.8x 160 2048 24653 218 143756 508 5.6x 160 4096 24333 20 136752 548 5.0x 160 8192 23310 15 117660 481 multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215) 1.0x 160 16 412157 3855 426973 1591 1.0x 160 64 412600 4410 431920 4224 1.1x 160 256 410352 3254 453691 17831 1.2x 160 512 406293 4948 473491 39818 1.2x 160 1024 395123 7804 478539 27660 1.2x 160 2048 385144 7601 453720 17579 1.2x 160 4096 371989 3631 449923 15331 1.2x 160 8192 346723 1617 399824 18559 A few tools were used in the initial performance analysis to confirm the source of the speedups. I'll show results from one of them, ftrace. Custom kernel events were added to record the runtime and CPU number of each crypto request, which runs a padata job under the hood. For analysis only (not the runs that produced these results), the threads of the competing workload stress-ng were bound to a known set of CPUs, and two histograms were created of crypto request runtimes, one for just the CPUs without the stress-ng tasks ("uncontended") and one with ("contended"). The histogram clearly shows increased times for the padata jobs with contended CPUs, as expected: Crypto request runtimes (usec) on uncontended CPUs # request-count: 11980; mean: 41; stdev: 23; median: 45 runtime (usec) count -------------- -------- 0 - 1 [ 0]: 1 - 2 [ 0]: 2 - 4 [ 0]: 4 - 8 [ 209]: * 8 - 16 [ 3630]: ********************* 16 - 32 [ 188]: * 32 - 64 [ 6571]: ************************************** 64 - 128 [ 1381]: ******** 128 - 256 [ 1]: 256 - 512 [ 0]: 512 - 1024 [ 0]: 1024 - 2048 [ 0]: 2048 - 4096 [ 0]: 4096 - 8192 [ 0]: 8192 - 16384 [ 0]: 16384 - 32768 [ 0]: Crypto request runtimes (usec) on contended CPUs # request-count: 3991; mean: 3876; stdev: 455; median 3999 runtime (usec) count -------------- -------- 0 - 1 [ 0]: 1 - 2 [ 0]: 2 - 4 [ 0]: 4 - 8 [ 0]: 8 - 16 [ 0]: 16 - 32 [ 0]: 32 - 64 [ 0]: 64 - 128 [ 0]: 128 - 256 [ 0]: 256 - 512 [ 4]: 512 - 1024 [ 4]: 1024 - 2048 [ 0]: 2048 - 4096 [ 3977]: ************************************** 4096 - 8192 [ 4]: 8192 - 16384 [ 2]: 16384 - 32768 [ 0]: Conclusion Now that padata has unbound workqueue support, look out for further enhancements to padata in coming releases! Next steps include creating padata threads in cgroups so they can be properly throttled and adding multithreaded job support to padata.

Oracle Linux kernel developer Daniel Jordan contributes this post on enhancing the performance of padata. padata is a generic framework for parallel jobs in the kernel -- with a twist. It not only...

Announcements

Announcing the First Oracle Linux 7 Template for Oracle Linux KVM

We are proud to announce the first Oracle Linux 7 Template for Oracle Linux KVM and Oracle Linux Virtualization Manager. The new Oracle Linux 7 Template for Oracle Linux KVM and Oracle Linux Virtualization Manager supplies powerful automation. It is built on cloud-init, the same technology used today on Oracle Cloud Infrastructure. The template has been built with the following components/options: Oracle Linux 7 Update 7 x86_64 Unbreakable Enterprise Kernel 5 - kernel-uek-4.14.35-1902.5.2.2.el7uek.x86_64 Red Hat Compatible Kernel - kernel-3.10.0-1062.1.2.el7.x86_64 8GB of RAM 15GB of OS virtual disk Downloading Oracle Linux 7 Template for Oracle Linux KVM Oracle Linux 7 Template for Oracle Linux KVM is available on Oracle Software Delivery Cloud. Search for "Oracle Linux KVM" and select "Oracle Linux KVM Templates for Oracle Linux" Click on the "Add to Cart" button and then click on "Checkout" in the right upper corner. On the following window, select "Linux-x86_64" and click on the "Continue" button: Accept the "Oracle Standard Terms and Restrictions" to continue and, on the following window, click on "V988166-01.zip" to download the Oracle Linux 7 Template for Oracle Linux KVM and on "V988167-01.zip" to download the README with instructions: Further information Oracle Linux 7 Template for Oracle Linux KVM allows you to configure different options on the first boot for your Virtual Machine; cloud-init options configured on Oracle Linux 7 Template are: VM Hostname define the Virtual Machine hostname Configure Timezone define the Virtual Machine timezone (within an existing available list) Authentication Username define a custom Linux user on the Virtual Machine Password Verify Password define the password for the custom Linux user on the Virtual Machine SSH Authorized Keys SSH authorized keys to get password-less access to the Virtual Machine Regenerate SSH Keys Option to regenerate the Virtual Machine Host SSH Keys Networks DNS Servers define the Domain Name Servers for the Virtual Machine DNS Search Domains define the Domain Name Servers Search Domain for the Virtual Machine In-guest Network Interface Name define the virtual-NIC device name for the Virtual Machine (ex. eth0) Custom script Execute a custom-script at the end of the cloud-init configuration process All of those options can be easily managed by "Oracle Linux Virtualization Manager" web interface by editing the Virtual Machine and enabling "Cloud-Init/Sysprep" option: Further details on how to import and use the Oracle Linux 7 Template for Oracle Linux KVM are available in this Technical Article on Simon Coter's Oracle Blog. Oracle Linux KVM & Virtualization Manager Support Support for Oracle Linux Virtualization Manager is available to customers with an Oracle Linux Premier Support subscription. Refer to Oracle Unbreakable Linux Network for additional resources on Oracle Linux support. Oracle Linux Resources Documentation Oracle Linux Virtualization Manager Documentation Blogs Oracle Linux Blog Oracle Virtualization Blog Community Pages Oracle Linux Product Training and Education Oracle Linux Administration - Training and Certification Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter

We are proud to announce the first Oracle Linux 7 Template for Oracle Linux KVM and Oracle Linux Virtualization Manager. The new Oracle Linux 7 Template for Oracle Linux KVM and Oracle Linux...

Linux Kernel Development

The Benefit of Static Trace Points

Chuck Lever is a Linux Kernel Architect working with the Oracle Linux and Unbreakable Enterprise Kernel team at Oracle. He contributed this article about replacing printk debugging with static trace points in the kernel. On The Benefits of Static Trace Points These days, kernel developers have several choices when it comes to reporting exceptional events. Among them: a console message; a static trace point; Dtrace; or, a Berkeley Packet Filter script. Arguably the best choice for building an observability framework into the kernel is the judicious use of static trace points. Amongst the several kernel debugging techniques that are currently in vogue, we like static trace points. Here's why. A little history Years ago IBM coined the term First Failure Data Capture (FFDC). Capture enough data about a failure, just as it occurs the first time, so that reproducing the failure is all but unnecessary. An observability framework is a set of tools that enable system administrators to monitor and troubleshoot systems running in production, without interfering with efficient operation. In other words, it captures enough data about any failure that occurs so that a failure can be root-caused and possibly even fixed without the need to reproduce the failure in vitro. Of course, FFDC is an aspirational goal. There will always be a practical limit to how much data can be collected, managed, and analyzed without impacting normal operation. The key is to identify important exceptional events and place hooks in those areas to record those events as they happen. These exceptional events are hopefully rare enough that the captured data is manageable. And the hooks themselves must introduce little or no overhead to a running system. The trace point facility The trace point facility, also known as ftrace, has existed in the Linux kernel for over a decade. Each static trace point is an individually-enabled call out that records a set of data as a structured record into a circular buffer. An area expert determines where each trace point is placed, what data is stored in the structured record, and how the stored record should be displayed (i.e., a print format specifier string). The format of the structured record acts as a kernel API. It is much simpler to parse than string output by printk. User space tools can filter trace data based on values contained in the fields (e.g., show me just trace events where "status != 0"). Each trace point is always available to use, as it is built into the code. When triggered, a trace point can do more than capture the values of a few variables. It also records a timestamp and whether interrupts are enabled, and which CPU, which PID, and which executable is running. It is also able to enable or disable other trace points, or provide a stack trace. Dtrace and eBPF scripts can attach to a trace point, and hist triggers are also possible. Trace point buffers are allocated per CPU to eliminate memory contention and lock waiting when a trace event is triggered. There is a default set of buffers ready from system boot onward. However, trace point events can be directed into separate buffers. This permits several different tracing operations to occur concurrently without interfering with each other. These buffers can be recorded into files, transmitted over the network, or read from a pipe. If a system crash should occur, captured trace records still reside in these buffers and can be examined using crash dump analysis tools. The benefits of trace points Trace points can be safely placed in code that runs at interrupt context as well as code that runs in process context, unlike printk(). Also unlike printk(), individual trace points can be enabled, rather than every printk() at a particular log level. Groups of trace points can be conveniently enabled or disabled with a single operation, and can be combined with other more advanced ftrace facilities such as function_graph tracing. Trace points are designed to be low overhead, especially when they are disabled. The code that structures trace point data and inserts it into a trace buffer is out-of-line, so that a uncalled trace point adds very little instruction cache footprint. The actual code at the call site is nothing more than a load and a conditional branch. This is unlike some debugging mechanisms that place no-ops in the code, and then modify the code when they are enabled. This technique would not be effective if the executable resides in read-only memory, but a trace point in the same code can continue to work. What about printk? In contrast, printk() logs messages onto the system console and directly into the kernel's log file (typically /var/log/messages). In recent Linux distributions, kernel log output is rate-limited, which means an important but noisy stream of printk() messages can be temporarily disabled by software just before that one critical log message comes out. In addition, in lights-out environments, the console can be a serial device set to a relative low baud rate. A copious stream of printk() messages can trigger workqueue stalls or worse as the console device struggles to keep up. How do I use trace points? We've described a good way to identify and record exceptional events, using static trace points. How are the captured events recorded to a file for analysis? The trace-cmd(1) tool permits a privileged user to specify a set of events to enable and direct the output to a file or across a network, and then filter and display the captured data. This tool is packaged and available for download in Oracle Linux RPM channels. A graphical front-end for trace-cmd called kernelshark is also available. In addition, Oracle has introduced a facility for continuous monitoring of trace point events called Flight Data Recorder (FDR for short). FDR is started by systemd and enables trace points to monitor. It captures event data to a round-robin set of files, limiting the amount of data so it does not overrun the local root filesystem. A configuration file allows administrators to adjust the set of trace points that are monitored. Because this facility is always on, it can capture events at the time of a crash. The captured trace point data is available in files or it can be examined by crash analysis. To keep this article short, we've left out plenty of other benefits and details about static trace points. You can read more about them by following the links below. These links point to articles about trace point-related user space tools, clever tips and tricks, how to insert trace points into your code, and much much more. There are several links to lwn.net http://lwn.net/ above. lwn.net http://lwn.net/ is such a valuable resource to the Linux community. I encourage everyone to consider a subscription! First Failure Data Capture https://www.ibm.com/garage/method/practices/manage/first-failure-data-capture Using the Linux Kernel Tracepoints https://www.kernel.org/doc/html/latest/trace/tracepoints.html Debugging the kernel using Ftrace https://lwn.net/Articles/365835/ trace-cmd: A front-end for Ftrace https://lwn.net/Articles/410200/ Flight Data Recorder https://github.com/oracle/fdr Hist Triggers in Linux 4.7 http://www.brendangregg.com/blog/2016-06-08/linux-hist-triggers.html Ftrace: The hidden light switch https://lwn.net/Articles/608497/ Triggers for Tracing https://lwn.net/Articles/556186/ Finding Origins of Latencies Using Ftrace https://static.lwn.net/images/conf/rtlws11/papers/proc/p02.pdf

Chuck Lever is a Linux Kernel Architect working with the Oracle Linux and Unbreakable Enterprise Kernel team at Oracle. He contributed this article about replacing printk debugging with static trace...

Linux

Linux Kernel Developments Since 5.0: Features and Developments of Note

Introduction Last year, I covered features in Linux kernel 5.0 that we thought were worth highlighting. Unbreakable Enterprise Kernel 6 is based on stable kernel 5.4 and was recently made available as a developer preview. So, now is as good a time as any to review developments that have occurred since 5.0. While the features below are roughly in chronological order, there is no significance to the order otherwise. BPF spinlock patches BPF (Berkeley Packet Filter) spinlock patches give BPF programs increased control over concurrency. Learn more about BPF and how to use it in this seven part series by Oracle developer Alan Maguire. Btrfs ZSTD compression The Btrfs filesystem now supports the use of multiple ZSTD (Zstandard) compression levels. See this commit for some information about the feature and the performance characteristics of the various levels. Memory compaction improvements Memory compaction has been reworked, resulting in significant improvements in compaction success rates and CPU time required. In benchmarks that try to allocated Transparent HugePages in deliberatly fragmented virtual memory, the number of pages scanned for migration was reduced by 65% and the free scanner was reduced by 97.5%. io_uring API for high-performance async IO The io_uring API has been added, providing a new (and hopefully better) way of achieving high-performance asynchronous I/O. Build improvements to avoid unnecessary retpolines The GCC compiler can use indirect jumps for switch statements; those can end up using retpolines on x86 systems. The resulting slowdown is evidently inspiring some developers to recode switch statements as long if-then-else sequences. In 5.1, the compiler’s case-values-threshold will be raised to 20 for builds using retpolines — meaning that GCC will not create indirect jumps for statements with less than 20 branches — addressing the performance issue without the need for code changes that might well slow things down on non-retpoline systems. See patch Improved fanotify() to efficiently detect changes on large filesystem fanotify is a mechanism for monitoring filesystem events. This improvement enables watching of super bock root to be notified that any file was changed anywhere on the filesystem. See patch. Higher frequency Pressure Stall Information monitoring First introduced in 4.20, Pressure Stall Information (PSI) tells a system administrator how much wall clock time an application spends, on average, waiting for system resources such as memory or CPU. This view into how resource-constrained a system is can help prevent catastrophe. Whereas previously PSI only reported averages for fixed, relatively large time windows, these improvements enable user-defined and more fine-grained measurements as well as mechanisms to be notified when thresholds are reached. For more information, see this article. devlink health notifications The new “devlink health” mechanism provides notifications when an interface device has problems. See this merge commit and this documentation for details. BPF optimizations The BPF verifier has seen some optimization work that yields a 20x speedup on large programs. That has enabled an increase in the maximum program size (for the root user) from 4096 instructions to 1,000,000. Read more about the BPF Verifier here. Pressure stall monitors Pressure stall monitors, which allow user space to detect and respond quickly to memory pressure, have been added. See this commit for documentation and a sample program. MM optimizations to reduce unnecessary cache line movements/TLB misses Optimizations to memory management code reduces TLB (translation lookaside buffer) misses. More details in this commit. Control Group v2 enhancements Control Group or cgroup is a kernel feature that enables hierarchichal grouping of processes such that their use of system resources (memory, CPU, I/O, etc) can be controlled, monitored and limited. Version 1 of this feature has been in the kernel for a long time and is a crucial element of the implementation of containers in Linux. Version 2 or cgroup v2 is a re-work of control group, under development since version 4 of the kernel, that intends to remove inconsistencies and enable better resource isolation and better management for containers. Some of its characteristics include: unified hierarchy better support for rootless, unprivileged containers secure delegation of cgroups See also this documentation Power efficient userspace waiting The x86 umonitor, umwait, and tpause instructions are available in user-space code; they make it possible to efficiently execute small delays without the need for busy loops on Intel “Tremont” chips. Thus, applications can employ short waits while using less power and with reduced impact on the performance of other hypertreads. A tunable has been provided to allow system administrators to control the maximum period for which the CPU can be paused. pidfd_open() system call The pidfd_open() system call has been added; it allows a process to obtain a pidfd for another, existing process. It is also now possible to use poll() on a pidfd to get notification when the associated process dies. kdump support for AMD Secure Memory Encryption (SME) See this article for more details. Exposing knfsd state to userspace The NFSv4 server now creates a directory under /proc/fs/nfsd/clients with information about current NFS clients, including which files they have open. See patch. Previously, it was not possible to get information about open files held open by NFSv4 clients. haltpoll CPU idle governer The “haltpoll” CPU idle governor has been merged. This governor will poll for a while before halting an otherwise idle CPU; it is intended for virtualized guest applications where it can improve performance by avoiding exits to the hypervisor. See this commit for some more information. New madvice() commands There are two new madvise() commands to force the kernel to reclaim specific pages. MADV_COLD moves the indicated pages to the inactive list, essentially marking them unused and suitable targets for page reclaim. A stronger variant is MADV_PAGEOUT, which causes the pages to be reclaimed immediately. dm-clone target The new dm-clone target makes a copy of an existing read-only device. “The main use case of dm-clone is to clone a potentially remote, high-latency, read-only, archival-type block device into a writable, fast, primary-type device for fast, low-latency I/O”. More information can be found in this commit. virtio-fs Virtio-fs is a shared file system that lets virtual machines access a directory tree on the host. See this document and this commit message for more information. Kernel lockdown Kernel lockdown seeks to improve on guarantees that a system is running software intended by its owner. The idea is to build on protections offered at boot time (e.g. by UEFI secure boot) and extend it such that no program can modify the running kernel. This has recently been implemented as a security module. Improved AMD EPYC scheduler/load balancing Fixes to ensure the scheduler properly load balances across NUMA nodes on different sockets. See commit message Preparations for realtime preemption Those who need realtime support in Linux have to this day had to settle for using the out-of-tree patchset PREEMPT_RT. 5.4 saw a number of patches preparing the kernel for native PREEMPT_RT support. pidfd API pidfd is a new concept in the kernel that represents a process as a file descriptor. As described in this article, the primary purpose is to prevent the delivery of signals to the wrong process should the target exit and be replaced —at the same ID— by an unrelated process, also known as PID recycling. Conclusion In slightly less than a year, a lot has happened in mainline kernel development. While the features covered here represent a mere subset of all the work that went into the kernel since 5.0, we thought they were noteworthy. If there are features you think we missed, please let us know in the comments! Acknowledgments lwn.net kernelnewbies.org Chuck Anderson, Oracle Scott Davenport, Oracle

Introduction Last year, I covered features in Linux kernel 5.0 that we thought were worth highlighting. Unbreakable Enterprise Kernel 6 is based on stable kernel 5.4 and was recently made available as...

Linux Kernel Development

XFS - Online Filesystem Checking

XFS Upstream maintainer Darrick Wong provides another instalment, this time focusing on how to facilitate sysadmins in maintaining healthy filesystems. Since Linux 4.17, I have been working on an online filesystem checking feature for XFS. As I mentioned in the previous update, the online fsck tool (named xfs_scrub) walks all internal filesystem metadata records. Each record is checked for obvious corruptions before being cross-referenced with all other metadata in the filesystem. If problems are found, they are reported to the system administrator through both xfs_scrub and the health reporting system. As of Linux 5.3 and xfsprogs 5.3, online checking is feature complete and has entered the stabilization and performance optimization stage. For the moment it remains tagged experimental, though it should be stable. We seek early adopters to try out this new functionality and give us feedback. Health Reporting A new feature under development since Linux 5.2 is the new metadata health reporting feature. In its current draft form, it collects checking and corruption reports from the online filesystem checker, and can report that to userspace via the xfs_spaceman health command. Soon, we will begin connecting it to all other places in the XFS codebase where we test for metadata problems so that administrators can find out if a filesystem observed any errors during operation. Reverse Mapping Three years ago, I also introduced the reverse space mapping feature to XFS. At its core is a secondary index of storage space usage that effectively provides a redundant copy of primary space usage metadata. This adds some overhead to filesystem operations, but its inclusion in a filesystem makes cross-referencing very fast. It is an essential feature for repairing filesystems online because we can rebuild damaged primary metadata from the secondary copy. The feature graduated from EXPERIMENTAL status in Linux 4.16 and is production ready. However, online filesystem checking and repair is (so far) the only use case for this feature, so it will remain opt-in at least until online checking graduates to production readiness. To try out this feature, pass the parameter -m rmapbt=1 to mkfs.xfs when formatting a new filesystem. Online Filesystem Repair Work has continued on online repair over the past two years. The basic core of how it works has not changed (we use reverse mapping information to reconnect damaged primary metadata), but our rigorous review processes have revealed other areas of XFS that could be improved significantly ahead of landing online repair support. For example, the offline repair tool (xfs_repair) rebuilds the filesystem btrees in bulk by regenerating all the records in memory and then writing out fully formed btree blocks all at once. The original online repair code would rebuild indices one record at a time to avoid running afoul of other transactions, which was not efficient. Because this is an opportunity to share code, I have cleaned up xfs_repair's code into a generic btree bulk load function and have refactored both repair tools to use it. Another part of repair that has been re-engineered significantly is how we stage those new records in memory. In the original design, we simply used kernel memory to hold all the records. The memory stress that this introduced made running repair a risky operation until I realized that repair should be running on a fully operational system. This means that we can store those records in memory that can be swapped out to conserve working set size. A potential third area for improvement is avoiding filesystem freezes to repair metadata. While freezing the filesystem to run a repair probably involves less downtime than unmounting, it would be very useful if we could isolate an allocation group that is found to be bad. This will reduce service impacts and is probably the only practical way to repair the reverse mapping index. I look forward to sending out a new revision of the online repair code in 2020 for further review. Demonstration: Online File System Check Online filesystem checking is a component that must be built into the Linux kernel at compile time by enabling the CONFIG_XFS_ONLINE_SCRUB kernel option. Checks are driven by a userspace utility named xfs_scrub. When run, this program announces itself as an experimental technical preview. Your kernel distributor must enable the option for the feature to work. On Debian and Ubuntu systems, the program is shipped in the regular xfsprogs package. On RedHat and Fedora systems, it is shipped in the xfsprogs-xfs_scrub package and must be installed separately. You can, of course, compile kernel and userspace from source. Let's try out the new program. It isn't very chatty by default, so we invoke it with the -v option to display status information and the -n option because we only want to check metadata: # xfs_scrub -n -v /storage/ EXPERIMENTAL xfs_scrub program in use! Use at your own risk! Phase 1: Find filesystem geometry. /storage/: using 4 threads to scrub. Phase 2: Check internal metadata. Info: AG 1 superblock: Optimization is possible. Info: AG 2 superblock: Optimization is possible. Info: AG 3 superblock: Optimization is possible. Phase 3: Scan all inodes. Info: /storage/: Optimizations of inode record are possible. Phase 5: Check directory tree. Info: inode 139431063 (1/5213335): Unicode name "arn.lm" in directory could be confused with "am.lm". Info: inode 407937855 (3/5284671): Unicode name "obs-l.I" in directory could be confused with "obs-1.I". Info: inode 407937855 (3/5284671): Unicode name "obs-l.X" in directory could be confused with "obs-1.X". Info: inode 688764901 (5/17676261): Unicode name "empty-fl.I" in directory could be confused with "empty-f1.I". Info: inode 688764901 (5/17676261): Unicode name "empty-fl.X" in directory could be confused with "empty-f1.X". Info: inode 688764901 (5/17676261): Unicode name "l.I" in directory could be confused with "1.I". Info: inode 688764901 (5/17676261): Unicode name "l.X" in directory could be confused with "1.X". Info: inode 944886180 (7/5362084): Unicode name "l.I" in directory could be confused with "1.I". Info: inode 944886180 (7/5362084): Unicode name "l.X" in directory could be confused with "1.X". Phase 7: Check summary counters. 279.1GiB data used; 3.5M inodes used. 262.2GiB data found; 3.5M inodes found. 3.5M inodes counted; 3.5M inodes checked. As you can see, metadata checking is split into different phases: This phase gathers information about the filesystem and tests whether or not online checking is supported. Here we examine allocation group metadata and aggregated filesystem metadata for problems. These include free space indices, inode indices, reverse mapping and reference count information, and quota records. In this example, the program lets us know that the secondary superblocks could be updated, though they are not corrupt. Now we scan all inodes for problems in the storage mappings, extended attributes, and directory contents, if applicable. No problems found here! Repairs are performed on the filesystem in this phase, though only if the user did not invoke the program with -n. Directories and extended attributes are checked for connectivity and naming problems. Here, we see that the program has identified several directories containing file names that could render similarly enough to be confusing. These aren't filesystem errors per se, but should be reviewed by the administrator. If enabled with -x, this phase scans the underlying disk media for latent failures. In the final phase, we compare the summary counters against what we've seen and report on the effectiveness of our scan. As you can see, we found all the files and most of the file data. Our sample filesystem is in good shape! We saw a few things that could be optimized or reviewed, but no corruptions were reported. No data have been lost. However, this is not the only way we can run xfs_scrub! System administrators can set it up to run in the background when the system is idle. xfsprogs ships with the appropriate job control files to run as a systemd timer service or a cron job. The systemd timer service can be run automatically by enabling the timer: # systemctl start xfs_scrub_all.timer # systemctl list-timers NEXT LEFT LAST PASSED UNIT ACTIVATES Thu 2019-11-28 03:10:59 PST 12h left Wed 2019-11-27 07:25:21 PST 7h ago xfs_scrub_all.timer xfs_scrub_all.service <listing shortened for brevity> When enabled, the background service will email failure reports to root. Administrators can configure when the service runs by running systemctl edit xfs_scrub_all.timer, and where the failure reports are sent by running systemctl edit xfs_scrub_fail@.service to change the EMAIL_ADDR variable. The systemd service takes advantage of systemd's sandboxing capabilities to restrict the program to idle priority and to run with as few privileges as possible. For systems that have cron installed (but not systemd), a sample cronjob file is shipped in /usr/lib/xfsprogs/xfs_scrub_all.cron. This file can be edited as necessary and copied to /etc/cron.d/. Failure reports are dispatched to wherever cronjob errors are sent. Demonstration: Health Reporting A comprehensive health report can be generated with the xfs_spaceman tool. The report contains health status about allocation group metadata and inodes in the filesystem: # xfs_spaceman -c 'health -c' /storage filesystem summary counters: ok AG 0 superblock: ok AG 0 AGF header: ok AG 0 AGFL header: ok AG 0 AGI header: ok AG 0 free space by block btree: ok AG 0 free space by length btree: ok AG 0 inode btree: ok AG 0 free inode btree: ok AG 0 overall inode state: ok <snip> inode 501370 inode core: ok inode 501370 data fork: ok inode 501370 extended attribute fork: ok This concludes our demonstrations. We hope you'll try out these new features and let us know what you think!

XFS Upstream maintainer Darrick Wong provides another instalment, this time focusing on how to facilitate sysadmins in maintaining healthy filesystems. Since Linux 4.17, I have been working on an online...

Announcements

Announcing Oracle VirtIO Drivers 1.1.5 for Microsoft Windows

We are pleased to announce Oracle VirtIO Drivers for Microsoft Windows release 1.1.5. The Oracle VirtIO Drivers for Microsoft Windows are paravirtualized (PV) drivers for Microsoft Windows guests that are running on Oracle Linux KVM. The Oracle VirtIO Drivers for Microsoft Windows improve performance for network and block (disk) devices on Microsoft Windows guests and resolve common issues. What's New? Oracle VirtIO Drivers for Microsoft Windows 1.1.5 provides: An updated installer to configure a guest VM for migration from another VM technology to Oracle Cloud Infrastructure (OCI) without the need to select a custom installation VirtIO SCSI and Block storage drivers, updated to release 1.1.5, with support for dumping crash files The signing of the drivers to Microsoft Windows 2019 The installer enables the use of the VirtIO drivers at boot time so that the migrated guest can boot in OCI Note: If installing these drivers on Microsoft Windows 2008 SP2 and 2008 R2, you will need to first install the following update from Microsoft: 2019-08 Security Update for Windows Server 2008 for x64-based Systems (KB4474419) Failure to do this may result in errors during installation due to the inability to validate signatures of the drivers. Please follow normal Windows installation procedure for this Microsoft update. Oracle VirtIO Drivers Support Oracle VirtIO Drivers 1.1.5 support the KVM hypervisor with Oracle Linux 7 on premise and on Oracle Cloud Infrastructure. The following guest Microsoft Windows operating systems are supported: Guest OS  64-bit   32-bit  Microsoft Windows Server 2019 Yes N/A  Microsoft Windows Server 2016 Yes N/A  Microsoft Windows Server 2012 R2 Yes N/A  Microsoft Windows Server 2012 Yes N/A  Microsoft Windows Server 2008 R2 SP1 Yes N/A  Microsoft Windows Server 2008 SP2 Yes Yes Microsoft Windows Server 2003 R2 SP2 Yes Yes Microsoft Windows 10 Yes Yes Microsoft Windows 8.1 Yes Yes Microsoft Windows 8 Yes Yes Microsoft Windows 7 SP1 Yes Yes Microsoft Windows Vista SP2 Yes Yes   For further details related to support and certifications, refer to the Oracle Linux 7 Administrator's Guide. Additional information on the Oracle VirtIO Drivers 1.1.5 certifications can be found in the Windows Server Catalog. Downloading Oracle VirtIO Drivers Oracle VirtIO Drivers release 1.1.5 is available on the Oracle Software Delivery Cloud by searching on "Oracle Linux" and select "DLP:Oracle Linux 7.7.0.0.0 ( Oracle Linux )" Click on the "Add to Cart" button and then click on "Checkout" in the right upper corner. On the following window, select "x86-64" and click on the "Continue" button: Accept the "Oracle Standard Terms and Restrictions" to continue and, on the following window, click on "V984560-01.zip - Oracle VirtIO Drivers Version for Microsoft Windows 1.1.5" to download the drivers: Oracle Linux Resources Documentation Oracle Linux Virtualization Manager Documentation Oracle VirtIO Drivers for Microsoft Windows Blogs Oracle Linux Blog Oracle Virtualization Blog Community Pages Oracle Linux Product Training and Education Oracle Linux Administration - Training and Certification Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter

We are pleased to announce Oracle VirtIO Drivers for Microsoft Windows release 1.1.5. The Oracle VirtIO Drivers for Microsoft Windows are paravirtualized (PV) drivers for Microsoft Windows guests that...

Comparing Workload Performance

In this blog post, Oracle Linux performance engineer Jesse Gordon presents an alternate approach to comparing the performance of a workload when measured in two different scenarios.  This improves on the traditional "perf diff" method. The benefits of this approach are as follows: ability to compare based on either inclusive time (time spent in a given method and all the methods it calls) or exclusive time (time spent only in a given method) fields in perf output can be applied to any two experiments that have common function names more readable output Comparing Perf Output from Different Kernels You’ve just updated your Oracle Linux kernel – or had it updated autonomously -- and you notice that the performance of your key workload has changed.  How do you figure out what is responsible for the difference?  The basic tool for this task is the perf profile 1, which can be used to generate traces of the workload on the two kernels.  Once you have the two perf outputs, the current Linux approach is to use "perf diff" 2 to compare the resulting traces.  The problem with the approach is that "perf diff" output is neither easy to read nor to use.  Here is an example: # # Baseline Delta Abs Shared Object Symbol # ........ ......... ................................... .............................................. # +3.38% [unknown] [k] 0xfffffe0000006000 29.46% +0.98% [kernel.kallsyms] [k] __fget 8.42% +0.91% [kernel.kallsyms] [k] fput +0.88% [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe +0.68% [kernel.kallsyms] [k] syscall_trace_enter 2.98% -0.67% [kernel.kallsyms] [k] _raw_spin_lock +0.55% [kernel.kallsyms] [k] do_syscall_64 0.40% -0.34% syscall [.] [main] In this blog, we outline an alternate approach which produces easier to read and use output.  Here is what the above output looks like using this approach: Command Symbol Before# After# Delta -------------------- ------------------------------ ------- ------- ------- syscall __fget 29.46 30.43 0.97 syscall fput 8.41 9.33 0.92 syscall entry_SYSCALL_64_after_hwframe 0.00 0.88 0.88 syscall syscall_trace_enter 0.00 0.68 0.68 syscall _raw_spin_lock 2.98 2.31 -0.67 syscall do_syscall_64 0.00 0.55 0.55 syscall main 0.40 0.06 -0.34 Furthermore, this alternate approach extends the comparison options, allowing one to compare based on any of the fields in the perf output report.  In the remainder of this blog, we detail the steps involved in producing such output.   Step 1: Generate the perf traces Taking a trace involves running the workload while invoking perf.  In this article, we chose to use the syscall workload from UnixBench 3 suite, a typical sequence would be: $ perf record -a -g -c 1000001 \<PATH-TO\>/byte-unixbench-master/UnixBench/Run syscall -i 1 -c 48 where: -a asks perf to monitor all online CPUs; -g asks perf to collect data so call graphs (stack traces) may be generated; -c 1000001 asks perf to collect a sample once every 1000001 cycles Step 2: Post-process the trace data Samples collected by perf record are saved into a binary file called, by default, perf.data. The "perf report" command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first.  To post-process the perf.data file generated in step 1: $ mv perf.data perf.data.KERNEL $ perf report -i perf.data.KERNEL -n > perf.report.KERNEL Step 3: Compare the traces To be able to compare the two traces, first ensure that they are in a common directory on the system.  So, we would have, for example, perf.report.KERNEL1 and perf.report.KERNEL2.  This is what one such trace profile looks like for UnixBench syscall: # # Samples: 1M of event 'cycles' # Event count (approx.): 1476340476339 # # Children Self Samples Command Shared Object Symbol # ........ ........ ............ ............... .................................. .................................................... # 98.60% 0.00% 0 syscall [unknown] [.] 0x7564207325203a65 | ---0x7564207325203a65 85.91% 0.24% 3538 syscall [kernel.kallsyms] [k] system_call_fastpath | ---system_call_fastpath | |--60.76%-- __GI___libc_close | 0x7564207325203a65 | |--37.72%-- __GI___dup | 0x7564207325203a65 | |--1.30%-- __GI___umask | 0x7564207325203a65 --0.21%-- [...] Listing 1: example perf trace profile The columns of interest shown are as follows: Children -- the percent of time spent in this method and all the methods that it calls, also referred to as inclusive time Self -- the percent of time spent in this method only, also referred to as exclusive time Samples -- the number of trace samples that fell in this method only Command -- the process name Shared Object -- the library Symbol -- the method (or function) name Now, we can use perf diff as follows: $ perf diff perf.data.KERNEL1 perf.data.KERNEL2 > perf.diff.KERNEL1.vs.KERNEL2   Here is what the resulting output looks like: # # Baseline Delta Abs Shared Object Symbol # ........ ......... ................................... .............................................. # +3.38% [unknown] [k] 0xfffffe0000006000 29.46% +0.98% [kernel.kallsyms] [k] __fget 8.42% +0.91% [kernel.kallsyms] [k] fput +0.88% [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe +0.68% [kernel.kallsyms] [k] syscall_trace_enter 2.98% -0.67% [kernel.kallsyms] [k] _raw_spin_lock +0.55% [kernel.kallsyms] [k] do_syscall_64 0.40% -0.34% syscall [.] main Listing 2: example perf diff profile trace This output, by default, has done the comparison using the “Self” column, or time spent in just this one method.  This can be useful output, but is often insufficient as part of a performance analysis.  We next present an approach to comparing using the “Children” column, for time spent in this method and all its children.  Step 4: Generate comparison using the “Children” column To perform the comparison, we first extract all of the lines that have entries in all six columns i.e., all fields are present.  These are the lines at the top of each of the call graphs. You can find the allfields.delta.py program that we use to render these results on github at https://github.com/oracle/linux-blog-sample-code/tree/comparing-workload-performance/allfields.delta.py $ grep "\\[" perf.data.DESCRIPTOR | grep -v "|" | grep -v "\\-\\-" > perf.report.DESCRIPTOR.allfields The output of this script looks as follows: 98.60% 0.00% 0 syscall [unknown] [.] 0x7564207325203a65 85.91% 0.24% 3538 syscall [kernel.kallsyms] [k] system_call_fastpath 55.20% 0.16% 2403 syscall libc-2.17.so [.] __GI___libc_close 52.27% 0.14% 2020 syscall [kernel.kallsyms] [k] sys_close 52.11% 0.08% 1207 syscall [kernel.kallsyms] [k] __close_fd 50.39% 21.98% 324434 syscall [kernel.kallsyms] [k] filp_close 35.44% 0.13% 1958 syscall libc-2.17.so [.] __GI___dup 32.39% 0.15% 2181 syscall [kernel.kallsyms] [k] sys_dup 29.46% 29.46% 434902 syscall [kernel.kallsyms] [k] __fget 19.92% 19.92% 294070 syscall [kernel.kallsyms] [k] dnotify_flush Listing 3: perf output showing lines with all fields present Now we compare two "allfields" files, using a Python script which reads in the two files and compares lines for which the combination of SharedObject + Symbol are the same.  This script allows the user to compare based on each of the three left side columns (children, self, or samples) and would be run as follows: $ allfields.delta.py -b perf.report.KERNEL1.allfields -a perf.report.KERNEL2.allfields -d children > allfields.delta.children.KERNEL1.vs.KERNEL2 For the UnixBench syscall workload, comparing a two distinct kernels, the output would look like this: perf report allfields delta report before file name == perf.report.KERNEL1.allfields after file name == perf.report.KERNEL2.allfields delta type == children Command Symbol Before# After# Delta -------------------- ------------------------------ ------- ------- ------- syscall 0x7564207325203a65 98.60 99.81 1.21 syscall system_call_fastpath 85.91 0.00 -85.91 syscall __GI___libc_close 55.20 56.73 1.53 syscall sys_close 52.27 53.69 1.42 syscall __close_fd 52.11 53.62 1.51 [...] Listing 4: example output from script, comparing using "children" field Lastly, we can sort this output to highlight the largest differences in each direction, as follows: $ sort -rn -k 5 allfields.delta.children.KERNEL1.vs.KERNEL2 | less where the head of the file shows those methods where more time was spent in KERNEL1 and the tail of the file shows those methods where more time was spent in KERNEL2: syscall entry_SYSCALL_64_after_hwframe 0.00 92.18 92.18 syscall do_syscall_64 0.00 91.07 91.07 syscall filp_close 50.39 52.70 2.31 syscall syscall_slow_exit_work 0.00 1.67 1.67 syscall __GI___libc_close 55.20 56.73 1.53 [...] syscall tracesys 1.18 0.00 -1.18 syscall syscall_trace_leave 1.70 0.00 -1.70 syscall int_very_careful 1.83 0.00 -1.83 syscall system_call_after_swapgs 2.42 0.00 -2.42 syscall system_call 3.47 0.00 -3.47 syscall system_call_fastpath 85.91 0.00 -85.91 Listing 5: sorted output of script We may now use these top and bottom methods as starting points into root causing the performance differences observed when executing the workload on the two kernels. Summary We have presented an alternate approach to comparing the performance of a workload when measured in two different scenarios.  This method can be applied to any two experiments that have common function names.  The benefits of this approach are as follows: ability to compare based on either inclusive time (Children) or exclusive time (Self) fields in perf output more readable output Please try it out. perf: Linux profiling with performance counters↩ perf-diff man page↩ UnixBench on GitHub↩

In this blog post, Oracle Linux performance engineer Jesse Gordon presents an alternate approach to comparing the performance of a workload when measured in two different scenarios.  This improves on...

Announcements

Announcing Oracle Linux 7 Update 8 Beta Release

We are pleased to announce the availability of the Oracle Linux 7 Update 8 Beta release for the 64-bit Intel and AMD (x86_64) and 64-bit Arm (aarch64) platforms. Oracle Linux 7 Update 8 Beta is an updated release that include bug fixes, security fixes and enhancements. It is fully binary compatible with Red Hat Enterprise Linux 7 Update 8 Beta. Updates include: A revised protection profile for General Purpose Operating Systems (OSPP) in the SCAP Security Guide packages SELinux enhancements for Tomcat domain access and graphical login sessions rsyslog has a new option for managing letter-case preservation by using the FROMHOST property for the imudp and imtcp modules Pacemaker concurrent-fencing cluster property defaults to true, speeding up recovery in a large cluster where multiple nodes are fenced. The Oracle Linux 7 Update 8 Beta Releases includes the following kernel packages: kernel-uek-4.14.35-1902.7.3.1 for x86_64 an d aarch64 platforms - The Unbreakable Enterprise Kernel Release 5, which is the default kernel. kernel-3.10.0-1101 for x86_64 platform - The latest Red Hat Compatible Kernel (RHCK). To get started with Oracle Linux 7 Update 8 Beta Release, you can simply perform a fresh installation by using the ISO images available for download from Oracle Technology Network. Or, you can perform an upgrade from an existing Oracle Linux 7 installation by using the Beta channels for Oracle Linux 7 Update 8 on the Oracle Linux yum server or the Unbreakable Linux Network (ULN).  # vi /etc/yum.repos.d/oracle-linux-ol7.repo [ol7_beta] name=Oracle Linux $releasever Update 8 Beta ($basearch) baseurl=https://yum$ociregion.oracle.com/repo/OracleLinux/OL7/beta/$basearch/ gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle gpgcheck=1 enabled=1 [ol7_optional_beta] name=Oracle Linux $releasever Update 8 Beta ($basearch) Optional baseurl=https://yum$ociregion.oracle.com/repo/OracleLinux/OL7/optional/beta/$basearch/ gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle gpgcheck=1 enabled=1 If your instance is running on OCI, the value "$ociregion" will be automatically valued to use OCI local-region Yum Mirrors. Modify the yum channel setting and enable the Oracle Linux 7 Update 8 Beta channels. Then you perform the upgrade. # yum update After the upgrade is completed, reboot the system and you will have Oracle Linux 7 Update 8 Beta running. [root@v2v-app: ~]# cat /etc/oracle-release Oracle Linux Server release 7.8 This release is provided for development and test purposes only and is not covered by Oracle Linux support. Oracle does not recommended using Beta releases in production. Further technical details and known issues for Oracle Linux 7 Update 8 Beta Release are available on Oracle Community - Oracle Linux and UEK Preview space. We welcome your questions and feedback on Oracle Linux 7 Update 8 Beta Release. You may contact the Oracle Linux team at oraclelinux-info_ww_grp@oracle.com or post your questions and comments on the Oracle Linux and UEK Preview Space on the Oracle Community.

We are pleased to announce the availability of the Oracle Linux 7 Update 8 Beta release for the 64-bit Intel and AMD (x86_64) and 64-bit Arm (aarch64) platforms. Oracle Linux 7 Update 8 Beta is an...

Announcements

UEK Release 6 Developer Preview available for Oracle Linux 7 and Oracle Linux 8

The Unbreakable Enterprise Kernel (UEK), included as part of Oracle Linux, provides the latest open source innovations, optimizations and security for enterprise cloud workloads. The UEK Release 5, based on the upstream kernel 4.14, is the current UEK release that powers the production workloads on Oracle Linux 7 in the cloud or on-premises. Linux 5.4 is the Latest Stable Kernel release, and it is the mainline kernel that the UEK Release 6 tracks. You can experiment the UEK Release 6 preview today with Oracle Linux 7 and Oracle Linux 8 on both x86_64 and aarch64 platforms. The example below is using an Oracle Linux 8 x86_64 instance on Oracle Cloud Infrastructure. The kernel was upgraded to the UEK Release 6 preview within a few minutes. The same upgrade procedures apply to an Oracle Linux 7 or Oracle Linux 8 instance running on-premises. The Oracle Linux 8 instance runs the current RHCK (Red Hat Compatible Kernel). [root@ol8-uek6 ~]# uname -a Linux ol8-uek6 4.18.0-147.el8.x86_64 #1 SMP Tue Nov 12 11:05:49 PST 2019 x86_64 x86_64 x86_64 GNU/Linux Update the system: [root@ol8-uek6 ~]# yum update -y Enable "ol8_developer_UEKR6" UEK Release 6 Preview Channel: [root@ol8-uek6 ~]# dnf config-manager --set-enabled ol8_developer_UEKR6 Run "dnf" command to perform the UEKR6 Developer Preview installation: [root@ol8-uek6 ~]# dnf install kernel-uek kernel-uek-devel Reboot the Oracle Linux 8 instance to have the new kernel take effect. When the Oracle Linux 8 instance comes back, you now have the UEK Release 6 preview  running. [root@ol8-uek6 ~]# uname -a Linux ol8-uek6 5.4.2-1950.2.el8uek.x86_64 #2 SMP Thu Dec 19 17:07:00 PST 2019 x86_64 x86_64 x86_64 GNU/Linux Further technical details and known issues on UEK6 are available on this dedicated article at Oracle Community - Oracle Linux and UEK Preview space. If you have any questions, post them to Oracle Linux Community.

The Unbreakable Enterprise Kernel (UEK), included as part of Oracle Linux, provides the latest open source innovations, optimizations and security for enterprise cloud workloads. The UEK Release 5,...

Linux Kernel Development

XFS - Data Block Sharing (Reflink)

Following on from his recent blog XFS - 2019 Development Retrospective, XFS Upstream maintainer Darrick Wong dives a little deeper into the Reflinks implementation for XFS in the mainline Linux Kernel. Three years ago, I introduced to XFS a new experimental "reflink" feature that enables users to share data blocks between files. With this feature, users gain the ability to make fast snapshots of VM images and directory trees; and deduplicate file data for more efficient use of storage hardware. Copy on write is used when necessary to keep file contents intact, but XFS otherwise continues to use direct overwrites to keep metadata overhead low. The filesystem automatically creates speculative preallocations when copy on write is in use to combat fragmentation. I'm pleased to announce with xfsprogs 5.1, the reflink feature is now production ready and enabled by default on new installations, having graduated from the experimental and stabilization phases. Based on feedback from early adopters of reflink, we also redesigned some of our in-kernel algorithms for better performance, as noted below: iomap for Faster I/O Beginning with Linux 4.18, Christoph Hellwig and I have migrated XFS' IO paths away from the old VFS/MM infrastructure, which dealt with IO on a per-block and per-block-per-page ("bufferhead") basis. These mechanisms were introduced to handle simple filesystems on Linux in the 1990s, but are very inefficient. The new IO paths, known as "iomap", iterate IO requests on an extent basis as much as possible to reduce overhead. The subsystem was written years ago to handle file mapping leases and the like, but nowadays we can use it as a generic binding between the VFS, the memory manager, and XFS whenever possible. The conversion finished as of Linux 5.4. In-Core Extent Tree For many years, the in-core file extent cache in XFS used a contiguous chunk of memory to store the mappings. This introduces a serious and subtle pain point for users with large sparse files, because it can be very difficult for the kernel to fulfill such an allocation when memory is fragmented. Christoph Hellwig rewrote the in-core mapping cache in Linux 4.15 to use a btree structure. Instead of using a single huge array, the btree structure reduces our contiguous memory requirements to 256 bytes per chunk, with no maximum on the number of chunks. This enables XFS to scale to hundreds of millions of extents while eliminating a source of OOM killer reports. Users need only upgrade their kernel to take advantage of this improvement. Demonstration: Reflink To begin experimenting with XFS's reflink support, one must format a new filesystem: # mkfs.xfs /dev/sda1 meta-data=/dev/sda1 isize=512 agcount=4, agsize=6553600 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=12800, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 If you do not see the exact phrase "reflink=1" in the mkfs output then your system is too old to support reflink on XFS. Now one must mount the filesystem: # mount /dev/sda1 /storage At this point, the filesystem is ready to absorb some new files. Let's pretend that we're running a virtual machine (VM) farm and therefore need to manage deployment images. This and the next example are admittedly contrived, as any serious VM and container farm manager takes care of all these details. # mkdir /storage/images # truncate -s 30g /storage/images/os8_base.img # qemu-system-x86_64 -hda /storage/images/os8_base.img -cdrom /isoz/os8_install.iso Now we install a base OS image that we will later use for fast deployment. Once that's done, we shut down the QEMU process. But first, we'll check that everything's in order: # xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img /storage/images/os8_base.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 000000 <listing shortened for brevity> # df -h /storage Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 32G 68G 32% /storage Now, let's say that we want to provision a new VM using the base image that we just created. In the old days we would have had to copy the entire image, which can be very time consuming. Now, we can do this very quickly: # /usr/bin/time cp -pRdu --reflink /storage/images/os8_base.img /storage/images/vm1.img 0.00user 0.00system 0:00.02elapsed 39%CPU (0avgtext+0avgdata 2568maxresident)k 0inputs+0outputs (0major+108minor)pagefaults 0swaps # xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img /storage/images/vm1.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 100000 <listing shortened for brevity> FLAG Values: 0100000 Shared extent 0010000 Unwritten preallocated extent 0001000 Doesn't begin on stripe unit 0000100 Doesn't end on stripe unit 0000010 Doesn't begin on stripe width 0000001 Doesn't end on stripe width # df -h /storage Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 32G 68G 32% /storage This was a very quick copy! Notice how the extent map on the new image file shows file data pointing to the same physical storage as the original base image, but is now marked as a shared extent, and there's about as much free space as there was before the copy. Now let's start that new VM and let it run for a little while before re-querying the block mapping: # xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img /storage/images/vm1.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 100000 <listing shortened for brevity> # xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img /storage/images/vm1.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..255]: 102762656..102762911 1 (50333856..50334111) 256 000000 1: [256..15728639]: 52429216..68157599 1 (416..15728799) 15728384 100000 <listing shortened for brevity> # df -h /storage Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 36G 64G 32% /storage Notice how the first 128K of the file now points elsewhere. This is evidence that the VM guest wrote to its storage, causing XFS to employ copy on write on the file so that the original base image remains unmodified. We've apparently used another 4GB of space, which is far better than the 64GB that would have been required in the old days. Let's turn our attention to the second major feature for reflink: fast(er) snapshotting of directory trees. Suppose now that we want to manage containers with XFS. After a fresh formatting, create a directory tree for our container base: # mkdir -p /storage/containers/os8_base In the directory we just created, install a base container OS image that we will later use for fast deployment. Once that's done, we shut down the container and check that everything's in order: # df /storage/ Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 2.0G 98G 2% /storage # xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash /storage/containers/os8_base/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52440384..52442559 1 (11584..13759) 2176 000000 Ok, that looks like a reasonable base system. Let's use reflink to make a fast copy of this system: # /usr/bin/time cp -pRdu --reflink=always /storage/containers/os8_base /storage/containers/container1 0.01user 0.64system 0:00.68elapsed 96%CPU (0avgtext+0avgdata 2744maxresident)k 0inputs+0outputs (0major+129minor)pagefaults 0swaps # xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash /storage/containers/os8_base/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52440384..52442559 1 (11584..13759) 2176 100000 # df /storage/ Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 2.0G 98G 2% /storage Now we let the container runtime do some work and update (for example) the bash binary: # xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash /storage/containers/os8_base/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52440384..52442559 1 (11584..13759) 2176 000000 # xfs_bmap -e -l -p -v -v -v /storage/containers/container1/bin/bash /storage/containers/container1/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52442824..52444999 1 (14024..16199) 2176 000000 Notice that the two copies of bash no longer share blocks. This concludes our demonstration. We hope you enjoy this major new feature!

Following on from his recent blog XFS - 2019 Development Retrospective, XFS Upstream maintainer Darrick Wong dives a little deeper into the Reflinks implementation for XFS in the mainline Linux...

Linux Kernel Development

XFS - 2019 Development Retrospective

Darrick Wong, Upstream XFS Maintainer and kernel developer for Oracle Linux, returns to talk about what's been happening with XFS. Hi folks! It has been a little under two years since my last post about upcoming XFS features in the mainline Linux kernel. In that time, the XFS development community have been hard at work fixing bugs and rolling out new features! Let's talk about the improvements that have landed recently in the mainline Linux Kernel, and our development roadmap for 2020. The new reflink and online fsck features will be covered in separate future blog posts. Lazy Timestamp Updates Starting with Linux 4.17, XFS implements the lazytime mount option. This mount option permits the filesystem to skip updates to the last modification timestamp and file metadata change timestamp if they have been updated within the last 24 hours. When used in combination with the relatime mount option to skip updates to a last access timestamp when it is newer than the file modification timestamp, we see a marked decrease in metadata writes, which in turn improves filesystem performance on non-volatile storage. This enhancement was provided by Christoph Hellwig. Filesystem Label Management In Linux 4.18, Eric Sandeen added to XFS support for btrfs' label get and set ioctls. This change enables administrators to change a filesystem label while that filesystem is mounted. A future xfsprogs release will adapt xfs_admin to take advantage of this interface. Large Directory Dave Chinner contributed a series of patches for Linux 5.4 that reduce the amount of time that XFS spends searching for free space in a directory when creating a file. This change improves performance on very large directories, which should be beneficial for object stores and container deployment systems. Solving the Y2038 Problem The year 2038 poses a special problem for Linux -- any signed 32-bit seconds counter will overflow back to 1901. Work is underway in the kernel to extend all of those counters to support 64-bit counters fully. In 2020, we will begin work on extending XFS's metadata (primarily inode timestamps and quota expiration timer) to support timestamps out to the year 2486. It should be possible to upgrade to existing V5 filesystems. Metadata Directory Tree This feature, which I showed off late in 2018, creates a separate directory tree for filesystem metadata. This feature is not itself significant for users, but it will enable the creation of many more metadata structures. This in turn can enable us to provide reverse mapping and data block sharing for realtime volumes; support creating subvolumes for container hosts; store arbitrary properties in the filesystem; and attach multiple realtime volumes to the filesystem. Deferred Inode Reclaim and Inactivation We frequently hear two complaints lodged against XFS -- memory reclamation runs very slowly because XFS inode reclamation sometimes has to flush dirty inodes to disk; and deletions are slow because we charge all costs of freeing all the file's resources to the process deleting files. Dave Chinner and I have been collaborating this year and last on making those problems go away. Dave has been working on replacing the current inode memory reclaim code with a simpler LRU list and reorganizing the dirty inode flushing code so that inodes aren't handed to memory reclaim until the metadata log has finished flushing the inodes to disk. This should eliminate the complaints that slow IO gets in the way of reclaiming memory in other parts of the system. Meanwhile, I have been working on the deletion side of the equation by adding new states to the inode lifecycle. When a file is deleted, we can tag it as needing to have its resources freed, and move on. A background thread can free all those resources in bulk. Even better, on systems with a lot of IOPs available, these bulk frees can be done on a per-AG basis with multiple threads. Inode Parent Pointers Allison Collins continues developing the inode parent pointer feature. This has led to the introduction of atomic setting and removal of extended attributes and a refactoring of the existing extended attribute code. When completed, this will enable both filesystem check and repair tools to check the integrity of a filesystem's directory tree and rebuild subtrees when they are damaged. Anyway, that wraps up our new feature retrospective and discussion of 2020 roadmap! See you on the mailing lists!

Darrick Wong, Upstream XFS Maintainer and kernel developer for Oracle Linux, returns to talk about what's been happening with XFS. Hi folks! It has been a little under two years since my last post about...

Announcements

Announcing Oracle Container Runtime for Docker Release 19.03

Oracle is pleased to announce the release of Oracle Container Runtime for Docker version 19.03. Oracle Container Runtime allows you to create and distribute applications across Oracle Linux systems and other operating systems that support Docker. Oracle Container Runtime for Docker consists of the Docker Engine, which packages and runs the applications, and integrates with the Oracle Container Registry and Docker Hub to share the applications in a Software-as-a-Service (SaaS) cloud. Notable Updates The docker run and docker create commands now include an option to set the domain name, using the --domainname option. The docker image pull command now includes an option to quietly pull an image, using the --quiet option. Faster context switching using the docker context command. Added ability to list kernel capabilities with --capabilities instead of --capadd and --capdrop. Added ability to define sysctl options with--sysctl list, --sysctl-add list, and --sysctl-rm list. Added inline cache support to builder with the --cache-from option. The IPVLAN driver is now supported and no longer considered experimental. Upgrading To learn how to upgrade from a previously supported version of Oracle Container Runtime for Docker, please review the Upgrading Oracle Container Runtime for Docker chapter of the documentation. Note that upgrading from a developer preview release is not supported by Oracle. Support Support for the Oracle Container Runtime for Docker is available to customers with either an Oracle Linux Premier or Basic support subscription.  Resources Documentation Oracle Container Runtime for Docker Oracle Linux Cloud Native Environment Oracle Linux Software Download Oracle Linux download instructions Oracle Software Delivery Cloud Oracle Container Registry Oracle Groundbreakers Community Oracle Linux Space Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Product Training and Education Oracle Linux

Oracle is pleased to announce the release of Oracle Container Runtime for Docker version 19.03. Oracle Container Runtime allows you to create and distribute applications across Oracle Linux systems...

Technologies

Kata Containers: What, When and How

When we began work to include Kata Containers with Oracle Linux Cloud Native Environment I was immediately impressed with the change they bring to the security boundary of containers but I had to wonder, how does it work? This article attempts to briefly cover the What, When and How of Kata Containers. Before we dive into Kata Containers you may want to read the brief history of Linux containers. 1. What are Kata Containers? Kata Containers is an open source [project with a] community working to build a secure container runtime with lightweight virtual machines that feel and perform like containers, but provide stronger workload isolation using hardware virtualization technology as a second layer of defense. Kata Containers stem from the Intel® Clear Containers and Hyper RunV projects. Kata Containers use existing CPU features like Intel VT-X and AMD-V™ to better isolate containers from each other when run on the same host. Each container can run in its own VM and have its own Linux Kernel. Due to the boundaries between VMs a container should not be able to access the memory of another container (Hypervisors+EPT/RVI). runc is the runtime-spec reference implementation on Linux and when it spawns containers it uses standard Linux Kernel features like AppArmour, capabilities(7), Control Groups, seccomp, SELinux and namespaces(7) to control permissions and flow of data in and out of the container. Kata Containers extends this by wrapping the containers in VMs. 2. When should I use Kata Containers? runc is the most common container runtime, the default for Docker™, CRI-O and in turn Kubernetes®. Kata Containers give you an alternative, one which provides higher isolation for mixed-use or multi-tenant environments. Kubernetes worker nodes are capable of using both runc and Kata Containers simultaneously so dedicated hardware is not required. For intra-container communication efficiency and to reduce resource usage overhead, Kata Containers executes all containers of a Kubernetes pod in a single VM.   Deciding when to use runc and when to use Kata Containers is dependent on your own security policy and posture. Factors that may influence when higher levels of isolation are necessary include: The source of the image - trusted vs untrusted. Was the image built in-house or downloaded from a public registry? The contents of the container In-house software that brings a competitive advantage The dataset the container works on (public vs confidential) Working in a virtual environment may impact performance so workload-specific testing is recommended to evaluate the extent, if any, of that impact in your environment. 3. How do Kata Containers work? When installing the Kubernetes module of Oracle Linux Cloud Native Environment both runc and Kata Containers are deployed along with CRI-O, which provides the necessary support between Kubernetes and the container runtimes. A heavily optimized and purpose-tuned Linux Kernel is used by Kata Containers to boot the VM. This is paired with a minimized user space to support container operations and together they provide fast initialization. In order to create a Kata Container, a Kubernetes user must initially create a RuntimeClass object. After that Pods or Deployments can reference the RuntimeClass to indicate the runtime to use. Examples are available in the Using Container Runtimes documentation. Kata Containers are designed to provide "lightweight virtual machines that feel and perform like containers"; your developers shouldn't need to be aware that their code is executing in a VM and should not need to change their workflow to gain the benefits. Trademarks Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

When we began work to include Kata Containers with Oracle Linux Cloud Native Environment I was immediately impressed with the change they bring to the security boundary of containers but I had to...

Linux

Oracle Linux Training at Your Own Pace

Knowing that taking training at your own pace, when you have time, suits many people's schedules and learning style, Oracle has just releases new Training-on-Demand courses for those aspiring to build their Linux administration skills. Why not take advantage to the newly released training to build your Linux skills. Start your Linux learning with the Oracle Linux System Administration I course. This course covers a range of skills including installation, using the Unbreakable Enterprise Kernel, configuring Linux services, preparing the system for the Oracle Database, monitoring and troubleshooting. After gaining essential knowledge and skills from taking the Oracle Linux System Administration I course, students are encouraged to continue their Linux learning with Oracle Linux System Administration II. The Oracle Linux System Administration II course teaches you how to automate the installation of the operating system and implement advanced software package management. How to configure advanced networking and authentication services. Resources: Oracle Linux Curriculum Oracle Linux Product Documentation Linux on Oracle Cloud Infrastructure learning path Oracle Linux Cloud Native Environment learning path Linux Containers and Orchestration Groundbreakers Community

Knowing that taking training at your own pace, when you have time, suits many people's schedules and learning style, Oracle has just releases new Training-on-Demand courses for those aspiring to build...

Events

Meet the Oracle Linux Team at Open FinTech Forum in New York

For IT decision makers in the financial services sector, you won’t want to miss Open FinTech Forum, December 9, 2019, at the Convene Conference Center at One Liberty Plaza, New York, NY. This event is designed to better inform you about the open technologies driving digital transformation, and how to best utilize an open source strategy. This information-packed day starts with several brief keynotes. Be sure to mark your schedule and join Robert Shimp, Group Vice President of Infrastructure Software Product Management at Oracle, for the keynote: A New Blueprint for Modern Application Development and Runtime Environment Mr. Shimp will discuss new open source technologies that make it easier and faster than ever to design and deliver modern cloud native applications. When: Monday, December 9, 2019 Time: 10:05 a.m. Location: The Forum Meet Our Experts Register for Open FinTech Forum today and stop by Oracle’s table to chat with our Linux experts. Learn more about Oracle Linux and how it delivers a complete, open DevOps environment featuring leading performance, scalability, reliability and security for enterprise applications deployed in the cloud or on premise. One of the most secure Linux environments available with certification from Common Criteria as well as FIPS 140-2 validation of its cryptographic modules, Oracle Linux is currently the only Linux distribution on the NIAP Product Compliant List. It is also the only Linux with Ksplice zero-downtime automated patching for kernel, hypervisor, and critical user space libraries. We look forward to meeting you at Open FinTech Forum. #osfintech    #OracleLinux

For IT decision makers in the financial services sector, you won’t want to miss Open FinTech Forum, December 9, 2019, at the Convene Conference Center at One Liberty Plaza, New York, NY. This event is...

Linux

A Brief History of Linux Containers

The latest update to Oracle Linux Cloud Native Environment introduces new functionally that we'll cover in upcoming posts but before we dive into those features, let's take a look at the history of Linux containers to see how we got here. The first fundamental building block that led to the creation of Linux containers was submitted to the Kernel by Google. It was the first version of a feature named control groups or cgroups. A cgroup is a collection of processes whose use of system resources is constrained by the Kernel to better manage completing workloads and to contain their impact on other processes. The second building block was the namespaces feature which allows the system to apply a restricted view of system resources to processes or groups of processes. Namespaces were actually introduced much earlier than cgroups but they were limited to specific object types like hostnames and Process IDs. It wasn't until 2008 and the creation of network namespaces that we could create different views of key network objects for different processes. Processes could now be prevented from knowing about each other, communicating with each other and each could have a unique network configuration. The availability of these Kernel features led to the formation of LXC (Linux Containers) which provides a simple interface to create and manage containers, which are simply processes that are limited in what they can see by namespaces and what they can use by cgroups. Docker expanded LXC's functionality by providing a full container life-cycle management tool (also named Docker™). Docker's popularity led to its name now being synonymous with containers. In time Docker replaced LXC with its own libcontainer and additional controls were added which utilized Kernel features such as AppAmor, capabilities, seccomp and SELinux. These gave developers improved methods of restricting what container processes could see and do. A key component to Docker's success was the introduction of a container and image format for portability, i.e. a container or image could be transferred between systems without any impact to functionality. This provided the assurance of repeatable, consistent deployments. This container and image format is based on individual file system layers where each layer is intentionally separated for re-use but is presented as a unified file system when the container starts. Layers also allow images to extend another image. For example, an image may use oraclelinux:7-slim as its first layer and add additional software in other layers. This separation of layers allows images and containers to share the same bits on disk across multiple instances. This improves resource utilization and start-up time. A new API was created by Docker to facilitate image transfer but pre-existing union filesystems like aufs and then OverlayFS were the base methods to present the unified container filesystem. While most of us are familiar with Docker, what is less well-known is that those container and image formats and how a container is launched from an image are published standards of the Open Container Initiative. These standards are designed for interoperability so any runtime-spec compliant runtime can use an image-spec compliant image to launch a container. Docker was a founding member of the Open Container Initiative and contributed the Docker V2 Image specification to act as the basis of the image specification. Through standards like these, interoperability is enhanced which helps the industry continue to develop and grow at a rapid pace. So if you're creating images today with Docker or other Open Container Initiative compliant tools, they can continue to work on other compliant tools like Kata containers which we'll be looking at in an upcoming blog post. Note: There were alternatives available in other operating systems and outside of mainline Linux prior to LXC but the focus of the article was Linux. Similarly, there were other projects in parallel to LXC which contributed to the industry but are not mentioned for brevity. Docker™ is a trademark of Docker, Inc. in the United States and/or other countries. Linux® is a registered trademark of Linus Torvalds in the United States and/or other countries.

The latest update to Oracle Linux Cloud Native Environment introduces new functionally that we'll cover in upcoming posts but before we dive into those features, let's take a look at the history ofLin...

Linux Kernel Development

Using Tracepoints to Debug iSCSI

Using Tracepoints to Debug iSCSI Modules Oracle Linux kernel developer Fred Herard offered this blog post on how to use tracepoints with iSCSI kernel modules. The scsi_transport_iscsi, libiscsi, libiscsi_tcp, and iscsi_tcp modules have been modified to leverage Linux Kernel Tracepoints to capture debug messages. Before this modification, debug messages for these modules were simply directed to syslog when enabled. This enhancement gives users the option to use Tracepoint facility to dump enabled events (debug messages) into ftrace ring buffer. The following tracepoint events are available: # perf list 'iscsi:*' List of pre-defined events (to be used in -e): iscsi:iscsi_dbg_conn [Tracepoint event] iscsi:iscsi_dbg_eh [Tracepoint event] iscsi:iscsi_dbg_session [Tracepoint event] iscsi:iscsi_dbg_sw_tcp [Tracepoint event] iscsi:iscsi_dbg_tcp [Tracepoint event] iscsi:iscsi_dbg_trans_conn [Tracepoint event] iscsi:iscsi_dbg_trans_session [Tracepoint event] Here's a simple diagram depicting the tracepoint enhancement: These tracepoint events can be enabled on the fly to aid in debugging iscsi issues. Here's a sample output of tracing iscsi:iscsi_dbg_eh tracepoint event using the perf utility: # /usr/bin/perf trace --no-syscalls --event="iscsi:iscsi_dbg_eh" 0.000 iscsi:iscsi_dbg_eh:session25: iscsi_eh_target_reset tgt Reset [sc ffff883fee609500 tgt iqn.1986-03.com.sun:02:fa41d51f-45a5-cea4-d661-a854dd13cf07]) 0.009 iscsi:iscsi_dbg_eh:session25: iscsi_exec_task_mgmt_fn tmf set timeout) 3.214 iscsi:iscsi_dbg_eh:session25: iscsi_eh_target_reset tgt iqn.1986-03.com.sun:02:fa41d51f-45a5-cea4-d661-a854dd13cf07 reset result = SUCCESS) Tracepoint events that have been insterted into ftrace ring buffer can be extracted using e.g. crash utility version 7.1.6 or higher: crash> extend ./extensions/trace.so ./extensions/trace.so: shared object loaded crash> trace show ... <...>-18646 [023] 20618.810958: iscsi_dbg_eh: session4: iscsi_eh_target_reset tgt Reset [sc ffff883fead741c0 tgt iqn.2019-10.com.example:storage] <...>-18646 [023] 20618.810968: iscsi_dbg_eh: session4: iscsi_exec_task_mgmt_fn tmf set timeout <...>-18570 [016] 20848.578257: iscsi_dbg_trans_session: session4: iscsi_session_event Completed handling event 105 rc 0 <...>-18570 [016] 20848.578260: iscsi_dbg_trans_session: session4: __iscsi_unbind_session Completed target removal This enhancement can be found in Oracle Linux UEK-qu7 and newer releases.

Using Tracepoints to Debug iSCSI Modules Oracle Linux kernel developer Fred Herard offered this blog post on how to use tracepoints with iSCSI kernel modules. The scsi_transport_iscsi, libiscsi,...

Announcements

Announcing Oracle Linux 8 Update 1

Oracle is pleased to announce the general availability of Oracle Linux 8 Update 1. Individual RPM packages are available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. ISO installation images will soon be available for download from the Oracle Software Delivery Cloud and Docker images will soon be available via Oracle Container Registry and Docker Hub. Oracle Linux 8 Update 1 ships with Red Hat Compatible Kernel (RHCK) (kernel-4.18.0-147.el8) kernel packages for x86_64 Platform (Intel & AMD), that include bug fixes, security fixes, and enhancements; the 64-bit Arm (aarch64) platform is also available for installation as a developer preview release. Notable new features for all architectures Security Udica package added You can use udica to create a tailored security policy, for better control of how a container accesses host system resources. This capability enables you to harden container deployments against security violations. SELinux SELinux user-space tools have been updated to release 2.9 SETools collection and libraries have been updated to release 4.2.2 New boltd_t SELinux type added (used to manage Thunderbolt 3 devices) New bpf SELinux policy added (used to control Berkeley Packet Filter) SELinux policy packages have been updated to release 3.14.3 OpenSCAP OpenSCAP packages have been updated to release 1.3.1 OpenSCAP includes SCAP 1.3 data stream version scap-security-guide packages have been updated to release 0.1.44 OpenSSH have been updated to 8.0p1 release Other Intel Optane DC Persistent Memory Memory Mode for the Intel Optane DC Persistent Memory technology has been added; this technology is transparent to the operating system. Database Oracle Linux 8 Update 1 ships with version 8.0 of the MySQL database. Cockpit Web Console Capability for Simultaneous Multi-Threading (SMT) configuration using the Cockpit web console. The ability to disable SMT in the Cockpit web console is also included. For further details please see Oracle Linux Simultaneous Multithreading Notice Networking page updated with new firewall settings on the web console's Networking page Several improvements to the Virtual Machines management page Important changes in this release virt-manager The Virtual Machine Manager application (virt-manager) is deprecated in Oracle Linux 8 Update 1. Oracle recommends that you use the web console (Cockpit) for managing virtualization in a graphical user interface (GUI). RHCK Btrfs file system removed from RHCK. OCFS2 file system removed from RHCK. VDO Ansible module moved to Ansible packages; the VDO Ansible module is provided by the ansible package and is located in /usr/lib/python3.6/site-packages/ansible/modules/system/vdo.py Further information on Oracle Linux 8 For more details about these and other new features and changes, please consult the Oracle Linux 8 Update 1 Release Notes and Oracle Linux 8 Documentation. Oracle Linux can be downloaded, used, and distributed free of charge and all updates and errata are freely available. Customers decide which of their systems require a support subscription. This makes Oracle Linux an ideal choice for development, testing, and production systems. The customer decides which support coverage is best for each individual system while keeping all systems up to date and secure. Customers with Oracle Linux Premier Support also receive support for additional Linux programs, including Gluster Storage, Oracle Linux Software Collections, and zero-downtime kernel updates using Oracle Ksplice. Application Compatibility Oracle Linux maintains user-space compatibility with Red Hat Enterprise Linux (RHEL), which is independent of the kernel version that underlies the operating system. To minimize impact on interoperability during releases, the Oracle Linux team works closely with third-party vendors for hardware and software that have dependencies on kernel modules. For more information about Oracle Linux, please visit www.oracle.com/linux.

Oracle is pleased to announce the general availability of Oracle Linux 8 Update 1. Individual RPM packages are available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server....

Announcements

Unified Management for Oracle Linux Cloud Native Environment

Delivering a production-ready, cloud-native application development and operating environment Oracle Linux Cloud Native Environment has gained some notable additions. Specifically, three core components for unified management: the Oracle Linux Cloud Native Environment Platform API Server, Platform Agent and Platform Command-Line Interface (CLI). These new open source management tools simplify the installation and day-to-day management of the cloud native environment, and provide extensibility to support new functionality Oracle Linux Cloud Native Environment was announced at Oracle OpenWorld 2018 as a curated set of open source projects that are based on open standards, specifications and APIs defined by the Open Container Initiative and Cloud Native Computing Foundation that can be easily deployed, have been tested for interoperability and for which enterprise-grade support is offered. Since then we have released several new components, either generally available under an existing Oracle Linux support subscription or as technical preview releases. Here's what the three core components provide: The Platform API Server is responsible for performing all of the business logic required to deploy and manage an Oracle Linux Cloud Native Environment. We recommend using a dedicated operator node to host the Platform API Server, though it can run on any node within the environment.   The business logic used by the Platform API Server is encapsulated within the metadata associated with each module we publish. An Oracle Linux Cloud Native Environment module is a method of packaging software so that it can be deployed by the Platform API Server to provide either core or optional cluster-wide functionality. Today, we are shipping the Kubernetes module which provides the core container orchestration functionality for the entire cluster. Included within the Kubernetes module are additional components that provide required services including CoreDNS for name resolution and Flannel for layer 3 networking services. The Platform API Server interacts with a Platform Agent that must be installed on each host within the environment. The Platform Agent knows how to gather the state of resources on its host and how to change the state of those resources. For example, the Platform Agent can determine if a package is installed and at which version, or if a firewall port is open or closed. It could then be requested to change the state of those resources, that is to upgrade the package if it is old or to open the port if it is closed. New instructions on how to gather and set state values can be added at any time by the Platform API Server which makes the Platform Agent easily extensible at runtime, without requiring a cluster-wide upgrade. You interact with the Platform API Server using the Platform CLI tool. The Platform CLI tool is the primary interface for the administration of Oracle Linux Cloud Native Environment. Like the Platform Agent, it is simply an interface for the functionality provided by the Platform API Server. The Platform CLI tool can be installed on the operator node within the environment. Kata Containers support and other updates Oracle Linux Cloud Native Environment contains several new or updated components over the previously released Oracle Container Services for use with Kubernetes product. The following changes are in addition to the new management functionality: The Kubernetes® module for the Oracle Linux Cloud Native Environment which is based on upstream Kubernetes v1.14 and is a Certified Kubernetes distribution, now automatically installs the CRI-O runtime interface which supports both runC and Kata Container runtime engines. The Kata Containers runtime engine which uses lightweight virtual machines for improved container isolation is now fully supported for production use and is automatically installed by the Kubernetes module. The Kubernetes module can either be configured to use an external load balancer or the Platform API Server can deploy a  software-based load balancer to ensure multi-master high availability. The Platform API Server is capable of providing full cluster-wide backup/restore functionality for disaster recovery. Join us at KubeCon + CloudNativeCon! Grab a coffee with the Oracle Linux and Virtualization team at Booth #P26 and get an Oracle Tux cup of your own. While you're there, our Linux and Virtualization experts can answer your questions and provide one-on-one demos of the unified management for Oracle Linux Cloud Native Environment. Installation Oracle Linux Cloud Native Environment RPM packages are available on the Unbreakable Linux Network and the Oracle Linux yum server. The installation of Oracle Linux Cloud Native Environment requires downloading container images directly from the Oracle Container Registry or by creating and using a local mirror of the images. Both options are covered in the Getting Started Guide. Oracle recommends reviewing the known issues list before starting the installation. Support Support for Oracle Linux Cloud Native Environment is included with an Oracle Linux Premier support subscription. Documentation and training Oracle Linux Cloud Native Environment documentation Oracle Linux Cloud Native Environment training   Kubernetes® is a registered trademark of The Linux Foundation in the United States and other countries, and is used pursuant to a license from The Linux Foundation.

Delivering a production-ready, cloud-native application development and operating environment Oracle Linux Cloud Native Environment has gained some notable additions. Specifically, three core...

Events

Join Us at KubeCon + CloudNativeCon

Oracle’s Linux, Virtualization, and Cloud Infrastructure teams are heading south for KubeCon + CloudNativeCon, November 18 – 21, at the San Diego Convention Center. This conference gathers leading technologists from open source and cloud native communities to further the education and advancement of cloud native computing. If you’re making the move from traditional application design to cloud native – orchestrating containers as part of a microservices architecture – you probably have questions about the latest technologies, available solutions, and deployment best practices. Let us help you put the pieces together. Meet us at booth # P26   Our Linux and Virtualization experts can answer your questions and provide one-on-one demos. Learn about the latest advancements in Oracle Linux Cloud Native Environment’s rich set of curated software components for DevSecOps, for on premise and in the cloud. Oracle Linux Cloud Native Environment is based on open standards, specifications and APIs defined by the Open Container Initiative and Cloud Native Computing Foundation™. It offers an open, integrated operating environment that is popular with developers and easy for IT operations to deliver containers and orchestration, management and development tools. With Oracle Linux Cloud Native Environment, you can: Accelerate time-to-value and deliver agility through modularity and developer productivity Modernize applications and lower costs by fully exploiting the economic advantages of cloud and open source Achieve vendor independence At Oracle’s booth, you can also learn about: Oracle Cloud Native Services: public cloud for Kubernetes, Container registry, open source serverless, and more. GraalVM: Run programs faster anywhere. More agility and increased performance on premise and in the cloud. Throughout the conference, you can sit in on informative “lightening talks” in the adjacent Oracle Cloud lounge. Have a cup of coffee on us – and you can enjoy it in your own Oracle Tux cup. To partake, visit the Coffee Lounge.  Booth hours Tuesday, Nov. 19          10:25 am – 8:40 pm Wednesday, Nov. 20   10:25 am – 5:20 pm Thursday, Nov. 21        10:25 am – 4:30 pm   We look forward to meeting you at the conference!

Oracle’s Linux, Virtualization, and Cloud Infrastructure teams are heading south for KubeCon + CloudNativeCon, November 18 – 21, at the San Diego Convention Center. This conference gathers...

Linux Kernel Development

Thoughts on Attending and Presenting at Linux Security Summit North America 2019

Oracle Linux kernel developer Tom Hromatka attended Linux Security Summit NA 2019. In this blog post, Tom discusses the presentation that he gave as well as other talks he found interesting. Linux Security Summit North America 2019 I was one of the lucky attendees at the Linux Security Summit North America 2019 conference in sunny San Diego from August 19th through the 21st. Three major topics dominated this year's agenda - trusted computing, containers, and overall kernel security. This was largely my first interaction with trusted computing and hardware attestation, so it was very interesting to hear about all of the innovative work going on in this area. My Presentation - The Why and How of Libseccomp https://sched.co/RHaK For 2019, LSS added three tutorial sessions to the schedule. These 90-minute talks were envisioned to be interactive and provide more in-depth details of a given technology. Paul Moore (co-maintainer of libseccomp) and I presented the first tutorial of the conference. We dedicated the first 20 minutes or so to a slide show introduction to the technology. Paul has given various flavors of this talk before, and he delivered the "why" part of the talk with a brief history of seccomp and libseccomp. He has a charismatic and entertaining delivery that can captivate an audience on even the driest of subjects - which seccomp is not :). I took over with the "how" portion of the discussion and jumped right in with a comparison of white- vs blacklists. (Spoiler - if security is of the utmost concern, I recommend a whitelist.) I briefly touched on other seccomp considerations such as supporting additional architectures (x86_64, x32, etc.), strings in seccomp, and parameter filtering pitfalls. The bulk of our timeslot was then spent writing a seccomp/libseccomp filter by hand. My goal was to highlight how easy it is to write a filter while simultaneously demonstrating some of the pitfalls (e.g. string handling) and how to debug them. In hindsight, this was a slightly crazy idea as many, many things could have gone horribly wrong. I had a rough plan of the program we were going to write and had tested it out beforehand. But like all good plans, no battle plan survives first contact with the enemy. Here is what we ended up writing. My laptop behaved differently at the conference than it did at home which led to more involved debugging than I had envisioned. I admit that I did want some live debugging, but... not that much. (I think the cause of the behavior differences was because I had done my testing at home using STDERR, but I inadvertently switched to using STDOUT at the conference.) Ultimately though these issues were the exact catalyst I was looking for, and audience participation soared. By the end of the talk I had the attention of the entire room and many, many people were actively throwing out ideas. There was no shortage of great ideas on how to fix the problem and perhaps more importantly, how to debug the problem. Afterward, a large number of people came up and thanked us for a fun talk. Several said that they were running the test program on their laptops while I was writing it and trying to actively debug it themselves. All in all, the talk didn't go exactly as I had envisioned, but perhaps that is for the better. The audience was amazing, and I sure had a lot of fun. Making Containers Safer - Stéphane Graber and Christian Brauner https://sched.co/RHa5 Stéphane and Christian are two of the lead engineers working on LXC for Canonical. They are both intelligent and often working on the forefront of containers, so when they make an upstream proposal, it's wise to pay attention. In this talk, they mentioned several things they have worked on lately to improve kernel and container security: Their users tend to run a single app in a container, rather than an entire distro. Thus, they can really lock down security via seccomp, SELinux, and cgroups to grant the bare minimum of permissions to run the application LXC supports unprivileged containers launched by an unprivileged user, but there are still some limiitations. Stay tuned for patches and improvements from them :) Multiple users within such a container is difficult Several helper binaries are required Christian again reemphasized the importance of using unprivileged containers when possible and listed several CVEs that would have been ineffective against an unprivileged container They spent quite a bit of time discussing the newly-added seccomp/libseccomp user notification feature. (I worked on the libseccomp side of it.) They are considering adding a "keep running the kernel filter" option to the user notification filter Keynote: Retrospective: 26 Years of Flexible MAC - Stephen Smalley https://sched.co/RHaH Stephen Smalley was one of the early innovators in the Mandatory Access Control (MAC) arena (think SELinux and similar) and continues to innovate and advocate for better MAC solutions. Stephen presented a amazingly detailed and lengthy history from ~1999 through today on the history of MACs in computing. He touched on early NSA work with closed source OSes and the NSA's inability to gain traction there. These failures drove the NSA to look at open source OSes, and early experiments with the University of Utah and the OS they maintained proved the viability of a MAC. SELinux work started shortly after that and was added to Linux in 2003. Stephen applauded the android work as a good example of how to apply a MAC. Android is 100% confined+enforcing and has a large automated validation and testing suite. Going forward, Stephen said that MACs are being effectively used by higher-level services and emerging technologies. For better security, this is critical. Application Whitelisting - Steven Grubb https://sched.co/RHb9 Steve Grubb is working on a rather novel approach to improve security. He's working on a daemon, fapolicyd, that can whitelist files on the system. His introduction quickly spelled out the problem space. Antivirus is an effective blacklisting approach. It can identify untrusted files and rapidly neutralize them. fapolicyd is effectively the opposite. A sysadmin should generally know the expected files that will be on the system and can create an application whitelist based upon these known files. He then went on a small tangent showing how easy it is to hijack a Python process and start up a webserver without touching the disk. fapolicyd uses seccomp to restrict execve access. Another quick demo showed how fapolicyd will allow /bin/ls to run, but a copy of it in /tmp was blocked. It's an interesting project in its early stages, and I'm eager to see how it progresses, so I started following it on github. How to Write a Linux Security Module - Casey Schaufler https://sched.co/RHa2 Casey gave the third and final tutorial of the conference on how (and why) to write a Linux Security Module (LSM). As an aside, I had lunch with Casey prior to his presentation, and he good-naturedly said that he wasn't "crazy enough" to write software live in front of a large audience. Hmmm :). Anyway... My key takeaways from this tutorial: * Why write your own LSM? You may have unique things you want to check beyond what SELinux or Apparmor are checking. Perhaps there's one little thing that your LSM can do... and do well * One LSM cannot override another LSM's denial. In fact, once a check fails, no other LSMs are run * If the check can be readily done in userspace, do it there. This includes LSM logic * You only need to implement the LSM hooks you are interested in Kernel Self-Protection Project - Kees Cook https://sched.co/RHbF Kees (kernel seccomp maintainer amongst many other things) gave another excellent talk on the status of security in the Linux kernel. His talks are usually so engaging that I don't take notes, and this one was no exception. He outlined the many security (and otherwise) fixes that have gone into the kernel over the last year. He also opined that he would love to see the kernel move away from C and replace it with rust, but he acknowledges there are a lot of challenges (both technical and human) before that could happen. Hallway Track As with any major Linux conference, the hallway track is every bit as invaluable as the official presentations. This was the first time I met my co-maintainer of libseccomp (Paul Moore) in person, and we were able to meet up a few times to talk seccomp/libseccomp and their roadmap going forward. I was lucky to be able to spend some time with several of the presenters, talking containers, seccomp, cgroups and whatever other topics we had in common. And of course I talked seccomp with many conference attendees and gladly offered my assistance in getting their seccomp filters up and running. Summary This was my first LSS and hopefully not my last. I really enjoyed my time with the outstanding conference attendees, and the conversations (both formal and informal) were excellent. In summary, I learned a ton, ate way too much really good food, and met many intelligent and wonderful people. I hope to see you at LSS 2020!

Oracle Linux kernel developer Tom Hromatka attended Linux Security Summit NA 2019. In this blog post, Tom discusses the presentation that he gave as well as other talks he found interesting. Linux...

Events

Join Wim Coekaerts, Oracle SVP, for a webinar: What you need to know about Oracle Autonomous Linux

When: Tuesday, November 19, 2019 In this webinar, Wim Coekaerts, Senior Vice President, Software Development at Oracle will discuss the world's first Autonomous Linux and the benefits it offers customers. Today, security and data protection are among the biggest challenges faced by IT. Keeping systems up to date and secure involves tasks that can be error prone and extremely difficult to manage in large-scale environments. Learn how Oracle is helping to solve these challenges. Automation is driving the cloud Automating common management tasks to greatly reduce complexity and human error Delivering increased security and reducing downtime by self-patching, self-updating, and known exploit detection Freeing up critical IT resources to tackle more strategic initiatives Full application compatibility How to configure automation in your private clouds - reducing cost This webinar will be held at three times on November 19, 2019, to accommodate global locations. Please use the link below to register for the session in your region/local time zone. APAC | 01:00 PM Singapore Time | Register EMEA | 10:00 AM (GMT) Europe / London Time | Register North America | 09:00 AM Pacific Standard Time / 12:00 PM Eastern Standard Time | Register During the webinar you will have the opportunity to have your questions answered. Please join us.

When: Tuesday, November 19, 2019 In this webinar, Wim Coekaerts, Senior Vice President, Software Development at Oracle will discuss the world's first Autonomous Linux and the benefits it offers...

Announcements

Easier access to open source images on Oracle Container Registry

We recently updated Oracle Container Registry so that images which contain only open source software no longer requires Oracle Single Sign-on authentication to the web interface, nor does it require the Docker client to login prior to issuing a pull request. The change was made to simplify the installation workflow for open source components hosted on the registry and allow those components to be more easily accessible by a continuous integration platform. Downloading open source images To determine whether an image is available without authentication, navigate to the Oracle Container Registry and navigate to the product category that contains the repository you're interested in. If the repository table status that the image "... is licensed under one or more open source licenses..." then that image can be pulled from Oracle Container Registry without any manual intervention or login required. See below for an example. If you click the name of a repository, a table of available tags with their associated pull command is displayed on the image detail page. For example, the oraclelinux repository in the OS product category has around 15 tags available when this blog was published. The Tags table also provides a list of download mirrors across the world that can be used. For best performance, select the download mirror closest to you. To pull an open source image, you simply issue the docker pull command as documented. No need to login beforehand. For example, to pull the latest Oracle Linux 7 slim image from our Sydney download mirror, simply run: # docker pull container-registry-sydney.oracle.com/os/oraclelinux:7-slim 7-slim: Pulling from os/oraclelinux Digest: sha256:c2d507206f62119db3a07014b445dd87f85b0d6f204753229bf9b72f82ac9385 Status: Downloaded newer image for container-registry-sydney.oracle.com/os/oraclelinux:7-slim container-registry-sydney.oracle.com/os/oraclelinux:7-slim Obtaining images that contain licensed Oracle product binaries For details on the process required to download images that contain licensed Oracle product binaries, please review the Using the Oracle Container Registry chapter of the Oracle Container Runtime for Docker manual.

We recently updated Oracle Container Registry so that images which contain only open source software no longer requires Oracle Single Sign-on authentication to the web interface, nor does it require...

Events

Resources at the Ready: Oracle OpenWorld 2019 Offers Continued Learning

Before Oracle OpenWorld 2019 is too far in the “rear view mirror” and we’re on to the holiday season, year’s end, and regional OpenWorld events that start in early 2020, we wanted to highlight some of the content from the September conference. At the link below, you’ll find many of the presentations given by product experts, partners, and customers with valuable information that can help you – and it’s all at your fingertips. Topics include:  Securing Oracle Linux 7 Setting Up a Kernel-Based VM with Oracle Linux 7, UEK5, Oracle Linux Virtualization Manager Maximizing Performance with Oracle Linux Secure Container Orchestration Using Oracle Linux Cloud Native (Kubernetes/Kata) Infrastructure as Code: Oracle Linux, Terraform, and Oracle Cloud Infrastructure Building Flexible, Multicloud Solutions: Oracle Private Cloud Appliance / Oracle Linux Oracle Linux and Oracle VM VirtualBox: The Enterprise Cloud Development Platform Oracle Linux: A Cloud-Ready, Optimized Platform for Oracle Cloud Infrastructure Open Container Virtualization: Security of Virtualization, Speed of Containers Server Virtualization in Your Data Center and Migration Paths to Oracle Cloud This link is to the Session Catalog. Simply search for the title above or topic of interest. You’ll see a green download arrow, to the right of the session title, which lets you know that content is just a two clicks away. Videos of note: Announcing Oracle Autonomous Linux (0:36) Oracle's Infrastructure Strategy for the Cloud and On Premise (43:16) Cloud Platform and Middleware Strategy and Roadmap (32:56) We hope you find this content helpful. Let us know. And, if you’re already planning for 2020, here’s the line-up of regional conferences: Oracle OpenWorld Middle East: Dubai | January 14-15 | World Trade Centre Oracle OpenWorld Europe: London | February 12-13 | ExCel Oracle OpenWorld Asia: Singapore | April 21-22 | Marina Bay Sands Oracle OpenWorld Latin America: Sao Paulo | June 17-18 Bold ideas. Breakthrough technologies. Better possibilities. It all starts here.

Before Oracle OpenWorld 2019 is too far in the “rear view mirror” and we’re on to the holiday season, year’s end, and regional OpenWorld events that start in early 2020, we wanted to highlight some of...

Linux Kernel Development

What it means to be a maintainer of Linux seccomp

In this blog post, Linux kernel developer Tom Hromatka talks about becoming a co-maintainer of libsecccomp, what it means and his recent presentation at Linux Security Summit North America 2019.  Seccomp Maintainer Recently I was named a libseccomp co-maintainer. As a brief background, the Linux kernel provides a mechanism - called SECure COMPuting mode or seccomp for short - to block a process or thread's access to some syscalls. seccomp filters are written in a pseudo-assembly instruction set called Berkeley Packet Filter (BPF), but these filters can be difficult to write by hand and are challenging to maintain as updates are applied and syscalls are added. libseccomp is a low-level userspace library designed to simplify the creation of these seccomp BPF filters. My role as a maintainer is diverse and varies greatly from day to day: I initially started working with libseccomp because we in Oracle identified opportunities that could significantly improve seccomp performance for containers and virtual machines. This work then grew into fixing bugs, helping others with their seccomp issues, and in general trying to improve seccomp and libseccomp for the future. Becoming a maintainer was the next logical progression Our code is publicly available on github and we also maintain a public mailing list. Most questions, bug reports, and feature requests come in via github. Ideally the submitter will work with us to triage the issue, but that is not required Pull requests are a great way for others to get involved in seccomp and libseccomp. If a user identifies a bug or wants to add a new feature, they are welcome to modify the libseccomp code and submit a pull request to propose changes to the library. In cases like this, I will work with users to make sure the code meets our guidelines. I will help them match the coding style, create automated tests, or whatever else needs to be done to ensure their pull request meets our stringent requirements. We have an extensive automated test suite, code coverage, and static analysis integrated directly into github to maintain our high level of code quality. These checks run against every pull request and every commit Periodically we release new versions of libseccomp. (At present the release schedule is "as needed" rather than on a set timeline. This could change in the future if need be.) We maintain two milestones within github - a major release milestone and a minor release milestone. Major releases are based upon the master branch of the repo and will contain new features, bug fixes, etc. - including potentially major changes. On the other hand, the minor release is based upon the git release- branch. Changes to the minor branch consist of bug fixes, security CVEs, etc. - and do not contain major new features. As a maintainer, the release process is fairly involved to ensure the release is of the highest quality Of course, I get to add new features, fix bugs - and hopefully not add any new ones :), and add tests And finally I work with others both within Oracle and throughout the greater Linux community to plan libseccomp and seccomp's future. For example, Christian Brauner (Canonical) and Kees Cook (Google) are interested in adding deep argument inspection to seccomp. This will require non-trivial changes to both the kernel and libseccomp. This is a challenging feature that has significant security risks and will require cooperation up and down the software stack to ensure it's done safely and with a user-friendly API Libseccomp at Linux Security Summit North America 2019 In August my co-maintainer, Paul Moore (Cisco), and I attended the Linux Security Summit (LSS) conference in San Diego. We presented a tutorial on the "Why and How of libseccomp" Paul opened up the 90-minute session with an entertaining retelling of the history of seccomp, libseccomp, and why it has evolved into its current form. I took over and presented the "how" portion of the presentation with a comparison of white- vs. blacklists, common pitfalls like string filters and parameter filtering. But the bulk of our tutorial was how to actually write a libseccomp filter, so with a tremendous amount of help from the audience, we wrote a filter by hand and debugged several troublesome issues. Full disclosure: I wanted to highlight some of the challenges when writing a filter, but as Murphy's Law would have it, even more went awry than I expected. Hijinks didn't ensue, but thankfully, I had an engaged and wonderful audience, and together we debugged the filter into existence. The live writing of code really did drive home some of the pitfalls as well as outline methods to overcome these challenges. Overall, things didn't go exactly as I had envisioned, but I feel the talk was a success. Thanks again to our wonderful audience! The full recording of the tutorial is available here

In this blog post, Linux kernel developer Tom Hromatka talks about becoming a co-maintainer of libsecccomp, what it means and his recent presentation at Linux Security Summit North America 2019.  Seccom...

Linux Kernel Development

Notes on BPF (7) - BPF, tc and Generic Segmentation Offload

Notes on BPF (7) - BPF, tc and Generic Segmentation Offload Oracle Linux kernel developer Alan Maguire continues our blog series on BPF, wherein he presented an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering. In the previous BPF blog entry, I warned against enabling generic segmentation offload (GSO) when using tc-bpf. The purpose of this blog entry is to describe how it in fact can be used, even for cases where BPF programs add encapsulation. A caveat however; this is only true for the 5.2 kernel and later. So here we will describe GSO briefly, then why it matters for BPF and finally demonstrate how new flags added to the bpf_skb_adjust_room helper facilitate using it for cases where encapsulation is added. What is Generic Segmentation Offload? Generic segmentation offload took the hardware concept of allowing Linux to pass down a large packet - termed a megapacket - which the hardware would then dice up into individual MTU size frames for transmission. This is termed TSO (TCP segmentation offload), and GSO generalizes this beyond TCP to UDP, tunnels, etc and performs the segmentation in software. Performance benefits are still significant, even in software and we find ourselves reaching line rate on 10Gb/s and faster NICs, even with an MTU of 1500 bytes. Because a lot of the per-packet costs in the networking stack are paid by one megapacket rather than dozens of smaller MTU packets traversing the stack, the benefits really accrue. Enter BPF Now consider the BPF case. If we are doing processing or encapsulation in BPF, we are adding per-packet overhead. This overhead could come in the form of map lookups, adding encapsulation etc. The beautiful thing about GSO is that it happens after tc-bpf processing, so any costs we accrue in BPF are only paid for the mega-packet, rather than each individual MTU-sized packet. As such switching GSO on is highly desirable. There is a problem however. For GSO to work on encapsulated packets, the packets must mark their inner encapsulated headers. This is done for native tunnels via skb_set_inner_[mac|transport]_header() functions, but if we added encapsulation in BPF there was no way to mark the inner headers accordingly. The solution In BPF we carry out the marking of inner headers via flags passed to bpf_skb_adjust_room(). For usage examples, I would recommend looking at tools/testing/selftests/bpf/progs/test_tc_tunnel.c. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/test_tc_tunnel.c GRE and UDP tunnels are supported via the BPF_F_ADJ_ROOM_ENCAP_L4_GRE and BPF_F_ADJ_ROOM_ENCAP_L4_UDP flags. For L3, we need to specify if the inner header is IPv4 or IPv6 via the BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 and BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 flags. Finally if we have L2 encapsulation such as an inner ether header or MPLS label(s), we need to pass in the inner L2 header length via BPF_F_ADJ_ROOM_ENCAP_L2(inner_maclen). So we simply OR together the flags that specify the encapsulation we are adding. Conclusion Generic Segmentation Offload and BPF work beautifully together, but BPF encapsulation presented a difficulty since GSO did not know that the encapsulation had been added. With 5.2 and later kernels, this problem is now solved! Be sure to visit the previous installments of this series on BPF, here, and stay tuned for our next blog posts! 1. BPF program types 2. BPF helper functions for those programs 3. BPF userspace communication 4. BPF program build environment 5. BPF bytecodes and verifier 6. BPF Packet Transformation

Notes on BPF (7) - BPF, tc and Generic Segmentation Offload Oracle Linux kernel developer Alan Maguire continues our blog series on BPF, wherein he presented an in depth look at the kernel's "Berkeley...