X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Recent Posts

Linux

Generating a vmcore in OCI

  In this blog, Oracle Linux kernel developer Manjunath Patil demostrates how you can configure your Oracle Linux instances (both bare metal and virtual machine) running in Oracle Cloud for crash dumps. OCI instances can generate a vmcore using kdump. kdump is a mechanism to dump the 'memory contents of a system' [vmcore] when the system crashes. The vmcore later can be analyzed using the crash utility to understand the cause of the system crash. The kdump mechanism works by booting a second kernel [called kdump kernel or capture kernel] when the system running first kernel[called panicked kernel] crashes. The kdump kernel runs in its own reserved memory so that it wont affect the memory used by the system. The OCI systems are all pre-configured with kdump. When an OCI instance crashes, it will generate the vmcore which can be shared with developers to understand the cause of the crash. How to configure your Oracle Linux system with kdump 1. Pre-requisites Make sure you have the kexec-tools rpm installed shell # yum install kexec-tools # yum list installed | grep kexec-tools This is the main rpm which contains the tools to configure the kdump 2. Reserve memory for kdump kernel kdump kernel needs its own reserved memory so that when it boots, it won't use the first kernel's memory. The first kernel is told to reserve the memory for kdump kernel using crashkernel=auto kernel parameter. The first kernel needs to be rebooted for the kernel parameter to be effective. a. Here is how we can check if the memory is reserved # cat /proc/iomem | grep -i crash 27000000-370fffff : Crash kernel # dmesg | grep -e "Reserving .* crashkernel" [0.000000] Reserving 257MB of memory at 624MBfor crashkernel (System RAM: 15356MB) b. How to set kernel parameters OL6 systems - update the /etc/grub.conf file OL7 systems - update the /etc/default/grub file [GRUB_CMDLINE_LINUX= line] and re-generate the grub.cfg [grub2-mkconfig -o /boot/grub2/grub.cfg] 3. Setup the serial console Setting serial console prints progress of kdump kernel onto serial console. It would also help debugging any of the kdump kernel related issues. This setting is optional. To set: add 'console=tty0 console=ttyS0,115200n8' kernel parameters. Addition of kernel parameters require a reboot to be effective. 4. Configuring kdump /etc/kdump.conf is used to configure the kdump. The following are the two main configurations - a. where to dump the vmcore? Default location is /var/crash/ To change, update the line starting with 'path'. Make sure the new path has enough space to accommodate vmcore. b. minimize the size of vmcore We can reduce the size of vmcore by excluding memory pages such as pages filled with zero, user process data pages, free pages etc. This is controlled by 'core_collector' line in the config file. Default value is 'core_collector makedumpfile -p --message-level 1 -d 31' -p = compress the data using snappy --message-level = print messages on console. Range 0[brevity] to 31[verbose]. -d = dump level = dictates size of vmcore. Range 0[biggest] to 31[smallest] More on dump level and message-level in 'man makedumpfile' 5. Make kdump service run at boot time OL6: # chkconfig kdump on; chkconfig kdump --list OL7: # systemctl enable kdump; systemctl is-enabled kdump 6. Manually crash the system to make sure it's working # echo c > /proc/sysrq-trigger [After reboot] # ls -l /var/crash/* Keep the system in configured state, so that when system crashes vmcore is collected. 8. Examples a. OL6U10 - VM [root@ol6u10-vm ~]# cat /proc/cmdline ro root=UUID=... crashkernel=auto ... console=tty0 console=ttyS0,9600 [root@ol6u10-vm ~]# service kdump status Kdump is operational [root@ol6u10-vm ~]# cat /proc/iomem | grep -i crash 27000000-370fffff : Crash kernel [root@ol6u10-vm ~]# dmesg | grep -e "Reserving .* crashkernel" [ 0.000000] Reserving 257MB of memory at 624MB for crashkernel (System RAM: 15356MB) [root@ol6u10-vm ~]# echo c > /proc/sysrq-trigger [After reboot] [root@ol6u10-vm ~]# cat /etc/kdump.conf | grep -v '#' | grep -v '^$' path /var/crash core_collector makedumpfile -l --message-level 1 -d 31 [root@ol6u10-vm ~]# ls -lhs /var/crash/127.0.0.1-20.../ total 96M 96M -rw-------. 1 root root 96M Dec 13 00:48 vmcore 44K -rw-r--r--. 1 root root 41K Dec 13 00:48 vmcore-dmesg.txt [root@ol6u10-vm ~]# free -h total used free shared buffers cached Mem: 14G 395M 14G 208K 10M 114M -/+ buffers/cache: 270M 14G Swap: 8.0G 0B 8.0G b. OL6U10 - BM [root@ol6u10-bm ~]# cat /proc/cmdline ro root=UUID=... crashkernel=auto ... console=tty0 console=ttyS0,9600 [root@ol6u10-bm ~]# service kdump status Kdump is operational [root@ol6u10-bm ~]# cat /proc/iomem | grep -i crash 27000000-37ffffff : Crash kernel [root@ol6u10-bm ~]# dmesg | grep -e "Reserving .* crashkernel" [ 0.000000] Reserving 272MB of memory at 624MB for crashkernel (System RAM: 262010MB) [root@ol6u10-bm ~]# echo c > /proc/sysrq-trigger [After Reboot] [root@ol6u10-bm ~]# cat /etc/kdump.conf | grep -v '#' | grep -v '^$' path /var/crash core_collector makedumpfile -l --message-level 1 -d 31 [root@ol6u10-bm ~]# ls -lhs /var/crash/127.0.0.1-20.../ total 1.1G 1.1G -rw-------. 1 root root 1.1G Dec 18 05:24 vmcore 92K -rw-r--r--. 1 root root 90K Dec 18 05:23 vmcore-dmesg.txt [root@ol6u10-bm ~]# free -h total used free shared buffers cached Mem: 251G 1.4G 250G 224K 13M 157M -/+ buffers/cache: 1.2G 250G Swap: 8.0G 0B 8.0G c. OL7U7 - VM [root@ol7u7-vm opc]# free -h total used free shared buff/cache available Mem: 14G 290M 13G 16M 274M 13G Swap: 8.0G 0B 8.0G [root@ol7u7-vm opc]# service kdump status Redirecting to /bin/systemctl status kdump.service kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: active (exited) ... ... ol7u7-vm systemd[1]: Started Crash recovery kernel arming. [root@ol7u7-vm opc]# cat /proc/cmdline BOOT_IMAGE=... crashkernel=auto ... console=tty0 console=ttyS0,9600 [root@ol7u7-vm opc]# dmesg | grep -e "Reserving .* crashkernel" [ 0.000000] Reserving 257MB of memory at 624MB for crashkernel (System RAM: 15356MB) [root@ol7u7-vm opc]# cat /proc/iomem | grep -i crash 27000000-370fffff : Crash kernel [root@ol7u7-vm ~]# ls -lhs /var/crash/127.0.0.1-20.../ total 90M 90M -rw-------. 1 root root 90M Dec 18 13:48 vmcore 48K -rw-r--r--. 1 root root 47K Dec 18 13:48 vmcore-dmesg.txt [root@ol7u7-vm ~]# cat /etc/kdump.conf | grep -v '#' | grep -v "^$" path /var/crash core_collector makedumpfile -l --message-level 1 -d 31 d. OL7U7-BM [root@ol7u7-bm opc]# free -h total used free shared buff/cache available Mem: 251G 1.1G 250G 17M 298M 249G Swap: 8.0G 0B 8.0G [root@ol7u7-bm opc]# service kdump status Redirecting to /bin/systemctl status kdump.service kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: active (exited) ... ... ol7u7-bm systemd[1]: Started Crash recovery kernel arming. [root@ol7u7-bm opc]# cat /proc/cmdline BOOT_IMAGE=... crashkernel=auto ... console=tty0 console=ttyS0,9600 [root@ol7u7-bm opc]# cat /proc/iomem | grep -i crash 25000000-35ffffff : Crash kernel [root@ol7u7-bm opc]# dmesg | grep -e "Reserving .* crashkernel" [ 0.000000] Reserving 272MB of memory at 592MB for crashkernel (System RAM: 262010MB) [root@ol7u7-bm opc]# cat /etc/kdump.conf | grep -v '#' | grep -v '^$' path /var/crash core_collector makedumpfile -l --message-level 1 -d 31 [root@ol7u7-bm opc]# echo c > /proc/sysrq-trigger [root@ol7u7-bm ~]# ls -lhs /var/crash/127.0.0.1-20.../ total 1.1G 1.1G -rw-------. 1 root root 1.1G Dec 18 14:16 vmcore 116K -rw-r--r--. 1 root root 114K Dec 18 14:16 vmcore-dmesg.txt

  In this blog, Oracle Linux kernel developer Manjunath Patil demostrates how you can configure your Oracle Linux instances (both bare metal and virtual machine) running in Oracle Cloud for...

Linux

Building (Small) Oracle Linux Images For The Cloud

Overview Oracle Linux Image Tools is a sample project to build small or customized Oracle Linux Cloud images in a repeatable way. It provides a bash modular framework which uses HashiCorp Packer to build images in Oracle VM VirtualBox. Images are then converted to an appropriate format depending on the Cloud provider. This article shows you how to build the sample images from this repository and how to use the framework to build custom images. The framework is based around two concepts: Distribution and Cloud modules. A Distribution module is responsible for the installation and configuration of Oracle Linux as well as the packages needed for your project. The sample ol7-slim distribution provides an Oracle Linux 7 image with a minimalist set of packages (about 250 packages – smaller than an Oracle Linux 7 Minimal Install). A Cloud module ensures that the image is properly configured and packaged for a particular cloud provider. The following modules are currently available: oci: Oracle Cloud Infrastructure (QCOW2 file) olvm: Oracle Linux Virtualization Manager (OVA file) ovm: Oracle VM Server (OVA file) azure: Microsoft Azure (VHD file) none: no cloud customization (OVA file) Build requirements Environment A Linux environment is required for building images. The project is developed and tested with Oracle Linux 7, but should run on most Linux distribution. If your environment is a virtual machine, it must support nested virtualization. The build tool needs root privileges to mount the generated images. Ensure sudo is properly configured for the user running the build. Software You will need the following software installed: HashiCorp Packer and Oracle VM VirtualBox yum --enablerepo=ol7_developer install packer VirtualBox-6.0 kpartx and qemu-img to manipulate the artifacts yum install kpartx qemu-img Disk space You will need at least twice the size of your images as free disk space. That is: building a 30GB image will require 60GB of free space. Building the project images Building the images from the project is straightforward. Configuration Build configuration is done by editing the env.properties file (or better, a copy of it). Options are documented in the property file, but at the very least you must provide: WORKSPACE: the directory used for the build ISO_URL / ISO_SHA1_CHECKSUM: location of the Oracle Linux ISO image. You can download it from the Oracle Software Delivery Cloud or use a public mirror. The image is cached in the workspace. DISTR: the Distribution to build CLOUD: the target cloud provider. Sample build The following env.properties.oci property file is used to build a minimal OL7 image for the Oracle Cloud Infrastructure, using all default parameters: WORKSPACE="/data/workspace" ISO_URL="http://my.mirror.example.com/iso/ol7/OracleLinux-R7-U7-Server-x86_64-dvd.iso" ISO_SHA1_CHECKSUM="3ef94628cf1025dab5f10bbc1ed2005ca0cb0933" DISTR="ol7-slim" CLOUD="oci" Run the script: $ ./bin/build-image.sh --env env.properties.oci +++ build-image.sh: Parse arguments +++ build-image.sh: Load environment +++ build-image.sh: Stage Packer files +++ build-image.sh: Stage kickstart file +++ build-image.sh: Generate Packer configuration file +++ build-image.sh: Run Packer build-image.sh: Spawn HTTP server build-image.sh: Invoke Packer ... build-image.sh: Package image +++ build-image.sh: Cleanup Workspace +++ build-image.sh: All done +++ build-image.sh: Image available in /data/workspace/OL7U7_x86_64-oci-b0 $ That’s it! The /data/workspace/OL7U7_x86_64-oci-b0 directory now contains OL7U7_x86_64-oci-b0.qcow, a QCOW2 file which can be imported and run on OCI. Adding new modules Directory layout Each Distribution module is represented by a subdirectory of the distr directory. Each Cloud module is represented by a subdirectory of the cloud directory. Additionally, Cloud actions for a specific Distribution can be defined in the cloud/<cloud>/<distr> directory. Any element not necessary can be omitted – e.g. the none cloud module only provides a packaging function. All the env.properties files are merged and made available to the scripts at runtime. They define parameters with default values which can be overridden by the user in the global env.properties file in the project base directory. Adding a distribution To add a new distribution, create a directory in distr/ with the following files: env.properties: parameters for the distribution. ks.cfg: a kickstart file to bootstrap the installation. This is the only mandatory file. image-scripts.sh: a shell script with the following optional functions which will be invoked on the build host: distr::validate: validate the parameters before the build. distr::kickstart: tailor the kickstart file based on the parameters. distr::image_cleanup: disk image cleanup run at the end of the build. provision.sh: a shell script with the following optional functions which will be invoked on the VM used for the build: distr::provision: image provisioning (install/configure software) distr::cleanup: image cleanup (uninstall software, …) files directory: the files in this directory are copied to the image in /tmp/distr and can be used by the provisioning scripts. Adding a cloud The process is similar to the distribution: create a directory in cloud/ with the following files: env.properties: parameters for the cloud. image-scripts.sh: a shell script with the following optional functions which will be invoked on the build host: cloud::validate: validate the parameters before the build. cloud::kickstart: tailor the kickstart file based on the parameters. cloud::image_cleanup: disk image cleanup run at the end of the build. cloud::image_package: package the image in a suitable format for the cloud provider. This is the only mandatory function. provision.sh: a shell script with the following optional functions which will be invoked on the VM used for the build: cloud::provision: image provisioning (install/configure software) cloud::cleanup: image cleanup (uninstall software, …) files directory: the files in this directory are copied to the image in /tmp/cloud and can be used by the provisioning scripts. If some cloud actions are specific to a particular distribution, they can be specified in the <cloud>/<distr> subdirectory. If a cloud_distr::image_package function is provided it will override the cloud::image_package one. Builder flow The complete build flow is illustrated hereunder: The builder goes through the following steps: Build environment All the env.properties files are sourced and merged. The user provided one is sourced last and defines the build behavior The validate() functions are called. These hooks perform a sanity check on the parameters Packer configuration and run The distribution kickstart file is copied and the kickstart() hooks have the opportunity to customize it The distribution is installed in a VirtualBox VM using this kickstart files The files directories are copied in /tmp on the VM The provision() functions are run in the VM The cleanup() functions are run in the VM Packer will then shutdown and export the VM Image cleanup The generated image is unpacked and mounted on the host The image_cleanup() functions are called The image is unmounted The final package is created by the image_package() function, either from cloud_distr or from cloud

Overview Oracle Linux Image Tools is a sample project to build small or customized Oracle Linux Cloud images in a repeatable way. It provides a bash modular framework which uses HashiCorp Packer to...

Events

Live Webcast: Top 5 Reasons to Build your Virtualization with Oracle Linux KVM

Register Today: February 27, 2020 EMEA: 10:00 a.m. GMT/11:00 CET/12:00 SAST/14:00 GST APAC: 10:30 AM IST/ 1:00 PM SGT/4:00 PM AEDT North America: 09:00 AM Pacific Standard Time Recent industry surveys indicate that most enterprises have a strategy of using multiple clouds. Most who are planning to migrate to cloud start with modernizing their on premises data center. The choice of Linux and Virtualization can make a big impact on their infrastructure, both today and tomorrow. Oracle Linux Virtualization Manager, based on the open source oVirt project, can be easily deployed to configure, monitor, and manage an Oracle Linux KVM environment with enterprise-grade performance and support for both on premise and cloud.     Join this webcast to learn from Oracle experts about the top 5 reasons to build your virtualization infrastructure using Oracle Linux KVM:   Accelerated deployment with ready to go VMs with Oracle software  Increased performance and security Simplified, easy management of the full stack Improved licensing costs through hard partitioning Lower licensing and support costs while increasing benefits   Featured Speakers Simon Coter Director of Product Management for Linux and Virtualization, Oracle Simon is responsible for both Oracle Linux and Virtualization, the Unbreakable Enterprise Kernel along with all its sub-components and add-ons, including Oracle Linux KVM, Oracle Linux Virtualization Manager, Ceph, Gluster, Oracle VM and VirtualBox. John Priest  Product Management Director for Oracle Server Virtualization John covers all aspects of the Oracle Linux Virtualization Manager and Oracle VM product life-cycles.

Register Today: February 27, 2020 EMEA: 10:00 a.m. GMT/11:00 CET/12:00 SAST/14:00 GST APAC: 10:30 AM IST/ 1:00 PM SGT/4:00 PM AEDT North America: 09:00 AM Pacific Standard Time Recent industry surveys...

Libcgroup in the Twenty-First Century

In this blog post, Oracle Linux kernel developer Tom Hromatka writes about the new testing frameworks, continuous integration and code coverage capabilities that have been added to libcgroup. In 2008 libcgroup was created to simplify how users interact with and manage cgroups. At the time, only cgroups v1 existed, the libcgroup source was hosted in a subversion repository on Sourceforce, and System V still ruled the universe. Fast forward to today and the landscape is changing quickly. To pave the way for cgroups v2 support in libcgroup, we have added unit tests, functional tests, continuous integration, code coverage, and more. Unit Testing In May 2019 we added the googletest unit testing framework to libcgroup. libcgroup has many large, monolithic functions that perform the bulk of the cgroup management logic, and adding cgroup v2 support to these complex functions could easily introduce regressions. To combat this, we plan on adding tests before we add cgroup v2 support. Functional Testing In June 2019 we added a functional test framework to libcgroup. The functional test framework consists of several Python classes that either represent cgroup data or can be used to manage cgroups and the system. Years ago tests were added to libcgroup, but they have proven difficult to run and maintain because they are destructive to the host system's libcgroup hierarchy. With the advent of containers, this problem can easily be avoided. The functional test framework utilizes LXC containers and the LXD interfaces to encapsulate the tests. Running the tests within a container provides a safe environment where cgroups can be created, deleted, and modified in an easily reproducible setting - without destructively modifying the host's cgroup hierachy. libcgroup's functional tests are quick and easy to write and provide concise and informative feedback on the status of the run. Here's a simple example of a successful test run: $ ./001-cgget-basic_cgget.py ----------------------------------------------------------------- Test Results: Run Date: Dec 02 17:54:28 Passed: 1 test(s) Skipped: 0 test(s) Failed: 0 test(s) ----------------------------------------------------------------- Timing Results: Test Time (sec) --------------------------------------------------------- setup 5.02 001-cgget-basic_cgget.py 0.76 teardown 0.00 --------------------------------------------------------- Total Run Time 5.79 And here's an example of where something went wrong. In this case I have artificially caused the Run() class to raise an exception early in the test run. The framework reports the test and the exact command that failed. The return code, stdout, and stderr from the failing command are also reported to facilitate debugging. And of course the log file contains a chronological history of the entire test run to further help in troubleshooting the root cause. $ ./001-cgget-basic_cgget.py ----------------------------------------------------------------- Test Results: Run Date: Dec 02 18:11:47 Passed: 0 test(s) Skipped: 0 test(s) Failed: 1 test(s) Test: 001-cgget-basic_cgget.py - RunError: command = ['sudo', 'lxc', 'exec', 'TestLibcg', '--', '/home/thromatka/git/libcgroup/src/tools/cgset', '-r', 'cpu.shares=512', '001cgget'] ret = 0 stdout = b'' stderr = b'I artificially injected this exception' ----------------------------------------------------------------- Continuous Integration and Code Coverage In September 2019 we added continuous integration and code coverage to libcgroup. libcgroup's github repository is now linked with Travis CI to automatically configure the library, build the library, run the unit tests, and run the functional tests every time a commit is pushed to the repo. If the tests pass, Travis CI invokes coveralls.io to gather code coverage metrics. The continuous integration status and the current code coverage percentage are prominently displayed on the github source repository. Currently all two :) tests are passing and code coverage is at 16%. I have many more tests currently in progress, so expect to see these numbers improve significantly in the next few months. Future Work Ironically, after all these changes, we're now nearly ready to start the "real work." A loose roadmap of our upcoming improvements: Add an "ignore" rule to cgrulesengd. (While not directly related to the cgroup v2 work, this new ignore rule will heavily utilize the testing capabilities outlined above) Add a ton more tests - both unit and functional Add cgroup v2 support to our functional testing framework. I have a really rough prototype working, but I think automating it will require help from the Travis CI development team Add cgroup v2 capabilities to libcgroup utilities like cgget, cgset, etc. Design and implement a cgroup abstraction layer that will abstract away all of the gory detail differences between cgroup v1 and cgroup v2

In this blog post, Oracle Linux kernel developer Tom Hromatka writes about the new testing frameworks, continuous integration and code coverage capabilities that have been added to libcgroup. In 2008...

Events

Join the Oracle Linux and Virtualization Team in London at Oracle OpenWorld Europe

The Oracle OpenWorld Global Series continues with our next stop at ExCeL London, February 12–13, 2020. With just 5 days left to register, you’ll want to sign up now for your complementary pass and reserve your place. Across the two days, you can immerse yourself in the infinite possibilities of a data-driven world. Wednesday, 12 February | Insight Starts Here | Outpace Change with Intelligence Explore how leading companies—faced with an ever-accelerating pace of change—are unlocking insights with data to re-engineer the core of their business, elevate the value they deliver to customers, pioneer new ways of working, and drive completely new opportunities. Thursday, 13 February | Innovation Starts Here | Technology-Powered Possibilities Dive deep into the transformational and autonomous technologies fundamentally changing work and life. Fuel innovation by pulling value from vast amounts of data at scale and unleashing opportunities with AI and machine learning and a long list of featured speakers and luminaries. We look forward to seeing you there. Be sure to add these two sessions to your agenda: Wim Coekaerts, SVP, Software Development, will present a Solution Keynote: Cloud Platform and Middleware Strategy and Roadmap [SOL1194-LON] Thursday, Feb 13 | 09:00 - 10:20 | Arena F - Zone 6 In this session, Wim Coekaerts will discuss the strategy and vision for Oracle’s comprehensive cloud platform services and on-premise software. Customers are on a number of journeys to the cloud: moving and modernizing workloads out of data centers; transitioning off on-premises apps to SaaS; innovating with new API-first, chatbot-based container native applications; optimizing IT operations and security from the cloud; and getting real-time insight leveraging big data and analytics from the cloud. Hear from customers about how they leverage Oracle Cloud for their digital transformation. And hear how Oracle’s application development, integration, systems management, and security solutions leverage artificial intelligence to drive cost savings and operational efficiency for hybrid and multicloud ecosystems. Simon Coter, Product Management Director, Oracle Linux and Virtualization, delivers a Breakout Session: Tools and Techniques for Modern Cloud Native Development [SES1270-LON]  Thursday, Feb 13 | 13:05 - 13:40 | Arena C - Zone 2 Simon Coter will explore the tools, techniques, and strategies you can apply using Oracle Linux to help you evolve toward a cloud native future. On-premise or in the cloud, you'll learn how Oracle Linux Cloud Native Environment enables you to deploy reliable, secure, and scalable applications. You will also discover how Kubernetes, Docker, CRI-O, and Kata Containers, available for free with Oracle Linux Premier Support, and Oracle VM VirtualBox deliver an exceptional DevSecOps solution. Explore all of the conference’s content through the detailed content catalogue and attend keynote sessions and other sessions of your interest. Join Us at The Exchange | Zone 3 Talk with product experts and experience the latest Oracle Linux and Oracle Virtualization technologies first hand. You’ll find us at two stands in Zone 3 of The Exchange. Don’t miss the Raspberry Pi “Mini” Super Computer search for aliens!  @ Groundbreakers Hub | Zone 3    Mini is the sibling of Super Pi, the super computer demonstrated at Oracle OpenWorld San Francisco in October, 2019, and among the top 10 Raspberry Pi projects last year.  Mini is a portable Pi cluster in a large pelican-like case on wheels with 84 Raspberry Pi 3B+ boards running Oracle Linux 8. Check out Mini as it searches for aliens with SETI@home.   Bold ideas. Breakthrough technologies. Better possibilities. It all starts here. Register now. We look forward to meeting you in London. Join the conversation: @OracleLinux @OracleOpenWorld #OOWLON #OracleTux

The Oracle OpenWorld Global Series continues with our next stop at ExCeL London, February 12–13, 2020. With just 5 days left to register, you’ll want to sign up now for your complementary passand...

Linux Kernel Development

Unbinding Parallel Jobs in Padata

Oracle Linux kernel developer Daniel Jordan contributes this post on enhancing the performance of padata. padata is a generic framework for parallel jobs in the kernel -- with a twist. It not only schedules jobs on multiple CPUs but also ensures that each job is properly serialized, i.e. finishes in the order it was submitted. This post will provide some background on this somewhat obscure feature of the core kernel and cover recent efforts to enhance its parallel performance in preparation for more multithreading in the kernel. How Padata Works padata allows users to create an instance that represents a certain class of parallel jobs, for example IPsec decryption (see pdecrypt in the kernel source). The instance serves as a handle when submitting jobs to padata so that all jobs submitted with the same handle are serialized amongst themselves. An instance also allows for fine-grained control over which CPUs are used to run work, and contains other internal data such as the next sequence number to assign for serialization purposes and the workqueue used for parallelization. To initialize a job (known cryptically as padata_priv in the code), a pair of function pointers are required, parallel and serial, where parallel is responsible for doing the actual work in a workqueue worker and serial completes the job once padata has serialized it. The user submits the job along with a corresponding instance to the framework via padata_do_parallel to start it running, and once the job's parallel part is finished, the user calls padata_do_serial to inform padata of this. padata_do_serial is currently always called from parallel, but this is not strictly required. padata ensures that a job's serial function is called only when the serial functions of all previously-submitted jobs from the same instance have been called. Though parallelization is ultimately padata's (and this blog post's) reason for being, its serialization algorithm is the most technically interesting part, so I'll go on a tangent to explain a bit about it. For scalability reasons, padata allocates internal per-CPU queues, and there are three types, parallel, reorder, and serial, where each type is used for a different phase of a padata job's lifecycle. When a job is submitted to padata, it's atomically assigned a unique sequence number within the instance that determines the order its serialization callback runs. The sequence number is hashed to a CPU that is used to select which queue a job is placed on. When the job is preparing to execute its parallel function, it is placed on a parallel per-CPU queue that determines which CPU it runs on (this becomes important later in the post). Using a per-CPU queue allows multiple tasks to submit parallel jobs concurrently with only minimal contention from the atomic op on the sequence number, avoiding a shared lock. When the parallel part finishes and the user calls padata_do_serial, padata then places the job on the reorder queue, again corresponding to the CPU that the job hashed to. And finally, a job is placed on the serial queue once all jobs before it have been serialized. During the parallel phase, jobs may finish out of order relative to when they were submitted. Nevertheless, each call to padata_do_serial places the job on its corresponding reorder queue and attempts to process the entire reorder queue across all CPUs, which entails repeatedly checking whether the job with the next unserialized sequence number has finished until there are no more jobs left to reorder. These jobs may or may not include the one passed to padata_do_serial because again, jobs finish out of order. This process of checking for the next unserialized job is the biggest potential bottleneck in all of padata because a global lock is used. Without the lock, multiple tasks might process the reorder queues at once, leading to duplicate serial callbacks and list corruption. However, if all calls to padata_do_serial were to wait on the lock when only one call actually ends up processing all the jobs, the rest of the tasks would be waiting for no purpose and introduce unnecessary latency in the system. To avoid this situation, the lock is acquired with a trylock call, and if a task fails to get the lock, it can safely bail out of padata knowing that a current or future lock holder will take care of its job. This serialization process is important for the use case that prompted padata to begin with, IPsec. IPsec throughput was a bottleneck in the kernel because a single CPU, the one that the receiving NIC's interrupt ran on, was doing all the work, with the CPU-intensive portion largely consisting of the crypto operations. Parallelization could address this, but serialization was required to maintain the packet ordering that the upper layer protocols required, and getting that right was not an easy task. See this presentation from Steffen Klassert, the original author of padata, for more background. More Kernel Multithreading Though padata was designed to be generic, it currently has just the one IPsec user. There are more kernel codepaths that can benefit from parallelization, such as struct page initialization, page clearing in various memory management paths (huge page fallocate, get_user_pages), and page freeing at munmap and exit time. Two previous blog posts and an LWN article on ktask have covered some of these. Recent upstream feedback has called for merging ktask with padata, and the first step in that process is to change where padata schedules its parallel workers. To that end, I posted a series on the mailing lists, merged for the v5.3 release, that adds a second workqueue per padata instance dedicated to parallel jobs. Earlier in the post, I described padata's per-CPU parallel queues. To assign a job to one of these queues, padata uses a simple round-robin algorithm to hash a job's sequence number to a CPU, and then runs the job bound to that CPU alone. Each successive job submitted to the instance runs on the next CPU. There are two problems with this approach. First, it's not NUMA-aware, so on multi-socket systems, a job may not run locally. Second, on a busy system, a job will likely complete faster if it allows the scheduler to select the CPU within the NUMA node it's run on. To solve both problems, the series uses an unbound workqueue, which is NUMA-aware by default and not bound to a particular CPU (hence the name). Performance Results The numbers from tcrypt, a test module in the kernel's crypto layer, look promising. Parts are shown here, see the upstream post for the full data. Measurements are from a 2-socket, 20-core, 40-CPU Xeon server. For repeatability, modprobe was bound to a CPU and the serial cpumasks for both pencrypt and pdecrypt were also restricted to a CPU different from modprobe's. # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3 # modprobe tcrypt mode=211 sec=1 # modprobe tcrypt mode=215 sec=1 Busy system (tcrypt run while 10 stress-ng tasks were burning 100% CPU) base test ---------------- --------------- speedup key_sz blk_sz ops/sec stdev ops/sec stdev (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211) 117.2x 160 16 960 30 112555 24775 135.1x 160 64 845 246 114145 25124 113.2x 160 256 993 17 112395 24714 111.3x 160 512 1000 0 111252 23755 110.0x 160 1024 983 16 108153 22374 104.2x 160 2048 985 22 102563 20530 98.5x 160 4096 998 3 98346 18777 86.2x 160 8192 1000 0 86173 14480 multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215) 242.2x 160 16 2363 141 572189 16846 242.1x 160 64 2397 151 580424 11923 231.1x 160 256 2472 21 571387 16364 237.6x 160 512 2429 24 577264 8692 238.3x 160 1024 2384 97 568155 6621 216.3x 160 2048 2453 74 530627 3480 209.2x 160 4096 2381 206 498192 19177 176.5x 160 8192 2323 157 410013 9903 Idle system (tcrypt run by itself) base test ---------------- --------------- speedup key_sz blk_sz ops/sec stdev ops/sec stdev (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=211) 2.5x 160 16 63412 43075 161615 1034 4.1x 160 64 39554 24006 161653 981 6.0x 160 256 26504 1436 160110 1158 6.2x 160 512 25500 40 157018 951 5.9x 160 1024 25777 1094 151852 915 5.8x 160 2048 24653 218 143756 508 5.6x 160 4096 24333 20 136752 548 5.0x 160 8192 23310 15 117660 481 multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215) 1.0x 160 16 412157 3855 426973 1591 1.0x 160 64 412600 4410 431920 4224 1.1x 160 256 410352 3254 453691 17831 1.2x 160 512 406293 4948 473491 39818 1.2x 160 1024 395123 7804 478539 27660 1.2x 160 2048 385144 7601 453720 17579 1.2x 160 4096 371989 3631 449923 15331 1.2x 160 8192 346723 1617 399824 18559 A few tools were used in the initial performance analysis to confirm the source of the speedups. I'll show results from one of them, ftrace. Custom kernel events were added to record the runtime and CPU number of each crypto request, which runs a padata job under the hood. For analysis only (not the runs that produced these results), the threads of the competing workload stress-ng were bound to a known set of CPUs, and two histograms were created of crypto request runtimes, one for just the CPUs without the stress-ng tasks ("uncontended") and one with ("contended"). The histogram clearly shows increased times for the padata jobs with contended CPUs, as expected: Crypto request runtimes (usec) on uncontended CPUs # request-count: 11980; mean: 41; stdev: 23; median: 45 runtime (usec) count -------------- -------- 0 - 1 [ 0]: 1 - 2 [ 0]: 2 - 4 [ 0]: 4 - 8 [ 209]: * 8 - 16 [ 3630]: ********************* 16 - 32 [ 188]: * 32 - 64 [ 6571]: ************************************** 64 - 128 [ 1381]: ******** 128 - 256 [ 1]: 256 - 512 [ 0]: 512 - 1024 [ 0]: 1024 - 2048 [ 0]: 2048 - 4096 [ 0]: 4096 - 8192 [ 0]: 8192 - 16384 [ 0]: 16384 - 32768 [ 0]: Crypto request runtimes (usec) on contended CPUs # request-count: 3991; mean: 3876; stdev: 455; median 3999 runtime (usec) count -------------- -------- 0 - 1 [ 0]: 1 - 2 [ 0]: 2 - 4 [ 0]: 4 - 8 [ 0]: 8 - 16 [ 0]: 16 - 32 [ 0]: 32 - 64 [ 0]: 64 - 128 [ 0]: 128 - 256 [ 0]: 256 - 512 [ 4]: 512 - 1024 [ 4]: 1024 - 2048 [ 0]: 2048 - 4096 [ 3977]: ************************************** 4096 - 8192 [ 4]: 8192 - 16384 [ 2]: 16384 - 32768 [ 0]: Conclusion Now that padata has unbound workqueue support, look out for further enhancements to padata in coming releases! Next steps include creating padata threads in cgroups so they can be properly throttled and adding multithreaded job support to padata.

Oracle Linux kernel developer Daniel Jordan contributes this post on enhancing the performance of padata. padata is a generic framework for parallel jobs in the kernel -- with a twist. It not only...

Announcements

Announcing the First Oracle Linux 7 Template for Oracle Linux KVM

We are proud to announce the first Oracle Linux 7 Template for Oracle Linux KVM and Oracle Linux Virtualization Manager. The new Oracle Linux 7 Template for Oracle Linux KVM and Oracle Linux Virtualization Manager supplies powerful automation. It is built on cloud-init, the same technology used today on Oracle Cloud Infrastructure. The template has been built with the following components/options: Oracle Linux 7 Update 7 x86_64 Unbreakable Enterprise Kernel 5 - kernel-uek-4.14.35-1902.5.2.2.el7uek.x86_64 Red Hat Compatible Kernel - kernel-3.10.0-1062.1.2.el7.x86_64 8GB of RAM 15GB of OS virtual disk Downloading Oracle Linux 7 Template for Oracle Linux KVM Oracle Linux 7 Template for Oracle Linux KVM is available on Oracle Software Delivery Cloud. Search for "Oracle Linux KVM" and select "Oracle Linux KVM Templates for Oracle Linux" Click on the "Add to Cart" button and then click on "Checkout" in the right upper corner. On the following window, select "Linux-x86_64" and click on the "Continue" button: Accept the "Oracle Standard Terms and Restrictions" to continue and, on the following window, click on "V988166-01.zip" to download the Oracle Linux 7 Template for Oracle Linux KVM and on "V988167-01.zip" to download the README with instructions: Further information Oracle Linux 7 Template for Oracle Linux KVM allows you to configure different options on the first boot for your Virtual Machine; cloud-init options configured on Oracle Linux 7 Template are: VM Hostname define the Virtual Machine hostname Configure Timezone define the Virtual Machine timezone (within an existing available list) Authentication Username define a custom Linux user on the Virtual Machine Password Verify Password define the password for the custom Linux user on the Virtual Machine SSH Authorized Keys SSH authorized keys to get password-less access to the Virtual Machine Regenerate SSH Keys Option to regenerate the Virtual Machine Host SSH Keys Networks DNS Servers define the Domain Name Servers for the Virtual Machine DNS Search Domains define the Domain Name Servers Search Domain for the Virtual Machine In-guest Network Interface Name define the virtual-NIC device name for the Virtual Machine (ex. eth0) Custom script Execute a custom-script at the end of the cloud-init configuration process All of those options can be easily managed by "Oracle Linux Virtualization Manager" web interface by editing the Virtual Machine and enabling "Cloud-Init/Sysprep" option: Further details on how to import and use the Oracle Linux 7 Template for Oracle Linux KVM are available in this Technical Article on Simon Coter's Oracle Blog. Oracle Linux KVM & Virtualization Manager Support Support for Oracle Linux Virtualization Manager is available to customers with an Oracle Linux Premier Support subscription. Refer to Oracle Unbreakable Linux Network for additional resources on Oracle Linux support. Oracle Linux Resources Documentation Oracle Linux Virtualization Manager Documentation Blogs Oracle Linux Blog Oracle Virtualization Blog Community Pages Oracle Linux Product Training and Education Oracle Linux Administration - Training and Certification Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter

We are proud to announce the first Oracle Linux 7 Template for Oracle Linux KVM and Oracle Linux Virtualization Manager. The new Oracle Linux 7 Template for Oracle Linux KVM and Oracle Linux...

On The Benefits of Static Trace Points

Chuck Lever is a Linux Kernel Architect working with the Oracle Linux and Unbreakable Enterprise Kernel team at Oracle. He contributed this article about replacing printk debugging with static trace points in the kernel. On The Benefits of Static Trace Points These days, kernel developers have several choices when it comes to reporting exceptional events. Among them: a console message; a static trace point; Dtrace; or, a Berkeley Packet Filter script. Arguably the best choice for building an observability framework into the kernel is the judicious use of static trace points. Amongst the several kernel debugging techniques that are currently in vogue, we like static trace points. Here's why. A little history Years ago IBM coined the term First Failure Data Capture (FFDC). Capture enough data about a failure, just as it occurs the first time, so that reproducing the failure is all but unnecessary. An observability framework is a set of tools that enable system administrators to monitor and troubleshoot systems running in production, without interfering with efficient operation. In other words, it captures enough data about any failure that occurs so that a failure can be root-caused and possibly even fixed without the need to reproduce the failure in vitro. Of course, FFDC is an aspirational goal. There will always be a practical limit to how much data can be collected, managed, and analyzed without impacting normal operation. The key is to identify important exceptional events and place hooks in those areas to record those events as they happen. These exceptional events are hopefully rare enough that the captured data is manageable. And the hooks themselves must introduce little or no overhead to a running system. The trace point facility The trace point facility, also known as ftrace, has existed in the Linux kernel for over a decade. Each static trace point is an individually-enabled call out that records a set of data as a structured record into a circular buffer. An area expert determines where each trace point is placed, what data is stored in the structured record, and how the stored record should be displayed (i.e., a print format specifier string). The format of the structured record acts as a kernel API. It is much simpler to parse than string output by printk. User space tools can filter trace data based on values contained in the fields (e.g., show me just trace events where "status != 0"). Each trace point is always available to use, as it is built into the code. When triggered, a trace point can do more than capture the values of a few variables. It also records a timestamp and whether interrupts are enabled, and which CPU, which PID, and which executable is running. It is also able to enable or disable other trace points, or provide a stack trace. Dtrace and eBPF scripts can attach to a trace point, and hist triggers are also possible. Trace point buffers are allocated per CPU to eliminate memory contention and lock waiting when a trace event is triggered. There is a default set of buffers ready from system boot onward. However, trace point events can be directed into separate buffers. This permits several different tracing operations to occur concurrently without interfering with each other. These buffers can be recorded into files, transmitted over the network, or read from a pipe. If a system crash should occur, captured trace records still reside in these buffers and can be examined using crash dump analysis tools. The benefits of trace points Trace points can be safely placed in code that runs at interrupt context as well as code that runs in process context, unlike printk(). Also unlike printk(), individual trace points can be enabled, rather than every printk() at a particular log level. Groups of trace points can be conveniently enabled or disabled with a single operation, and can be combined with other more advanced ftrace facilities such as function_graph tracing. Trace points are designed to be low overhead, especially when they are disabled. The code that structures trace point data and inserts it into a trace buffer is out-of-line, so that a uncalled trace point adds very little instruction cache footprint. The actual code at the call site is nothing more than a load and a conditional branch. This is unlike some debugging mechanisms that place no-ops in the code, and then modify the code when they are enabled. This technique would not be effective if the executable resides in read-only memory, but a trace point in the same code can continue to work. What about printk? In contrast, printk() logs messages onto the system console and directly into the kernel's log file (typically /var/log/messages). In recent Linux distributions, kernel log output is rate-limited, which means an important but noisy stream of printk() messages can be temporarily disabled by software just before that one critical log message comes out. In addition, in lights-out environments, the console can be a serial device set to a relative low baud rate. A copious stream of printk() messages can trigger workqueue stalls or worse as the console device struggles to keep up. How do I use trace points? We've described a good way to identify and record exceptional events, using static trace points. How are the captured events recorded to a file for analysis? The trace-cmd(1) tool permits a privileged user to specify a set of events to enable and direct the output to a file or across a network, and then filter and display the captured data. This tool is packaged and available for download in Oracle Linux RPM channels. A graphical front-end for trace-cmd called kernelshark is also available. In addition, Oracle has introduced a facility for continuous monitoring of trace point events called Flight Data Recorder (FDR for short). FDR is started by systemd and enables trace points to monitor. It captures event data to a round-robin set of files, limiting the amount of data so it does not overrun the local root filesystem. A configuration file allows administrators to adjust the set of trace points that are monitored. Because this facility is always on, it can capture events at the time of a crash. The captured trace point data is available in files or it can be examined by crash analysis. To keep this article short, we've left out plenty of other benefits and details about static trace points. You can read more about them by following the links below. These links point to articles about trace point-related user space tools, clever tips and tricks, how to insert trace points into your code, and much much more. There are several links to lwn.net http://lwn.net/ above. lwn.net http://lwn.net/ is such a valuable resource to the Linux community. I encourage everyone to consider a subscription! First Failure Data Capture https://www.ibm.com/garage/method/practices/manage/first-failure-data-capture Using the Linux Kernel Tracepoints https://www.kernel.org/doc/html/latest/trace/tracepoints.html Debugging the kernel using Ftrace https://lwn.net/Articles/365835/ trace-cmd: A front-end for Ftrace https://lwn.net/Articles/410200/ Flight Data Recorder https://github.com/oracle/fdr Hist Triggers in Linux 4.7 http://www.brendangregg.com/blog/2016-06-08/linux-hist-triggers.html Ftrace: The hidden light switch https://lwn.net/Articles/608497/ Triggers for Tracing https://lwn.net/Articles/556186/ Finding Origins of Latencies Using Ftrace https://static.lwn.net/images/conf/rtlws11/papers/proc/p02.pdf

Chuck Lever is a Linux Kernel Architect working with the Oracle Linux and Unbreakable Enterprise Kernel team at Oracle. He contributed this article about replacing printk debugging with static trace...

Linux

Linux Kernel Developments Since 5.0: Features and Developments of Note

Introduction Last year, I covered features in Linux kernel 5.0 that we thought were worth highlighting. Unbreakable Enterprise Kernel 6 is based on stable kernel 5.4 and was recently made available as a developer preview. So, now is as good a time as any to review developments that have occurred since 5.0. While the features below are roughly in chronological order, there is no significance to the order otherwise. BPF spinlock patches BPF (Berkeley Packet Filter) spinlock patches give BPF programs increased control over concurrency. Learn more about BPF and how to use it in this seven part series by Oracle developer Alan Maguire. Btrfs ZSTD compression The Btrfs filesystem now supports the use of multiple ZSTD (Zstandard) compression levels. See this commit for some information about the feature and the performance characteristics of the various levels. Memory compaction improvements Memory compaction has been reworked, resulting in significant improvements in compaction success rates and CPU time required. In benchmarks that try to allocated Transparent HugePages in deliberatly fragmented virtual memory, the number of pages scanned for migration was reduced by 65% and the free scanner was reduced by 97.5%. io_uring API for high-performance async IO The io_uring API has been added, providing a new (and hopefully better) way of achieving high-performance asynchronous I/O. Build improvements to avoid unnecessary retpolines The GCC compiler can use indirect jumps for switch statements; those can end up using retpolines on x86 systems. The resulting slowdown is evidently inspiring some developers to recode switch statements as long if-then-else sequences. In 5.1, the compiler’s case-values-threshold will be raised to 20 for builds using retpolines — meaning that GCC will not create indirect jumps for statements with less than 20 branches — addressing the performance issue without the need for code changes that might well slow things down on non-retpoline systems. See patch Improved fanotify() to effeciently detect changes on large filesystem fanotify is a mechanism for monitoring filesystem events. This improvement enables watching of super bock root to be notified that any file was changed anywhere on the filesystem. See patch. Higher frequency Pressure Stall Information monitoring First introduced in 4.20, Pressure Stall Information (PSI) tells a system administrator how much wall clock time an application spends, on average, waiting for system resources such as memory or CPU. This view into how resource-constrained a system is can help prevent catastrophe. Whereas previously PSI only reported averages for fixed, relatively large time windows, these improvements enable user-defined and more fine-grained measurements as well as mechanisms to be notified when thresholds are reached. For more information, see this article. devlink health notifications The new “devlink health” mechanism provides notifications when an interface device has problems. See this merge commit and this documentation for details. BPF optimizations The BPF verifier has seen some optimization work that yields a 20x speedup on large programs. That has enabled an increase in the maximum program size (for the root user) from 4096 instructions to 1,000,000. Read more about the BPF Verifier here. Pressure stall monitors Pressure stall monitors, which allow user space to detect and respond quickly to memory pressure, have been added. See this commit for documentation and a sample program. MM optimizations to reduce unnecessary cache line movements/TLB misses Optimizations to memory management code reduces TLB (translation lookaside buffer) misses. More details in this commit. Control Group v2 enhancements Control Group or cgroup is a kernel feature that enables hierarchichal grouping of processes such that their use of system resources (memory, CPU, I/O, etc) can be controlled, monitored and limited. Version 1 of this feature has been in the kernel for a long time and is a crucial element of the implementation of containers in Linux. Version 2 or cgroup v2 is a re-work of control group, under development since version 4 of the kernel, that intends to remove inconsistencies and enable better resource isolation and better management for containers. Some of its characteristics include: unified hierarchy better support for rootless, unprivileged containers secure delegation of cgroups See also this documentation Power efficient userspace waiting The x86 umonitor, umwait, and tpause instructions are available in user-space code; they make it possible to efficiently execute small delays without the need for busy loops on Intel “Tremont” chips. Thus, applications can employ short waits while using less power and with reduced impact on the performance of other hypertreads. A tunable has been provided to allow system administrators to control the maximum period for which the CPU can be paused. pidfd_open() system call The pidfd_open() system call has been added; it allows a process to obtain a pidfd for another, existing process. It is also now possible to use poll() on a pidfd to get notification when the associated process dies. kdump support for AMD Secure Memory Encryption (SME) See this article for more details. Exposing knfsd state to userspace The NFSv4 server now creates a directory under /proc/fs/nfsd/clients with information about current NFS clients, including which files they have open. See patch. Previously, it was not possible to get information about open files held open by NFSv4 clients. haltpoll CPU idle governer The “haltpoll” CPU idle governor has been merged. This governor will poll for a while before halting an otherwise idle CPU; it is intended for virtualized guest applications where it can improve performance by avoiding exits to the hypervisor. See this commit for some more information. New madvice() commands There are two new madvise() commands to force the kernel to reclaim specific pages. MADV_COLD moves the indicated pages to the inactive list, essentially marking them unused and suitable targets for page reclaim. A stronger variant is MADV_PAGEOUT, which causes the pages to be reclaimed immediately. dm-clone target The new dm-clone target makes a copy of an existing read-only device. “The main use case of dm-clone is to clone a potentially remote, high-latency, read-only, archival-type block device into a writable, fast, primary-type device for fast, low-latency I/O”. More information can be found in this commit. virtio-fs Virtio-fs is a shared file system that lets virtual machines access a directory tree on the host. See this document and this commit message for more information. Kernel lockdown Kernel lockdown seeks to improve on guarantees that a system is running software intended by its owner. The idea is to build on protections offered at boot time (e.g. by UEFI secure boot) and extend it such that no program can modify the running kernel. This has recently been implemented as a security module. Improved AMD EPYC scheduler/load balancing Fixes to ensure the scheduler properly load balances across NUMA nodes on different sockets. See commit message Preparations for realtime preemption Those who need realtime support in Linux have to this day had to settle for using the out-of-tree patchset PREEMPT_RT. 5.4 saw a number of patches preparing the kernel for native PREEMPT_RT support. pidfd API pidfd is a new concept in the kernel that represents a process as a file descriptor. As described in this article, the primary purpose is to prevent the delivery of signals to the wrong process should the target exit and be replaced —at the same ID— by an unrelated process, also known as PID recycling. Conclusion In slightly less than a year, a lot has happened in mainline kernel development. While the features covered here represent a mere subset of all the work that went into the kernel since 5.0, we thought they were noteworthy. If there are features you think we missed, please let us know in the comments! Acknowledgments lwn.net kernelnewbies.org Chuck Anderson, Oracle Scott Davenport, Oracle

Introduction Last year, I covered features in Linux kernel 5.0 that we thought were worth highlighting. Unbreakable Enterprise Kernel 6 is based on stable kernel 5.4 and was recently made available as...

Linux Kernel Development

XFS - Online Filesystem Checking

XFS Upstream maintainer Darrick Wong provides another instalment, this time focusing on how to facilitate sysadmins in maintaining healthy filesystems. Since Linux 4.17, I have been working on an online filesystem checking feature for XFS. As I mentioned in the previous update, the online fsck tool (named xfs_scrub) walks all internal filesystem metadata records. Each record is checked for obvious corruptions before being cross-referenced with all other metadata in the filesystem. If problems are found, they are reported to the system administrator through both xfs_scrub and the health reporting system. As of Linux 5.3 and xfsprogs 5.3, online checking is feature complete and has entered the stabilization and performance optimization stage. For the moment it remains tagged experimental, though it should be stable. We seek early adopters to try out this new functionality and give us feedback. Health Reporting A new feature under development since Linux 5.2 is the new metadata health reporting feature. In its current draft form, it collects checking and corruption reports from the online filesystem checker, and can report that to userspace via the xfs_spaceman health command. Soon, we will begin connecting it to all other places in the XFS codebase where we test for metadata problems so that administrators can find out if a filesystem observed any errors during operation. Reverse Mapping Three years ago, I also introduced the reverse space mapping feature to XFS. At its core is a secondary index of storage space usage that effectively provides a redundant copy of primary space usage metadata. This adds some overhead to filesystem operations, but its inclusion in a filesystem makes cross-referencing very fast. It is an essential feature for repairing filesystems online because we can rebuild damaged primary metadata from the secondary copy. The feature graduated from EXPERIMENTAL status in Linux 4.16 and is production ready. However, online filesystem checking and repair is (so far) the only use case for this feature, so it will remain opt-in at least until online checking graduates to production readiness. To try out this feature, pass the parameter -m rmapbt=1 to mkfs.xfs when formatting a new filesystem. Online Filesystem Repair Work has continued on online repair over the past two years. The basic core of how it works has not changed (we use reverse mapping information to reconnect damaged primary metadata), but our rigorous review processes have revealed other areas of XFS that could be improved significantly ahead of landing online repair support. For example, the offline repair tool (xfs_repair) rebuilds the filesystem btrees in bulk by regenerating all the records in memory and then writing out fully formed btree blocks all at once. The original online repair code would rebuild indices one record at a time to avoid running afoul of other transactions, which was not efficient. Because this is an opportunity to share code, I have cleaned up xfs_repair's code into a generic btree bulk load function and have refactored both repair tools to use it. Another part of repair that has been re-engineered significantly is how we stage those new records in memory. In the original design, we simply used kernel memory to hold all the records. The memory stress that this introduced made running repair a risky operation until I realized that repair should be running on a fully operational system. This means that we can store those records in memory that can be swapped out to conserve working set size. A potential third area for improvement is avoiding filesystem freezes to repair metadata. While freezing the filesystem to run a repair probably involves less downtime than unmounting, it would be very useful if we could isolate an allocation group that is found to be bad. This will reduce service impacts and is probably the only practical way to repair the reverse mapping index. I look forward to sending out a new revision of the online repair code in 2020 for further review. Demonstration: Online File System Check Online filesystem checking is a component that must be built into the Linux kernel at compile time by enabling the CONFIG_XFS_ONLINE_SCRUB kernel option. Checks are driven by a userspace utility named xfs_scrub. When run, this program announces itself as an experimental technical preview. Your kernel distributor must enable the option for the feature to work. On Debian and Ubuntu systems, the program is shipped in the regular xfsprogs package. On RedHat and Fedora systems, it is shipped in the xfsprogs-xfs_scrub package and must be installed separately. You can, of course, compile kernel and userspace from source. Let's try out the new program. It isn't very chatty by default, so we invoke it with the -v option to display status information and the -n option because we only want to check metadata: # xfs_scrub -n -v /storage/ EXPERIMENTAL xfs_scrub program in use! Use at your own risk! Phase 1: Find filesystem geometry. /storage/: using 4 threads to scrub. Phase 2: Check internal metadata. Info: AG 1 superblock: Optimization is possible. Info: AG 2 superblock: Optimization is possible. Info: AG 3 superblock: Optimization is possible. Phase 3: Scan all inodes. Info: /storage/: Optimizations of inode record are possible. Phase 5: Check directory tree. Info: inode 139431063 (1/5213335): Unicode name "arn.lm" in directory could be confused with "am.lm". Info: inode 407937855 (3/5284671): Unicode name "obs-l.I" in directory could be confused with "obs-1.I". Info: inode 407937855 (3/5284671): Unicode name "obs-l.X" in directory could be confused with "obs-1.X". Info: inode 688764901 (5/17676261): Unicode name "empty-fl.I" in directory could be confused with "empty-f1.I". Info: inode 688764901 (5/17676261): Unicode name "empty-fl.X" in directory could be confused with "empty-f1.X". Info: inode 688764901 (5/17676261): Unicode name "l.I" in directory could be confused with "1.I". Info: inode 688764901 (5/17676261): Unicode name "l.X" in directory could be confused with "1.X". Info: inode 944886180 (7/5362084): Unicode name "l.I" in directory could be confused with "1.I". Info: inode 944886180 (7/5362084): Unicode name "l.X" in directory could be confused with "1.X". Phase 7: Check summary counters. 279.1GiB data used; 3.5M inodes used. 262.2GiB data found; 3.5M inodes found. 3.5M inodes counted; 3.5M inodes checked. As you can see, metadata checking is split into different phases: This phase gathers information about the filesystem and tests whether or not online checking is supported. Here we examine allocation group metadata and aggregated filesystem metadata for problems. These include free space indices, inode indices, reverse mapping and reference count information, and quota records. In this example, the program lets us know that the secondary superblocks could be updated, though they are not corrupt. Now we scan all inodes for problems in the storage mappings, extended attributes, and directory contents, if applicable. No problems found here! Repairs are performed on the filesystem in this phase, though only if the user did not invoke the program with -n. Directories and extended attributes are checked for connectivity and naming problems. Here, we see that the program has identified several directories containing file names that could render similarly enough to be confusing. These aren't filesystem errors per se, but should be reviewed by the administrator. If enabled with -x, this phase scans the underlying disk media for latent failures. In the final phase, we compare the summary counters against what we've seen and report on the effectiveness of our scan. As you can see, we found all the files and most of the file data. Our sample filesystem is in good shape! We saw a few things that could be optimized or reviewed, but no corruptions were reported. No data have been lost. However, this is not the only way we can run xfs_scrub! System administrators can set it up to run in the background when the system is idle. xfsprogs ships with the appropriate job control files to run as a systemd timer service or a cron job. The systemd timer service can be run automatically by enabling the timer: # systemctl start xfs_scrub_all.timer # systemctl list-timers NEXT LEFT LAST PASSED UNIT ACTIVATES Thu 2019-11-28 03:10:59 PST 12h left Wed 2019-11-27 07:25:21 PST 7h ago xfs_scrub_all.timer xfs_scrub_all.service <listing shortened for brevity> When enabled, the background service will email failure reports to root. Administrators can configure when the service runs by running systemctl edit xfs_scrub_all.timer, and where the failure reports are sent by running systemctl edit xfs_scrub_fail@.service to change the EMAIL_ADDR variable. The systemd service takes advantage of systemd's sandboxing capabilities to restrict the program to idle priority and to run with as few privileges as possible. For systems that have cron installed (but not systemd), a sample cronjob file is shipped in /usr/lib/xfsprogs/xfs_scrub_all.cron. This file can be edited as necessary and copied to /etc/cron.d/. Failure reports are dispatched to wherever cronjob errors are sent. Demonstration: Health Reporting A comprehensive health report can be generated with the xfs_spaceman tool. The report contains health status about allocation group metadata and inodes in the filesystem: # xfs_spaceman -c 'health -c' /storage filesystem summary counters: ok AG 0 superblock: ok AG 0 AGF header: ok AG 0 AGFL header: ok AG 0 AGI header: ok AG 0 free space by block btree: ok AG 0 free space by length btree: ok AG 0 inode btree: ok AG 0 free inode btree: ok AG 0 overall inode state: ok <snip> inode 501370 inode core: ok inode 501370 data fork: ok inode 501370 extended attribute fork: ok This concludes our demonstrations. We hope you'll try out these new features and let us know what you think!

XFS Upstream maintainer Darrick Wong provides another instalment, this time focusing on how to facilitate sysadmins in maintaining healthy filesystems. Since Linux 4.17, I have been working on an online...

Announcements

Announcing Oracle VirtIO Drivers 1.1.5 for Microsoft Windows

We are pleased to announce Oracle VirtIO Drivers for Microsoft Windows release 1.1.5. The Oracle VirtIO Drivers for Microsoft Windows are paravirtualized (PV) drivers for Microsoft Windows guests that are running on Oracle Linux KVM. The Oracle VirtIO Drivers for Microsoft Windows improve performance for network and block (disk) devices on Microsoft Windows guests and resolve common issues. What's New? Oracle VirtIO Drivers for Microsoft Windows 1.1.5 provides: An updated installer to configure a guest VM for migration from another VM technology to Oracle Cloud Infrastructure (OCI) without the need to select a custom installation VirtIO SCSI and Block storage drivers, updated to release 1.1.5, with support for dumping crash files The signing of the drivers to Microsoft Windows 2019 The installer enables the use of the VirtIO drivers at boot time so that the migrated guest can boot in OCI Note: If installing these drivers on Microsoft Windows 2008 SP2 and 2008 R2, you will need to first install the following update from Microsoft: 2019-08 Security Update for Windows Server 2008 for x64-based Systems (KB4474419) Failure to do this may result in errors during installation due to the inability to validate signatures of the drivers. Please follow normal Windows installation procedure for this Microsoft update. Oracle VirtIO Drivers Support Oracle VirtIO Drivers 1.1.5 support the KVM hypervisor with Oracle Linux 7 on premise and on Oracle Cloud Infrastructure. The following guest Microsoft Windows operating systems are supported: Guest OS  64-bit   32-bit  Microsoft Windows Server 2019 Yes N/A  Microsoft Windows Server 2016 Yes N/A  Microsoft Windows Server 2012 R2 Yes N/A  Microsoft Windows Server 2012 Yes N/A  Microsoft Windows Server 2008 R2 SP1 Yes N/A  Microsoft Windows Server 2008 SP2 Yes Yes Microsoft Windows Server 2003 R2 SP2 Yes Yes Microsoft Windows 10 Yes Yes Microsoft Windows 8.1 Yes Yes Microsoft Windows 8 Yes Yes Microsoft Windows 7 SP1 Yes Yes Microsoft Windows Vista SP2 Yes Yes   For further details related to support and certifications, refer to the Oracle Linux 7 Administrator's Guide. Additional information on the Oracle VirtIO Drivers 1.1.5 certifications can be found in the Windows Server Catalog. Downloading Oracle VirtIO Drivers Oracle VirtIO Drivers release 1.1.5 is available on the Oracle Software Delivery Cloud by searching on "Oracle Linux" and select "DLP:Oracle Linux 7.7.0.0.0 ( Oracle Linux )" Click on the "Add to Cart" button and then click on "Checkout" in the right upper corner. On the following window, select "x86-64" and click on the "Continue" button: Accept the "Oracle Standard Terms and Restrictions" to continue and, on the following window, click on "V984560-01.zip - Oracle VirtIO Drivers Version for Microsoft Windows 1.1.5" to download the drivers: Oracle Linux Resources Documentation Oracle Linux Virtualization Manager Documentation Oracle VirtIO Drivers for Microsoft Windows Blogs Oracle Linux Blog Oracle Virtualization Blog Community Pages Oracle Linux Product Training and Education Oracle Linux Administration - Training and Certification Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter

We are pleased to announce Oracle VirtIO Drivers for Microsoft Windows release 1.1.5. The Oracle VirtIO Drivers for Microsoft Windows are paravirtualized (PV) drivers for Microsoft Windows guests that...

Comparing Workload Performance

In this blog post, Oracle Linux performance engineer Jesse Gordon presents an alternate approach to comparing the performance of a workload when measured in two different scenarios.  This improves on the traditional "perf diff" method. The benefits of this approach are as follows: ability to compare based on either inclusive time (time spent in a given method and all the methods it calls) or exclusive time (time spent only in a given method) fields in perf output can be applied to any two experiments that have common function names more readable output Comparing Perf Output from Different Kernels You’ve just updated your Oracle Linux kernel – or had it updated autonomously -- and you notice that the performance of your key workload has changed.  How do you figure out what is responsible for the difference?  The basic tool for this task is the perf profile 1, which can be used to generate traces of the workload on the two kernels.  Once you have the two perf outputs, the current Linux approach is to use "perf diff" 2 to compare the resulting traces.  The problem with the approach is that "perf diff" output is neither easy to read nor to use.  Here is an example: # # Baseline Delta Abs Shared Object Symbol # ........ ......... ................................... .............................................. # +3.38% [unknown] [k] 0xfffffe0000006000 29.46% +0.98% [kernel.kallsyms] [k] __fget 8.42% +0.91% [kernel.kallsyms] [k] fput +0.88% [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe +0.68% [kernel.kallsyms] [k] syscall_trace_enter 2.98% -0.67% [kernel.kallsyms] [k] _raw_spin_lock +0.55% [kernel.kallsyms] [k] do_syscall_64 0.40% -0.34% syscall [.] [main] In this blog, we outline an alternate approach which produces easier to read and use output.  Here is what the above output looks like using this approach: Command Symbol Before# After# Delta -------------------- ------------------------------ ------- ------- ------- syscall __fget 29.46 30.43 0.97 syscall fput 8.41 9.33 0.92 syscall entry_SYSCALL_64_after_hwframe 0.00 0.88 0.88 syscall syscall_trace_enter 0.00 0.68 0.68 syscall _raw_spin_lock 2.98 2.31 -0.67 syscall do_syscall_64 0.00 0.55 0.55 syscall main 0.40 0.06 -0.34 Furthermore, this alternate approach extends the comparison options, allowing one to compare based on any of the fields in the perf output report.  In the remainder of this blog, we detail the steps involved in producing such output.   Step 1: Generate the perf traces Taking a trace involves running the workload while invoking perf.  In this article, we chose to use the syscall workload from UnixBench 3 suite, a typical sequence would be: $ perf record -a -g -c 1000001 \<PATH-TO\>/byte-unixbench-master/UnixBench/Run syscall -i 1 -c 48 where: -a asks perf to monitor all online CPUs; -g asks perf to collect data so call graphs (stack traces) may be generated; -c 1000001 asks perf to collect a sample once every 1000001 cycles Step 2: Post-process the trace data Samples collected by perf record are saved into a binary file called, by default, perf.data. The "perf report" command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first.  To post-process the perf.data file generated in step 1: $ mv perf.data perf.data.KERNEL $ perf report -i perf.data.KERNEL -n > perf.report.KERNEL Step 3: Compare the traces To be able to compare the two traces, first ensure that they are in a common directory on the system.  So, we would have, for example, perf.report.KERNEL1 and perf.report.KERNEL2.  This is what one such trace profile looks like for UnixBench syscall: # # Samples: 1M of event 'cycles' # Event count (approx.): 1476340476339 # # Children Self Samples Command Shared Object Symbol # ........ ........ ............ ............... .................................. .................................................... # 98.60% 0.00% 0 syscall [unknown] [.] 0x7564207325203a65 | ---0x7564207325203a65 85.91% 0.24% 3538 syscall [kernel.kallsyms] [k] system_call_fastpath | ---system_call_fastpath | |--60.76%-- __GI___libc_close | 0x7564207325203a65 | |--37.72%-- __GI___dup | 0x7564207325203a65 | |--1.30%-- __GI___umask | 0x7564207325203a65 --0.21%-- [...] Listing 1: example perf trace profile The columns of interest shown are as follows: Children -- the percent of time spent in this method and all the methods that it calls, also referred to as inclusive time Self -- the percent of time spent in this method only, also referred to as exclusive time Samples -- the number of trace samples that fell in this method only Command -- the process name Shared Object -- the library Symbol -- the method (or function) name Now, we can use perf diff as follows: $ perf diff perf.data.KERNEL1 perf.data.KERNEL2 > perf.diff.KERNEL1.vs.KERNEL2   Here is what the resulting output looks like: # # Baseline Delta Abs Shared Object Symbol # ........ ......... ................................... .............................................. # +3.38% [unknown] [k] 0xfffffe0000006000 29.46% +0.98% [kernel.kallsyms] [k] __fget 8.42% +0.91% [kernel.kallsyms] [k] fput +0.88% [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe +0.68% [kernel.kallsyms] [k] syscall_trace_enter 2.98% -0.67% [kernel.kallsyms] [k] _raw_spin_lock +0.55% [kernel.kallsyms] [k] do_syscall_64 0.40% -0.34% syscall [.] main Listing 2: example perf diff profile trace This output, by default, has done the comparison using the “Self” column, or time spent in just this one method.  This can be useful output, but is often insufficient as part of a performance analysis.  We next present an approach to comparing using the “Children” column, for time spent in this method and all its children.  Step 4: Generate comparison using the “Children” column To perform the comparison, we first extract all of the lines that have entries in all six columns i.e., all fields are present.  These are the lines at the top of each of the call graphs. You can find the allfields.delta.py program that we use to render these results on github at https://github.com/oracle/linux-blog-sample-code/tree/comparing-workload-performance/allfields.delta.py $ grep "\\[" perf.data.DESCRIPTOR | grep -v "|" | grep -v "\\-\\-" > perf.report.DESCRIPTOR.allfields The output of this script looks as follows: 98.60% 0.00% 0 syscall [unknown] [.] 0x7564207325203a65 85.91% 0.24% 3538 syscall [kernel.kallsyms] [k] system_call_fastpath 55.20% 0.16% 2403 syscall libc-2.17.so [.] __GI___libc_close 52.27% 0.14% 2020 syscall [kernel.kallsyms] [k] sys_close 52.11% 0.08% 1207 syscall [kernel.kallsyms] [k] __close_fd 50.39% 21.98% 324434 syscall [kernel.kallsyms] [k] filp_close 35.44% 0.13% 1958 syscall libc-2.17.so [.] __GI___dup 32.39% 0.15% 2181 syscall [kernel.kallsyms] [k] sys_dup 29.46% 29.46% 434902 syscall [kernel.kallsyms] [k] __fget 19.92% 19.92% 294070 syscall [kernel.kallsyms] [k] dnotify_flush Listing 3: perf output showing lines with all fields present Now we compare two "allfields" files, using a Python script which reads in the two files and compares lines for which the combination of SharedObject + Symbol are the same.  This script allows the user to compare based on each of the three left side columns (children, self, or samples) and would be run as follows: $ allfields.delta.py -b perf.report.KERNEL1.allfields -a perf.report.KERNEL2.allfields -d children > allfields.delta.children.KERNEL1.vs.KERNEL2 For the UnixBench syscall workload, comparing a two distinct kernels, the output would look like this: perf report allfields delta report before file name == perf.report.KERNEL1.allfields after file name == perf.report.KERNEL2.allfields delta type == children Command Symbol Before# After# Delta -------------------- ------------------------------ ------- ------- ------- syscall 0x7564207325203a65 98.60 99.81 1.21 syscall system_call_fastpath 85.91 0.00 -85.91 syscall __GI___libc_close 55.20 56.73 1.53 syscall sys_close 52.27 53.69 1.42 syscall __close_fd 52.11 53.62 1.51 [...] Listing 4: example output from script, comparing using "children" field Lastly, we can sort this output to highlight the largest differences in each direction, as follows: $ sort -rn -k 5 allfields.delta.children.KERNEL1.vs.KERNEL2 | less where the head of the file shows those methods where more time was spent in KERNEL1 and the tail of the file shows those methods where more time was spent in KERNEL2: syscall entry_SYSCALL_64_after_hwframe 0.00 92.18 92.18 syscall do_syscall_64 0.00 91.07 91.07 syscall filp_close 50.39 52.70 2.31 syscall syscall_slow_exit_work 0.00 1.67 1.67 syscall __GI___libc_close 55.20 56.73 1.53 [...] syscall tracesys 1.18 0.00 -1.18 syscall syscall_trace_leave 1.70 0.00 -1.70 syscall int_very_careful 1.83 0.00 -1.83 syscall system_call_after_swapgs 2.42 0.00 -2.42 syscall system_call 3.47 0.00 -3.47 syscall system_call_fastpath 85.91 0.00 -85.91 Listing 5: sorted output of script We may now use these top and bottom methods as starting points into root causing the performance differences observed when executing the workload on the two kernels. Summary We have presented an alternate approach to comparing the performance of a workload when measured in two different scenarios.  This method can be applied to any two experiments that have common function names.  The benefits of this approach are as follows: ability to compare based on either inclusive time (Children) or exclusive time (Self) fields in perf output more readable output Please try it out. perf: Linux profiling with performance counters↩ perf-diff man page↩ UnixBench on GitHub↩

In this blog post, Oracle Linux performance engineer Jesse Gordon presents an alternate approach to comparing the performance of a workload when measured in two different scenarios.  This improves on...

Announcements

Announcing Oracle Linux 7 Update 8 Beta Release

We are pleased to announce the availability of the Oracle Linux 7 Update 8 Beta release for the 64-bit Intel and AMD (x86_64) and 64-bit Arm (aarch64) platforms. Oracle Linux 7 Update 8 Beta is an updated release that include bug fixes, security fixes and enhancements. It is fully binary compatible with Red Hat Enterprise Linux 7 Update 8 Beta. Updates include: A revised protection profile for General Purpose Operating Systems (OSPP) in the SCAP Security Guide packages SELinux enhancements for Tomcat domain access and graphical login sessions rsyslog has a new option for managing letter-case preservation by using the FROMHOST property for the imudp and imtcp modules Pacemaker concurrent-fencing cluster property defaults to true, speeding up recovery in a large cluster where multiple nodes are fenced. The Oracle Linux 7 Update 8 Beta Releases includes the following kernel packages: kernel-uek-4.14.35-1902.7.3.1 for x86_64 an d aarch64 platforms - The Unbreakable Enterprise Kernel Release 5, which is the default kernel. kernel-3.10.0-1101 for x86_64 platform - The latest Red Hat Compatible Kernel (RHCK). To get started with Oracle Linux 7 Update 8 Beta Release, you can simply perform a fresh installation by using the ISO images available for download from Oracle Technology Network. Or, you can perform an upgrade from an existing Oracle Linux 7 installation by using the Beta channels for Oracle Linux 7 Update 8 on the Oracle Linux yum server or the Unbreakable Linux Network (ULN).  # vi /etc/yum.repos.d/oracle-linux-ol7.repo [ol7_beta] name=Oracle Linux $releasever Update 8 Beta ($basearch) baseurl=https://yum$ociregion.oracle.com/repo/OracleLinux/OL7/beta/$basearch/ gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle gpgcheck=1 enabled=1 [ol7_optional_beta] name=Oracle Linux $releasever Update 8 Beta ($basearch) Optional baseurl=https://yum$ociregion.oracle.com/repo/OracleLinux/OL7/optional/beta/$basearch/ gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle gpgcheck=1 enabled=1 If your instance is running on OCI, the value "$ociregion" will be automatically valued to use OCI local-region Yum Mirrors. Modify the yum channel setting and enable the Oracle Linux 7 Update 8 Beta channels. Then you perform the upgrade. # yum update After the upgrade is completed, reboot the system and you will have Oracle Linux 7 Update 8 Beta running. [root@v2v-app: ~]# cat /etc/oracle-release Oracle Linux Server release 7.8 This release is provided for development and test purposes only and is not covered by Oracle Linux support. Oracle does not recommended using Beta releases in production. Further technical details and known issues for Oracle Linux 7 Update 8 Beta Release are available on Oracle Community - Oracle Linux and UEK Preview space. We welcome your questions and feedback on Oracle Linux 7 Update 8 Beta Release. You may contact the Oracle Linux team at oraclelinux-info_ww_grp@oracle.com or post your questions and comments on the Oracle Linux and UEK Preview Space on the Oracle Community.

We are pleased to announce the availability of the Oracle Linux 7 Update 8 Beta release for the 64-bit Intel and AMD (x86_64) and 64-bit Arm (aarch64) platforms. Oracle Linux 7 Update 8 Beta is an...

Announcements

UEK Release 6 Developer Preview available for Oracle Linux 7 and Oracle Linux 8

The Unbreakable Enterprise Kernel (UEK), included as part of Oracle Linux, provides the latest open source innovations, optimizations and security for enterprise cloud workloads. The UEK Release 5, based on the upstream kernel 4.14, is the current UEK release that powers the production workloads on Oracle Linux 7 in the cloud or on-premises. Linux 5.4 is the Latest Stable Kernel release, and it is the mainline kernel that the UEK Release 6 tracks. You can experiment the UEK Release 6 preview today with Oracle Linux 7 and Oracle Linux 8 on both x86_64 and aarch64 platforms. The example below is using an Oracle Linux 8 x86_64 instance on Oracle Cloud Infrastructure. The kernel was upgraded to the UEK Release 6 preview within a few minutes. The same upgrade procedures apply to an Oracle Linux 7 or Oracle Linux 8 instance running on-premises. The Oracle Linux 8 instance runs the current RHCK (Red Hat Compatible Kernel). [root@ol8-uek6 ~]# uname -a Linux ol8-uek6 4.18.0-147.el8.x86_64 #1 SMP Tue Nov 12 11:05:49 PST 2019 x86_64 x86_64 x86_64 GNU/Linux Update the system: [root@ol8-uek6 ~]# yum update -y Enable "ol8_developer_UEKR6" UEK Release 6 Preview Channel: [root@ol8-uek6 ~]# dnf config-manager --set-enabled ol8_developer_UEKR6 Run "dnf" command to perform the UEKR6 Developer Preview installation: [root@ol8-uek6 ~]# dnf install kernel-uek kernel-uek-devel Reboot the Oracle Linux 8 instance to have the new kernel take effect. When the Oracle Linux 8 instance comes back, you now have the UEK Release 6 preview  running. [root@ol8-uek6 ~]# uname -a Linux ol8-uek6 5.4.2-1950.2.el8uek.x86_64 #2 SMP Thu Dec 19 17:07:00 PST 2019 x86_64 x86_64 x86_64 GNU/Linux Further technical details and known issues on UEK6 are available on this dedicated article at Oracle Community - Oracle Linux and UEK Preview space. If you have any questions, post them to Oracle Linux Community.

The Unbreakable Enterprise Kernel (UEK), included as part of Oracle Linux, provides the latest open source innovations, optimizations and security for enterprise cloud workloads. The UEK Release 5,...

Linux Kernel Development

XFS - Data Block Sharing (Reflink)

Following on from his recent blog XFS - 2019 Development Retrospective, XFS Upstream maintainer Darrick Wong dives a little deeper into the Reflinks implementation for XFS in the mainline Linux Kernel. Three years ago, I introduced to XFS a new experimental "reflink" feature that enables users to share data blocks between files. With this feature, users gain the ability to make fast snapshots of VM images and directory trees; and deduplicate file data for more efficient use of storage hardware. Copy on write is used when necessary to keep file contents intact, but XFS otherwise continues to use direct overwrites to keep metadata overhead low. The filesystem automatically creates speculative preallocations when copy on write is in use to combat fragmentation. I'm pleased to announce with xfsprogs 5.1, the reflink feature is now production ready and enabled by default on new installations, having graduated from the experimental and stabilization phases. Based on feedback from early adopters of reflink, we also redesigned some of our in-kernel algorithms for better performance, as noted below: iomap for Faster I/O Beginning with Linux 4.18, Christoph Hellwig and I have migrated XFS' IO paths away from the old VFS/MM infrastructure, which dealt with IO on a per-block and per-block-per-page ("bufferhead") basis. These mechanisms were introduced to handle simple filesystems on Linux in the 1990s, but are very inefficient. The new IO paths, known as "iomap", iterate IO requests on an extent basis as much as possible to reduce overhead. The subsystem was written years ago to handle file mapping leases and the like, but nowadays we can use it as a generic binding between the VFS, the memory manager, and XFS whenever possible. The conversion finished as of Linux 5.4. In-Core Extent Tree For many years, the in-core file extent cache in XFS used a contiguous chunk of memory to store the mappings. This introduces a serious and subtle pain point for users with large sparse files, because it can be very difficult for the kernel to fulfill such an allocation when memory is fragmented. Christoph Hellwig rewrote the in-core mapping cache in Linux 4.15 to use a btree structure. Instead of using a single huge array, the btree structure reduces our contiguous memory requirements to 256 bytes per chunk, with no maximum on the number of chunks. This enables XFS to scale to hundreds of millions of extents while eliminating a source of OOM killer reports. Users need only upgrade their kernel to take advantage of this improvement. Demonstration: Reflink To begin experimenting with XFS's reflink support, one must format a new filesystem: # mkfs.xfs /dev/sda1 meta-data=/dev/sda1 isize=512 agcount=4, agsize=6553600 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=12800, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 If you do not see the exact phrase "reflink=1" in the mkfs output then your system is too old to support reflink on XFS. Now one must mount the filesystem: # mount /dev/sda1 /storage At this point, the filesystem is ready to absorb some new files. Let's pretend that we're running a virtual machine (VM) farm and therefore need to manage deployment images. This and the next example are admittedly contrived, as any serious VM and container farm manager takes care of all these details. # mkdir /storage/images # truncate -s 30g /storage/images/os8_base.img # qemu-system-x86_64 -hda /storage/images/os8_base.img -cdrom /isoz/os8_install.iso Now we install a base OS image that we will later use for fast deployment. Once that's done, we shut down the QEMU process. But first, we'll check that everything's in order: # xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img /storage/images/os8_base.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 000000 <listing shortened for brevity> # df -h /storage Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 32G 68G 32% /storage Now, let's say that we want to provision a new VM using the base image that we just created. In the old days we would have had to copy the entire image, which can be very time consuming. Now, we can do this very quickly: # /usr/bin/time cp -pRdu --reflink /storage/images/os8_base.img /storage/images/vm1.img 0.00user 0.00system 0:00.02elapsed 39%CPU (0avgtext+0avgdata 2568maxresident)k 0inputs+0outputs (0major+108minor)pagefaults 0swaps # xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img /storage/images/vm1.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 100000 <listing shortened for brevity> FLAG Values: 0100000 Shared extent 0010000 Unwritten preallocated extent 0001000 Doesn't begin on stripe unit 0000100 Doesn't end on stripe unit 0000010 Doesn't begin on stripe width 0000001 Doesn't end on stripe width # df -h /storage Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 32G 68G 32% /storage This was a very quick copy! Notice how the extent map on the new image file shows file data pointing to the same physical storage as the original base image, but is now marked as a shared extent, and there's about as much free space as there was before the copy. Now let's start that new VM and let it run for a little while before re-querying the block mapping: # xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img /storage/images/vm1.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 100000 <listing shortened for brevity> # xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img /storage/images/vm1.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..255]: 102762656..102762911 1 (50333856..50334111) 256 000000 1: [256..15728639]: 52429216..68157599 1 (416..15728799) 15728384 100000 <listing shortened for brevity> # df -h /storage Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 36G 64G 32% /storage Notice how the first 128K of the file now points elsewhere. This is evidence that the VM guest wrote to its storage, causing XFS to employ copy on write on the file so that the original base image remains unmodified. We've apparently used another 4GB of space, which is far better than the 64GB that would have been required in the old days. Let's turn our attention to the second major feature for reflink: fast(er) snapshotting of directory trees. Suppose now that we want to manage containers with XFS. After a fresh formatting, create a directory tree for our container base: # mkdir -p /storage/containers/os8_base In the directory we just created, install a base container OS image that we will later use for fast deployment. Once that's done, we shut down the container and check that everything's in order: # df /storage/ Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 2.0G 98G 2% /storage # xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash /storage/containers/os8_base/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52440384..52442559 1 (11584..13759) 2176 000000 Ok, that looks like a reasonable base system. Let's use reflink to make a fast copy of this system: # /usr/bin/time cp -pRdu --reflink=always /storage/containers/os8_base /storage/containers/container1 0.01user 0.64system 0:00.68elapsed 96%CPU (0avgtext+0avgdata 2744maxresident)k 0inputs+0outputs (0major+129minor)pagefaults 0swaps # xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash /storage/containers/os8_base/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52440384..52442559 1 (11584..13759) 2176 100000 # df /storage/ Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 2.0G 98G 2% /storage Now we let the container runtime do some work and update (for example) the bash binary: # xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash /storage/containers/os8_base/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52440384..52442559 1 (11584..13759) 2176 000000 # xfs_bmap -e -l -p -v -v -v /storage/containers/container1/bin/bash /storage/containers/container1/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52442824..52444999 1 (14024..16199) 2176 000000 Notice that the two copies of bash no longer share blocks. This concludes our demonstration. We hope you enjoy this major new feature!

Following on from his recent blog XFS - 2019 Development Retrospective, XFS Upstream maintainer Darrick Wong dives a little deeper into the Reflinks implementation for XFS in the mainline Linux...

Linux Kernel Development

XFS - 2019 Development Retrospective

Darrick Wong, Upstream XFS Maintainer and kernel developer for Oracle Linux, returns to talk about what's been happening with XFS. Hi folks! It has been a little under two years since my last post about upcoming XFS features in the mainline Linux kernel. In that time, the XFS development community have been hard at work fixing bugs and rolling out new features! Let's talk about the improvements that have landed recently in the mainline Linux Kernel, and our development roadmap for 2020. The new reflink and online fsck features will be covered in separate future blog posts. Lazy Timestamp Updates Starting with Linux 4.17, XFS implements the lazytime mount option. This mount option permits the filesystem to skip updates to the last modification timestamp and file metadata change timestamp if they have been updated within the last 24 hours. When used in combination with the relatime mount option to skip updates to a last access timestamp when it is newer than the file modification timestamp, we see a marked decrease in metadata writes, which in turn improves filesystem performance on non-volatile storage. This enhancement was provided by Christoph Hellwig. Filesystem Label Management In Linux 4.18, Eric Sandeen added to XFS support for btrfs' label get and set ioctls. This change enables administrators to change a filesystem label while that filesystem is mounted. A future xfsprogs release will adapt xfs_admin to take advantage of this interface. Large Directory Dave Chinner contributed a series of patches for Linux 5.4 that reduce the amount of time that XFS spends searching for free space in a directory when creating a file. This change improves performance on very large directories, which should be beneficial for object stores and container deployment systems. Solving the Y2038 Problem The year 2038 poses a special problem for Linux -- any signed 32-bit seconds counter will overflow back to 1901. Work is underway in the kernel to extend all of those counters to support 64-bit counters fully. In 2020, we will begin work on extending XFS's metadata (primarily inode timestamps and quota expiration timer) to support timestamps out to the year 2486. It should be possible to upgrade to existing V5 filesystems. Metadata Directory Tree This feature, which I showed off late in 2018, creates a separate directory tree for filesystem metadata. This feature is not itself significant for users, but it will enable the creation of many more metadata structures. This in turn can enable us to provide reverse mapping and data block sharing for realtime volumes; support creating subvolumes for container hosts; store arbitrary properties in the filesystem; and attach multiple realtime volumes to the filesystem. Deferred Inode Reclaim and Inactivation We frequently hear two complaints lodged against XFS -- memory reclamation runs very slowly because XFS inode reclamation sometimes has to flush dirty inodes to disk; and deletions are slow because we charge all costs of freeing all the file's resources to the process deleting files. Dave Chinner and I have been collaborating this year and last on making those problems go away. Dave has been working on replacing the current inode memory reclaim code with a simpler LRU list and reorganizing the dirty inode flushing code so that inodes aren't handed to memory reclaim until the metadata log has finished flushing the inodes to disk. This should eliminate the complaints that slow IO gets in the way of reclaiming memory in other parts of the system. Meanwhile, I have been working on the deletion side of the equation by adding new states to the inode lifecycle. When a file is deleted, we can tag it as needing to have its resources freed, and move on. A background thread can free all those resources in bulk. Even better, on systems with a lot of IOPs available, these bulk frees can be done on a per-AG basis with multiple threads. Inode Parent Pointers Allison Collins continues developing the inode parent pointer feature. This has led to the introduction of atomic setting and removal of extended attributes and a refactoring of the existing extended attribute code. When completed, this will enable both filesystem check and repair tools to check the integrity of a filesystem's directory tree and rebuild subtrees when they are damaged. Anyway, that wraps up our new feature retrospective and discussion of 2020 roadmap! See you on the mailing lists!

Darrick Wong, Upstream XFS Maintainer and kernel developer for Oracle Linux, returns to talk about what's been happening with XFS. Hi folks! It has been a little under two years since my last post about...

Announcements

Announcing Oracle Container Runtime for Docker Release 19.03

Oracle is pleased to announce the release of Oracle Container Runtime for Docker version 19.03. Oracle Container Runtime allows you to create and distribute applications across Oracle Linux systems and other operating systems that support Docker. Oracle Container Runtime for Docker consists of the Docker Engine, which packages and runs the applications, and integrates with the Oracle Container Registry and Docker Hub to share the applications in a Software-as-a-Service (SaaS) cloud. Notable Updates The docker run and docker create commands now include an option to set the domain name, using the --domainname option. The docker image pull command now includes an option to quietly pull an image, using the --quiet option. Faster context switching using the docker context command. Added ability to list kernel capabilities with --capabilities instead of --capadd and --capdrop. Added ability to define sysctl options with--sysctl list, --sysctl-add list, and --sysctl-rm list. Added inline cache support to builder with the --cache-from option. The IPVLAN driver is now supported and no longer considered experimental. Upgrading To learn how to upgrade from a previously supported version of Oracle Container Runtime for Docker, please review the Upgrading Oracle Container Runtime for Docker chapter of the documentation. Note that upgrading from a developer preview release is not supported by Oracle. Support Support for the Oracle Container Runtime for Docker is available to customers with either an Oracle Linux Premier or Basic support subscription.  Resources Documentation Oracle Container Runtime for Docker Oracle Linux Cloud Native Environment Oracle Linux Software Download Oracle Linux download instructions Oracle Software Delivery Cloud Oracle Container Registry Oracle Groundbreakers Community Oracle Linux Space Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Product Training and Education Oracle Linux

Oracle is pleased to announce the release of Oracle Container Runtime for Docker version 19.03. Oracle Container Runtime allows you to create and distribute applications across Oracle Linux systems...

Technologies

Kata Containers: What, When and How

When we began work to include Kata Containers with Oracle Linux Cloud Native Environment I was immediately impressed with the change they bring to the security boundary of containers but I had to wonder, how does it work? This article attempts to briefly cover the What, When and How of Kata Containers. Before we dive into Kata Containers you may want to read the brief history of Linux containers. 1. What are Kata Containers? Kata Containers is an open source [project with a] community working to build a secure container runtime with lightweight virtual machines that feel and perform like containers, but provide stronger workload isolation using hardware virtualization technology as a second layer of defense. Kata Containers stem from the Intel® Clear Containers and Hyper RunV projects. Kata Containers use existing CPU features like Intel VT-X and AMD-V™ to better isolate containers from each other when run on the same host. Each container can run in its own VM and have its own Linux Kernel. Due to the boundaries between VMs a container should not be able to access the memory of another container (Hypervisors+EPT/RVI). runc is the runtime-spec reference implementation on Linux and when it spawns containers it uses standard Linux Kernel features like AppArmour, capabilities(7), Control Groups, seccomp, SELinux and namespaces(7) to control permissions and flow of data in and out of the container. Kata Containers extends this by wrapping the containers in VMs. 2. When should I use Kata Containers? runc is the most common container runtime, the default for Docker™, CRI-O and in turn Kubernetes®. Kata Containers give you an alternative, one which provides higher isolation for mixed-use or multi-tenant environments. Kubernetes worker nodes are capable of using both runc and Kata Containers simultaneously so dedicated hardware is not required. For intra-container communication efficiency and to reduce resource usage overhead, Kata Containers executes all containers of a Kubernetes pod in a single VM.   Deciding when to use runc and when to use Kata Containers is dependent on your own security policy and posture. Factors that may influence when higher levels of isolation are necessary include: The source of the image - trusted vs untrusted. Was the image built in-house or downloaded from a public registry? The contents of the container In-house software that brings a competitive advantage The dataset the container works on (public vs confidential) Working in a virtual environment may impact performance so workload-specific testing is recommended to evaluate the extent, if any, of that impact in your environment. 3. How do Kata Containers work? When installing the Kubernetes module of Oracle Linux Cloud Native Environment both runc and Kata Containers are deployed along with CRI-O, which provides the necessary support between Kubernetes and the container runtimes. A heavily optimized and purpose-tuned Linux Kernel is used by Kata Containers to boot the VM. This is paired with a minimized user space to support container operations and together they provide fast initialization. In order to create a Kata Container, a Kubernetes user must initially create a RuntimeClass object. After that Pods or Deployments can reference the RuntimeClass to indicate the runtime to use. Examples are available in the Using Container Runtimes documentation. Kata Containers are designed to provide "lightweight virtual machines that feel and perform like containers"; your developers shouldn't need to be aware that their code is executing in a VM and should not need to change their workflow to gain the benefits. Trademarks Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

When we began work to include Kata Containers with Oracle Linux Cloud Native Environment I was immediately impressed with the change they bring to the security boundary of containers but I had to...

Linux

Oracle Linux Training at Your Own Pace

Knowing that taking training at your own pace, when you have time, suits many people's schedules and learning style, Oracle has just releases new Training-on-Demand courses for those aspiring to build their Linux administration skills. Why not take advantage to the newly released training to build your Linux skills. Start your Linux learning with the Oracle Linux System Administration I course. This course covers a range of skills including installation, using the Unbreakable Enterprise Kernel, configuring Linux services, preparing the system for the Oracle Database, monitoring and troubleshooting. After gaining essential knowledge and skills from taking the Oracle Linux System Administration I course, students are encouraged to continue their Linux learning with Oracle Linux System Administration II. The Oracle Linux System Administration II course teaches you how to automate the installation of the operating system and implement advanced software package management. How to configure advanced networking and authentication services. Resources: Oracle Linux Curriculum Oracle Linux Product Documentation Linux on Oracle Cloud Infrastructure learning path Oracle Linux Cloud Native Environment learning path Linux Containers and Orchestration Groundbreakers Community

Knowing that taking training at your own pace, when you have time, suits many people's schedules and learning style, Oracle has just releases new Training-on-Demand courses for those aspiring to build...

Events

Meet the Oracle Linux Team at Open FinTech Forum in New York

For IT decision makers in the financial services sector, you won’t want to miss Open FinTech Forum, December 9, 2019, at the Convene Conference Center at One Liberty Plaza, New York, NY. This event is designed to better inform you about the open technologies driving digital transformation, and how to best utilize an open source strategy. This information-packed day starts with several brief keynotes. Be sure to mark your schedule and join Robert Shimp, Group Vice President of Infrastructure Software Product Management at Oracle, for the keynote: A New Blueprint for Modern Application Development and Runtime Environment Mr. Shimp will discuss new open source technologies that make it easier and faster than ever to design and deliver modern cloud native applications. When: Monday, December 9, 2019 Time: 10:05 a.m. Location: The Forum Meet Our Experts Register for Open FinTech Forum today and stop by Oracle’s table to chat with our Linux experts. Learn more about Oracle Linux and how it delivers a complete, open DevOps environment featuring leading performance, scalability, reliability and security for enterprise applications deployed in the cloud or on premise. One of the most secure Linux environments available with certification from Common Criteria as well as FIPS 140-2 validation of its cryptographic modules, Oracle Linux is currently the only Linux distribution on the NIAP Product Compliant List. It is also the only Linux with Ksplice zero-downtime automated patching for kernel, hypervisor, and critical user space libraries. We look forward to meeting you at Open FinTech Forum. #osfintech    #OracleLinux

For IT decision makers in the financial services sector, you won’t want to miss Open FinTech Forum, December 9, 2019, at the Convene Conference Center at One Liberty Plaza, New York, NY. This event is...

Linux

A Brief History of Linux Containers

The latest update to Oracle Linux Cloud Native Environment introduces new functionally that we'll cover in upcoming posts but before we dive into those features, let's take a look at the history of Linux containers to see how we got here. The first fundamental building block that led to the creation of Linux containers was submitted to the Kernel by Google. It was the first version of a feature named control groups or cgroups. A cgroup is a collection of processes whose use of system resources is constrained by the Kernel to better manage completing workloads and to contain their impact on other processes. The second building block was the namespaces feature which allows the system to apply a restricted view of system resources to processes or groups of processes. Namespaces were actually introduced much earlier than cgroups but they were limited to specific object types like hostnames and Process IDs. It wasn't until 2008 and the creation of network namespaces that we could create different views of key network objects for different processes. Processes could now be prevented from knowing about each other, communicating with each other and each could have a unique network configuration. The availability of these Kernel features led to the formation of LXC (Linux Containers) which provides a simple interface to create and manage containers, which are simply processes that are limited in what they can see by namespaces and what they can use by cgroups. Docker expanded LXC's functionality by providing a full container life-cycle management tool (also named Docker™). Docker's popularity led to its name now being synonymous with containers. In time Docker replaced LXC with its own libcontainer and additional controls were added which utilized Kernel features such as AppAmor, capabilities, seccomp and SELinux. These gave developers improved methods of restricting what container processes could see and do. A key component to Docker's success was the introduction of a container and image format for portability, i.e. a container or image could be transferred between systems without any impact to functionality. This provided the assurance of repeatable, consistent deployments. This container and image format is based on individual file system layers where each layer is intentionally separated for re-use but is presented as a unified file system when the container starts. Layers also allow images to extend another image. For example, an image may use oraclelinux:7-slim as its first layer and add additional software in other layers. This separation of layers allows images and containers to share the same bits on disk across multiple instances. This improves resource utilization and start-up time. A new API was created by Docker to facilitate image transfer but pre-existing union filesystems like aufs and then OverlayFS were the base methods to present the unified container filesystem. While most of us are familiar with Docker, what is less well-known is that those container and image formats and how a container is launched from an image are published standards of the Open Container Initiative. These standards are designed for interoperability so any runtime-spec compliant runtime can use an image-spec compliant image to launch a container. Docker was a founding member of the Open Container Initiative and contributed the Docker V2 Image specification to act as the basis of the image specification. Through standards like these, interoperability is enhanced which helps the industry continue to develop and grow at a rapid pace. So if you're creating images today with Docker or other Open Container Initiative compliant tools, they can continue to work on other compliant tools like Kata containers which we'll be looking at in an upcoming blog post. Note: There were alternatives available in other operating systems and outside of mainline Linux prior to LXC but the focus of the article was Linux. Similarly, there were other projects in parallel to LXC which contributed to the industry but are not mentioned for brevity. Docker™ is a trademark of Docker, Inc. in the United States and/or other countries. Linux® is a registered trademark of Linus Torvalds in the United States and/or other countries.

The latest update to Oracle Linux Cloud Native Environment introduces new functionally that we'll cover in upcoming posts but before we dive into those features, let's take a look at the history ofLin...

Linux Kernel Development

Using Tracepoints to Debug iSCSI

Using Tracepoints to Debug iSCSI Modules Oracle Linux kernel developer Fred Herard offered this blog post on how to use tracepoints with iSCSI kernel modules. The scsi_transport_iscsi, libiscsi, libiscsi_tcp, and iscsi_tcp modules have been modified to leverage Linux Kernel Tracepoints to capture debug messages. Before this modification, debug messages for these modules were simply directed to syslog when enabled. This enhancement gives users the option to use Tracepoint facility to dump enabled events (debug messages) into ftrace ring buffer. The following tracepoint events are available: # perf list 'iscsi:*' List of pre-defined events (to be used in -e): iscsi:iscsi_dbg_conn [Tracepoint event] iscsi:iscsi_dbg_eh [Tracepoint event] iscsi:iscsi_dbg_session [Tracepoint event] iscsi:iscsi_dbg_sw_tcp [Tracepoint event] iscsi:iscsi_dbg_tcp [Tracepoint event] iscsi:iscsi_dbg_trans_conn [Tracepoint event] iscsi:iscsi_dbg_trans_session [Tracepoint event] Here's a simple diagram depicting the tracepoint enhancement: These tracepoint events can be enabled on the fly to aid in debugging iscsi issues. Here's a sample output of tracing iscsi:iscsi_dbg_eh tracepoint event using the perf utility: # /usr/bin/perf trace --no-syscalls --event="iscsi:iscsi_dbg_eh" 0.000 iscsi:iscsi_dbg_eh:session25: iscsi_eh_target_reset tgt Reset [sc ffff883fee609500 tgt iqn.1986-03.com.sun:02:fa41d51f-45a5-cea4-d661-a854dd13cf07]) 0.009 iscsi:iscsi_dbg_eh:session25: iscsi_exec_task_mgmt_fn tmf set timeout) 3.214 iscsi:iscsi_dbg_eh:session25: iscsi_eh_target_reset tgt iqn.1986-03.com.sun:02:fa41d51f-45a5-cea4-d661-a854dd13cf07 reset result = SUCCESS) Tracepoint events that have been insterted into ftrace ring buffer can be extracted using e.g. crash utility version 7.1.6 or higher: crash> extend ./extensions/trace.so ./extensions/trace.so: shared object loaded crash> trace show ... <...>-18646 [023] 20618.810958: iscsi_dbg_eh: session4: iscsi_eh_target_reset tgt Reset [sc ffff883fead741c0 tgt iqn.2019-10.com.example:storage] <...>-18646 [023] 20618.810968: iscsi_dbg_eh: session4: iscsi_exec_task_mgmt_fn tmf set timeout <...>-18570 [016] 20848.578257: iscsi_dbg_trans_session: session4: iscsi_session_event Completed handling event 105 rc 0 <...>-18570 [016] 20848.578260: iscsi_dbg_trans_session: session4: __iscsi_unbind_session Completed target removal This enhancement can be found in Oracle Linux UEK-qu7 and newer releases.

Using Tracepoints to Debug iSCSI Modules Oracle Linux kernel developer Fred Herard offered this blog post on how to use tracepoints with iSCSI kernel modules. The scsi_transport_iscsi, libiscsi,...

Announcements

Announcing Oracle Linux 8 Update 1

Oracle is pleased to announce the general availability of Oracle Linux 8 Update 1. Individual RPM packages are available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. ISO installation images will soon be available for download from the Oracle Software Delivery Cloud and Docker images will soon be available via Oracle Container Registry and Docker Hub. Oracle Linux 8 Update 1 ships with Red Hat Compatible Kernel (RHCK) (kernel-4.18.0-147.el8) kernel packages for x86_64 Platform (Intel & AMD), that include bug fixes, security fixes, and enhancements; the 64-bit Arm (aarch64) platform is also available for installation as a developer preview release. Notable new features for all architectures Security Udica package added You can use udica to create a tailored security policy, for better control of how a container accesses host system resources. This capability enables you to harden container deployments against security violations. SELinux SELinux user-space tools have been updated to release 2.9 SETools collection and libraries have been updated to release 4.2.2 New boltd_t SELinux type added (used to manage Thunderbolt 3 devices) New bpf SELinux policy added (used to control Berkeley Packet Filter) SELinux policy packages have been updated to release 3.14.3 OpenSCAP OpenSCAP packages have been updated to release 1.3.1 OpenSCAP includes SCAP 1.3 data stream version scap-security-guide packages have been updated to release 0.1.44 OpenSSH have been updated to 8.0p1 release Other Intel Optane DC Persistent Memory Memory Mode for the Intel Optane DC Persistent Memory technology has been added; this technology is transparent to the operating system. Database Oracle Linux 8 Update 1 ships with version 8.0 of the MySQL database. Cockpit Web Console Capability for Simultaneous Multi-Threading (SMT) configuration using the Cockpit web console. The ability to disable SMT in the Cockpit web console is also included. For further details please see Oracle Linux Simultaneous Multithreading Notice Networking page updated with new firewall settings on the web console's Networking page Several improvements to the Virtual Machines management page Important changes in this release virt-manager The Virtual Machine Manager application (virt-manager) is deprecated in Oracle Linux 8 Update 1. Oracle recommends that you use the web console (Cockpit) for managing virtualization in a graphical user interface (GUI). RHCK Btrfs file system removed from RHCK. OCFS2 file system removed from RHCK. VDO Ansible module moved to Ansible packages; the VDO Ansible module is provided by the ansible package and is located in /usr/lib/python3.6/site-packages/ansible/modules/system/vdo.py Further information on Oracle Linux 8 For more details about these and other new features and changes, please consult the Oracle Linux 8 Update 1 Release Notes and Oracle Linux 8 Documentation. Oracle Linux can be downloaded, used, and distributed free of charge and all updates and errata are freely available. Customers decide which of their systems require a support subscription. This makes Oracle Linux an ideal choice for development, testing, and production systems. The customer decides which support coverage is best for each individual system while keeping all systems up to date and secure. Customers with Oracle Linux Premier Support also receive support for additional Linux programs, including Gluster Storage, Oracle Linux Software Collections, and zero-downtime kernel updates using Oracle Ksplice. Application Compatibility Oracle Linux maintains user-space compatibility with Red Hat Enterprise Linux (RHEL), which is independent of the kernel version that underlies the operating system. To minimize impact on interoperability during releases, the Oracle Linux team works closely with third-party vendors for hardware and software that have dependencies on kernel modules. For more information about Oracle Linux, please visit www.oracle.com/linux.

Oracle is pleased to announce the general availability of Oracle Linux 8 Update 1. Individual RPM packages are available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server....

Announcements

Unified Management for Oracle Linux Cloud Native Environment

Delivering a production-ready, cloud-native application development and operating environment Oracle Linux Cloud Native Environment has gained some notable additions. Specifically, three core components for unified management: the Oracle Linux Cloud Native Environment Platform API Server, Platform Agent and Platform Command-Line Interface (CLI). These new open source management tools simplify the installation and day-to-day management of the cloud native environment, and provide extensibility to support new functionality Oracle Linux Cloud Native Environment was announced at Oracle OpenWorld 2018 as a curated set of open source projects that are based on open standards, specifications and APIs defined by the Open Container Initiative and Cloud Native Computing Foundation that can be easily deployed, have been tested for interoperability and for which enterprise-grade support is offered. Since then we have released several new components, either generally available under an existing Oracle Linux support subscription or as technical preview releases. Here's what the three core components provide: The Platform API Server is responsible for performing all of the business logic required to deploy and manage an Oracle Linux Cloud Native Environment. We recommend using a dedicated operator node to host the Platform API Server, though it can run on any node within the environment.   The business logic used by the Platform API Server is encapsulated within the metadata associated with each module we publish. An Oracle Linux Cloud Native Environment module is a method of packaging software so that it can be deployed by the Platform API Server to provide either core or optional cluster-wide functionality. Today, we are shipping the Kubernetes module which provides the core container orchestration functionality for the entire cluster. Included within the Kubernetes module are additional components that provide required services including CoreDNS for name resolution and Flannel for layer 3 networking services. The Platform API Server interacts with a Platform Agent that must be installed on each host within the environment. The Platform Agent knows how to gather the state of resources on its host and how to change the state of those resources. For example, the Platform Agent can determine if a package is installed and at which version, or if a firewall port is open or closed. It could then be requested to change the state of those resources, that is to upgrade the package if it is old or to open the port if it is closed. New instructions on how to gather and set state values can be added at any time by the Platform API Server which makes the Platform Agent easily extensible at runtime, without requiring a cluster-wide upgrade. You interact with the Platform API Server using the Platform CLI tool. The Platform CLI tool is the primary interface for the administration of Oracle Linux Cloud Native Environment. Like the Platform Agent, it is simply an interface for the functionality provided by the Platform API Server. The Platform CLI tool can be installed on the operator node within the environment. Kata Containers support and other updates Oracle Linux Cloud Native Environment contains several new or updated components over the previously released Oracle Container Services for use with Kubernetes product. The following changes are in addition to the new management functionality: The Kubernetes® module for the Oracle Linux Cloud Native Environment which is based on upstream Kubernetes v1.14 and is a Certified Kubernetes distribution, now automatically installs the CRI-O runtime interface which supports both runC and Kata Container runtime engines. The Kata Containers runtime engine which uses lightweight virtual machines for improved container isolation is now fully supported for production use and is automatically installed by the Kubernetes module. The Kubernetes module can either be configured to use an external load balancer or the Platform API Server can deploy a  software-based load balancer to ensure multi-master high availability. The Platform API Server is capable of providing full cluster-wide backup/restore functionality for disaster recovery. Join us at KubeCon + CloudNativeCon! Grab a coffee with the Oracle Linux and Virtualization team at Booth #P26 and get an Oracle Tux cup of your own. While you're there, our Linux and Virtualization experts can answer your questions and provide one-on-one demos of the unified management for Oracle Linux Cloud Native Environment. Installation Oracle Linux Cloud Native Environment RPM packages are available on the Unbreakable Linux Network and the Oracle Linux yum server. The installation of Oracle Linux Cloud Native Environment requires downloading container images directly from the Oracle Container Registry or by creating and using a local mirror of the images. Both options are covered in the Getting Started Guide. Oracle recommends reviewing the known issues list before starting the installation. Support Support for Oracle Linux Cloud Native Environment is included with an Oracle Linux Premier support subscription. Documentation and training Oracle Linux Cloud Native Environment documentation Oracle Linux Cloud Native Environment training   Kubernetes® is a registered trademark of The Linux Foundation in the United States and other countries, and is used pursuant to a license from The Linux Foundation.

Delivering a production-ready, cloud-native application development and operating environment Oracle Linux Cloud Native Environment has gained some notable additions. Specifically, three core...

Events

Join Us at KubeCon + CloudNativeCon

Oracle’s Linux, Virtualization, and Cloud Infrastructure teams are heading south for KubeCon + CloudNativeCon, November 18 – 21, at the San Diego Convention Center. This conference gathers leading technologists from open source and cloud native communities to further the education and advancement of cloud native computing. If you’re making the move from traditional application design to cloud native – orchestrating containers as part of a microservices architecture – you probably have questions about the latest technologies, available solutions, and deployment best practices. Let us help you put the pieces together. Meet us at booth # P26   Our Linux and Virtualization experts can answer your questions and provide one-on-one demos. Learn about the latest advancements in Oracle Linux Cloud Native Environment’s rich set of curated software components for DevSecOps, for on premise and in the cloud. Oracle Linux Cloud Native Environment is based on open standards, specifications and APIs defined by the Open Container Initiative and Cloud Native Computing Foundation™. It offers an open, integrated operating environment that is popular with developers and easy for IT operations to deliver containers and orchestration, management and development tools. With Oracle Linux Cloud Native Environment, you can: Accelerate time-to-value and deliver agility through modularity and developer productivity Modernize applications and lower costs by fully exploiting the economic advantages of cloud and open source Achieve vendor independence At Oracle’s booth, you can also learn about: Oracle Cloud Native Services: public cloud for Kubernetes, Container registry, open source serverless, and more. GraalVM: Run programs faster anywhere. More agility and increased performance on premise and in the cloud. Throughout the conference, you can sit in on informative “lightening talks” in the adjacent Oracle Cloud lounge. Have a cup of coffee on us – and you can enjoy it in your own Oracle Tux cup. To partake, visit the Coffee Lounge.  Booth hours Tuesday, Nov. 19          10:25 am – 8:40 pm Wednesday, Nov. 20   10:25 am – 5:20 pm Thursday, Nov. 21        10:25 am – 4:30 pm   We look forward to meeting you at the conference!

Oracle’s Linux, Virtualization, and Cloud Infrastructure teams are heading south for KubeCon + CloudNativeCon, November 18 – 21, at the San Diego Convention Center. This conference gathers...

Linux Kernel Development

Thoughts on Attending and Presenting at Linux Security Summit North America 2019

Oracle Linux kernel developer Tom Hromatka attended Linux Security Summit NA 2019. In this blog post, Tom discusses the presentation that he gave as well as other talks he found interesting. Linux Security Summit North America 2019 I was one of the lucky attendees at the Linux Security Summit North America 2019 conference in sunny San Diego from August 19th through the 21st. Three major topics dominated this year's agenda - trusted computing, containers, and overall kernel security. This was largely my first interaction with trusted computing and hardware attestation, so it was very interesting to hear about all of the innovative work going on in this area. My Presentation - The Why and How of Libseccomp https://sched.co/RHaK For 2019, LSS added three tutorial sessions to the schedule. These 90-minute talks were envisioned to be interactive and provide more in-depth details of a given technology. Paul Moore (co-maintainer of libseccomp) and I presented the first tutorial of the conference. We dedicated the first 20 minutes or so to a slide show introduction to the technology. Paul has given various flavors of this talk before, and he delivered the "why" part of the talk with a brief history of seccomp and libseccomp. He has a charismatic and entertaining delivery that can captivate an audience on even the driest of subjects - which seccomp is not :). I took over with the "how" portion of the discussion and jumped right in with a comparison of white- vs blacklists. (Spoiler - if security is of the utmost concern, I recommend a whitelist.) I briefly touched on other seccomp considerations such as supporting additional architectures (x86_64, x32, etc.), strings in seccomp, and parameter filtering pitfalls. The bulk of our timeslot was then spent writing a seccomp/libseccomp filter by hand. My goal was to highlight how easy it is to write a filter while simultaneously demonstrating some of the pitfalls (e.g. string handling) and how to debug them. In hindsight, this was a slightly crazy idea as many, many things could have gone horribly wrong. I had a rough plan of the program we were going to write and had tested it out beforehand. But like all good plans, no battle plan survives first contact with the enemy. Here is what we ended up writing. My laptop behaved differently at the conference than it did at home which led to more involved debugging than I had envisioned. I admit that I did want some live debugging, but... not that much. (I think the cause of the behavior differences was because I had done my testing at home using STDERR, but I inadvertently switched to using STDOUT at the conference.) Ultimately though these issues were the exact catalyst I was looking for, and audience participation soared. By the end of the talk I had the attention of the entire room and many, many people were actively throwing out ideas. There was no shortage of great ideas on how to fix the problem and perhaps more importantly, how to debug the problem. Afterward, a large number of people came up and thanked us for a fun talk. Several said that they were running the test program on their laptops while I was writing it and trying to actively debug it themselves. All in all, the talk didn't go exactly as I had envisioned, but perhaps that is for the better. The audience was amazing, and I sure had a lot of fun. Making Containers Safer - Stéphane Graber and Christian Brauner https://sched.co/RHa5 Stéphane and Christian are two of the lead engineers working on LXC for Canonical. They are both intelligent and often working on the forefront of containers, so when they make an upstream proposal, it's wise to pay attention. In this talk, they mentioned several things they have worked on lately to improve kernel and container security: Their users tend to run a single app in a container, rather than an entire distro. Thus, they can really lock down security via seccomp, SELinux, and cgroups to grant the bare minimum of permissions to run the application LXC supports unprivileged containers launched by an unprivileged user, but there are still some limiitations. Stay tuned for patches and improvements from them :) Multiple users within such a container is difficult Several helper binaries are required Christian again reemphasized the importance of using unprivileged containers when possible and listed several CVEs that would have been ineffective against an unprivileged container They spent quite a bit of time discussing the newly-added seccomp/libseccomp user notification feature. (I worked on the libseccomp side of it.) They are considering adding a "keep running the kernel filter" option to the user notification filter Keynote: Retrospective: 26 Years of Flexible MAC - Stephen Smalley https://sched.co/RHaH Stephen Smalley was one of the early innovators in the Mandatory Access Control (MAC) arena (think SELinux and similar) and continues to innovate and advocate for better MAC solutions. Stephen presented a amazingly detailed and lengthy history from ~1999 through today on the history of MACs in computing. He touched on early NSA work with closed source OSes and the NSA's inability to gain traction there. These failures drove the NSA to look at open source OSes, and early experiments with the University of Utah and the OS they maintained proved the viability of a MAC. SELinux work started shortly after that and was added to Linux in 2003. Stephen applauded the android work as a good example of how to apply a MAC. Android is 100% confined+enforcing and has a large automated validation and testing suite. Going forward, Stephen said that MACs are being effectively used by higher-level services and emerging technologies. For better security, this is critical. Application Whitelisting - Steven Grubb https://sched.co/RHb9 Steve Grubb is working on a rather novel approach to improve security. He's working on a daemon, fapolicyd, that can whitelist files on the system. His introduction quickly spelled out the problem space. Antivirus is an effective blacklisting approach. It can identify untrusted files and rapidly neutralize them. fapolicyd is effectively the opposite. A sysadmin should generally know the expected files that will be on the system and can create an application whitelist based upon these known files. He then went on a small tangent showing how easy it is to hijack a Python process and start up a webserver without touching the disk. fapolicyd uses seccomp to restrict execve access. Another quick demo showed how fapolicyd will allow /bin/ls to run, but a copy of it in /tmp was blocked. It's an interesting project in its early stages, and I'm eager to see how it progresses, so I started following it on github. How to Write a Linux Security Module - Casey Schaufler https://sched.co/RHa2 Casey gave the third and final tutorial of the conference on how (and why) to write a Linux Security Module (LSM). As an aside, I had lunch with Casey prior to his presentation, and he good-naturedly said that he wasn't "crazy enough" to write software live in front of a large audience. Hmmm :). Anyway... My key takeaways from this tutorial: * Why write your own LSM? You may have unique things you want to check beyond what SELinux or Apparmor are checking. Perhaps there's one little thing that your LSM can do... and do well * One LSM cannot override another LSM's denial. In fact, once a check fails, no other LSMs are run * If the check can be readily done in userspace, do it there. This includes LSM logic * You only need to implement the LSM hooks you are interested in Kernel Self-Protection Project - Kees Cook https://sched.co/RHbF Kees (kernel seccomp maintainer amongst many other things) gave another excellent talk on the status of security in the Linux kernel. His talks are usually so engaging that I don't take notes, and this one was no exception. He outlined the many security (and otherwise) fixes that have gone into the kernel over the last year. He also opined that he would love to see the kernel move away from C and replace it with rust, but he acknowledges there are a lot of challenges (both technical and human) before that could happen. Hallway Track As with any major Linux conference, the hallway track is every bit as invaluable as the official presentations. This was the first time I met my co-maintainer of libseccomp (Paul Moore) in person, and we were able to meet up a few times to talk seccomp/libseccomp and their roadmap going forward. I was lucky to be able to spend some time with several of the presenters, talking containers, seccomp, cgroups and whatever other topics we had in common. And of course I talked seccomp with many conference attendees and gladly offered my assistance in getting their seccomp filters up and running. Summary This was my first LSS and hopefully not my last. I really enjoyed my time with the outstanding conference attendees, and the conversations (both formal and informal) were excellent. In summary, I learned a ton, ate way too much really good food, and met many intelligent and wonderful people. I hope to see you at LSS 2020!

Oracle Linux kernel developer Tom Hromatka attended Linux Security Summit NA 2019. In this blog post, Tom discusses the presentation that he gave as well as other talks he found interesting. Linux...

Events

Join Wim Coekaerts, Oracle SVP, for a webinar: What you need to know about Oracle Autonomous Linux

When: Tuesday, November 19, 2019 In this webinar, Wim Coekaerts, Senior Vice President, Software Development at Oracle will discuss the world's first Autonomous Linux and the benefits it offers customers. Today, security and data protection are among the biggest challenges faced by IT. Keeping systems up to date and secure involves tasks that can be error prone and extremely difficult to manage in large-scale environments. Learn how Oracle is helping to solve these challenges. Automation is driving the cloud Automating common management tasks to greatly reduce complexity and human error Delivering increased security and reducing downtime by self-patching, self-updating, and known exploit detection Freeing up critical IT resources to tackle more strategic initiatives Full application compatibility How to configure automation in your private clouds - reducing cost This webinar will be held at three times on November 19, 2019, to accommodate global locations. Please use the link below to register for the session in your region/local time zone. APAC | 01:00 PM Singapore Time | Register EMEA | 10:00 AM (GMT) Europe / London Time | Register North America | 09:00 AM Pacific Standard Time / 12:00 PM Eastern Standard Time | Register During the webinar you will have the opportunity to have your questions answered. Please join us.

When: Tuesday, November 19, 2019 In this webinar, Wim Coekaerts, Senior Vice President, Software Development at Oracle will discuss the world's first Autonomous Linux and the benefits it offers...

Announcements

Easier access to open source images on Oracle Container Registry

We recently updated Oracle Container Registry so that images which contain only open source software no longer requires Oracle Single Sign-on authentication to the web interface, nor does it require the Docker client to login prior to issuing a pull request. The change was made to simplify the installation workflow for open source components hosted on the registry and allow those components to be more easily accessible by a continuous integration platform. Downloading open source images To determine whether an image is available without authentication, navigate to the Oracle Container Registry and navigate to the product category that contains the repository you're interested in. If the repository table status that the image "... is licensed under one or more open source licenses..." then that image can be pulled from Oracle Container Registry without any manual intervention or login required. See below for an example. If you click the name of a repository, a table of available tags with their associated pull command is displayed on the image detail page. For example, the oraclelinux repository in the OS product category has around 15 tags available when this blog was published. The Tags table also provides a list of download mirrors across the world that can be used. For best performance, select the download mirror closest to you. To pull an open source image, you simply issue the docker pull command as documented. No need to login beforehand. For example, to pull the latest Oracle Linux 7 slim image from our Sydney download mirror, simply run: # docker pull container-registry-sydney.oracle.com/os/oraclelinux:7-slim 7-slim: Pulling from os/oraclelinux Digest: sha256:c2d507206f62119db3a07014b445dd87f85b0d6f204753229bf9b72f82ac9385 Status: Downloaded newer image for container-registry-sydney.oracle.com/os/oraclelinux:7-slim container-registry-sydney.oracle.com/os/oraclelinux:7-slim Obtaining images that contain licensed Oracle product binaries For details on the process required to download images that contain licensed Oracle product binaries, please review the Using the Oracle Container Registry chapter of the Oracle Container Runtime for Docker manual.

We recently updated Oracle Container Registry so that images which contain only open source software no longer requires Oracle Single Sign-on authentication to the web interface, nor does it require...

Events

Resources at the Ready: Oracle OpenWorld 2019 Offers Continued Learning

Before Oracle OpenWorld 2019 is too far in the “rear view mirror” and we’re on to the holiday season, year’s end, and regional OpenWorld events that start in early 2020, we wanted to highlight some of the content from the September conference. At the link below, you’ll find many of the presentations given by product experts, partners, and customers with valuable information that can help you – and it’s all at your fingertips. Topics include:  Securing Oracle Linux 7 Setting Up a Kernel-Based VM with Oracle Linux 7, UEK5, Oracle Linux Virtualization Manager Maximizing Performance with Oracle Linux Secure Container Orchestration Using Oracle Linux Cloud Native (Kubernetes/Kata) Infrastructure as Code: Oracle Linux, Terraform, and Oracle Cloud Infrastructure Building Flexible, Multicloud Solutions: Oracle Private Cloud Appliance / Oracle Linux Oracle Linux and Oracle VM VirtualBox: The Enterprise Cloud Development Platform Oracle Linux: A Cloud-Ready, Optimized Platform for Oracle Cloud Infrastructure Open Container Virtualization: Security of Virtualization, Speed of Containers Server Virtualization in Your Data Center and Migration Paths to Oracle Cloud This link is to the Session Catalog. Simply search for the title above or topic of interest. You’ll see a green download arrow, to the right of the session title, which lets you know that content is just a two clicks away. Videos of note: Announcing Oracle Autonomous Linux (0:36) Oracle's Infrastructure Strategy for the Cloud and On Premise (43:16) Cloud Platform and Middleware Strategy and Roadmap (32:56) We hope you find this content helpful. Let us know. And, if you’re already planning for 2020, here’s the line-up of regional conferences: Oracle OpenWorld Middle East: Dubai | January 14-15 | World Trade Centre Oracle OpenWorld Europe: London | February 12-13 | ExCel Oracle OpenWorld Asia: Singapore | April 21-22 | Marina Bay Sands Oracle OpenWorld Latin America: Sao Paulo | June 17-18 Bold ideas. Breakthrough technologies. Better possibilities. It all starts here.

Before Oracle OpenWorld 2019 is too far in the “rear view mirror” and we’re on to the holiday season, year’s end, and regional OpenWorld events that start in early 2020, we wanted to highlight some of...

Linux Kernel Development

What it means to be a maintainer of Linux seccomp

In this blog post, Linux kernel developer Tom Hromatka talks about becoming a co-maintainer of libsecccomp, what it means and his recent presentation at Linux Security Summit North America 2019.  Seccomp Maintainer Recently I was named a libseccomp co-maintainer. As a brief background, the Linux kernel provides a mechanism - called SECure COMPuting mode or seccomp for short - to block a process or thread's access to some syscalls. seccomp filters are written in a pseudo-assembly instruction set called Berkeley Packet Filter (BPF), but these filters can be difficult to write by hand and are challenging to maintain as updates are applied and syscalls are added. libseccomp is a low-level userspace library designed to simplify the creation of these seccomp BPF filters. My role as a maintainer is diverse and varies greatly from day to day: I initially started working with libseccomp because we in Oracle identified opportunities that could significantly improve seccomp performance for containers and virtual machines. This work then grew into fixing bugs, helping others with their seccomp issues, and in general trying to improve seccomp and libseccomp for the future. Becoming a maintainer was the next logical progression Our code is publicly available on github and we also maintain a public mailing list. Most questions, bug reports, and feature requests come in via github. Ideally the submitter will work with us to triage the issue, but that is not required Pull requests are a great way for others to get involved in seccomp and libseccomp. If a user identifies a bug or wants to add a new feature, they are welcome to modify the libseccomp code and submit a pull request to propose changes to the library. In cases like this, I will work with users to make sure the code meets our guidelines. I will help them match the coding style, create automated tests, or whatever else needs to be done to ensure their pull request meets our stringent requirements. We have an extensive automated test suite, code coverage, and static analysis integrated directly into github to maintain our high level of code quality. These checks run against every pull request and every commit Periodically we release new versions of libseccomp. (At present the release schedule is "as needed" rather than on a set timeline. This could change in the future if need be.) We maintain two milestones within github - a major release milestone and a minor release milestone. Major releases are based upon the master branch of the repo and will contain new features, bug fixes, etc. - including potentially major changes. On the other hand, the minor release is based upon the git release- branch. Changes to the minor branch consist of bug fixes, security CVEs, etc. - and do not contain major new features. As a maintainer, the release process is fairly involved to ensure the release is of the highest quality Of course, I get to add new features, fix bugs - and hopefully not add any new ones :), and add tests And finally I work with others both within Oracle and throughout the greater Linux community to plan libseccomp and seccomp's future. For example, Christian Brauner (Canonical) and Kees Cook (Google) are interested in adding deep argument inspection to seccomp. This will require non-trivial changes to both the kernel and libseccomp. This is a challenging feature that has significant security risks and will require cooperation up and down the software stack to ensure it's done safely and with a user-friendly API Libseccomp at Linux Security Summit North America 2019 In August my co-maintainer, Paul Moore (Cisco), and I attended the Linux Security Summit (LSS) conference in San Diego. We presented a tutorial on the "Why and How of libseccomp" Paul opened up the 90-minute session with an entertaining retelling of the history of seccomp, libseccomp, and why it has evolved into its current form. I took over and presented the "how" portion of the presentation with a comparison of white- vs. blacklists, common pitfalls like string filters and parameter filtering. But the bulk of our tutorial was how to actually write a libseccomp filter, so with a tremendous amount of help from the audience, we wrote a filter by hand and debugged several troublesome issues. Full disclosure: I wanted to highlight some of the challenges when writing a filter, but as Murphy's Law would have it, even more went awry than I expected. Hijinks didn't ensue, but thankfully, I had an engaged and wonderful audience, and together we debugged the filter into existence. The live writing of code really did drive home some of the pitfalls as well as outline methods to overcome these challenges. Overall, things didn't go exactly as I had envisioned, but I feel the talk was a success. Thanks again to our wonderful audience! The full recording of the tutorial is available here

In this blog post, Linux kernel developer Tom Hromatka talks about becoming a co-maintainer of libsecccomp, what it means and his recent presentation at Linux Security Summit North America 2019.  Seccom...

Linux Kernel Development

Notes on BPF (7) - BPF, tc and Generic Segmentation Offload

Notes on BPF (7) - BPF, tc and Generic Segmentation Offload Oracle Linux kernel developer Alan Maguire continues our blog series on BPF, wherein he presented an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering. In the previous BPF blog entry, I warned against enabling generic segmentation offload (GSO) when using tc-bpf. The purpose of this blog entry is to describe how it in fact can be used, even for cases where BPF programs add encapsulation. A caveat however; this is only true for the 5.2 kernel and later. So here we will describe GSO briefly, then why it matters for BPF and finally demonstrate how new flags added to the bpf_skb_adjust_room helper facilitate using it for cases where encapsulation is added. What is Generic Segmentation Offload? Generic segmentation offload took the hardware concept of allowing Linux to pass down a large packet - termed a megapacket - which the hardware would then dice up into individual MTU size frames for transmission. This is termed TSO (TCP segmentation offload), and GSO generalizes this beyond TCP to UDP, tunnels, etc and performs the segmentation in software. Performance benefits are still significant, even in software and we find ourselves reaching line rate on 10Gb/s and faster NICs, even with an MTU of 1500 bytes. Because a lot of the per-packet costs in the networking stack are paid by one megapacket rather than dozens of smaller MTU packets traversing the stack, the benefits really accrue. Enter BPF Now consider the BPF case. If we are doing processing or encapsulation in BPF, we are adding per-packet overhead. This overhead could come in the form of map lookups, adding encapsulation etc. The beautiful thing about GSO is that it happens after tc-bpf processing, so any costs we accrue in BPF are only paid for the mega-packet, rather than each individual MTU-sized packet. As such switching GSO on is highly desirable. There is a problem however. For GSO to work on encapsulated packets, the packets must mark their inner encapsulated headers. This is done for native tunnels via skb_set_inner_[mac|transport]_header() functions, but if we added encapsulation in BPF there was no way to mark the inner headers accordingly. The solution In BPF we carry out the marking of inner headers via flags passed to bpf_skb_adjust_room(). For usage examples, I would recommend looking at tools/testing/selftests/bpf/progs/test_tc_tunnel.c. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/test_tc_tunnel.c GRE and UDP tunnels are supported via the BPF_F_ADJ_ROOM_ENCAP_L4_GRE and BPF_F_ADJ_ROOM_ENCAP_L4_UDP flags. For L3, we need to specify if the inner header is IPv4 or IPv6 via the BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 and BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 flags. Finally if we have L2 encapsulation such as an inner ether header or MPLS label(s), we need to pass in the inner L2 header length via BPF_F_ADJ_ROOM_ENCAP_L2(inner_maclen). So we simply OR together the flags that specify the encapsulation we are adding. Conclusion Generic Segmentation Offload and BPF work beautifully together, but BPF encapsulation presented a difficulty since GSO did not know that the encapsulation had been added. With 5.2 and later kernels, this problem is now solved! Be sure to visit the previous installments of this series on BPF, here, and stay tuned for our next blog posts! 1. BPF program types 2. BPF helper functions for those programs 3. BPF userspace communication 4. BPF program build environment 5. BPF bytecodes and verifier 6. BPF Packet Transformation

Notes on BPF (7) - BPF, tc and Generic Segmentation Offload Oracle Linux kernel developer Alan Maguire continues our blog series on BPF, wherein he presented an in depth look at the kernel's "Berkeley...

Linux Kernel Development

Mirroring a running system into a ramdisk

In this blog post, Oracle Linux kernel developer William Roche presents a method to mirror a running system into a ramdisk. A RAM mirrored System ? There are cases where a system can boot correctly but after some time, can lose its system disk access - for example an iSCSI system disk configuration that has network issues, or any other disk driver problem. Once the system disk is no longer accessible, we rapidly face a hang situation followed by I/O failures, without the possibility of local investigation on this machine. I/O errors can be reported on the console: XFS (dm-0): Log I/O Error Detected.... Or losing access to basic commands like: # ls -bash: /bin/ls: Input/output error The approach presented here allows a small system disk space to be mirrored in memory to avoid the above I/O failures situation, which provides the ability to investigate the reasons for the disk loss. The system disk loss will be noticed as an I/O hang, at which point there will be a transition to use only the ram-disk. To enable this, the Oracle Linux developer Philip "Bryce" Copeland created the following method (more details will follow): Create a "small enough" system disk image using LVM (a minimized Oracle Linux installation does that) After the system is started, create a ramdisk and use it as a mirror for the system volume when/if the (primary) system disk access is lost, the ramdisk continues to provide all necessary system functions. Disk and memory sizes: As we are going to mirror the entire system installation to the memory, this system installation image has to fit in a fraction of the memory - giving enough memory room to hold the mirror image and necessary running space. Of course this is a trade-off between the memory available to the server and the minimal disk size needed to run the system. For example a 12GB disk space can be used for a minimal system installation on a 16GB memory machine. A standard Oracle Linux installation uses XFS as root fs, which (currently) can't be shrunk. In order to generate a usable "small enough" system, it is recommended to proceed to the OS installation on a correctly sized disk space. Of course, a correctly sized installation location can be created using partitions of large physical disk. Then, the needed application filesystems can be mounted from their current installation disk(s). Some system adjustments may also be required (services added, configuration changes, etc...). This configuration phase should not be underestimated as it can be difficult to separate the system from the needed applications, and keeping both on the same space could be too large for a RAM disk mirroring. The idea is not to keep an entire system load active when losing disks access, but to be able to have enough system to avoid system commands access failure and analyze the situation. We are also going to avoid the use of swap. When the system disk access is lost, we don't want to require it for swap data. Also, we don't want to use more memory space to hold a swap space mirror. The memory is better used directly by the system itself. The system installation can have a swap space (for example a 1.2GB space on our 12GB disk example) but we are neither going to mirror it nor use it. Our 12GB disk example could be used with: 1GB /boot space, 11GB LVM Space (1.2GB swap volume, 9.8 GB root volume). Ramdisk memory footprint: The ramdisk size has to be a little larger (8M) than the root volume size that we are going to mirror, making room for metadata. But we can deal with 2 types of ramdisk: A classical Block Ram Disk (brd) device A memory compressed Ram Block Device (zram) We can expect roughly 30% to 50% memory space gain from zram compared to brd, but zram must use 4k I/O blocks only. This means that the filesystem used for root has to only deal with a multiple of 4k I/Os. Basic commands: Here is a simple list of commands to manually create and use a ramdisk and mirror the root filesystem space. We create a temporary configuration that needs to be undone or the subsequent reboot will not work. But we also provide below a way of automating at startup and shutdown. Note the root volume size (considered to be ol/root in this example): # lvs --units k -o lv_size ol/root LSize 10268672.00k Create a ramdisk a little larger than that (at least 8M larger): # modprobe brd rd_nr=1 rd_size=$((10268672 + 8*1024)) Verify the created disk: # lsblk /dev/ram0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT ram0 1:0 0 9.8G 0 disk Put the disk under lvm control # pvcreate /dev/ram0 Physical volume "/dev/ram0" successfully created. # vgextend ol /dev/ram0 Volume group "ol" successfully extended # vgscan --cache Reading volume groups from cache. Found volume group "ol" using metadata type lvm2 # lvconvert -y -m 1 ol/root /dev/ram0 Logical volume ol/root successfully converted. We now have ol/root mirror to our /dev/ram0 disk. # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root ol rwi-aor--- 9.79g 40.70 root_rimage_0(0),root_rimage_1(0) [root_rimage_0] ol iwi-aor--- 9.79g /dev/sda2(307) [root_rimage_1] ol Iwi-aor--- 9.79g /dev/ram0(1) [root_rmeta_0] ol ewi-aor--- 4.00m /dev/sda2(2814) [root_rmeta_1] ol ewi-aor--- 4.00m /dev/ram0(0) swap ol -wi-ao---- <1.20g /dev/sda2(0) A few minutes (or seconds) later, the synchronization is completed: # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root ol rwi-aor--- 9.79g 100.00 root_rimage_0(0),root_rimage_1(0) [root_rimage_0] ol iwi-aor--- 9.79g /dev/sda2(307) [root_rimage_1] ol iwi-aor--- 9.79g /dev/ram0(1) [root_rmeta_0] ol ewi-aor--- 4.00m /dev/sda2(2814) [root_rmeta_1] ol ewi-aor--- 4.00m /dev/ram0(0) swap ol -wi-ao---- <1.20g /dev/sda2(0) We have our mirrored configuration running ! For security, we can also remove the swap and /boot, /boot/efi(if it exists) mount points: # swapoff -a # umount /boot/efi # umount /boot Stopping the system also requires some actions as you need to cleanup the configuration so that it will not be looking for a gone ramdisk on reboot. # lvconvert -y -m 0 ol/root /dev/ram0 Logical volume ol/root successfully converted. # vgreduce ol /dev/ram0 Removed "/dev/ram0" from volume group "ol" # mount /boot # mount /boot/efi # swapon -a What about in-memory compression ? As indicated above, zRAM devices can compress data in-memory, but 2 main problems need to be fixed: LVM does take into account zRAM devices by default zRAM only works with 4K I/Os Make lvm work with zram: The lvm configuration file has to be changed to take into account the "zram" type of devices. Including the following "types" entry to the /etc/lvm/lvm.conf file in its "devices" section: devices { types = [ "zram", 16 ] } Root file system I/Os: A standard Oracle Linux installation uses XFS, and we can check the sector size used (depending on the disk type used) with # xfs_info / meta-data=/dev/mapper/ol-root isize=256 agcount=4, agsize=641792 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=2567168, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 We can notice here that the sector size (sectsz) used on this root fs is a standard 512 bytes. This fs type cannot be mirrored with a zRAM device, and needs to be recreated with 4k sector sizes. Transforming the root file system to 4k sector size: This is simply a backup (to a zram disk) and restore procedure after recreating the root FS. To do so, the system has to be booted from another system image. Booting from an installation DVD image can be a good possibility. Boot from an OL installation DVD [Choose "Troubleshooting", "Rescue a Oracle Linux system", "3) Skip to shell"] Activate and mount the root volume sh-4.2# vgchange -a y ol 2 logical volume(s) in volume group "ol" now active sh-4.2# mount /dev/mapper/ol-root /mnt create a zram to store our disk backup sh-4.2# modprobe zram sh-4.2# echo 10G > /sys/block/zram0/disksize sh-4.2# mkfs.xfs /dev/zram0 meta-data=/dev/zram0 isize=256 agcount=4, agsize=655360 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0, sparse=0 data = bsize=4096 blocks=2621440, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 sh-4.2# mkdir /mnt2 sh-4.2# mount /dev/zram0 /mnt2 sh-4.2# xfsdump -L BckUp -M dump -f /mnt2/ROOT /mnt xfsdump: using file dump (drive_simple) strategy xfsdump: version 3.1.7 (dump format 3.0) - type ^C for status and control xfsdump: level 0 dump of localhost:/mnt ... xfsdump: dump complete: 130 seconds elapsed xfsdump: Dump Summary: xfsdump: stream 0 /mnt2/ROOT OK (success) xfsdump: Dump Status: SUCCESS sh-4.2# umount /mnt recreate the xfs on the disk with a 4k sector size sh-4.2# mkfs.xfs -f -s size=4096 /dev/mapper/ol-root meta-data=/dev/mapper/ol-root isize=256 agcount=4, agsize=641792 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0, sparse=0 data = bsize=4096 blocks=2567168, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 sh-4.2# mount /dev/mapper/ol-root /mnt restore the backup sh-4.2# xfsrestore -f /mnt2/ROOT /mnt xfsrestore: using file dump (drive_simple) strategy xfsrestore: version 3.1.7 (dump format 3.0) - type ^C for status and control xfsrestore: searching media for dump ... xfsrestore: restore complete: 337 seconds elapsed xfsrestore: Restore Summary: xfsrestore: stream 0 /mnt2/ROOT OK (success) xfsrestore: Restore Status: SUCCESS sh-4.2# umount /mnt sh-4.2# umount /mnt2 reboot the machine on its disk (may need to remove the DVD) sh-4.2# reboot login and verify the root filesystem $ xfs_info / meta-data=/dev/mapper/ol-root isize=256 agcount=4, agsize=641792 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=2567168, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 With sectsz=4096, our system is now ready for zRAM mirroring. Basic commands with a zRAM device: # modprobe zram # zramctl --find --size 10G /dev/zram0 # pvcreate /dev/zram0 Physical volume "/dev/zram0" successfully created. # vgextend ol /dev/zram0 Volume group "ol" successfully extended # vgscan --cache Reading volume groups from cache. Found volume group "ol" using metadata type lvm2 # lvconvert -y -m 1 ol/root /dev/zram0 Logical volume ol/root successfully converted. # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root ol rwi-aor--- 9.79g 12.38 root_rimage_0(0),root_rimage_1(0) [root_rimage_0] ol iwi-aor--- 9.79g /dev/sda2(307) [root_rimage_1] ol Iwi-aor--- 9.79g /dev/zram0(1) [root_rmeta_0] ol ewi-aor--- 4.00m /dev/sda2(2814) [root_rmeta_1] ol ewi-aor--- 4.00m /dev/zram0(0) swap ol -wi-ao---- <1.20g /dev/sda2(0) # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root ol rwi-aor--- 9.79g 100.00 root_rimage_0(0),root_rimage_1(0) [root_rimage_0] ol iwi-aor--- 9.79g /dev/sda2(307) [root_rimage_1] ol iwi-aor--- 9.79g /dev/zram0(1) [root_rmeta_0] ol ewi-aor--- 4.00m /dev/sda2(2814) [root_rmeta_1] ol ewi-aor--- 4.00m /dev/zram0(0) swap ol -wi-ao---- <1.20g /dev/sda2(0) # zramctl NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT /dev/zram0 lzo 10G 9.8G 5.3G 5.5G 1 The compressed disk uses a total of 5.5GB of memory to mirror a 9.8G volume size (using in this case 8.5G). Removal is performed the same way as brd, except that the device is /dev/zram0 instead of /dev/ram0. Automating the process: Fortunately, the procedure can be automated on system boot and shutdown with the following scripts (given as examples). The start method: /usr/sbin/start-raid1-ramdisk: [https://github.com/oracle/linux-blog-sample-code/blob/ramdisk-system-image/start-raid1-ramdisk] After a chmod 555 /usr/sbin/start-raid1-ramdisk, running this script on a 4k xfs root file system should show something like: # /usr/sbin/start-raid1-ramdisk Volume group "ol" is already consistent. RAID1 ramdisk: intending to use 10276864 K of memory for facilitation of [ / ] Physical volume "/dev/zram0" successfully created. Volume group "ol" successfully extended Logical volume ol/root successfully converted. Waiting for mirror to synchronize... LVM RAID1 sync of [ / ] took 00:01:53 sec Logical volume ol/root changed. NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT /dev/zram0 lz4 9.8G 9.8G 5.5G 5.8G 1 The stop method: /usr/sbin/stop-raid1-ramdisk: [https://github.com/oracle/linux-blog-sample-code/blob/ramdisk-system-image/stop-raid1-ramdisk] After a chmod 555 /usr/sbin/stop-raid1-ramdisk, running this script should show something like: # /usr/sbin/stop-raid1-ramdisk Volume group "ol" is already consistent. Logical volume ol/root changed. Logical volume ol/root successfully converted. Removed "/dev/zram0" from volume group "ol" Labels on physical volume "/dev/zram0" successfully wiped. A service Unit file can also be created: /etc/systemd/system/raid1-ramdisk.service [https://github.com/oracle/linux-blog-sample-code/blob/ramdisk-system-image/raid1-ramdisk.service] [Unit] Description=Enable RAMdisk RAID 1 on LVM After=local-fs.target Before=shutdown.target reboot.target halt.target [Service] ExecStart=/usr/sbin/start-raid1-ramdisk ExecStop=/usr/sbin/stop-raid1-ramdisk Type=oneshot RemainAfterExit=yes TimeoutSec=0 [Install] WantedBy=multi-user.target Conclusion: When the system disk access problem manifests itself, the ramdisk mirror branch will provide the possibility to investigate the situation. This procedure goal is not to keep the system running on this memory mirror configuration, but help investigate a bad situation. When the problem is identified and fixed, I really recommend to come back to a standard configuration -- enjoying the entire memory of the system, a standard system disk, a possible swap space etc. Hoping the method described here can help. I also want to thank for their reviews Philip "Bryce" Copeland who also created the first prototype of the above scripts, and Mark Kanda who also helped testing many aspects of this work.

In this blog post, Oracle Linux kernel developer William Roche presents a method to mirror a running system into a ramdisk. A RAM mirrored System ? There are cases where a system can boot correctly but...

Announcements

Oracle OpenWorld 2019 & Oracle Code One – It’s a Wrap!

Oracle OpenWorld 2019 and Oracle Code One went by in a flash. It was a wonderful week, packed with great content. There was a fun new vibe at the conference, lots of learning, and much to inspire and celebrate. Below are some of the Oracle Linux and virtualization highlights. We launched Oracle Autonomous Linux, which marks a major milestone in the company’s autonomous strategy. Oracle Autonomous Linux, along with the new Oracle OS Management Service, is the first and only autonomous operating environment that helps greatly reduce complexity and human error to deliver increased cost savings, security, and availability for customers. For more details, watch the highlights from the keynote announcement, listen to an interview with Wim Coekaerts, Senior Vice President, Operating Systems and Virtualization Engineering, and read the news release, blog, and website. Go to Oracle Cloud Marketplace to get started. We learned more about Oracle Linux Cloud Native Environment; container orchestration with Kubernetes and Kata; how to use the Developer Image for Oracle Cloud Infrastructure; setting up a kernel-based VM with Oracle Linux; securing systems with zero-downtime using Oracle Ksplice; developing for the cloud with Oracle Linux and Oracle VM VirtualBox, the impact of selecting the right infrastructure software from first-hand customer experiences, and more. In this interview, Wim Coekaerts talks about what Oracle is doing to help make things easier for developers – across apps, database, and infrastructure – whether the target is deployment in the cloud or on premise. Hardware partners Dell Technologies and Lenovo, who work with the Oracle Linux and virtualization team to certify their hardware through the HCL program to run our software,  showcased joint solutions, while in a session about optimizing environments, Cisco shared how the company uses Oracle Database running on Oracle Linux. Several sessions covered more about the interconnect collaboration between Microsoft and Oracle, in which our team is closely involved. We celebrated our amazing customers! To all those who joined us at the conference, we appreciate your time and hope you found the demonstrations, sessions, keynotes, and 1:1s valuable. Sharing your experiences helps drive innovation. Your feedback, knowledge, and perspectives are vital in helping us deliver industry-leading products and services. Making you successful with your Oracle Linux and virtualization deployments – on premise or in the cloud – is among our most important goals and a tremendously rewarding achievement. It was a pleasure to host the Developer Appreciation Event where we shared some brews, casual conversation, and demos including Oracle Linux, Oracle VM VirtualBox, GraalVM, Helidon, Oracle Digital Assistant, Oracle Integration Cloud, and Oracle Content and Experience. Oracle CloudFest.19 at the Chase Center was a big hit. The performers and the venue were superb! It was a special honor to recognize eight of our customers with the 2019 Oracle Excellence Awards for their leadership in infrastructure transformation. Congratulations! Tell us about your Oracle OpenWorld and Code One highlights via the comments section. Until next year, it's a wrap!   #OOW19  #CodeOne #OracleLinux  #OracleAutonomousLinux

Oracle OpenWorld 2019 and Oracle Code One went by in a flash. It was a wonderful week, packed with great content. There was a fun new vibe at the conference, lots of learning, and much to inspire and...

Announcements

Oracle Announces 2019 Oracle Excellence Awards – Congratulations to our “Leadership in Infrastructure Transformation" Winners

We are pleased to announce the 2019 Oracle Excellence Awards “Leadership in Infrastructure Transformation" Winners. This elite group of recipients includes customers and partners who are using Oracle Infrastructure Technologies to accelerate innovation and drive business transformation by increasing agility, lowering costs, and reducing IT complexity. This year, our 8 award recipients were selected from among hundreds of nominations. The winners represent 8 different countries: Australia, Austria, China, India, Japan, Netherlands, New Zealand, and the United States Winners must use at least one, or a combination, of the following for category qualification:   •    Oracle Linux •    Oracle Virtualization •    Oracle Private Cloud Appliance •    Oracle SPARC •    Oracle Solaris •    Oracle Storage, Tape/Disk Oracle is pleased to honor these leaders who have delivered value to their organizations through the use of multiple Oracle technologies which have resulted in reduced cost of IT operations, improved time to deployment, and performance and end user productivity gains. This year’s winners are Jan-Pier Loonstra, enterprise infrastructure architect, KPN; Laszlo Beres, engineering manager, GVC Holdings; Prasanta Kumar Nayak, deputy general manager (systems), State Bank of India; Ryan Lea, solution consultant, CCL: Together with Revera; Sumesh Vadassary, software development director, PayPal, Inc.; Suping Dong, CEO, Bowmicro Ltd.; Yasushi Taki, CEO, CTO and founder, JustPlayer Co., Ltd.; Vivek Jaiswal, group general manager DevSpecOps Shared Services, National Roads and Motorist Association Australia. More information can be found here.

We are pleased to announce the 2019 Oracle Excellence Awards “Leadership in Infrastructure Transformation" Winners. This elite group of recipients includes customers and partners who are using Oracle...

Linux Kernel Development

Persistent Memory and Oracle Linux

In this blog post, Oracle Linux kernel developer Jane Chu talks about persistent memory, the support we have in Oracle Linux for it and some examples on how to use it. Persistent Memory Introduction Persistent Memory Overview More than ever, applications have been driving hardware technologies. Big Data is responsible for recent advances in artificial intelligence which demand massive parallel processing capability and immediate access to large data. For example, realtime business analytics, providing results based on realtime consumer/product information; realtime traffic pattern analysis based on data collected by smart cameras, etc. These applications present challenges in transporting, storing, and processing big data. Modern disk technology provides access to large amount of persistent data, but access times are not fast enough. DRAM offers fast access to data, but it's generally not feasible to store one and a half terabytes of data for CPU intensive computation without involving IO. Various NVRAM solutions have been used to address the needs for speed and capacity. NAND flash offers capacity for a resonable price, but is slow and not byte-addressable. NVDIMMs based on a combination of DDR4, supercapacitors, and flash offer fast speed and byte-addressibility, but capacity is limited by DDR4 capacity. A new technology developed jointly by Intel and Micron, 3D XPoint, offers all these features. Intel Optane DC PMEM is the product from Intel that uses 3D XPoint technology. Intel Optane Data Center Persistent Memory (Optane DC PMEM) The Optane DC PMEM DIMM is pin compatible with a DDR4 DIMM and is available in 128G, 256G and 512G capacity. It has much higher endurance than NAND flash and its density is eight times of DDR4. It supports byte-addressable load and store instructions, where read latency is roughly 4 times that of DDR4, and write latency is roughly 10 times that of DDR4. There are two PMEM modes that are supported on Intel systems equipped with Optane DC PMEM. 2-Level memory mode (2LM) In this mode, NVDIMMs are used as main memory, and DRAM is treated as a write-back cache. A load instruction ends up fetching data from DRAM if there is a cache hit, otherwise data will be fetched from the second level memory PMEM, incur longer latency. A store instruction is not cached, so will always incur longer lantecy. Therefore, performance in 2LM mode depends on both the nature of the workload and the DRAM:PMEM ratio. PMEM in 2LM mode can be considered volatile, as the hardware ensures that data is not available after a power cycle. Application Direct mode (AppDirect) As the name implies, this is the mode where applications can directly access PMEM in byte-addressable style. Unlike in 2LM mode, PMEM in this mode is not presented as system memory, but rather device memory that an application can map into its address space. PMEM in AppDirect mode is persistent, persistence being achieved via Asynchronous DRAM Refresh (ADR). ADR is a mechanism activated by the power loss signal. It ensures data that has reached the ADR domain is flushed to the media to be persisted. The ADR Domain consists of the memory controller, the Write Pending Queue (WPQ), the Transaction Pending Queue (TPQ), and NVDIMM media. To ensure no data loss, the application is responsible for flushing data out of the CPU caches (into the ADR domain). ADR could in theory fail due to an unqualified Power Supply having signal issue or being unable to sustain sufficient voltage long enough to flush the pending queues. To enable PMEM, these basic components are required: Some NVDIMMs, and CPUs that support PMEM, such as Intel's Cascade Lake BIOS or UEFI firmware that supports PMEM Kernel support, NVDIMM drivers and filesystem DAX support Administrative management tools such as ipmctl and ndctl to manage the NVDIMMs Oracle Linux support for PMEM Oracle is committed to providing solid PMEM support to its customers in Oracle Linux. Oracle Linux 7/UEK5 has the latest full set of Linux PMEM support, including but not limited to: device-dax support, device-dax memory for memory hot-plug feature filesystem-dax support, including MAP_SYNC, DAX-XFS metadata protection btt block driver support for PMEM used as traditional block device ndctl utility Oracle is actively participating in the upstream development of Linux PMEM support, as well as backporting new upstream PMEM features/fixes to OL/UEK kernels. How to use PMEM on Oracle Linux Interleaved vs NonInterleaved NVDIMM interleaved works the same way as DDR4 interleave, what is worth noting is its storage characteristics. In NonInterleaved mode, each NVDIMM is like a disk that can be carved up into partitions that one can put filesystems in. In N-way interleaved mode (N >= 2), the disk is formed by the N-way striped NVDIMMs, hence a single partition spans over N participating NVDIMMs. In PMEM terms, such a disk is called region, defined as physically contiguous memory. Its raw capacity can be partitioned into logical devices called namespaces. Interleaved configuration is accomplished at BIOS/UEFI level, hence a firmware reboot is required for an action initiated in OS. PMEM Configuration To configure 2LM mode with NVDIMMs in non-interleaved mode, # ipmctl create -goal memory-mode=100 PersistentMemoryType=AppDirectNotInterleaved To configure 2LM mode with NVDIMMs in interleaved mode, # ipmctl create -goal memory-mode=100 PersistentMemoryType=AppDirect To configure NVDIMMs in AppDirect in interleaved mode, # ipmctl create -goal memory-mode=0 PersistentMemoryType=AppDirect To configure NVDIMMs in AppDirect in non-interleaved mode, # ipmctl create -goal memory-mode=0 PersistentMemoryType=AppDirectNotInterleaved For more information, see ipmctl github and NDCTL User Guide. On a 2-node system with 8 NVDIMMs configured in AppDirect non-interleave mode, there will be 8 regions. # ndctl list -Ru | grep -c region 8 # ndctl list -NRui -r region0 { "regions":[ { "dev":"region0", "size":"126.00 GiB (135.29 GB)", "available_size":"126.00 GiB (135.29 GB)", "max_available_extent":"126.00 GiB (135.29 GB)", "type":"pmem", "iset_id":"0xcc18da901a1e8a22", "persistence_domain":"memory_controller", "namespaces":[ { "dev":"namespace0.0", "mode":"raw", "size":0, "uuid":"00000000-0000-0000-0000-000000000000", "sector_size":512, "state":"disabled" } ] } ] } The above shows that region0 has capacity of 126GiB as indicated by max_available_extent value, and no namespace has been created in region0 yet. namespace0.0 above is a seed namespace purely for programming purpose. Now, create an fsdax type namespace in region0 called pmem0 to make a PMEM block device. # ndctl create-namespace -m fsdax -r region0 { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":"124.03 GiB (133.18 GB)", "uuid":"e04893f8-8b50-4232-b71c-9742ea3a6a3b", "sector_size":512, "align":2097152, "blockdev":"pmem0" } where "align":2097152 indicates the default namespace alignment size: 2MiB. Filesystem-DAX Filesystem DAX is supported in XFS and EXT2/4. To use FS-DAX, an fsdax mode namespace has to be created, such as /dev/pmem0 in the above. Then, a filesystem to be created in /dev/pmem0, and mounted with -o dax option. For reducing memory footprint and for better performance, 2MiB page size is preferred with FS-DAX. To achieve that, two things must be done: the fsdax namespace must be created in 2MiB alignment, such as above; the mkfs parameters must be specified as below. # mkdir /mnt_xfs # mkfs.xfs -d agcount=2,extszinherit=512,su=2m,sw=1 -f /dev/pmem0 # mount -o dax /dev/pmem0 /mnt_xfs To verify the 2MiB hugepage support, one can turn on the PMD fault debug trace, look for dax_pmd_fault_done events in the trace log. # echo 1 > /sys/kernel/debug/tracing/events/fs_dax/dax_pmd_fault_done/enable # fallocate --length 1G /mnt_xfs/1G_file # xfs_bmap -v /mnt_xfs/1G_file /mnt_xfs/1G_file: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: 8192..2105343 0 (8192..2105343) 2097152 Turn off the debug trace by # echo 0 > /sys/kernel/debug/tracing/events/fs_dax/dax_pmd_fault_done/enable Device-DAX Device dax is character device that supports byte-addressable direct access to PMEM. The command below force (-f) reconfigure namespace2.0 to a devdax device, setting the namespace alignment to 1GiB. The device name is /dev/dax2.0. # ndctl create-namespace -f -e namespace2.0 -m devdax -a 1G { "dev":"namespace2.0", "mode":"devdax", "map":"dev", "size":"124.00 GiB (133.14 GB)", "uuid":"2d058872-4025-4440-b49a-105501d366b7", "daxregion":{ "id":2, "size":"124.00 GiB (133.14 GB)", "align":1073741824, "devices":[ { "chardev":"dax2.0", "size":"124.00 GiB (133.14 GB)" } ] }, "align":1073741824 } Then, a process can mmap(2) /dev/dax2.0 to its address space. RDMA with PMEM device Traditional RDMA via pinned pages works with device-dax, but is not (yet) supported with filesystem-DAX. For filesystem-DAX, RDMA via ODP (On-Demand Paging) is the alternative. PMEM UE report and "badblocks" Similar to DRAM, PMEM might have errors. Most of the errors are correctable errors (CE) that can be corrected by ECC without OS intervention. But rarely, an Uncorrectable Error (UE) occurs in PMEM, and when the act of read consumes the UE, that triggers a Machine Check Event (MCE) event which traps the CPU, and consequently the memory error handler is invoked. The handler marks the offending page in order to prevent it from being prefetched. If the page belongs to kernel thread, the system will panic. If the page belongs to user processes, the page will be unmapped from every process that has a mapping from the page. Then kernel sends a SIGBUS to the user process. Among the signal payload, these fields are the worth noting: .si_code := always BUS_MCERERR_AR for PMEM UE .si_addr := page aligned user address .si_addr_lsb := Least Significant Bit(LSB) in the .si_addr field, for a 2MiB page, that's 21. In addition, the libnvdimm driver subscribes for a callback service to MCE in order to get a record of the UE. The driver maintain a badblocks list to keep track of the known UEs. A known UE is also called poison as the initial consumption of the UE causes the poison bit set in the media. Here is the command that displays the existing badblocks. # ndctl list -n namespace6.0 --media-errors [ { "dev":"namespace6.0", "mode":"devdax", "map":"mem", "size":135289372672, "uuid":"d45108b4-2b9d-48f6-a55b-1a095fd1eb51", "chardev":"dax6.0", "align":2097152, "badblock_count":3, "badblocks":[ { "offset":4128778, "length":1, "dimms":[ "nmem6" ] }, ] } ] Following the block device tradition, the bad blocks are in unit of 512 bytes. The offset starts from the beginning of the user-visible area in the namespace. In the above example, offset = 4128778 * 512 = 0x74ee1400. If a process mmap()s /dev/dax6.0 entirely at virtual address vaddr, then vaddr + 0x74ee1400 is the starting address of the poisoned block. To clear the poison, one may issue # ndctl clear-errors -r region6 --scrub -v The command scrubs the entire region6, clears the known poison, as well as poison from UE that has not been consumed in the region. PMEM Emulation using DRAM It is possible to emulate PMEM with DRAM via the memmap kernel parameter. First, examine the e820 table via dmesg, select a usable region. [ 0.000000] user: [mem 0x0000000100000000-0x000000407fffffff] usable Second, add the memmap parameters to the boot cmdline: memmap=16G!20G memmap=16G!36G memmap=16G!52G memmap=16G!68G and reboot. After system boots up, dmesg shows [ 0.000000] user: [mem 0x0000000100000000-0x00000004ffffffff] usable [ 0.000000] user: [mem 0x0000000500000000-0x00000008ffffffff] persistent (type 12) [ 0.000000] user: [mem 0x0000000900000000-0x0000000cffffffff] persistent (type 12) [ 0.000000] user: [mem 0x0000000d00000000-0x00000010ffffffff] persistent (type 12) [ 0.000000] user: [mem 0x0000001100000000-0x00000014ffffffff] persistent (type 12) [ 0.000000] user: [mem 0x0000001500000000-0x000000407fffffff] usable By default, the four memmap regions are emulated as four fsdax devices - $ sudo ndctl list -Nu [ { "dev":"namespace1.0", "mode":"fsdax", "map":"mem", "size":"16.00 GiB (17.18 GB)", "sector_size":512, "blockdev":"pmem1" }, { "dev":"namespace3.0", "mode":"fsdax", "map":"mem", "size":"16.00 GiB (17.18 GB)", "sector_size":512, "blockdev":"pmem3" }, { "dev":"namespace0.0", "mode":"fsdax", "map":"mem", "size":"16.00 GiB (17.18 GB)", "sector_size":512, "blockdev":"pmem0" }, { "dev":"namespace2.0", "mode":"fsdax", "map":"mem", "size":"16.00 GiB (17.18 GB)", "sector_size":512, "blockdev":"pmem2" } ]

In this blog post, Oracle Linux kernel developer Jane Chu talks about persistent memory, the support we have in Oracle Linux for it and some examples on how to use it. Persistent Memory Introduction Pers...

Announcements

Oracle OpenWorld 2019 & Oracle Code One Day 3 Recap, Day 4 Closing

Day 3 of Oracle OpenWorld 2019 and Oracle Code One was packed with Linux and Virtualization sessions. Customers from Lawrence Livermore National Lab, KPN, and GVC shared their experiences and the importance of selecting the right infrastructure software, which includes that it is cost-effective, secure, available, and provides business agility. Wim Coekaerts and Ali Alasti’s solution keynote covered the key differentiators and strategy for Oracle’s infrastructure software and hardware – noting that Oracle is unique as a cloud vendor in that the infrastructure software and hardware we use in Oracle Cloud is the same as customers can use on premise. It’s with thanks to Tim Hall, Oracle ACE Director and Oracle Groundbreaker Ambassador, who provides insights from two of the sessions he attended: Practical DevOps with Linux, Virtualization, and Oracle Application Express Understanding the Oracle Linux Cloud Native Environment (OLCNE) In this short video, Simon Coter, Director of Product Management, provides a few reasons why Oracle Linux is the best choice for developers. While our sessions are done, there’s still a lot to learn in today’s sessions. Safe travels Tim and all!   Breakthroughs start here. #OOW19  #CodeOne #OracleLinux  #OracleAutonomousLinux

Day 3 of Oracle OpenWorld 2019 and Oracle Code One was packed with Linux and Virtualization sessions. Customers from Lawrence Livermore National Lab, KPN, and GVC shared their experiences and the...

Announcements

Oracle OpenWorld 2019 & Oracle Code One – Day 2 Recap, Day 3 Highlights

Day 2 of Oracle OpenWorld and Oracle Code One has finished. It was another great day – great weather, great to meet friends and colleagues from all over the world, and great to see the all the buzz and enthusiasm for Oracle Autonomous Linux, announced yesterday in this keynote.  Several articles have been written and many conference sessions covered the news in more detail. Here’s an interview with Wim Coekaerts. The day also included HOLs, strategy and roadmap sessions on Linux, VirtualBox, cloud infrastructure and security. Day 3, Wednesday, September 18, has more in store to inspire you. Make it a great day! Customer Panel: Impact of Selecting the Right Infrastructure Software Philip Adams, CTO/Lead Architect, Lawrence Livermore National Lab Karen Sigman, Vice President, Product and Partner Marketing, Oracle Jan-Pier Loonstra, Architect, KPN Laszlo Beres, Engineering Manager Unix, Linux Technology Operations, GVC 9:00 a.m. - 9:45 a.m. Moscone South - Room 214 Solution Keynote: Oracle's Infrastructure Strategy for Cloud and On-Premise Wim Coekaerts, Senior Vice President, Software Development, Oracle Ali Alasti, SVP of Hardware Engineering, Oracle 11:15 a.m. - 12:00 p.m. YBCA Theater Maximize Performance with Oracle Linux Srini Eeda, Senior Director, Oracle Linux, Oracle Aruna Ramakrishna, Principal Member of Staff, Oracle 12:30 p.m. - 1:15 p.m. Moscone South (Esplanade Ballroom) - Room 152D Simplify Secure Cloud Access with Oracle Secure Global Desktop Jan Mangold, Secure Global Desktop Senior Product Manager, Oracle 12:30 p.m. - 1:15 p.m. Moscone South (Esplanade Ballroom) - Room 155A App Development with Oracle Cloud Infrastructure/Oracle Autonomous Database: Get Started Sergio Leunissen, VP, Oracle 1:30 p.m. - 2:15 p.m. Moscone South - Room 312 Practical DevOps with Linux, Virtualization, and Oracle Application Express Simon Coter, Director of Product Management, Linux and Virtualization, Oracle 2:30 p.m. - 3:15 p.m. Moscone South - Room 312 Learning Oracle Linux Cloud Native from the Ground Up - BYOL Michele Casey, Senior Director Product Management, Oracle Linux, Oracle Thomas Tanaka, Oracle Wiekus Beukes, Oracle Tom Cocozzello, Oracle 2:45 p.m. - 4:45 p.m. Moscone West - Room 3024C Create a HA-NFS server Using Gluster, Corosync, and Pacemaker Avi Miller, Senior Manager, Oracle Linux and Virtualization Product Management, Oracle David Gilpin, Principal Product Manager, Oracle 3:45 p.m. - 4:45 p.m. Moscone West - Room 3022B Server Virtualization in Your Data Center and Migration Paths to Oracle Cloud Kurt Hackel, Vice President, Oracle John Priest, Product Management Director, Oracle Virtualization, Oracle 3:45 p.m. - 4:30 p.m. Moscone South (Esplanade Ballroom) - Room 152D High-Performance Apps: Oracle Private Cloud Appliance/Oracle Private Cloud at Customer Steve Callahan, Senior Principal Product Manager, Oracle 3:45 p.m. - 4:30 p.m. Moscone South (Esplanade Ballroom) - Room 155A Oracle Linux: A Cloud-Ready, Optimized Platform for Oracle Cloud Infrastructure Julie Wong, Product Management Director, Linux and Virtualization, Oracle Ryan Volkmann, Sr. Manager IT PMO, Nidec 4:45 p.m. - 5:30 p.m. Moscone South (Esplanade Ballroom) - Room 152D Understanding the Oracle Linux Cloud Native Environment Thomas Tanaka, Oracle Wiekus Beukes, Oracle Tom Cocozzello, Oracle 5:00 p.m. - 5:45 p.m. Moscone South - Room 206 Add these sessions to your schedule and don't forget to bookmark the Oracle Linux and Virtualization Program Guide for more information. And don’t miss The Exchange in Moscone South for demos, conversations, opportunities for Q&As, and theater presentations.   Breakthroughs start here. #OOW19  #CodeOne #OracleLinux  #OracleAutonomousLinux

Day 2 of Oracle OpenWorld and Oracle Code One has finished. It was another great day – great weather, great to meet friends and colleagues from all over the world, and great to see the all the buzz...

Announcements

Oracle OpenWorld 2019 & Oracle Code One – Day 1 Recap, Day 2 Highlights

Day 1 of Oracle OpenWorld and Oracle Code One has come to a close. It was great a day – filled with lots of learning and exciting news. Customers, partners, and Oracle Linux and Virtualization product experts presented sessions that covered a range of subjects, including building secure systems; getting started with application development in OCI using Oracle Cloud Developer Image; the advantages of Kata containers; using Oracle Linux Cloud Native Environment, and more. The big news was saved for last—the announcement of Oracle Autonomous Linux. Day 2, Tuesday, September 17, has more to offer. Please join us for: Oracle’s Open Cloud Infrastructure Strategy Ajay Srivastava, Senior Vice President, Operating Systems and Virtualization, Oracle             11:15 a.m. - 12:00 p.m.          Moscone South, Room 210 Cloud Platform and Middleware Strategy Roadmap  Edward Screven, Chief Corporate Architect, Oracle 12:30 p.m. – 01:15 p.m.         YCBA Theater Oracle Linux : State of the Penguin (Learn more about Autonomous Linux) Wim Coekaerts, Senior Vice President, Linux and Virtualization Engineering, Oracle            01:45 p.m. - 02:30 p.m.          Moscone South, Room 210 Strategic Considerations to Achieve Business Impact with Cloud Native Projects Mickey Bharat, Senior Director, Worldwide Embedded Sales, Oracle Linux and Virtualization   03:15 p.m. – 04:00 p.m.         Moscone South, Room 152B How to Get Started with Cloud Native Karen Sigman, Vice President, Product and Partner Marketing        04:00 p.m. - 04:20 p.m.          The Exchange, Moscone South, Theater 1 Securing Oracle Linux 7 Erik Benner, Vice President of Enterprise Transformation, Mythics, Inc. Avi Miller, Senior Manager, Oracle Linux and Virtualization Product Management  04:15 p.m. - 05:00 p.m.          Moscone South, Room 210 Oracle Linux and Oracle VM VirtualBox: The Enterprise Cloud Development Platform Simon Coter, Director, Oracle Linux and Virtualization Product Management         05:15 p.m. - 06:00 p.m.          Moscone South, Room 152B Add these sessions to your schedule and bookmark the Oracle Linux and Virtualization Program Guide for more on all of our sessions. And don’t miss The Exchange in Moscone South for demos, conversations, opportunities for Q&As, and theater presentations.   Breakthroughs start here. #OOW19  #CodeOne #OracleLinux  #OracleAutonomousLinux  

Day 1 of Oracle OpenWorld and Oracle Code One has come to a close. It was great a day – filled with lots of learning and exciting news. Customers, partners, and Oracle Linux and Virtualization...

Announcements

Partner: Microsoft @ Oracle OpenWorld

We are happy to welcome our partner, and conference Bronze Sponsor, Microsoft to Oracle OpenWorld 2019. There are many exciting things you will want to learn about the areas of collaboration between companies. To begin, please visit Microsoft representatives at The Exchange, in Moscone South, booth #1511. For more in-depth information, make sure to register for these sessions: Monday, September 16, 11:15 AM - 12:00 PM | Moscone South - Room 151D Oracle on Azure: An Overview [CON6620] Speakers: Romit Girdhar, Senior Software Engineer, Microsoft Edward Burns, Principal Architect, Microsoft Monday, September 16, 01:45 PM - 02:30 PM | Moscone South - Room 210 Microsoft Azure and Oracle Cloud Infrastructure: Seattle's Newest Power Couple [BUS3362] Speakers: FS Nooruddin, Vice President Information Technology, Gap Inc. Umakanth Puppala, Program Manager, Microsoft Chinmay Joshi, Principal Product Manager, Oracle Monday, September 16, 02:45 PM - 03:05 PM | The Exchange Lobby (Moscone South) Elevating Your Multicloud Strategy: Azure and Oracle Cloud Infrastructure Interconnect [THT6633] Speakers: Umakanth Puppala, Program Manager, Microsoft Chinmay Joshi, Principal Product Manager, Oracle Wednesday, September 18, 10:00 AM - 10:45 AM | Moscone South - Room 155C Connecting Oracle Cloud to Microsoft Azure: Technical Deep-Dive and Demo [CON6696] Speakers: Romit Girdhar, Senior Software Engineer, Microsoft Chinmay Joshi, Principal Product Manager, Oracle Oracle OpenWorld is the perfect opportunity to immerse yourself in all that’s new with Oracle and Microsoft.  

We are happy to welcome our partner, and conference Bronze Sponsor, Microsoft to Oracle OpenWorld 2019. There are many exciting things you will want to learn about the areas of collaboration...

Announcements

Partner: Lenovo @ Oracle OpenWorld 2019

The Oracle Linux and Virtualization team, on behalf of Oracle, is delighted to welcome Platinum Sponsor Lenovo to Oracle OpenWorld 2019. Lenovo works closely with us to certify their hardware, through the HCL program, to run our software. Please be sure to attend our joint session: Monday, September 16, 02:45 PM - 03:30 PM | Moscone South - Room 152D Data-Driven to Insight-Driven Transformation by Oracle Autonomous Database [CON4565] This session discusses how Oracle Autonomous Database and Oracle Linux on Lenovo infrastructure enables this digital transformation through insight-driven analytics, eliminating the manual task of database patching and upgrades, reducing human error, and increasing productivity. Speakers: Michele Resta, Sr Director, Oracle Prasad Venkatachar, Sr Solutions Product Manager, Lenovo (United States) Inc. You can also meet Lenovo representatives at The Exchange, in Moscone South, booth #1211. While you’re there, check out the theater schedule for a joint presentation by the Oracle Linux and Virtualization team and Lenovo on the companies’ alliance.                                                                                                                                                  

The Oracle Linux and Virtualization team, on behalf of Oracle, is delighted to welcome Platinum Sponsor Lenovo to Oracle OpenWorld 2019. Lenovo works closely with us to certify their hardware, through...

Linux Kernel Development

Soft Affinity - When Hard Partitioning Is Too Much

Scheduler Soft Affinity Oracle Linux kernel developer Subhra Mazumdar presents a new interface to the Linux scheduler he is proposing. Servers are getting bigger and bigger with more CPU cores, memory and I/O. This trend has lead to workload consolidation (e.g multiple virtual machines (VMs) and containers running on the same physical host). Each VM or container can run a different instance of the same or different workload. Oracle Database has a similar virtualization feature called Oracle Multitenant where a root Database can be enabled to act as a Container Database (CDB) and house multiple lightweight Pluggable Databases (PDBs), all running in the same host. This allows for very dense DB consolidation. Large servers usually have multiple sockets or NUMA (Non Uniform Memory Access) nodes with each node having its own CPU cores and attached memory. Cache coherence and remote memory access is facilitated by inter-socket links (QPI in case of Intel) but usually much more costlier than local access and coherence. When running multiple instances of a workload in a single NUMA host, it is good practice to partition them e.g give a NUMA node partition to each DB instance for best performance. Currently the Linux kernel provides two interfaces to hard partition instances: sched_setaffinity() system call or cpuset.cpus cgroup interface. This doesn't allow one instance to burst out of its partition and use potentially available CPUs of other partitions when they are idle. Another option is to allow all instances to spread across the system without any affinity, but this suffers from a cache coherence penalty across sockets when all instances are busy. Autonuma Balancer One potential way to achieve the desired behavior is to use the Linux autonuma balancer which migrates pages and threads to align them. For example, if each DB instance has memory pinned to one NUMA node, autonuma can migrate threads to their corresponding nodes when all instances are busy, thus automatically partitioning them. Motivational experiments, however, show not much benefit is achieved by enabling autonuma. In this case 2 DB instances were run on a 2 socket system, each with 22 cores. Each DB instance was running an OLTP load (TPC-C) and had its memory allocated from one NUMA node using numactl. But autonuma ON vs OFF didn't make any difference. The following statistics show (for different number of TPC-C users) the migration of pages by autonuma, which didn't have any performance benefit. This also shows that numactl only restricts the initial memory allocation to a NUMA node and autonuma balancer is free to migrate them later. Below numa_hint_faults is the total number of NUMA hinting faults, numa_hint_faults_local is the number of local faults so the rest are remote and numa_pages_migrated is the number of pages migrated by autonuma. users (2x16) no affinity numa_hint_faults 1672485 numa_hint_faults_local 1158283 numa_pages_migrated 373670 users (2x24) no affinity numa_hint_faults 2267425 numa_hint_faults_local 1548501 numa_pages_migrated 586473 users (2x32) no affinity numa_hint_faults 1916625 numa_hint_faults_local 1499772 numa_pages_migrated 229581 Other disadvantages of autonuma balancer are a) it can be ineffective in case of memory spread among all NUMA nodes and b) can be slow to react due to periodic scanning. Soft Affinity Given the above drawbacks, a logical way to achieve the best of both worlds is via the Linux task scheduler. A new interface can be added to specify the scheduler prefer a given a set of CPUs while scheduling a task, but using other available CPUs if the preferred set is all busy. The interface can either be a new system call (e.g sched_setaffinity2() that takes an extra parameter to specify HARD or SOFT affinity) or by adding a new parameter to cpuset (e.g cpuset.soft_cpus). It is important to note that Soft Affinity is orthogonal to cpu.shares: the latter decides how many CPU cycles to consume while former decides where to preferably consume those cycles. Under the hood the scheduler will add an extra set, cpu_preferred, in addition to the existing cpu_allowed set per task. cpu_preferred will be set as requested by the user using any of the above interfaces and will be a subset of cpu_allowed. In the first level of search, the scheduler chooses the last level cache (LLC) domain, which is typically a NUMA node. Here the scheduler will always use cpu_preferred to prune out remaining CPUs. Once LLC domain is selected, it will first search the cpu_preferred set and then (cpu_allowed - cpu_preferred) set to find an idle CPU and enqueue the thread. This only changes the wake up path of the scheduler, the idle balancing path is intentionally kept unchanged: together they achieve the "softness" of scheduling. With such an implementation, experiments were run with 2 instances of Hackbench and then 2 instances of DB by soft affinitizing each instance to one NUMA node on a 2-socket system. Another set of runs were done with only 1 instance active but still soft affinitized to the corresponding node. The load in each instance of Hackbench or DB was varied by varying the number of groups and number of users respectively. The following graphs outline the performance gain (or regression) for hard affinity and soft affinity with respect to no affinity. Hackbench shows little improvement for hard or soft affinity (possibly due to less data sharing) while the DB shows substantial improvement for the 2 instance case. For 1 instance, Hackbench shows significant regression while DB achieves performance very close to no affinity. The DB seems to achieve best of both worlds with such a basic implementation: improvement like hard affinity and almost no regression like no affinity. Load Based Soft Affinity While basic impleme ntation of Soft Affinity above worked well for DB, Hackbench showed serious regression for 1 instance case due to not using CPUs in the system efficiently. This begs the question: should the decision to trade off cache coherence for CPUs be conditional? The optimum trade off point of a given workload will depend on amount of data sharing between threads, the coherence overhead of the system and how much extra CPUs will help the workload. Unfortunately the kernel can't find this online, offline workload profiling is needed to quantify the different cost metrics. A reasonable approach to solve this is having kernel tunables that will allow tuning for different workloads. Two scheduler related kernel tunables are introduced for this purpose: sched_preferred and sched_allowed. The ratio of CPU utilization of cpu_preferred set and cpu_allowed set is compared to the ratio sched_allowed:sched_preferred; if greater the scheduler will choose cpu_allowed set in the first level of search, if lesser it will choose the cpu_preferred set. Setting the relative values of the tunables Soft Affinity can be made "harder" or "softer". To compare the utilization of two sets we can't iterate over all CPUs as that will add significant overhead. Hence two sample CPUs are chosen, one from each set and compared. The same experiments were run with the new load based Soft Affinity. Following graphs have the tunable pair (sched_preferred, sched_allowed) sorted from softest to hardest value. As can be seen, for DB case, harder Soft Affinity works best similar to the previous basic implementation. For Hackbench, a softer Soft Affinity works best thereby preserving the improvements but minimizing the regression. A separate set of experiments were also done (graphs not shown) where memory of each DB instance was spread evenly among NUMA nodes. This had similar improvements thus proving that the benefit of partitioning is primarily due to LLC sharing and saving cross socket coherence overhead. Soft Affinity Overhead The final goal of Soft Affinity is to introduce no overhead if not used. The scheduler wake-up path adds a few more conditions but breaks early if cpu_preferred == cpu_allowed. This keeps overhead minimal as shown in the following graph which compares the performance of Hackbench for 1 and 2 instance case for a varying number of groups. The difference in the last column is actually the improvement of Soft Affinity kernel with respect to the baseline kernel. This is actually within the noise margin but proves that overhead of Soft Affinity is negligible. The latest version of Soft Affinity with load based tunables has been posted upstream, you can find it here: https://lkml.org/lkml/2019/6/26/1044

Scheduler Soft Affinity Oracle Linux kernel developer Subhra Mazumdar presents a new interface to the Linux scheduler he is proposing. Servers are getting bigger and bigger with more CPU cores, memory...

Announcements

Join Oracle Executives for Linux and Virtualization Sessions at Oracle OpenWorld 2019

With Oracle OpenWorld and Code One starting next week, you’ll want to have your schedule locked down soon. If the sessions below aren’t on your agenda, you’ll want to register and add them. These sessions are presented by Oracle’s Linux and Virtualization executives. In these sessions, you’ll hear from the technology leaders helping to foster innovation at Oracle. These executives envision, develop, and help build the products, services, and technologies that are enabling customers’ success – in the cloud and on premise. Join them to learn about the vision, strategies, and breakthroughs that are paving the way to a bold new future. Inspiration starts here. We look forward to seeing you at these sessions! Date/Time Title Speaker(s) Location Tuesday September 17     11:15 a.m. – 12:00 p.m. Oracle’s Open Cloud Infrastructure Strategy    Ajay Srivastava, Senior Vice President, Operating Systems and Virtualization, Oracle Moscone South Room 210   12:30 p.m. – 01:15 p.m. Cloud Platform and Middleware Strategy Roadmap    Edward Screven, Chief Corporate Architect, Oracle YCBA Theater 01:45 p.m. – 02:30 p.m. Oracle Linux : State of the Penguin    Wim Coekaerts, Senior Vice President, Linux and Virtualization Engineering, Oracle Moscone South Room 210 04:00 p.m. – 04:20 p.m. How to Get Started with Cloud Native   Karen Sigman, Vice President, Product and Partner Marketing The Exchange, Moscone South  Theater 1 Wednesday,    September 18     11:15 a.m. – 12:00 p.m.   Oracle's Infrastructure Strategy for Cloud and On-Premises   Wim Coekaerts, SVP, Linux and Virtualization Engineering Ali Alasti, SVP, Hardware Development, x86 Management Ajay Srivastava, SVP, Operating Systems and Virtualization YCBA Theater To learn more about Oracle Linux and Virtualization sessions, HOLs, and demo kiosks (at The Exchange), take a look at recent blogs. 

With Oracle OpenWorld and Code One starting next week, you’ll want to have your schedule locked down soon. If the sessions below aren’t on your agenda, you’ll want to register and add them. These...

Announcements

Two Places to Meet the Oracle Linux and Virtualization Team at Oracle OpenWorld

The Oracle Linux and Virtualization team will be out in force at Oracle OpenWorld. There is a lot in store for attendees. We look forward to sharing the latest updates and demoing the latest innovations. In addition to our sessions and Hands on Labs, there are two more places to find Linux and virtualization experts… 1. @ The Exchange, Exhibition Level, Moscone South At The Exchange, the team will be ready to answer your questions and show you how the latest products and cloud offerings can help address your business needs. You will find us at the following kiosks: Public Cloud Infrastructure Showcase CIS-001 > Increasing IT Efficiency and Agility with Oracle Virtualization  Operating systems, containers, and virtualization are the fundamental building blocks of modern IT infrastructure. Come to this demo kiosk to learn how Oracle Linux and Oracle virtualization products help increase IT efficiency and agility—on premises and in the cloud. CIS-002 > Jump-Start Your Development with Oracle Linux and Oracle Cloud  Oracle Linux offers an open, integrated operating environment with application development tools, management tools, containers, and orchestration capabilities that enable DevOps teams to efficiently build reliable, secure cloud native applications. Developers worldwide use Oracle VM VirtualBox to run Oracle Linux with the cloud native software on their desktop and easily deploy to the cloud. In addition, Oracle Cloud developer tools such as Terraform, SDKs, and CLI are available on Oracle Linux for an improved experience. Come to this demo kiosk to learn more about speeding up your development and your move to the cloud. CIS-003 > Oracle Linux and Virtualization Management with Oracle Enterprise Manager 13c In this demo, learn how to monitor and manage Oracle Linux and Oracle Virtualization technologies with Oracle Enterprise Manager 13c. Learn how Oracle Enterprise Manager 13c optimizes Oracle Linux and virtualization resources in a multi-private-cloud environment. CIS-005 > Secure Your Cloud Infrastructure with Oracle Linux, Ksplice, Oracle Secure Global Desktop Oracle Linux is the only Linux distribution that supports live, nondisruptive patching, both in the kernel space and in the user space. That means you can immediately apply security patches without impacting your production environment-and without rebooting. To date, more than 1 million patches have been delivered in this fashion through Ksplice. In this demo, learn how to use Oracle Secure Global Desktop to enable your workforce to connect from nearly any device, anywhere, while providing administrators with the tools they need to control access to applications and desktop environments, in the cloud and in the data center. On-Premise Infrastructure Showcase OPI-005 > Building a Cloud Native Environment with Oracle Linux Oracle Linux offers an open, integrated operating environment with application development tools, management tools, containers, and orchestration capabilities that enable DevOps teams to efficiently build reliable, secure cloud native applications. Come to this demo kiosk to learn how Oracle Linux can help you enhance productivity. OPI-006 > Oracle Linux Solutions for ISVs, OEMs, Embedded, and Cloud Platforms Come to this demo to understand how Oracle Linux solutions can be a foundation to help you grow your applications or services to extend your reach into new markets. OPI-007 > Increasing IT Efficiency and Agility with Oracle Virtualization Operating systems, containers, and virtualization are the fundamental building blocks of modern IT infrastructure. Come to this demo kiosk to learn how Oracle Linux and Oracle virtualization products help increase IT efficiency and agility-on premises and in the cloud. OPI-008 > Secure Your Cloud Infrastructure with Oracle Linux, Ksplice, Oracle Secure Global Desktop  Oracle Linux is the only Linux distribution that supports live, nondisruptive patching, both in the kernel space and in the user space. That means you can immediately apply security patches without impacting your production environment—and without rebooting. To date, more than 1 million patches have been delivered in this fashion through Ksplice. In this demo, learn how to use Oracle Secure Global Desktop to enable your workforce to connect from nearly any device, anywhere, while providing administrators with the tools they need to control access to applications and desktop environments, in the cloud and in the data center.   2. @ The Developer Appreciation Event – Sunday, September 15, 6:30 p.m. – 9:30 p.m. Join us for an informal gathering. Have a brew and some light hors d’oeuvres; converse with Oracle product experts; and experience some of the latest demos in a developer-only environment. Please let us know if you can make it. Kindly reply via: sign me up for the Oracle Developer Appreciation event on Sunday evening, September 15 in San Francisco. We look forwarding to spending time with you at Oracle OpenWorld. The Linux and Virtualization Team

The Oracle Linux and Virtualization team will be out in force at Oracle OpenWorld. There is a lot in store for attendees. We look forward to sharing the latest updates and demoing the latest...

Announcements

Oracle Linux and Virtualization Hands On Labs at Oracle OpenWorld & Oracle CodeOne

Ready to roll up your sleeves and dive into the latest tools and technologies? You’ll find five Hands On Labs (HOLs) at this year’s Oracle OpenWorld and CodeOne conferences that will help you optimize and secure your environment. Topics include Oracle Linux Cloud Native Environment, Oracle VM VirtualBox, Kata Containers, KVM, Terraform and more. Register now to be sure you have a seat. Let the learning begin! @Oracle OpenWorld – HOLs are 1-hour sessions Date HOL Title/Description/Speaker Time Location Monday, September 16 Infrastructure as Code: Oracle Linux, Terraform, and Oracle Cloud Infrastructure HOL1512 In this hands-on lab see how to easily install and configure Terraform for Oracle Cloud Infrastructure on Oracle Linux 7, and then use it to provision infrastructure in Oracle Cloud Infrastructure. Speakers: Christophe Pauliat, Master Principal Sales Consultant, Oracle Solution Center, Oracle Simon Hayler, Product Manager, Oracle Matthieu Bordonne, Principal Sales Consultant, Emea Oracle Solution Center, Oracle 10:00 a.m. - 11:00 a.m.   Moscone West Room 3022B Tuesday, September 17 Secure Container Orchestration Using Oracle Linux Cloud Native (Kubernetes/Kata) HOL5303 Learn to use Vagrant to automatically deploy Oracle Cloud Infrastructure Container Service Classic for use with a Kubernetes cluster, on an Oracle Linux 7 virtual machine using Oracle VM VirtualBox. Once the cluster is deployed, learn how to deploy secured containers with Kata Containers. Speaker: Simon Coter, Director of Product Management, Linux and Virtualization, Oracle   2:15 p.m. - 3:15 p.m.     Moscone West Room 3022B   Set Up a Kernel-Based VM with Oracle Linux 7, UEK5, Oracle Linux Virtualization Manager HOL5308 Walk through the planning and deployment of an infrastructure-as-a-service (IaaS) environment with an Oracle Linux KVM as the foundation. Speaker: Simon Coter, Director of Product Management, Linux and Virtualization, Oracle   3:45 p.m. - 4:45 p.m.     Moscone West Room 3022B Wednesday, September 18 Create a HA-NFS server Using Gluster, Corosync, and Pacemaker HOL5373 Learn how to build a three-node highly available shared-nothing NFS storage cluster on Oracle Linux 7 using open source tools. The lab also covers the installation and configuration of Gluster to enable storage replication between all three nodes, followed by the configuration of a highly available NFS server. Speakers: Avi Miller, Senior Manager, Oracle Linux and Virtualization Product Management, Oracle David Gilpin, Principal Product Manager, Oracle   3:45 p.m. - 4:45 p.m.  Moscone West - Room 3022B   Moscone West Room 3022B     @Oracle CodeOne – HOLs are 2-hour sessions Date HOL Title/Description/Speaker Time Location Tuesday, September 17 Learning Oracle Linux Cloud Native from the Ground Up – BYOL HOL5780 PLEASE NOTE: YOU MUST BRING YOUR OWN LAPTOP (BYOL) TO PARTICIPATE IN THIS HANDS-ON LAB. This lab will walk participants through a full installation of the Oracle Linux Cloud Native Environment. Go through the basic installation and configuration of the core components, including the container runtime engine, Kubernetes for orchestration, Istio, Prometheus, and Grafana—to name a few. In addition, the lab will cover more-advanced concepts. No preparation required. Speakers: Michele Casey, Senior Director Product Management, Oracle Linux, Oracle Thomas Tanaka, Principal Member of Technical Staff, Oracle Wiekus Beukes, Software Development Senior Director, Oracle Tom Cocozzello, Principal Member of Technical Staff, Oracle   9:00 a.m. - 11:00 a.m.     Moscone West Room 3024B Wednesday, September 18 Learning Oracle Linux Cloud Native from the Ground Up – BYOL HOL5780 PLEASE NOTE: YOU MUST BRING YOUR OWN LAPTOP (BYOL) TO PARTICIPATE IN THIS HANDS-ON LAB. This lab will walk participants through a full installation of the Oracle Linux Cloud Native Environment. Go through the basic installation and configuration of the core components, including the container runtime engine, Kubernetes for orchestration, Istio, Prometheus, and Grafana—to name a few. In addition, the lab will cover more-advanced concepts. No preparation required. Speakers: Michele Casey, Senior Director Product Management, Oracle Linux, Oracle Thomas Tanaka, Principal Member of Technical Staff, Oracle Wiekus Beukes, Software Development Senior Director, Oracle Tom Cocozzello, Principal Member of Technical Staff, Oracle   2:45 p.m. - 4:45 p.m.     Moscone West Room 3024C    

Ready to roll up your sleeves and dive into the latest tools and technologies? You’ll find five Hands On Labs (HOLs) at this year’s Oracle OpenWorld and CodeOne conferences that will help you optimize...

Announcements

Join Oracle’s Linux Developers @ the Linux Plumbers Conference in Lisbon – September 9-11

Oracle is pleased to support the open source community as a Silver Sponsor of the Linux Foundation’s Linux Plumbers Conference (LPC). We look forward to meeting with peers from around the world in Lisbon, Portugal, September 9 – 11 at the Corinthia Hotel. LPC is a developer conference for the open source community. It brings together the top developers working on the “plumbing” of Linux — kernel subsystems, core libraries, windowing systems, etc.  LPC brings Linux and open source experts together for three days of intensive work on core design problems. This year, LPC is composed of several tracks: Refereed Talks; Networking Summit; Kernel Summit; and many Microconferences. Oracle’s Linux and MySQL developers will be presenting in several sessions in the various tracks. If you are attending LPC, please be sure to join us for these sessions: In the Refereed Talks track, you can attend: Kernel Address Space Isolation, with Alexandre Chartre + others In the Networking track, join us for: BPF packet capture helpers, libbpf interfaces, with Alan Maguire Some of this year’s Microconferences, including Testing and Fuzzing, Toolchains, and Scheduler are organized by Oracle Linux Engineers. Within the Microconferences, Oracle engineers will lead discussions on the following topics: Testing and Fuzzing Microconference: Collaboration/unification around unit testing frameworks, with Dr. Knut Omang Toolchain Microconference: eBPF support in the GNU Toolchain, with Jose Marchesi CTF in the GNU toolchains, with Nick Alcock Scheduler Microconference: Task latency-nice, with Subhra Mazumdar Distribution kernels Microconference: Being Kernel Maintainer at Oracle - Lessons & Challenges, with Allen Pais Databases Microconference: Dimitri Kravtchuk of MySQL will be discussing several topics: io_uring - excitement - looking for feedback & potential issues Filesystem atomic writes / O_ATOMIC MySQL @EXT4 performance impacts with latest Linux kernels MySQL @XFS IP / UNIX Socket Backlog Syscall overhead from Spectre/Meltdown fixes New InnoDB REDO log design and MT sync challenges, with Pawal Olchawa Containers and Checkpoint/Restore Microconference: Cgroup v1/v2 Abstraction Layer, with Tom Hromatka RDMS Microconference: Shared IB Objects, with Yuval Shaia System Boot and Security Microconference: TrenchBoot - how to nicely boot system with Intel TXT and AMD SVM, with Daniel Kiper LPC provides a forum to generate vigorous discussion and helps lead the community to beneficial change. Oracle’s Linux developers look forward to meeting and collaborating with everyone in Lisbon.

Oracle is pleased to support the open source community as a Silver Sponsor of the Linux Foundation’s Linux Plumbers Conference (LPC). We look forward to meeting with peers from around the world in...

Announcements

Top 10 Oracle Linux and Virtualization Sessions at Oracle OpenWorld 2019

The Oracle Linux and Virtualization team welcomes you to Oracle OpenWorld and Oracle Code One 2019, September 16-19, in San Francisco. We look forward to bringing you – our customers and partners – together with product experts, executives, and industry luminaries to discuss the future and highlight new developments. The lineup of keynotes and sessions will help answer your questions and enable you to bring your best ideas to bear on your business strategies. Hands on Labs and Developer sessions will offer deep dives into the technologies you need to drive innovation. Remember to register for sessions ahead of time to make sure you have a seat.  To help you plan your time, below is a sampling of the Linux and Virtualization sessions.  Top 10 Sessions, and a few more… Date Session Title/Speaker Time Location Monday, September 16 Building Government-Grade Secure Systems Using Open Source Customer Case Study Session Speakers: Kai Martius, Chief Technical Officer, secunet Security Networks AG Honglin Su, Senior Director, Oracle Linux and Virtualization Product Management 09:00 a.m. - 09:45 a.m. Moscone South Room 155A   App Development with Oracle Cloud Infrastructure/Oracle Autonomous Database: Get Started Developer Session Speaker: Sergio Leunissen, Vice President, Oracle Linux and Virtualization Development 01:30 p.m. - 02:15 p.m. Moscone South Room 201   Oracle Cloud Infrastructure Behind the Scenes: Deep Dive into the Software Conference Session Speaker: Rita Ousterhout, Senior Director, Oracle Linux and Virtualization Development 01:45 p.m. - 02:30 p.m. Moscone South Room 155A   Using Oracle's Cloud Native Environment to Kickstart Your Private Cloud Conference Session Speakers: Michele Casey, Senior Director, Oracle Linux Product Management Tom Cocozzello, Principal Member of Technical Staff, Oracle Linux and Virtualization Development David Gilpin, Principal Product Manager, Oracle Linux 01:45 p.m. - 02:30 p.m. Moscone South Room 155B   Open Container Virtualization: Security of Virtualization, Speed of Containers Customer Session Speakers: Katsuaki Shimadera, Security Architect, Recruit Technologies Co., Ltd. Simon Coter, Director, Oracle Linux and Virtualization Product Management 02:45 p.m. - 03:30 p.m. Moscone South Room 152B Tuesday, September 17 Oracle’s Open Cloud Infrastructure Strategy  Executive Session Speaker: Ajay Srivastava, Senior Vice President, Operating Systems and Virtualization, Oracle 11:15 a.m. - 12:00 p.m. Moscone South Room 210     Cloud Platform and Middleware Strategy Roadmap  Executive Session Speaker: Edward Screven, Chief Corporate Architect, Oracle 12:30 p.m. – 01:15 p.m. YCBA Theater   Oracle Linux: State of the Penguin  Executive Session Speaker: Wim Coekaerts, Senior Vice President, Linux and Virtualization Engineering, Oracle 01:45 p.m. - 02:30 p.m. Moscone South Room 210   Strategic Considerations to Achieve Business Impact with Cloud Native Projects Conference Session Speaker: Mickey Bharat, Senior Director, Worldwide Embedded Sales, Oracle Linux and Virtualization 03:15 p.m. – 04:00 p.m. Moscone South Room 152B   How to Get Started with Cloud Native Theater Session Speaker: Karen Sigman, Vice President, Product and Partner Marketing 04:00 p.m. - 04:20 p.m. The Exchange, Moscone South  Theater 1   Securing Oracle Linux 7 Conference Session Speakers: Erik Benner, Vice President of Enterprise Transformation, Mythics, Inc. Avi Miller, Senior Manager, Oracle Linux and Virtualization Product Management 04:15 p.m. - 05:00 p.m. Moscone South  Room 210   Oracle Linux and Oracle VM VirtualBox: The Enterprise Cloud Development Platform Product Overview and Roadmap Session Speaker: Simon Coter, Director, Oracle Linux and Virtualization Product Management 05:15 p.m. - 06:00 p.m. Moscone South Room 152B Wednesday, September 18 Oracle's Infrastructure Strategy for Cloud and On-Premises Executive Session Speakers: Wim Coekaerts, SVP, Linux and Virtualization Engineering Ali Alasti, SVP, Hardware Development, x86 Management Ajay Srivastava, SVP, Operating Systems and Virtualization 11:15 a.m. - 12:00 p.m.   YCBA Theater   Oracle Linux: A Cloud-Ready, Optimized Platform for Oracle Cloud Infrastructure Customer Session Speakers: Ryan Volkmann, Senior Manager IT PMO, Nidec Julie Wong, Director, Oracle Linux and Virtualization Product Management 04:45 p.m. - 05:30 p.m. Moscone South Room 152D   Understanding the Oracle Linux Cloud Native Environment Developer Session Speaker: Michele Casey, Senior Director, Oracle Linux Product Management 05:00 p.m. - 05:45 p.m. Moscone South  Room 206 Add these sessions to your schedule and don't forget to bookmark the Oracle Linux and Virtualization Program Guide for more details on these and all of our other sessions. #OOW19 and #CodeOne will provide opportunities to discover innovative technologies, get answers to your most important questions, and foster ideas with like-minded peers. Stay tuned to this blog for more information on Hands-on-Labs (HOLs) and demo areas in The Exchange in the coming days. We look forward to spending time with you at Oracle OpenWorld 2019!

The Oracle Linux and Virtualization team welcomes you to Oracle OpenWorld and Oracle Code One 2019, September 16-19, in San Francisco. We look forward to bringing you – our customers and partners –...

Perspectives

Getting started with Oracle Linux Virtualization Manager

Oracle recently announced the general availability of Oracle Linux Virtualization Manager. This new server virtualization management platform can be easily deployed to configure, monitor, and manage an Oracle Linux Kernel-based Virtual Machine (KVM) environment with enterprise-grade performance and support from Oracle.  Installing the new Manager and getting Oracle Linux KVM servers connected for your test or development environment is simple and can be done very quickly. Oracle Linux Virtualization Manager 4.2.8 can be installed from the Oracle Linux Yum Server or the Oracle Unbreakable Linux Network.  The steps to get up and running from these two sites are outlined below: Oracle Linux Yum Server. Install Oracle Linux 7 Update 6 on the host machine. # yum install https://yum.oracle.com/repo/OracleLinux/OL7/ovirt42/x86_64/ovirt-release42.rpm # yum install ovirt-engine Run the engine-setup command to configure Oracle Linux Virtualization Manager. Add Oracle Linux KVM Compute Hosts, Storage and Logical Networks - and then create your new Virtual Machines. Oracle Unbreakable Linux Network. Install Oracle Linux 7 Update 6 on the host machine. Log in to https://linux.oracle.com with your ULN user name and password. On the Systems tab, click the link named for the host registered machine. On the System Details page, click Manage Subscriptions. On the System Summary page, subscribe to the following channels: ol7_x86_64_latest ol7_x86_64_optional_latest ol7_x86_64_kvm_utils ol7_x86_64_ovirt42 ol7_x86_64_ovirt42_extras ol7_x86_64_gluster312 ol7_x86_64_UEKR5 Click Save Subscriptions. # yum install ovirt-engine Run the engine-setup command to configure Oracle Linux Virtualization Manager. Add Oracle Linux KVM Compute Hosts, Storage and Logical Networks - and then create your new Virtual Machines. For additional information on setting up your Oracle Linux Virtualization Manager please review the Installation Guide and the Getting Started Guide which are available from the Oracle Linux Virtualization Manager Document Library. Oracle Linux Virtualization Manager Support Support for Oracle Linux Virtualization Manager is available to customers with an Oracle Linux Premier Support subscription. Refer to Oracle Linux 7 License Information User Manual for information about Oracle Linux support levels.

Oracle recently announced the general availability of Oracle Linux Virtualization Manager. This new server virtualization management platform can be easily deployed to configure, monitor, and manage...

Announcements

Announcing Oracle Linux 7 Update 7

Oracle is pleased to announce the general availability of Oracle Linux 7 Update 7. Individual RPM packages are available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. ISO installation images will soon be available for download from the Oracle Software Delivery Cloud and Docker images will soon be available via Oracle Container Registry and Docker Hub. Oracle Linux 7 Update 7 ships with the following kernel packages, that include bug fixes, security fixes and enhancements: Unbreakable Enterprise Kernel (UEK) Release 5 (kernel-uek-4.14.35-1902.3.2.el7) for x86-64 and aarch64 Red Hat Compatible Kernel (RHCK) (kernel-3.10.0-1062.el7) for x86-64 only Notable new features for all architectures NetworkManager NetworkManager enables you to configure virtual LAN (VLAN) filtering on bridge interfaces, and define VLANs directly on bridge ports. NetworkManager also adds the capability to configure policy routing rules by using the GUI. Security Package Updates for Network Security Services (NSS), scap-security-guide and shadow-utils. SCAP Security Guide support for Universal Base Image (UBI) containers and images. UBI containers and images can now be scanned against any profile that is shipped in the SCAP Security Guide. Rules that are inapplicable to UBI images and containers are automatically skipped. Important changes introduced in this release btrfs: Starting with Oracle Linux 7 Update 4, btrfs is deprecated in RHCK. Note that BTRFS is fully supported with UEK R4 and UEK R5. MySQL Community Packages: Starting with Oracle Linux 7 Update 5, the MySQL Community Packages are no longer included on the Oracle Linux 7 ISO. These packages are available for download from the Oracle Linux yum server and ULN. Notable features available as a technology preview in RHCK Systemd Importd features for container image imports and exports File Systems Block and object storage layouts for parallel NFS (pNFS) DAX (Direct Access) for direct persistent memory mapping from an application for the ext4 and XFS file systems OverlayFS remains in technical preview Kernel Heterogeneous memory management (HMM) No-IOMMU mode virtual I/O feature Networking Cisco VIC InfiniBand kernel driver and Cisco libusnic_verbs driver for Cisco User Space Network Single-Root I/O virtualization (SR-IOV) in the qlcnic driver Cisco proprietary User Space Network Interface Controller in UCM servers provided in the libusnic_verbs driver Trusted Network Connect Storage Multi-queue I/O scheduling for SCSI (disabled by default) Plug-in for the libStorageMgmt API used for storage array management For more details about these and other new features and changes, please consult the Oracle Linux 7 Update 7 Release Notes for x86-64 and aarch64 platforms. Oracle Linux can be downloaded, used, and distributed free of charge and all updates and errata are freely available. Customers decide which of their systems require a support subscription. This makes Oracle Linux an ideal choice for development, testing, and production systems. The customer decides which support coverage is best for each individual system while keeping all systems up to date and secure. Customers with Oracle Linux Premier Support also receive support for additional Linux programs, including Gluster Storage, Oracle Linux Software Collections, and zero-downtime kernel updates using Oracle Ksplice. Application Compatibility Oracle Linux maintains user space compatibility with Red Hat Enterprise Linux (RHEL), which is independent of the kernel version that underlies the operating system. Existing applications in user space will continue to run unmodified on Oracle Linux 7 Update 7 with UEK Release 5 and no re-certifications are needed for applications already certified with Red Hat Enterprise Linux 7 or Oracle Linux 7. For more information about Oracle Linux, please visit www.oracle.com/linux. Oracle Linux Resources: Documentation Oracle Linux Software Download Oracle Linux Oracle Container Registry Blogs Oracle Linux Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux For community-based support, please visit the Oracle Linux space on the Oracle Developer Community.

Oracle is pleased to announce the general availability of Oracle Linux 7 Update 7. Individual RPM packages are available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. ISO...

Linux

Learn to Monitor Cloud Apps and Services with Prometheus

Want to know what is happening in your cloud-native environment at a given time? When using Prometheus to monitor your cloud workloads, you can historically collect and view metrics from configured targets to determine the moment in time when failures occur. Get an introduction to Prometheus and learn how to install and configure Prometheus through video content. Prometheus is a monitoring system, which collects metrics from configured targets at given intervals. It uses a powerful multidimensional data model with its own query language (PromQL). These features aid in discovering trends, events, and errors. The Prometheus server stores the collected metrics in a time-series database, and its query language allows for the effortless monitoring of CPU usage, memory usage, amount of HTTP request served, etc. The four major components of Prometheus are: Gathering component for accumulating data from various systems Storage component for storing gathered data for future use Viewing component for getting information from stored data Alerting component for processing the stored data and triggering alerts based on certain conditions Start your discovery today to learn more about Prometheus' architectural components, its multidimensional data model Continue your learning on how to install, configure and test Prometheus. Resources: Oracle Linux Cloud Native Environment Training Oracle Linux Cloud Native Environment product information Oracle Linux training courses Oracle Linux product documentation

Want to know what is happening in your cloud-native environment at a given time? When using Prometheus to monitor your cloud workloads, you can historically collect and view metrics from configured...

Linux

Learn About Communication Between Microservices with Istio

Get started today developing your Istio service mesh expertise by learning more about implementing Istio in a Kubernetes cluster. Istio is an open source independent service mesh that provides the fundamentals you need to successfully run a distributed microservice architecture. Istio provides a uniform way to integrate microservices and includes service discovery, load balancing, security, recovery, telemetry, and policy enforcement capabilities. An Istio service mesh is logically split into a data plane and a control plane. The data plane is composed of a set of intelligent proxies (Envoy) deployed as sidecars. This sidecar design means that communication proxies run in their own containers beside every service container. These proxies mediate and control all network communication between microservices and helps ensure that communication is reliable and secure. The control plane consists of Pilot, Mixer, Citadel, and Gallery. Pilot enables service discovery by the proxies, provides input for proxy load balancing pools, and provides routing rules to proxies. Mixer collects telemetry from Envoy sidecars and provides policy checking. Citadel is responsible for certification issuance and rotation. Galley validates and distributes configuration information within Istio. Leverage these videos to follow technical presentations and demonstrations on how to build your Oracle Container Services for Kubernetes cluster, and how to install Istio and deploy an application with automatic proxy sidecar injection enabled. Oracle Linux Cloud Native Environment Training Oracle Container Services for use with Kubernetes User’s Guide Oracle Linux Cloud Native Environment datasheet Oracle Linux Curriculum

Get started today developing your Istio service mesh expertise by learning more about implementing Istio in a Kubernetes cluster. Istio is an open source independent service mesh that provides the...

Linux

Learn to Drive Efficient Deployments with Kata Containers

Kata containers help you drive efficiency for your container deployments. With Oracle Linux Cloud Native Environment training, we bring you a series of free, short videos to help get you started with implementing Kata container technologies in your Kubernetes cluster with Oracle Linux. Supporting Open Container Initiative compatible containers, Kata allows you to efficiently deploy containers in lightweight virtual machines (VMs) that deliver performance and security with less overhead than standard container deployments. Kata container lightweight VMs look and operate like regular containers but do not share the same underlying kernel. They use hardware virtualization to allow each container to run its own VM and kernel. Kata on Kubernetes cluster is easy to install and set up, bringing solid performance and security returns to your Oracle Linux infrastructure investment. Get started today with developing your Kata container expertise by learning more about implementing Kata. Follow these links to find technical videos on how to build your Kubernetes cluster, and how to install and set up Kata in your cluster worker nodes. Oracle Linux Cloud Native Environment Training Oracle Container Services for use with Kubernetes User’s Guide Oracle Linux Curriculum

Kata containers help you drive efficiency for your container deployments. With Oracle Linux Cloud Native Environment training, we bring you a series of free, short videos to help get you started...

Linux

First Step on Your Oracle Linux System Admin Learning

Get started on your Linux learning with Oracle Linux System Administration I. This course offers extensive hands-on experience including installing the Oracle Linux operating system, configuring basic Linux services, preparing a system for the Oracle database, and monitoring and troubleshooting a running Oracle Linux system. Important to sys admins, this course provides students with the skills to handle networking, storage, security, monitoring, troubleshooting and more. Students are also introduced to the Oracle Cloud Infrastructure and learn how to create an Oracle Linux instance on the cloud, set up a Virtual Cloud Network (VCN), and attach a block volume to an Oracle Linux instance on the cloud. By taking Oracle Linux System Administration I, students learn to: Install Oracle Linux 7 operating system Configure a system to use the Unbreakable Enterprise Kernel (UEK) Set up users and groups Configure networking and storage devices Update a system using the Oracle's Unbreakable Linux Network (ULN) Use Ksplice technology to update the kernel on a running system And many more This Oracle Linux System Administration I course is the first of 3 new Oracle Linux System Administration courses, so you can continue your learning with: Oracle Linux System Administration II Oracle Linux System Administration III Resources: Oracle Linux curriculum Oracle Linux product documentation Linux on Oracle Cloud Infrastructure learning path Oracle Linux Cloud Native Environment learning path

Get started on your Linux learning with Oracle Linux System Administration I. This course offers extensive hands-on experience including installing the Oracle Linux operating system, configuring basic...

Announcements

Announcing the Release of Oracle Linux 8

Oracle is pleased to announce the general availability of Oracle Linux 8.   With Oracle Linux 8, the core operating environment and associated packages for a typical Oracle Linux 8 server are distributed through a combination of BaseOS and Applications Streams. BaseOS gives you a running user space for the operating environment. Application Streams provides a range of applications that were previously distributed in Software Collections, as well as other products and programs, that can run within the user space. Notable new features in this release Oracle Linux 8 introduces numerous enhancements and new features. Highlights include: Application Streams Oracle Linux 8 introduces the concept of Application Streams, where multiple versions of user space components can be delivered and updated more frequently than the core operating system packages. Application Streams contain the necessary system components and a range of applications that were previously distributed in Software Collections, as well as other products and programs. A list of Application Streams supported on Oracle Linux 8 is available here. System Management Dandified Yum, a new version of the yum tool based on DNF technology, is a software package manager that installs, updates, and removes packages on RPM-based Linux distributions Cockpit, an easy-to-use, lightweight and simple yet powerful remote manager for GNU/Linux servers, is an interactive server administration interface that offers a live Linux session via a web browser RPM Improvements Oracle Linux 8 ships with version 4.14 of RPM, which introduces many improvements and support for several new features Installation, Boot and Image Creation inst.addrepo=name boot parameter has been added to the installer. You can use this parameter to specify an additional repository during an installation By default, the Oracle Linux 8 installer uses the disk encryption specification LUKS2 (Linux Unified Key Setup 2) format Kernel The modinfo command has been updated to recognize and display signature information for modules that are signed with CMS and PKCS#7 formatted signatures A set of kernel modules have been moved to the kernel-modules-extra package, which means none of these modules are installed by default; as a consequence, non-root users cannot load these components, as they are also blacklisted by default Memory bus limits have been extended to 128 PiB of virtual address space and 4 PB of physical memory capacity. The I/O memory management unit (IOMMU) code in the Linux kernel is also updated to enable 5-level paging tables The early kdump feature enables the crash kernel and initramfs to load early so that it can capture vmcore information, including early kernel crashes Containers and Virtualization New container tools:  podman, buildah and skopeo, compatible with Open Container Initiative (OCI), are now available with the Oracle Linux 8. These tools can be used to manage the same Linux containers that are produced and managed by Docker and other compatible container engines. Q35 machine type, support for KVM, which is a more modern PCI Express-based machine type, is now available for KVM Additional information is included in KVM guest crash reports, which makes it easier to diagnose and fix problems when using KVM virtualization Filesystem and Storage Enhanced Device Mapper Multipathing SCSI Multiqueue driver enables block layer performance to scale well with fast solid-state drives (SSDs) and multi-core systems Stratis, an easy solution to manage local storage XFS support for shared COW data extents, shared copy-on-write (COW) data extent functionality, whereby two or more files can share a common set of data blocks. This feature is similar to Copy on write (COW) functionality that is found in other file systems, where if either of the files that are sharing common blocks change, XFS breaks the link to those common blocks and then creates a new file Identity Management Several major identity management (IdM) features and enhancements, including session recording, enhanced Microsoft AD integration and new password syntax check IdM server and client packages are distributed as a module; the IdM server module stream is called the DL1 stream and it contains multiple profiles (server, dns, adtrust, client, and default) Networking iptables network packet filtering framework has been replaced with nftables; the nftables framework includes packet classification facilities, several improvements and provides improved performance iptables-translate and ip6tables-translate commands are now available to convert existing rules to their nftables equivalents, thereby facilitating the move to Oracle Linux 8 IPVLAN virtual network driver enables network connectivity for multiple containers by exposing a single MAC address to the local network Networking, UDP, and TCP updated to release 4.18 with improved performance Security OpenSSH updated to release 7.8p1, enhancing access security  LUKS2 (Linux Unified Key Setup) is now the default format for encrypted volumes OpenSCAP has been updated to the release 1.3.0 with improvements to the command-line interface as well as consolidation of OpenSCAP API have been addressed SELinux now supports the map permission feature, to help prevent direct memory access to various file system objects and introduces new SELinux booleans Transport Layer Security (TLS) 1.3 is enabled by default in major back-end cryptographic libraries Support Oracle Linux can be downloaded, used, and distributed free of charge and updates and errata are freely available. Customers decide which of their systems require a support subscription. This makes Oracle Linux an ideal choice for development, testing, and production systems. The customer decides which support coverage is best for each individual system while keeping all systems up-to-date and secure. Customers with Oracle Linux Premier Support also receive support for additional Linux programs, including zero-downtime kernel updates using Oracle Ksplice, Oracle Linux Virtualization Manager and Oracle Linux Cloud Native Environment. Oracle Linux Premier Support is included with Oracle Cloud Infrastructure subscriptions at no additional cost. Further information Oracle Linux 8 installation software is now available as: ISO from the Oracle Software Delivery Cloud for x86-64 architecture Individual RPM packages via the Unbreakable Linux Network (ULN) and the Oracle Linux Yum Server Developer Preview ISO from Oracle Linux on Oracle Technology Network for aarch64 architecture Additional Oracle Linux 8 software options are also available on: Docker images via Oracle Container Registry and Docker Hub Platform image on Oracle Cloud Infrastructure Marketplace Oracle Linux 8 ships with the Red Hat Compatible Kernel (RHCK) kernel package kernel-4.18.0-80.el8. It is tested as a bundle, as shipped on the installation media image. The Unbreakable Enterprise Kernel (UEK), which is being built from a more current upstream kernel version, is undergoing final development. Oracle Linux 8 offers developers the opportunity to get started with 8.0 capabilities as well as get updates for free. Oracle Linux maintains binary compatibility with Red Hat Enterprise Linux (RHEL), which is independent of the kernel version that underlies the operating environment. Existing applications in user space will continue to run unmodified on Oracle Linux 8 and no re-certifications are needed for applications already certified with Red Hat Enterprise Linux 8.  Resources Documentation Oracle Linux Oracle Linux 8 Release Notes Software Download Oracle Linux download instructions Oracle Software Delivery Cloud Oracle Container Registry Community Pages Oracle Linux Community Space Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux

Oracle is pleased to announce the general availability of Oracle Linux 8.   With Oracle Linux 8, the core operating environment and associated packages for a typical Oracle Linux 8 server are...

Linux Kernel Development

Improve Security with Address Space Isolation (ASI)

Address Space Isolation In this blog post, Oracle Linux kernel developers Alexandre Chartre and Konrad Rzeszutek Wilk give an update on the Spectre v1 and L1TF software solutions. Introduction In August of 2018 the L1TF speculative execution side channel vulnerabilities were presented (see Foreshadow – Next Generation (NG). However the story is more complicated. In particular, an explanation in L1TF - L1 Terminal Fault mentions that if hyper-threading is enabled, and the host is running an untrusted guest - there is a possibility of one thread snooping the other thread. Also the recent Microarchitectural Data Sampling, aka Fallout, aka RIDL, aka Zombieland demonstrated that there are more low-level hardware resources that are shared between hyperthreads. Guests on Linux are treated the same way as any application in the host kernel, which means that the Completely Fair Scheduler (CFQ) does not distinguish whether guests should only run on specific threads of a core. Patches for core scheduling provide such a capability but, unfortunately, its performance is rather abysmal and as Linus mentions: Because performance is all that matters. If performance is bad, then it's pointless, since just turning off SMT is the answer. However turning off SMT (hyperthreading) is not a luxury that everyone can afford. But then, with hyperthreading enabled, a malicious guest running on one hyperthread can snoop from the other hyperthread, if the host kernel is not hardened. Details of data leak exploit There are two pieces of exploitation technologies combined: Spectre v1 code gadgets - that is any code in the kernel that accesses user controlled memory under speculation A malicious guest which is executing the actual L1TF attack In fact a Proof of Concept has been posted RFC x86/speculation: add L1 Terminal Fault / Foreshadow demo which does exactly that. The reason this is possible is due to the fact that hyperthreads share CPU resources - and a well-timed attack can occur in between the time we exit in the hypervisor and go back to running the guest: Thread #0 performs an operation that requires the help of the hypervisor, such as cpuid. Thread #1 spins in its attack code without invoking the hypervisor. Thread #0 pulls data in the cache using Spectre v1 code gadget. Thread #1 measure CPU resources to leak speculatively accessed data. Thread #0 flushes the cache and then resumes executing the guest. This is how an attacker can leak the kernel data - using a combination of Spectre v1 code gadgets and using L1TF attack in the little VMEXIT windows that an guest can force. Solutions As mentioned disabling hyperthreading automatically solves the security problem, but that may not be a solution as it halves the capacity of a cluster of machines. All the solutions revolve around the idea of allowing code gadgets to exist but they would either not be able to be execute in the speculative path, or they can execute - but are only be able to collect non-sensitive data. The first solution that comes in mind is - can we inhibit the secondary thread from executing code gadgets. One naive approach is to simply always kick the other sibling whenever we enter the kernel (or hypervisor) and have the other sibling spin until we are done in a safe space. Not surprisingly the performance was abysmal. Several other solutions that followed this path that have been proposed, including:   Co-scheduling - modifying the scheduler to run tasks of the same group and on the same core executing simultaneously, whenever they are executed Core-scheduling - different implementation but the same effect. Both of those follow the same pattern - lock-step enter the kernel (or hypervisor) when needed on both threads. This mitigates the Specte v1 issue by the guest or user space program not being able to leverage it - but it comes with unpleasant performance characteristics (on some workloads worst performance than turning hyperthreading off). Another solution includes proactively patching the kernel for Spectre v1 code gadgets, along with meticulous nanny-sitting of the scheduler to never schedule one customer guests from sharing another customer siblings, and other low-level mitigations not explained in this blog.   However that solution also does not solve the problem of the host kernel being leaked using the Spectre v1 code gadgets and L1TF attack (see _Details of data leak exploit_ above). But what if just remove sensitive data from being mapped to that virtual address space to begin with? This would mean even if the code gadgets were found they would never be able to bridge the gap to the attacker controlled signal array.   eXclusive Page Frame Ownership (XPFO)) One idea that has been proposed in order to reduce sensitive data is to remove from kernel memory pages that solely belong to a userspace process and that the kernel don't currently need to access. This idea is implemented in a patch series called XPFO that can be found here: Add support for eXclusive Page Frame Ownership, earlier explained in 2016 Exclusive page-frame ownership and the original author's patches Add support for eXclusive Page Frame Ownership (XPFO). Unfortunately this solution does not help with protecting the hypervisor from having data leaked, just protects user space data. Which for guest to guest protection is enough - even if a naughty guest caused the hypervisor to speculatively execute Spectre v1 code gadgets along with spilling the hypervisor data using L1TF attack - the hypervisor at that point has only the naughty guest mapped on the core and not the other guest's memory on the same core - so only hypervisor data is leaked and other guests' vCPUs register data. Only is not good enough - we want better security. And if one digs in deeper there are also some other issue such as non trivial performance hit as a result of TLB flushes which make it slower than just disabling hyperthreading. Also if vhost is used then XPFO does not help at all as each guest vhost thread ends up mapping the guest memory in the kernel virtual address space and re-opening the can of worms. Process-local memory allocations Process-local memory allocations (v2) addresses this problem a bit differently - mainly that each process has a kernel virtual address space slot (local, or more of secret) in which the kernel can squirrel sensitive data on behalf of the process. The patches focus only on one module (kvm) which would save guest vCPU registers in this secret area. Each guest is considered a separate process, which means that each guest is precluded from touching the other guest secret data. The "goal here is to make it harder for a random thread using cache load gadget (usually a bounds check of a system call argument plus array access suffices) to prefetch interesting data into the L1 cache and use L1TF to leak this data." However there are still issues - all of the guests memory is globally mapped inside the kernel. And the kernel memory itself can still be leaked in the guest. This is similar to XPFO in that it is a black-list approach - we decide on specific items in the kernel virtual address space and remove them. And it fails short of what XPFO does (XPFO removes the guest memory from the kernel address space). Combining XPFO with Process-local memory allocations would provide much better security than using them separately. Address Space Isolation Address Space Isolation is a new solution which isolates restricted/secret and non-secret code and data inside the kernel. This effectively introduces a firewall between sensitive and non-sensitive kernel data while retaining the performance (we hope). This design is inspired by Microsoft Hyper-V HyperClear Mitigation for L1 Terminal Fault. Liran Alon who sketched out the idea thought about this idea as follow: The most naive approach to prevent the SMT attack vector is to force sibling hyperthreads to exit every time one hyperthread exits. But it introduce in-practical perf hit. Therefore, next thought was to just remove what could be leaked to begin with. We assume that everything that could be leaked is something that is mapped into the virtual address space that the hyperthread is executing in after it exits to host. Because we assume that leakable CPU resources are only loaded with sensitive data from virtual address space. This is an important assumption. Going forward with this assumption, we need techniques to remove sensitive information from host virtual address space. XPFO and Kernel-Process-Local-Memory patch series goes with a black-list approach to remove explicitly specific parts of virtual address space which we consider to have sensitive information. The problem with this approach is that we are maybe missing here something and therefore a white-list approach is preferred. At this point, after being inspired from Microsoft HyperClear, the KVM ASI came about. The unique distinction about KVM ASI is that it creates a separate virtual address space for most of the exits to host that is built in a white-list approach: we only map the minimum information necessary to handle these exits and do not map sensitive information. Some exits may require more or sensitive information, and in those cases we kick the sibling hyperthreads and switch to the full address space. Details of Address Space Isolation QEMU and the KVM kernel module work together to manage a guest, and each guest is associated with a QEMU process. From userspace, QEMU uses the KVM_RUN ioctl (#1 and #2) to request KVM to run the VM (#3) from the kernel using Intel Virtual Machine Extensions (VMX). When an event causes the VM to return (VM-Exit, step #4) to KVM, KVM handles the VM-Exit (#5) and then transfer control to the VM again (VM-Enter). See below: However, most of the KVM VM-Exit handlers only need to access per-VM structures and KVM/vmlinux code and data that is not sensitive. Therefore, these KVM VM-Exit handlers can be run in an address space different from the standard kernel address space. So, we can define a KVM address space, separated from the kernel address space, which only needs to map the code and data required for running these KVM VM-Exit handlers (#5 see below). This provides a white-list approach of exactly what could be leaked while running the KVM VM-Exit code (in the picture below it is yellowish). When the KVM VM-Exit (#5a see below) code reaches a point where it does architecturally need to access sensitive data (and therefore not mapped in this isolated virtual address space), then it will kick all sibling hyperthreads outside of guest and switch to the full kernel address space. This kicking guarantees that there is no untrusted guest code running on sibling hyperthreads while KVM is bringing data into the L1 cache with the full kernel address space mapped. This overall operation happens, for example, when KVM needs to return to QEMU or the host needs to run an interrupt handler. Note that KVM flushes the L1 cache before VM-Enter back to running guest code to ensure nothing is leaked via the L1 cache back to the guest. In effect, we have made the KVM module a less privileged kernel module. That has three fantastic side-effects: The guest already knows about guest data on which KVM operates most of the time so if it is leaked to the guest that is okay. If the attacker does exploit a code gadget, it will only be able to run on the KVM module address space, not outside of it. Nice side-affect of ASI is that it can also assist against ROP exploitation and architectural (not speculative) info-leak vulnerabilities because much less information is mapped in the exit handler virtual address space.` If the KVM module needs to access restricted data or routines, it needs to switch to the full kernel page-table, and also bring the other sibling back to the kernel so that the other thread will be unable to insert code gadgets and slurp data in. Show me the code?! The first version, posted back in May, RFC KVM 00/27 KVM Address Space Isolation received many responses from the community. These patches - RFC v2 00/27 Kernel Address Space Isolation posted by Alexandre Chartre are the second step in this. The patches are posted as a Request For Comments which solicits guidance from the the Linux Kernel community on how they would like this to be done. The framework is more generic with the first user being KVM but could very well be extended to other modules   Thanks We would like also to thank the following folks for help with this article: Mark Kanda Darren Kenny Liran Alon Bhavesh Davda

Address Space Isolation In this blog post, Oracle Linux kernel developers Alexandre Chartre and Konrad Rzeszutek Wilk give an update on the Spectre v1 and L1TF software solutions. Introduction In August...

Announcements

OpenSSL Cryptographic Module for Oracle Linux 7.5 and 7.6 Received FIPS 140-2 Certification

OpenSSL cryptographic module for Oracle Linux 7.5 and 7.6 has just received FIPS 140-2 Level 1 certification. This is the first completed FIPS 140-2 certification with the latest Oracle Linux 7.6 update, ahead of any other Linux distributions. This certification adds to recent, related certifications and advancements, which enable Oracle Linux to deliver more security features that can help keep systems secure and improve the speed and stability of your operations on premises and in the cloud. Conformance with the FIPS 140-2 standard provides assurance to government and industry purchasers that products are correctly implementing cryptographic functions as the FIPS 140-2 standard specifies. FIPS 140-2 is a public sector procurement requirement in both the United States and Canada for any products claiming or providing encryption. The FIPS 140-2 program is jointly administered by the National Institute of Standards and Technology (NIST) in the US and the Canadian Center for Cyber Security (CCCE) in Canada. The joint program is called the CMVP (Cryptographic Module Validation Program). The platforms that are used for Oracle Linux 7.5 and 7.6 OpenSSL cryptographic module FIPS 140 validation testing include Oracle Server X7-2, running Oracle Linux 7.5 and 7.6. Oracle “vendor affirms” that the FIPS validation is maintained on other x86-64 equivalent hardware that has been qualified, per the Oracle Linux Hardware Certification List (HCL), on the corresponding Oracle Linux releases. Oracle Linux cryptographic modules enable FIPS 140 compliant operations for key use cases such as data protection and integrity, remote administration, cryptographic key generation, and key/certificate management. The packages that are FIPS 140-2 level 1 certified for Oracle Linux 7 can be obtained from Oracle Linux yum server. When the packages are installed, you can enable FIPS mode by following the Oracle Linux 7 Documentation. Oracle Linux is engineered for open cloud infrastructure. It delivers leading performance, scalability, reliability, and security for enterprise SaaS and PaaS workloads, as well as traditional enterprise applications. Oracle Linux Support offers access to award-winning Oracle support resources and Linux support specialists, zero-downtime updates using Ksplice, additional management tools such as Oracle Enterprise Manager and lifetime support, all at a low cost. Unlike many other commercial Linux distributions, Oracle Linux is easy to download and completely free to use, distribute, and update. The Oracle Linux images that are available on Oracle Cloud Infrastructure are updated frequently to provide access to the latest security updates, and Oracle Linux Premier Support is provided at no additional cost to Oracle Cloud Infrastructure subscribers.  For a matrix of Oracle security evaluations that are currently in progress, as well as those completed, please refer to Oracle Security Evaluations. Visit Oracle Linux Security to learn how Oracle Linux can help keep your systems secure and improve the speed and stability of your operations.

OpenSSL cryptographic module for Oracle Linux 7.5 and 7.6 has just received FIPS 140-2 Level 1 certification. This is the first completed FIPS 140-2 certification with the latest Oracle Linux...

Oracle Sponsors KubeCon + CloudNativeCon + Open Source Summit China 2019

Oracle is a committed and active member of the Linux community and is a gold sponsor of KubeCon + CloudNativeCon + Open Source Summit China 2019 (Shanghai, June 24-26, 2019). A founding platinum member of The Linux Foundation® and also a platinum member of Cloud Native Computing Foundation® (CNCF®), Oracle is dedicated to the worldwide success of Linux for organizations of all sizes and across all industries. Oracle continues to expand its commitment to open source and cloud native solutions targeted at helping move enterprise workloads to the cloud. At KubeCon + CloudNativeCon Europe 2019 in Barcelona last month, Oracle announced Oracle Cloud Infrastructure Service Broker for Kubernetes and highlighted a recent set of Oracle open source solutions that facilitate enterprise cloud migrations including Helidon, GraalVM, Fn Project, MySQL Operator for Kubernetes, and WebLogic Operator for Kubernetes. Oracle is enabling enterprise developers to embrace cloud native culture and open source and make it easier to move enterprise workloads to the cloud. That includes everyone, from database application teams, to Java developers, to WebLogic system engineers, to Go, Python, Ruby, Scala, Kotlin, JavaScript, Node.js developers and more. For example, the Oracle Cloud Developer Image provides a comprehensive development platform on Oracle Cloud Infrastructure that includes Oracle Linux, Oracle Java SE support, Terraform, and many SDKs.  It reduces the time it takes to get started on Oracle’s cloud infrastructure and makes it fast and easy, just a matter of minutes, to provision and run Oracle Autonomous Database. Operating systems, containers, and virtualization are the fundamental building blocks of modern IT infrastructure. Oracle combines them all into one integrated open source offering: Oracle Linux. Operating on your choice of hardware—in your data center or in the cloud—Oracle Linux provides the reliability, scalability, security, and performance for demanding enterprise and cloud workloads. We are pleased to share, below, the latest Oracle Linux developments and releases that can help accelerate your digital transformation. With Oracle Linux, you have a complete DevOps environment which is modern, optimized, and secure and is designed for hybrid and multi-cloud deployments at enterprise scale. Oracle Linux Cloud Native Environment—This curated set of open source software is selected from CNCF projects. Recently, the technology preview of Oracle Container Runtime for Kata was released, which aims to further protect cloud native, container-based microservices, by leveraging the security and isolation provided by virtual machines. Updates have been made to Oracle Container Runtime for Docker and Oracle Container Services for use with Kubernetes. Additionally, many Oracle software products are available as Docker container images that can be downloaded from Oracle Container Registry, and you can download Dockerfiles and samples from GitHub to build your own Docker container images for Oracle software. Unbreakable Enterprise Kernel (UEK) Release 5 Update 2—Available on Intel and AMD (x86_64) and Arm (aarch64) platforms, UEK Release 5 Update 2 for Oracle Linux 7 is based on the mainline kernel version 4.14.35 and includes several new features, added functionality, and bug fixes across a range of subsystems. Oracle Linux Virtualization Manager—This new server virtualization management platform can be easily deployed to configure, monitor, and manage an Oracle Linux Kernel-based Virtual Machine (KVM) environment with enterprise-grade performance and support from Oracle. Based on the open source oVirt project, Oracle Linux Virtualization Manager allows enterprise customers to continue supporting their on-premises data center deployments with the KVM hypervisor already available on Oracle Linux 7.6 with the Unbreakable Enterprise Kernel Release 5. Oracle Linux KVM is a feature that has been delivered and supported as part of Oracle Linux for some time. With the release of the UEK Release 5, the Oracle Linux server virtualization solution with KVM has been enhanced. Oracle Linux KVM is the same hypervisor used in Oracle Cloud Infrastructure, giving users an easy migration path to move workloads into Oracle Cloud in the future. Gluster Storage Release 5 for Oracle Linux 7—Gluster is a scalable, distributed file system that aggregates disk storage resources from multiple servers into a single global namespace. The new Gluster Storage Release 5 for Oracle Linux 7, based on the stable release of the upstream Gluster 5, brings customers higher performance, new storage capabilities and improved management. Security and Compliance—Oracle Linux is one of the most secure operating environments. Oracle Linux 7 has just received both a Common Criteria (CC) Certification which was performed against the National Information Assurance Partnership (NIAP) General Purpose Operating System Protection Profile (OSPP) v4.1 as well as a FIPS 140-2 validation of its cryptographic modules. Oracle Linux is currently one of only two operating systems—and the only Linux distribution—on the NIAP Product Compliant List. AMD Secure Memory Encryption—Oracle Linux 7 with UEK Release 5 enables hardware-accelerated memory encryption for data-in-use protection, such as Secure Memory Encryption (SME) for bare metal servers and Secure Encrypted Virtualization (SEV) for virtual machines, available on AMD EPYC processor-based systems. In particular, the SEV capability encrypts the memory of KVM guests so that the hypervisor can’t see the memory even when dumped. Zero-Downtime Patching with Oracle Ksplice—With Oracle Ksplice, you can immediately apply security patches (hypervisor, kernel, and user space) without impacting production environments—and without rebooting. When patching systems with the new Ksplice feature, Known Exploit Detection, not only is the security vulnerability closed, but tripwires are laid down for privilege escalation vulnerabilities. This means that if an attacker attempts to exploit a CVE that was patched, Ksplice notifies you. Moreover, Ksplice Known Exploit Detection will work from inside a container. If a container attempts to exploit a privilege escalation vulnerability, Ksplice will notify at the host level. This, combined with Kata Containers and AMD SEV for secure memory, provides strong protection for running containers. Ksplice zero-downtime patching support is provided to Oracle Cloud Infrastructure subscribers at no additional cost, for Oracle Linux instances, and is also available for Red Hat Enterprise Linux and CentOS instances deployed on Oracle Cloud Infrastructure. To get started, Oracle Linux is freely available—to download, use, and distribute—at Oracle Software Delivery Cloud. Updates can be obtained from Oracle Linux yum server. Additionally, Oracle VM VirtualBox, the most popular cross-platform virtualization software for development environments, can be downloaded on your desktop to run Oracle Linux and the cloud native software covered above, allowing you to easily deploy to the cloud. By using Vagrant boxes for Oracle software on GitHub, you have a more streamlined way to create virtual machines with Oracle software fully configured and ready to go inside of them. Oracle is offering up to 3,500 free hours on Oracle Cloud to developers that would like to use our cloud for their development environment. To learn more about Oracle Linux at KubeCon + CloudNativeCon + Open Source Summit China 2019, attend this session (June 25) and visit the Oracle booth.

Oracle is a committed and active member of the Linux community and is a gold sponsor of KubeCon + CloudNativeCon + Open Source Summit China 2019 (Shanghai, June 24-26, 2019). A founding platinum...

Linux Kernel Development

The Power of XDP

The Power of XDP Oracle Linux kernel developer Alan Maguire talks about XDP, the eXpress DataPath which uses BPF to accelerate packet processing. For more background on BPF, see the series on BPF, wherein he presented an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering. [Important note: the BPF blog series referred to BPF functionality available in the 4.14 kernel. The functionality described here is for the most part present in that kernel also, but a few of the libbpf functions used in the example program and the layout of the xdp_md metadata structure have changed, and here we refer to the up-to-date (as of the 5.2 kernel) versions.] In previous blog entries I gave a general description of BPF and applied BPF concepts to building tc-bpf programs. In that case, such programs are attached to tc ingress and egress hooks and can carry out packet transformation and other activities there. However, such processing happens after the packet metadata - in Linux this is a "struct sk_buff" - has been allocated. As such there are earlier intervention points where BPF could operate. The goal of XDP is to offer comparable performance to kernel bypass solutions while working with the existing kernel networking stack. For example, we may drop or forward packets directly using XDP, or perhaps simply pass them through the network stack for normal processing. XDP metadata As mentioned in the first article of the BPF series, XDP allows us to attach BPF programs early in packet receive codepaths. A key focus of the design is to minimize overheads, so each packet uses a minimal metadata descriptor: /* user accessible metadata for XDP packet hook * new fields must be added to the end of this structure */ struct xdp_md { __u32 data; __u32 data_end; __u32 data_meta; /* Below access go through struct xdp_rxq_info */ __u32 ingress_ifindex; /* rxq->dev->ifindex */ __u32 rx_queue_index; /* rxq->queue_index */ }; Contrast this to the struct sk_buff definition as described here: https://www.netdevconf.org/2.2/slides/miller-datastructurebloat-keynote.pdf Each sk_buff requires an allocation of at least 216 bytes of metadata. This translates into observable performance costs. XDP program execution XDP comes in two flavours; native XDP requires driver support, and packets are processed before sk_buffs are allocated. This allows us to realize the benefits of a minimal metadata descriptor. The hook comprises a call to bpf_prog_run_xdp, and after calling this function the driver must handle the possible return values - see below for a description of these. As an example, the bnxt_rx_pkt function calls bnxt_rx_xdp, which in turn verifies if an XDP program has been loaded for the RX ring, and if so sets up metadata buffer and calls bpf_prog_run_xdp. bnxt_rx_pkt is called directly from device polling functions and so is called via the net_rx_action for both interrupt processing and polling; in short we are getting our hands on the packet as soon as possible in the receive codepath. generic XDP, where the XDP hooks are called from within the networking stack after the sk_buff has been allocated. Generic XDP allows us to use the benefits of XDP - though at a slightly higer performance cost - without underlying driver support. In this case bpf_prog_run_xdp is called as via netdev's netif_receive_generic_xdp function; i.e. after the skb has been allocated and set up. To ensure that XDP processing works, the skb has to be linearized (made contiguous rather than chunked in data fragments) - again this can cost performance. XDP actions XDP programs can signal a desired behaviour by returning XDP_DROP: drops with XDP are fast, the buffers are just recycled to the rx ring queue XDP_PASS: pass to the normal networking stack, possibly after modification XDP_TX: send out same NIC packet arrived from, after modifying packet XDP_REDIRECT: Using the XDP_REDIRECT action from an XDP program, the program can redirect ingress frames to other XDP enabled netdev Adding support for XDP to a driver requires adding the receive hook calling bpf_prog_run_xdp and handling the various outcomes, and adding setup/teardown functions which dedicate buffer rings to XDP. An example - xdping From the above set of actions, and the desire to minimize per-packet overhead, we can see that use cases such as Distributed Denial of Service mitigation and load balancing make sense. To help illustrate the key concepts in XDP, here we present a fully-worked example of our own. This example is available in recent bpf-next kernels; see https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/xdping.c ...for the userspace program; https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/xdping.h ...for the shared header; and https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/xdping_kern.c ...for the BPF program. xdping is a C program that uses XDP, BPF maps and the ping program to measure round-trip times (RTT) in a similar manner to ping, but with xdping we measure round-trip time from XDP itself, instead of invoking all the additional layers of IP, ICMP and user-space-to-kernel interactions. The idea is that by presenting round-trip times as measured in XDP versus those measured via a traditional ping we can see how much processing traffic in XDP directly can save us in terms of response latency eliminate variations in RTT due to the additional processing layers xdping can operate in either client or server modes. As a client, it is responsible for generating ICMP requests and receiving ICMP replies, measuring RTT and saving the result in a BPF map. It does this by receiving a ping-generated ICMP reply, turning that back into an ICMP request, noting the time and sending it. When the reply is received, the RTT can be calculated As a server, it is responsible for receiving ICMP requests, turning them back into replies Note that the above approach is necessary because XDP is receive-driven; i.e. the XDP hooks are in the receive codepaths. With AF_XDP - the topic of our next XDP blog entry - transmission is also possible, but here we stick to core XDP. Let's see what the program looks like! # ./xdping -I eth4 192.168.55.7 Setting up xdp for eth4, please wait... Normal ping RTT data: PING 192.168.55.7 (192.168.55.7) from 192.168.55.8 eth4: 56(84) bytes of data. 64 bytes from 192.168.55.7: icmp_seq=1 ttl=64 time=0.206 ms 64 bytes from 192.168.55.7: icmp_seq=2 ttl=64 time=0.165 ms 64 bytes from 192.168.55.7: icmp_seq=3 ttl=64 time=0.162 ms 64 bytes from 192.168.55.7: icmp_seq=8 ttl=64 time=0.470 ms --- 192.168.55.7 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3065ms rtt min/avg/max/mdev = 0.162/0.250/0.470/0.129 ms XDP RTT data: 64 bytes from 192.168.55.7: icmp_seq=5 ttl=64 time=0.03003 ms 64 bytes from 192.168.55.7: icmp_seq=6 ttl=64 time=0.02665 ms 64 bytes from 192.168.55.7: icmp_seq=7 ttl=64 time=0.02453 ms 64 bytes from 192.168.55.7: icmp_seq=8 ttl=64 time=0.02633 ms Note that - unlike ping where it is optional - we must specify an interface for use in ping'ing; we need to know where to load the XDP program. Note also that the RTT measurements from XDP are significantly quicker than those reported by ping. Now ping has support for timestaming, where the network stack processing can use IP timestamps to get more accurate numbers, but not all systems have timestamping enabled. Finally notice one other thing; each ICMP echo packet has an associated sequence number, and we see these reported in the ping output. However note that the final icmp_seq=8 and not 4 as we might expect. This is because our XDP program took that 4th reply, rewrote as a request with sequence number 5 and sent it out. Then when it got that reply and measured the RTT, it did the same again for seq number 6 and so on until it got the 8th reply, realized it had all the numbers it needed (by defalt we do 4 requests, that can be changed with the "-c count" option to xdping) and instead of returning XDP_TX ("send out this modified packet") the program returns XDP_PASS ("pass this packet to the networking stack"). So the ping program finally sees ICMP reply number 8, hence the output. To store RTTs we need a common data structure to store in a BPF map which we shall key using the target (remote) IP address. xdping.h can store this info and be included by the userspace and kernel programs: /* SPDX-License-Identifier: GPL-2.0 */ /* Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved. */ #define XDPING_MAX_COUNT 10 #define XDPING_DEFAULT_COUNT 4 struct pinginfo { __u64 start; __be16 seq; __u16 count; __u32 pad; __u64 times[XDPING_MAX_COUNT]; }; We store the number of ICMP requests to make ("count"), the start time for the current request ("start"), the current sequence number ("seq") and the RTTs ("times"). Next, here is the implementation of the ping client code for the BPF program, xdping_kern.c: SEC("xdpclient") int xdping_client(struct xdp_md *ctx) { void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data; struct pinginfo *pinginfo = NULL; struct ethhdr *eth = data; struct icmphdr *icmph; struct iphdr *iph; __u64 recvtime; __be32 raddr; __be16 seq; int ret; __u8 i; ret = icmp_check(ctx, ICMP_ECHOREPLY); if (ret != XDP_TX) return ret; iph = data + sizeof(*eth); icmph = data + sizeof(*eth) + sizeof(*iph); raddr = iph->saddr; /* Record time reply received. */ recvtime = bpf_ktime_get_ns(); pinginfo = bpf_map_lookup_elem(&ping_map, &raddr); if (!pinginfo || pinginfo->seq != icmph->un.echo.sequence) return XDP_PASS; if (pinginfo->start) { #pragma clang loop unroll(full) for (i = 0; i < XDPING_MAX_COUNT; i++) { if (pinginfo->times[i] == 0) break; } /* verifier is fussy here... */ if (i < XDPING_MAX_COUNT) { pinginfo->times[i] = recvtime - pinginfo->start; pinginfo->start = 0; i++; } /* No more space for values? */ if (i == pinginfo->count || i == XDPING_MAX_COUNT) return XDP_PASS; } /* Now convert reply back into echo request. */ swap_src_dst_mac(data); iph->saddr = iph->daddr; iph->daddr = raddr; icmph->type = ICMP_ECHO; seq = bpf_htons(bpf_ntohs(icmph->un.echo.sequence) + 1); icmph->un.echo.sequence = seq; icmph->checksum = 0; icmph->checksum = ipv4_csum(icmph, ICMP_ECHO_LEN); pinginfo->seq = seq; pinginfo->start = bpf_ktime_get_ns(); return XDP_TX; } In the full program, there are two ELF sections; one for the client mode (turn replies into requests and send them, measure RTT), and one for the server (turn requests into replies and send them out). Finally, the user-space program loads the XDP program, intializes the map used by it and kicks off the ping. Here is the main() function that sets up XDP and runs the ping: int main(int argc, char **argv) { __u32 mode_flags = XDP_FLAGS_DRV_MODE | XDP_FLAGS_SKB_MODE; struct addrinfo *a, hints = { .ai_family = AF_INET }; struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; __u16 count = XDPING_DEFAULT_COUNT; struct pinginfo pinginfo = { 0 }; const char *optstr = "c:I:NsS"; struct bpf_program *main_prog; int prog_fd = -1, map_fd = -1; struct sockaddr_in rin; struct bpf_object *obj; struct bpf_map *map; char *ifname = NULL; char filename[256]; int opt, ret = 1; __u32 raddr = 0; int server = 0; char cmd[256]; while ((opt = getopt(argc, argv, optstr)) != -1) { switch (opt) { case 'c': count = atoi(optarg); if (count < 1 || count > XDPING_MAX_COUNT) { fprintf(stderr, "min count is 1, max count is %d\n", XDPING_MAX_COUNT); return 1; } break; case 'I': ifname = optarg; ifindex = if_nametoindex(ifname); if (!ifindex) { fprintf(stderr, "Could not get interface %s\n", ifname); return 1; } break; case 'N': xdp_flags |= XDP_FLAGS_DRV_MODE; break; case 's': /* use server program */ server = 1; break; case 'S': xdp_flags |= XDP_FLAGS_SKB_MODE; break; default: show_usage(basename(argv[0])); return 1; } } if (!ifname) { show_usage(basename(argv[0])); return 1; } if (!server && optind == argc) { show_usage(basename(argv[0])); return 1; } if ((xdp_flags & mode_flags) == mode_flags) { fprintf(stderr, "-N or -S can be specified, not both.\n"); show_usage(basename(argv[0])); return 1; } if (!server) { /* Only supports IPv4; see hints initiailization above. */ if (getaddrinfo(argv[optind], NULL, &hints, &a) || !a) { fprintf(stderr, "Could not resolve %s\n", argv[optind]); return 1; } memcpy(&rin, a->ai_addr, sizeof(rin)); raddr = rin.sin_addr.s_addr; freeaddrinfo(a); } if (setrlimit(RLIMIT_MEMLOCK, &r)) { perror("setrlimit(RLIMIT_MEMLOCK)"); return 1; } snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); if (bpf_prog_load(filename, BPF_PROG_TYPE_XDP, &obj, &prog_fd)) { fprintf(stderr, "load of %s failed\n", filename); return 1; } main_prog = bpf_object__find_program_by_title(obj, server ? "xdpserver" : "xdpclient"); if (main_prog) prog_fd = bpf_program__fd(main_prog); if (!main_prog || prog_fd < 0) { fprintf(stderr, "could not find xdping program"); return 1; } map = bpf_map__next(NULL, obj); if (map) map_fd = bpf_map__fd(map); if (!map || map_fd < 0) { fprintf(stderr, "Could not find ping map"); goto done; } signal(SIGINT, cleanup); signal(SIGTERM, cleanup); printf("Setting up XDP for %s, please wait...\n", ifname); printf("XDP setup disrupts network connectivity, hit Ctrl+C to quit\n"); if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) { fprintf(stderr, "Link set xdp fd failed for %s\n", ifname); goto done; } if (server) { close(prog_fd); close(map_fd); printf("Running server on %s; press Ctrl+C to exit...\n", ifname); do { } while (1); } /* Start xdping-ing from last regular ping reply, e.g. for a count * of 10 ICMP requests, we start xdping-ing using reply with seq number * 10. The reason the last "real" ping RTT is much higher is that * the ping program sees the ICMP reply associated with the last * XDP-generated packet, so ping doesn't get a reply until XDP is done. */ pinginfo.seq = htons(count); pinginfo.count = count; if (bpf_map_update_elem(map_fd, &raddr, &pinginfo, BPF_ANY)) { fprintf(stderr, "could not communicate with BPF map: %s\n", strerror(errno)); cleanup(0); goto done; } /* We need to wait for XDP setup to complete. */ sleep(10); snprintf(cmd, sizeof(cmd), "ping -c %d -I %s %s", count, ifname, argv[optind]); printf("\nNormal ping RTT data\n"); printf("[Ignore final RTT; it is distorted by XDP using the reply]\n"); ret = system(cmd); if (!ret) ret = get_stats(map_fd, count, raddr); cleanup(0); done: if (prog_fd > 0) close(prog_fd); if (map_fd > 0) close(map_fd); return ret; Conclusion We've talked about XDP programs; where they run, what they can do and provided a code example. I hope this inspires you to play around with XDP! Next time we'll cover AF_XDP, a new socket type which uses XDP to support a more complete range of kernel bypass functionality. Be sure to visit our series on BPF,  and stay tuned for our next blog posts! 1. BPF program types 2. BPF helper functions for those programs 3. BPF userspace communication 4. BPF program build environment 5. BPF bytecodes and verifier 6. BPF Packet Transformation

The Power of XDP Oracle Linux kernel developer Alan Maguire talks about XDP, the eXpress DataPath which uses BPF to accelerate packet processing. For more background on BPF, see the series on...

Linux

Getting Started with Oracle Arm Toolset 8

Contents: Why Arm Toolset 8? | devtoolset-8 or armtoolset-8? | Steps | (1) Download .repo | (2) Enable the collection | (3) yum install | (4) Start a shell | (5) Verify | (6) Problems? | Sources Why Use Oracle Arm Toolset 8? Oracle Linux 7 for Arm includes "Oracle Arm Toolset 8", which provides many popular development tools, including: gcc v8.2.0 Supports the 2017 revision of the ISO C standard. g++ v8.2.0 Supports the 2017 revision of the  ISO C++ standard. gfortran v8.2.0 Supports Fortran 2018 go 1.11.1 The Go Programming Language gdb v8.2 The GNU debugger binutils v2.31   Binary utilities The above versions are much more recent than the base system versions. The base system versions are intentionally kept stable for many years, in order to help ensure compatibility for device drivers and other components that may be intimately tied to a specific compiler version. For your own applications, you might want to use more modern language features. For example, Oracle Arm Toolset 8 includes support for C++17.   Illustration credit: adapted by Jamie Henning from wikipedia, license CC-by-2.0 For a complete list of the software packages in Oracle Arm Toolset 8, see the yum repo page Oracle Linux 7 Software Collections. devtoolset-8 or armtoolset-8? If you want to use GCC v8, you will see 2 package sets at Oracle Linux 7 Software Collections: devtoolset-8-gcc-8.2.1-3.el7.aarch64.rpm devtoolset-8-gcc-c++-8.2.1-3.el7.aarch64.rpm devtoolset-8-gcc-gdb-plugin-8.2.1-3.el7.aarch64.rpm . . . [etc] and oracle-armtoolset-8-gcc-8.2.0-6.el7_6.aarch64.rpm oracle-armtoolset-8-gcc-c++-8.2.0-6.el7_6.aarch64.rpm oracle-armtoolset-8-gcc-gdb-plugin-8.2.0-6.el7_6.aarch64.rpm . . . How can you decide which collection to choose? A few differences can be seen in the lists of packages. For example: oracle-armtoolset-8 includes the languages Ada and Go; devtoolset-8 includes an updated version of GNU make. oracle-armtoolset-8 updates support for certain platform-specific optimizations. The most important difference is in shared library handling for C++ applications: C++ applications compiled with oracle-armtoolset-8 require run-time systems to install oracle-armtoolset-8-libstdc++ C++ applications compiled with devtoolset-8 rely only on the system libstdc++ v4.8.5 Of course, the v4.8.5 library does not support C++17 features. The devtoolset compilers solve that problem using non-shared linking for library functions that are newer than the 4.8.5 system C++ library. (To be specific: /opt/rh/devtoolset-8/root/usr/lib/gcc/aarch64-redhat-linux/8/libstdc++.so is a linker script that resolves symbols from the v4.8.5 shared library /usr/lib64/libstdc++.so.6 when possible, or from the v8 libstdc++_nonshared.a otherwise.) The devtoolset method has the usual advantage of static linking: fewer runtime dependencies.  The system administrator need not install a new C++ library. The devtoolset method has the usual disadvantages, reducing both security and maintainability. For more detail, use your favorite search engine to look for: static linking considered harmful Summary: The choice is yours: both provide modern GCC v8 features; from a security and maintainability point of view, you may prefer Oracle Arm Toolset 8.     Installation Steps for Oracle Arm Toolset 8 (1) Download the .repo Download the Oracle Linux repo file: # cd /etc/yum.repos.d # wget http://yum.oracle.com/aarch64/public-yum-ol7.repo (2) Enable the collection In the repo file, set enabled=1 for ol7_software_collections: Edit the .repo file. Notice that there are many repositories. At minimum, you should edit the section about the Software Collection Library to set  enabled=1 While you are there, review the other repositories, and decide whether you would like to enable any others. You can view the Software Collection Library in a browser by going to:  http://yum.oracle.com/repo/OracleLinux/OL7/SoftwareCollections/aarch64/index.html [ol7_software_collections] name=Software Collection Library for Oracle Linux 7 ($basearch) baseurl=https://yum.oracle.com/repo/OracleLinux/OL7/SoftwareCollections/$basearch/ gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle gpgcheck=1 enabled=1 (3) Yum Install # yum install policycoreutils-python # yum install 'oracle-armtoolset-8*' (3a) Why 2 steps? The reason for doing the installation in two steps above is that it avoids a possible installation issue -- one user reported Error unpacking rpm package oracle-armtoolset-8-runtime when the installation was done as a single step. As of April 2019, the possible issue is under investigation; in the meantime, the above method is recommended. (3b) To start over: If you encounter the above installation issue, to start over, try this sequence: # yum remove 'oracle-armtoolset-8*' # yum remove policycoreutils-python # rm -Rf /opt/oracle/oracle-armtoolset-8/ # yum install policycoreutils-python # yum install 'oracle-armtoolset-8*' (4) Start a shell with the software collection $ scl enable oracle-armtoolset-8 bash Note that this will start a new shell.   (Of course, you could change the word ‘bash’ above to some other shell if you prefer.) (5) Verify Verify that the gcc command invokes the correct copy, and that paths are set as expected: which gcc echo $PATH echo $MANPATH echo $INFOPATH echo $LD_LIBRARY_PATH  Expected output: The which command should return: /opt/oracle/oracle-armtoolset-8/root/usr/bin/gcc All four echo commands should begin with: /opt/oracle/oracle-armtoolset-8/   (6) Problems? Wrong gcc? Wrong paths? If Step (5) gives unexpected output, then check whether your shell initialization files are re-setting the path variables. If so here are four possible solutions: (6a) norc Depending on your shell, there is probably an option to start up without initialization. For example, if you are a bash user, you could say: scl enable oracle-armtoolset-8 "bash --noprofile --norc" (6b) silence Alternatively, you can edit your shell initialization files to avoid setting paths, leaving it up to  scl instead. (6c) (RECOMMENDED) Set paths only in your login shell initialization files. The easiest solution is probably to check out the documentation for your shell and notice that it probably executes certain file(s) at login time and certain other file(s) when a new sub shell is created. For example, bash at login time will look for    ~/.bash_profile, ~/.bash_login, or ~/.profile and for sub shells it looks for    ~/.bashrc If you do your path setting in ~/.bash_profile and avoid touching paths in .bashrc, then the scl enable command will successfully add Oracle Arm Toolset 8 to your paths. (6d) (Kludge) enable last  If for some reason you wish to set paths in your sub shell initialization file, then please ensure that the toolset's enable scriptlet is done last. Here is an example from the bottom of my current .bashrc # If this is a shell created by 'scl enable', then make sure that the # 'enable' scriplet is done last, after all other path setting has # been completed. grandparent_cmd=$(ps -o cmd= $(ps -o ppid= $PPID)) if [[ "$grandparent_cmd" =~ "scl enable" ]] ; then #echo "looks like scl" grandparent_which=${grandparent_cmd/scl enable} grandparent_which=${grandparent_which/bash} grandparent_which=${grandparent_which// } grandparent_enable=$(ls /opt/*/$grandparent_which/enable 2>/dev/null) if [[ -f $grandparent_enable ]] ; then sourceit="source $grandparent_enable" echo doing "'$sourceit'" $sourceit else echo "did not find the enable scriplet for '$grandparent_which'" fi fi Sources If you would like the sources, please see  http://yum.oracle.com/repo/OracleLinux/OL7/SoftwareCollections/aarch64/index_src.html

Contents: Why Arm Toolset 8? | devtoolset-8 or armtoolset-8? | Steps | (1) Download .repo | (2) Enable the collection | (3) yum install | (4) Start a shell | (5) Verify | (6) Problems? | Sources Why Us...

Announcements

Announcing Oracle Linux Virtualization Manager

Announcing Oracle Linux Virtualization Manager  Oracle is pleased to announce the general availability of Oracle Linux Virtualization Manager. This new server virtualization management platform can be easily deployed to configure, monitor, and manage an Oracle Linux Kernel-based Virtual Machine (KVM) environment with enterprise-grade performance and support from Oracle. Based on the open source oVirt project, Oracle Linux Virtualization Manager allows enterprise customers to continue supporting their on-premises data center deployments with the KVM hypervisor already available on Oracle Linux 7.6 with the Unbreakable Enterprise Kernel Release 5. Oracle Linux KVM is a feature that has been delivered and supported as part of Oracle Linux for some time. With the release of the Unbreakable Enterprise Kernel Release 5, the Oracle Linux server virtualization solution with KVM has been enhanced. Oracle Linux KVM is the same hypervisor used in Oracle Cloud Infrastructure, giving users an easy migration path to move workloads into Oracle Cloud in the future. Oracle Linux Virtualization Manager release 4.2.8, the first release of this new management platform, supports multiple hosts running Oracle Linux KVM. The heart of the manager is the ovirt-engine which is used to discover KVM hosts and configure storage and networking for the virtualized data center. Oracle Linux Virtualization Manager offers a web-based User Interface (UI) and a Representation State Transfer (REST) Application Programming Interface (API) which can be used to manage your Oracle Linux KVM infrastructure. Oracle Linux Virtualization Manager delivers high performance with a modern web UI. A REST API is available for users that need to integrate with other management systems, or prefer to automate repetitive tasks with scripts. For most day to day operations, many users will rely on the administrative portal or the lighter weight VM portal. These portals (and the REST API Guide) can be accessed from the Oracle Linux Virtualization Manager landing page when first connected with a browser: After logging in from the main landing page, users are presented with a dashboard view which shows all of the key information about their deployment (VM counts, Host counts, Clusters, Storage, etc.), including the current status of each entity, in addition to key performance metrics: From the dashboard, users can move to the Compute view for Hosts, Virtual Machines, Templates, Data Centers, Clusters and Pools, to configure or edit their virtual environments. Additional menus and sub-menus for Network, Storage, Administration, and Events provide full control, with logical workflows, in an easy to use web interface. Notable Features In addition to the base virtualization management features required to operate your data center, notable features in Oracle Linux Virtualization Manager include: Snapshot - create a view of a running virtual machine at a given point in time. Multiple snapshots can be saved and used to return to a previous state, in the event of a problem. The snapshot feature is accessed from the Virtual Machines view: Role Based Access - define different users with different levels of operational permission within Oracle Linux Virtualization Manager: More information on these features can be found in the Oracle Linux Virtualization Manager Document Library. Additional features will be described in more detail in future blogs. In addition to these supported features, planned features may first be made available as technology previews, to allow users to test them in a development environment and offer feedback before the feature is supported. Getting Started Users can take either a previously deployed version of Oracle Linux and turn the OS into a KVM host, or a KVM configuration can be set up from a base Oracle Linux installation. Instructions and reference material can be found in the Oracle Linux Administrator's Guide for Release 7. Oracle Linux Virtualization Manager 4.2.8 can be installed from the Oracle Linux yum server or the Oracle Unbreakable Linux Network. Two new channels have been created in the Oracle Linux 7 repositories that users will access to install or update Oracle Linux Virtualization Manager: oVirt 4.2 - base packages required for Oracle Linux Virtualization Manager oVirt 4.2 Extra Packages - extra packages for Oracle Linux Virtualization Manager Oracle Linux 7.6 hosts can be installed with installation media (ISO images) that is available from Oracle Software Delivery Cloud. Instructions to download the Oracle Linux 7.6 ISO can be found on Oracle Technology Network. Using the "Minimal Install" option, during the installation process, sets up a base KVM system which can then be updated using the KVM Utilities channel in the Oracle Linux 7 repositories. This and other important packages for your Oracle Linux KVM host can be installed from the Oracle Linux yum server and the Oracle Unbreakable Linux Network: Latest - Latest packages released for Oracle Linux 7 UEK Release 5 - Latest Unbreakable Enterprise Kernel Release 5 packages for Oracle Linux 7 KVM Utilities - KVM Utils for Oracle Linux 7 Optional Latest - Latest packages released for Oracle Linux 7 Both Oracle Linux Virtualization Manager and Oracle Linux can be downloaded, used, and distributed free of charge and all updates and errata are freely available Oracle Linux Virtualization Manager Support Support for Oracle Linux Virtualization Manager is available to customers with an Oracle Linux Premier Support subscription. Refer to Oracle Linux 7 License Information User Manual for information about Oracle Linux support levels.

Announcing Oracle Linux Virtualization Manager  Oracle is pleased to announce the general availability of Oracle Linux Virtualization Manager. This new server virtualization management platform can be...

Announcements

New Lenovo Servers with Ampere Arm Processors now Qualified on Oracle Linux

Continuing the companies’ close collaboration, Oracle, Ampere, and Lenovo have completed joint development and testing to qualify Oracle Linux on new Lenovo servers. The Lenovo ThinkSystem HR330A and HR350A include the powerful Ampere eMAG™ Arm® (aarch64) processor and are certified and supported, through the Oracle HCL program, with Oracle Linux 7 Update 6 with the Unbreakable Enterprise Kernel (UEK) Release 5. UEK5 is based on the upstream LTS (long-term stable) kernel version 4.14 and is designed and recommended for enterprise workloads requiring stability, scalability, and performance. Oracle Linux 7 Update 6 (aarch64) is available from Oracle Software Delivery Cloud. Customers deploying these world-class systems deserve world-class support, and that’s just what they’ll get with Oracle Linux support. Oracle offers two levels of support subscriptions for Oracle Linux: Basic and Premier. As always, to support developers and users, Oracle Linux is free to download, use, and distribute. All Oracle Linux updates are freely available on the Oracle Linux yum server, to help users match development and test environments to the same patch level used in production. For more information on the engineering efforts involving Oracle Linux for Arm, please read this blog from Wim Coekaerts, Oracle Senior Vice President, Development. Additional Resources: Oracle Linux Hardware Certification List (HCL) Oracle Linux for Arm data sheet Oracle Linux 7 documentation Oracle Linux FAQ Oracle Linux Support Lenovo ThinkSystem datasheet: HR330A and HR350A

Continuing the companies’ close collaboration, Oracle, Ampere, and Lenovo have completed joint development and testing to qualify Oracle Linux on new Lenovo servers. The Lenovo ThinkSystem HR330A and H...

Announcements

Announcing Gluster Storage Release 5 for Oracle Linux 7

The Oracle Linux and Virtualization team is pleased to announce the release of Gluster Storage Release 5 for Oracle Linux 7, bringing customers higher performance, new storage capabilities and improved management. Gluster Storage is an open source, POSIX compatible file system capable of supporting thousands of clients while using commodity hardware. Gluster provides a scalable, distributed file system that aggregates disk storage resources from multiple servers into a single global namespace. Gluster provides built-in optimization for different workloads and can be accessed using an optimized Gluster FUSE client or standard protocols including SMB/CIFS. Gluster can be configured to enable both distribution and replication of content with quota support, snapshots, and bit-rot detection for self-healing. New Features Gluster Storage Release 5 introduces the support for the following new important capabilities: Gluster block storage: Gluster volumes can be set up as an iSCSI back-store to provide block storage using the gluster-block and tcmu-runner packages. Files on volumes are exported as block storage (iSCSI LUNs). Thanks to this new supported feature your Gluster cluster can act as an iSCSI storage for your development as well as production environments and grant an Enterprise Storage Level with a lower TCO For further details see "Chapter 4 Accessing Volumes" on "Gluster Storage for Oracle Linux User's Guide". Heketi scripted cluster automation: The heketi and heketi-client packages automate the management of a Gluster cluster. Trusted storage pools and volumes can be provisioned and managed using the heketi-cli command, and custom scripts can be written using the API functions exposed by the Heketi service. It is particularly useful for set-up steps during cloud-based deployments that can be automated without requiring manual systems administration. The introduction of Heketi API support opens Gluster as a real Storage-as-a-Service solution for your infrastructure. For further details see "Chapter 5 Automating Volume Lifecycle with Heketi" on "Gluster Storage for Oracle Linux User's Guide". Further enhancement and new features in Gluster Storage Release 5 for Oracle Linux 7 are: Performance Network throughput usage increased up to 5 times Standalone Dentry serializer feature is now enabled by default Python code in Gluster packages is Python 3 ready Added noatime option in utime xlator Enabling the utime and ctime feature enables Gluster to maintain consistent change and modification timestamps on files and directories across bricks. Gluster Storage Release 5 for Oracle Linux 7 supports: The Unbreakable Enterprise Kernel (Release 4 and higher) and the Red Hat Compatible Kernel on x86_64 architecture. The Unbreakable Enterprise Kernel (Release 5) on aarch64 architecture. Configurations upgraded from an existing Gluster Storage Release 3.12 and Gluster Storage Release 4.1. Installation Gluster Storage is available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. It is currently available for the x86_64 and aarch64 architectures and can be installed on any Oracle Linux 7 server running the Unbreakable Enterprise Kernel (UEK) Release 4 or 5 or the Red Hat Compatible Kernel (RHCK).  For more information on hardware requirements and how to install and configure Gluster, please review the Gluster Storage for Oracle Linux Release 5 documentation. Support Support for Gluster Storage is available to customers with an Oracle Linux Premier support subscription. Refer to Oracle Linux 7 License Information User Manual for information about Oracle Linux support levels. Oracle Linux Resources: Documentation Oracle Linux Software Download Oracle Linux Oracle Container Registry Blogs Oracle Linux Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux For community-based support, please visit the Oracle Linux space on the Oracle Developer Community.

The Oracle Linux and Virtualization team is pleased to announce the release of Gluster Storage Release 5 for Oracle Linux 7, bringing customers higher performance, new storage capabilities and...

Announcements

Announcing the general availability of the Unbreakable Enterprise Kernel Release 5 Update 2

The Unbreakable Enterprise Kernel (UEK) for Oracle Linux provides the latest open source innovations and key optimizations and security to enterprise cloud workloads. It is the Linux kernel that powers Oracle Cloud and Oracle Engineered Systems such as Oracle Exadata Database Machine as well as Oracle Linux on any Intel-64, AMD-64 or ARM hardware. What's New? UEK R5 Update 2 is based on the mainline kernel version 4.14.35. Through actively monitoring upstream check-ins and collaboration with partners and customers, Oracle continues to improve and apply critical bug and security fixes to the Unbreakable Enterprise Kernel (UEK) R5 for Oracle Linux. This update includes several new features, added functionality, and bug fixes across a range of subsystems. Notable changes: Pressure Stall Information (PSI) patchset implemented. PSI is designed to help system administrators maximize server resources and can be used to pinpoint and troubleshoot resource utilization issues. Implementation of the ktask framework for parallelizing CPU-intensive work. The ktask framework parallelizes CPU-intensive work in the kernel. This helps improve performance by harnessing idle CPUs, to complete jobs more quickly. DTrace support for libpcap packet capture. Kernel and userspace updates enable support for libpcap-based packet capture in DTrace. File system and storage fixes. Fixes to btrfs, CIFS, ext4, OCFS2, and XFS file systems. Virtualization features and updates. Upstream improvements from the 4.19 kernel for KVM, Xen, and Hyper-V, including major updates and security fixes for KVM; numerous security fixes and code enhancements for Hyper-V; fix for the Xen blkfront hotplug issue; and a fix for the Xen x86 guest clock scheduler. Driver updates. In close cooperation with hardware and software vendors, several device drivers have been updated. Kernel tuning dedicated to the Arm platform. Further kernel tuning for Arm platforms and parameters for unsupported hardware have been disabled, to improve stability and performance. NVMe updates. Fixes and improvements for NVMe are included from upstream Linux kernel versions 4.18 through 4.21. For more details on these and other new features and changes, please consult the Release Notes for the UEK R5 Update 2. Security (CVE) Fixes A full list of CVEs fixed in this release can be found in the Release Notes for the UEK R5 Update 2. Supported Upgrade Path Customers can upgrade existing Oracle Linux 7 Update 5 (and later) servers using the Unbreakable Linux Network or the Oracle Linux yum server. Software Download Oracle Linux can be downloaded, used, and distributed free of charge and all updates and errata are freely available. This allows organizations to decide which systems require a support subscription and makes Oracle Linux an ideal choice for development, testing, and production systems. The user decides which support coverage is the best for each system individually, while keeping all systems up-to-date and secure. Customers with Oracle Linux Premier Support also receive access to zero-downtime kernel updates using Oracle Ksplice. Compatibility UEK R5 Update 2 is fully compatible with the UEK R5 GA release. The kernel ABI for UEK R5 remains unchanged in all subsequent updates to the initial release. UEK R5 includes changes to the kernel ABI relative to UEK R4 that require recompilation of third-party kernel modules. About Oracle Linux The Oracle Linux operating system is engineered for an open cloud infrastructure. It delivers leading performance, scalability and reliability for enterprise SaaS and PaaS workloads as well as traditional enterprise applications. Oracle Linux Support offers access to award-winning Oracle support resources and Linux support specialists; zero-downtime updates using Ksplice; additional management tools such as Oracle Enterprise Manager and Spacewalk; and lifetime support, all at a low cost. And unlike many other commercial Linux distributions, Oracle Linux is easy to download, completely free to use, distribute, and update. Oracle tests the UEK intensively with demanding Oracle workloads, and recommends the UEK for Oracle deployments and all other enterprise deployments. Resources – Oracle Linux Documentation Oracle Linux Software Download Oracle Linux Blogs Oracle Linux Blog Oracle Virtualization Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux - education.oracle.com/linux

The Unbreakable Enterprise Kernel (UEK) for Oracle Linux provides the latest open source innovations and key optimizations and security to enterprise cloud workloads. It is the Linux kernel that...

Linux

Cisco Qualifies Cisco Tetration Platform on Oracle Linux Running on Oracle Exadata Database Machine

To provide mutual customers with greater application insights and security, Cisco has qualified the Cisco Tetration platform on Oracle Linux running on Oracle Exadata Database Machine. This combination is available in a Tetration SaaS offering, powered by Oracle Cloud Infrastructure, or on-premises deployments. The Cisco Tetration platform uses workload and network telemetry data to perform advanced analytics using an algorithmic approach and provides comprehensive workload protection for a multi-cloud data center. This algorithmic approach includes unsupervised machine-learning techniques and behavioral analysis. The platform provides a ready-to-use solution for: Visibility into application components, communications, and dependencies, to enable implementation of a zero-trust model in the data center Automatic generation of a whitelist policy, based on application behavior, including existing security policy mandated by business requirements Consistent enforcement of segmentation policies across a multi-cloud infrastructure to minimize lateral movement Identification of software vulnerabilities and exposures to reduce the attack surface Process behavior baselining and identification of deviations for faster detection of Indicators of Compromise (IOCs) By using a multidimensional workload protection approach, Cisco Tetration significantly reduces the attack surface, minimizes lateral movement in case of security incidents, and quickly identifies anomalous behaviors within the data center. Learn more at: https://www.cisco.com/c/en/us/products/collateral/data-center-analytics/tetration-analytics/datasheet-c78-737256.html https://www.cisco.com/c/en/us/products/data-center-analytics/tetration-analytics/index.html https://www.youtube.com/watch?v=_LGLFLDiTTU  

To provide mutual customers with greater application insights and security, Cisco has qualified the Cisco Tetration platform on Oracle Linux running on Oracle Exadata Database Machine....

Linux Kernel Development

Using AMD Secure Memory Encryption with Oracle Linux

Oracle Linux kernel developer Boris Ostrovsky wrote this explanation of AMD's memory encryption technologies.  AMD SME and SEV Introduction Disk encryption by now has become a standard procedure to protect information from an intruder who has physical access to the system but is not able, for example, to log in. However, the other system component used for storing data, system memory, remains largely vulnerable. It is true that extracting data from memory is typically more difficult but techniques like cold-boot attacks show that this is not an impossible task. To make things worse, introduction of non-volatile memory allows one to physically remove the NVDIMM chips from the system and examine their contents at some later time, making data there as easy to access as it would be on a non-encrypted hard drive. To protect system memory from such attacks, hardware manufacturers have been adding support for memory encryption. For example, when AMD recently introduced their EPYC processors, one of the new features was the support for Secure Memory Encryption. (Some of the desktop variants, such as Ryzen Pro, also included this). Secure Memory Encryption (SME) With SME, the data that the processor writes to memory passes through an encryption engine that scrambles it before committing. Conversely, when the data is read, the encryption engine unscrambles it and presents to the processor in its original format. All this is done without any software intervention. The encryption engine implements AES algorithm with an 128-bit encryption key. The key is managed by on-the-chip AMD Secure Processor (AMD-SP) and is generated anew after each reset. The key is not accessible to the software. There are a couple of ways SME can be used. The first is Transparent SME (TSME). In this mode, any software (operating system or hypervisor) will have its memory encrypted, without any special SME support in SW. This mode is enabled by BIOS setting (if the BIOS vendor decides to expose it). While TSME is the easiest to use, it has some limitations. The biggest one is that it does not allow the use of SEV which we will discuss in a moment. The other way of using SME is more flexible in that, in addition to enabling SEV, it also allows encrypting only certain memory regions (with page granularity). This is achieved by setting (typically) bit 47 of the physical address, and therefore requires OS/hypervisor support: for pages that should be encrypted, bit 47 (known as C bit) needs to be set. Secure Encrypted Virtualization (SEV) When a guest is executing on a hypervisor, the latter has access to all the resources used by the guest, including guest's memory. This is obviously not an ideal situation: the guest may be running a highly sensitive application and does not want anyone to see its data. If the hypervisor is compromised, then the guest's secrets can be too. That's where SEV comes to help. With SEV, each guest is assigned (by AMD-SP) an encryption key and can encrypt its pages using the same technique as what is used for SME on bare metal (PTE's C bit). The most important part to keep in mind here is that the key is not available to the hypervisor and therefore it cannot snoop on guest's data (unless the guest decides not to encrypt specific pages, for example, those shared with the hypervisor, such as DMA buffers). Software support SME UEK support for SME is enabled by setting CONFIG_CRYPTO_DEV_SP_PSP and CONFIG_AMD_MEM_ENCRYPT build options. After that, specifying mem_encrypt=on on kernel boot command line will activate SME. Alternatively, if CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is set in the kernel's .config file, then SME is active by default. To verify that SME is on: [root@host ~]# dmesg | grep SME [ 0.000000] AMD Secure Memory Encryption (SME) active [root@host ~]# Keep in mind that SME needs to be enabled by system firmware, and some BIOSes may have it turned off by default. You can check whether it is on by first making sure that the feature is present in the hardware by looking at CPUID Fn8000_001F[EAX].[0]: [root@host ~]# cpuid -r -1 -l 0x8000001f CPU: 0x8000001f 0x00: eax=0x0000000f ebx=0x0000016f ecx=0x0000000f edx=0x00000001 [root@host ~]# and then see if it is enabled by verifying that bit 23 of MSR 0xC0010010 is set: [root@host ~]# rdmsr 0xC0010010 f40000 [root@host ~]# For a quick demo of SME functionality we can use smetest.c, which is provided at the end of this blog post. The driver allocates a page where a secret string is stored and then prints the contents of that page (as stored in DRAM) either with SME enabled on that page (i.e. bit C set on the PTE) or when the page is accessed as unencrypted (bit C is cleared). Since the data was originally stored in memory in encrypted form, trying to access it with encryption disabled should be unsuccessful. The relevant part of the driver is the ioctl routine: static long smetest_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { int ret = 0; char buf[strlen(SECRET_DATA) + 1]; if (!mem_encrypt_active()) return -ENXIO; switch (cmd) { case 1: ret = set_memory_decrypted((unsigned long)secret, 1); case 0: break; default: return -EINVAL; } if (ret) return ret; memcpy(buf, secret, strlen(SECRET_DATA) + 1); if (cmd == 1) { /* Re-encrypt memory */ ret = set_memory_encrypted((unsigned long)secret, 1); /* Make sure string is terminated */ buf[strlen(SECRET_DATA)] = 0; } printk("Secret data is: %s\n", buf); return ret; } When cmd is 0, the C bit on the PTE is kept and therefore the data is decrypted before it is copied into buf. When cmd is 1, set_memory_decrypted() will clear the bit (and also flush caches and TLBs) so the contents of the memory will be read by the processor without passing through the encryption engine. (Notice that we need to terminate the string in this case since the NULL character will be scrambled). Userspace code is: #include <stdlib.h> #include <errno.h> main(int argc, char *argv) { int f; f = open("/dev/smetest", 0); if (f == -1) { perror("open"); exit(errno); } if (ioctl(f, 0)) perror("ioctl(0)"); if (ioctl(f, 1)) perror("ioctl(1)"); close(f); } Here are the results: [root@host ~]# insmod ./smetest.ko [root@host ~]# ./a.out [root@host ~]# dmesg [ 1129.283633] secret is my secret [ 1133.687482] Secret data is: my secret [ 1133.696322] Secret data is: \xffffff81\xffffff83\xffffff93\xffffffa8\xffffffe6\xffffffc\xffffff84\xfffffffc\xffffffb7 [root@host ~]# SEV To enable SEV, CONFIG_KVM_AMD_SEV needs to be set in the Linux configuration file. A newer qemu (such as qemu-3.0.0-4.el7) and OVMF is also required. Start the guest by specifying new qemu object, sev-guest and set machine's memory-encryption attribute. For example: [root@host ~]# qemu-system-x86_64 -enable-kvm -cpu EPYC -machine q35 -smp 1 -m 1G -drive if=pflash,format=raw,unit=0,file=/usr/share/OVMF/OVMF_CODE.pure-efi.fd,readonly -drive if=pflash,format=raw,unit=1,file=/usr/share/OVMF/OVMF_VARS.fd -drive file=./ol76-uefi.qcow2,if=none,id=disk0,format=qcow2 -device virtio-scsi-pci,id=scsi,disable-legacy=on,iommu_platform=true -device scsi-hd,drive=disk0 -nographic -s -device virtio-rng-pci,disable-legacy=on,iommu_platform=true -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1 -machine memory-encryption=sev0 To see whether SEV is available check CPUID Fn8000_001F[EAX].[1]: [root@guest ~]# cpuid -r -1 -l 0x8000001f CPU: 0x8000001f 0x00: eax=0x00000002 ebx=0x0000006f ecx=0x00000000 edx=0x00000000 [root@guest ~]# And to verify that it is active, look at bit 1 of MSR 0xc0010131: [root@guest ~]# rdmsr 0xc0010131 1 [root@guest ~]# You can also verify this by looking at dmesg output to see whether SEV is on: [root@guest ~]# dmesg | grep SEV [ 0.001000] AMD Secure Encrypted Virtualization (SEV) active [ 1.727193] SEV is active and system is using DMA bounce buffers [root@guest ~]# Recall that the main reason behind SEV is to protect guest's memory from being snooped on by the hypervisor. Here is a small example that demonstrates this: #include <stdio.h> #include <stdlib.h> main(int argc, char *argv[]) { char str[32]; int secret = -1; if (argc > 1) secret = atoi(argv[1]); sprintf(str, "My secret is %d\n", secret); sleep(10000); } We run the above code as: root@guest ~]# ./a.out 123 & [1] 3698 [root@guest ~]# We then drop to qemu monitor (Ctrl-A C) and save guest's memory into a file: (qemu) dump-guest-memory /tmp/encrypted (qemu) Now start the guest without SEV (by dropping '-object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1 -machine memory-encryption=sev0' options) and save its memory in /tmp/unencrypted. Let's first search unencrypted guest's memory: [root@host ~]# strings /tmp/unencrypted | grep "My secret" My secret is 123 My secret is %d My secret is %d [root@host ~]# and then [root@host ~]# strings /tmp/encrypted | grep "My secret" My secret is %d [root@host ~]# The secret string cannot be discovered when SEV is turned on. (Note that we still see "My secret is %d" string. This is because when the executable was fetched from the disk it was first placed into a buffer shared between the hypervisor (host) and the guest. Since the hypervisor cannot access the guest's encrypted memory, those shared buffers are not encrypted.) Limitations While SEV allows guests to hide contents of their memory, another component that a guest may wish to hide from the host is guest's registers. For example, various encryption keys (such as ssh keys, pgp keys etc.) are often stored in floating-point registers such as %xmm and %ymm and therefore it is important that access to that information is not allowed to any entity outside the guest. Currently it is not possible to limit hypervisor's visibility into this state, although AMD promises that future processors will support SEV-ES (Encrypted State) to address this issue. Another limitation of running guests with SEV is that at the moment live migration (and save/restore in general) are not properly supported. References https://developer.amd.com/sev/ AMD64 Architecture Programmer's Manual Volume 2: System Programming (chapters 7.10 and 15.34 in particular) https://www.kernel.org/doc/Documentation/x86/amd-memory-encryption.txt   Sample kernel module smetest.c https://github.com/oracle/linux-blog-sample-code/blob/amd-sev/smetest.c  

Oracle Linux kernel developer Boris Ostrovsky wrote this explanation of AMD's memory encryption technologies.  AMD SME and SEV Introduction Disk encryption by now has become a standard procedure...

Announcements

Oracle Cloud Developer Image Adds Java SE 11 and 12 and Oracle SQL Developer

We are pleased to announce an exciting new release of the Oracle Cloud Developer Image on Oracle Cloud Infrastructure. The Oracle Cloud Developer Image is an Oracle Linux 7 based, ready-to-run image that allows you to rapidly set up a development environment on Oracle Cloud Infrastructure with the latest Oracle Cloud Infrastructure client tools and Software Development Kits (SDKs), choice of development languages, and database connectors and tools. By deploying the Oracle Cloud Developer Image on Oracle Cloud Infrastructure, you can dramatically reduce the time and cost to develop your cloud applications.  Why is this release exciting?  Two reasons:  First, Oracle Java SE 11 and 12 have been added to the Oracle Cloud Developer Image, and support is now included with Oracle Cloud Infrastructure subscriptions. With the bundling of Oracle Java SE in the image, you can get your enterprise Java development environment up and running in no time and quickly start developing secure, portable, and high-performance applications in the cloud. Second, the Oracle Cloud Developer Image now makes it faster and easier for you to deploy the Oracle SQL Developer integrated development environment in Oracle Cloud, including Oracle SQL Developer Command Line (SQLcl), both now bundled in this new release. Oracle SQL Developer is a free graphical tool that enhances productivity and simplifies database development tasks and management of Oracle databases in both traditional and cloud deployments. SQLcl is a powerful free command line interface that allows you to author, and interactively or batch execute SQL and PL/SQL on Oracle Database.   Here’s a list of what’s included in this latest release of the Oracle Cloud Developer Image on Oracle Cloud Infrastructure: Latest Oracle Linux 7 image for Oracle Cloud Infrastructure Development Languages, Oracle Database Connectors and Tools Oracle Java Platform, Standard Edition (Java SE) 8, 11, 12 Python 3.6 and cx_Oracle 7  Node.js 10 and node-oracledb Go 1.12 Oracle Instant Client 18.5 Oracle SQL Developer 19.1 Oracle SQL Developer Command Line (SQLcl) 19.1 Oracle Cloud Infrastructure Command Line Interface (CLI), Software Development Kits (SDKs) and Tools Oracle Cloud Infrastructure CLI Python, Java, Go, and Ruby Oracle Cloud Infrastructure SDKs Terraform and Oracle Cloud Infrastructure Terraform Provider Oracle Cloud Infrastructure Utilities Other Oracle Container Runtime for Docker Access to Extra Packages for Enterprise Linux (EPEL) via Oracle Linux Yum Server GUI Desktop with access via VNC Server The Oracle Cloud Developer Image is available at no additional cost to Oracle Cloud Infrastructure subscribers. If you do not have an Oracle Cloud Infrastructure account, register for one here. You can try out the Oracle Cloud Developer Image today with available free subscription credits on Oracle Cloud Infrastructure. Getting started is easy and just takes minutes. Simply log into your Oracle Cloud Infrastructure console, and deploy the image from the Marketplace by selecting Marketplace under the main navigation menu under Solutions, Platform and Edge. Search for and select ‘Oracle Cloud Developer Image’. Follow the click-through instructions to launch the Oracle Cloud Developer Image instance. We are always looking for ways to enhance developers' user experience with the Oracle Cloud Developer Image. Your feedback is appreciated. Please send your comments and questions to oraclelinux-info_ww_grp@oracle.com or post them on the Oracle Linux for Oracle Cloud Infrastructure Community. Learn more about the Oracle Cloud Developer Image  Oracle Cloud Marketplace: Oracle Cloud Developer Image Support for Oracle Java SE now Included with Oracle Cloud Infrastructure Run Oracle SQL Developer and Connect to Oracle Autonomous Database Get Started with Autonomous Database and SQLcl in No Time Using Oracle Cloud Developer Image Announcing the Oracle Cloud Developer Image for Oracle Cloud Infrastructure

We are pleased to announce an exciting new release of the Oracle Cloud Developer Image on Oracle Cloud Infrastructure. The Oracle Cloud Developer Image is an Oracle Linux 7 based, ready-to-run image...

Announcements

Oracle Database now available in the Oracle Cloud Marketplace

"From this day forward, Oracle Database deployment will never be the same" ....just because in about 7~20 minutes you will have a fully functional Single Instance Oracle Database on any Oracle Cloud Infrastructure shape, BareMetal included!!! I'm so pleased and honored to announce the "Oracle Database" availability in the "Oracle Cloud MarketPlace". By leveraging the "Oracle Database" you will have the option to automatically deploy a fully functional Database environment by pasting a simple cloud-config script; the deployment allows for basic customization of the environment, further configurations, like adding extra disks, NICs, is always possible post-deployment. The framework allows for simple cleanup and re-deployment, via the Marketplace interface (terminate instance and re-launch), or cleanup the Instance within and re-deploy the same Instance with changed settings (see Usage Info below). To easily introduce to the different customization options, available with the "Oracle Database" we also created a dedicated document with examples on the Oracle Database customization deployment. The deployed Instance will be based on the following software stack: Oracle Cloud Infrastructure Native Instance Oracle Linux 7.6 UEK5 (Unbreakable Enterprise Kernel, release 5) Oracle Database 12cR2 or Oracle Database 18c For further information: Oracle Database deployment on Oracle Cloud Infrastructure Oracle Database on Oracle Cloud MarketPlace Oracle Cloud Marketplace Oracle Cloud: Try it for free

"From this day forward, Oracle Database deployment will never be the same" ....just because in about 7~20 minutes you will have a fully functional Single Instance Oracle Database on any Oracle...

Linux Kernel Development

An update on Meltdown and Enhanced IBRS

In this blog post, Oracle Linux kernel developer Konrad Rzeszutek Wilk gives an update on the state of speculative execution vulnerabilities and mitigations in 2019. In early 2018, researchers announced a novel mechanism to extract sensitive data from CPU cores using the processor's own speculative execution engine. These exploits are termed Speculative Execution Side Channel Vulnerabilities and were described in the meltdown.pdf and spectre.pdf papers. An additional side channel attack was released later in the year, called L1TF. Speculative execution side channel vulnerabilities exploit a race condition in the complicated out-of-order architecture of CPUs. This post describes the state of the art mitigations for such vulnerabilities. A Brief Review of Mitigations First, a brief review of the existing mitigations for speculative execution side channel vulnerabilities. The Linux mitigation for Meltdown is known as KPTI, also known as KAISER: Kernel page table isolation - the window of kernel code that each application has to have mapped is shrunk. For Spectre_v2, there are two existing mitigations: Updated microcode and use a new Model Specific Register (MSR) opcode to frob the CPU to flush it's branch predictors. This is known as Indirect Branch Restricted Speculation (IBRS) albeit the MSR in the documentation is called SPEC_CTRL. A software only mitigation known as retpoline where the branch predictor is slogged through the rodeo so that its predictions are always incorrect. Either mitigation is used on every transition to the kernel. Because they are called so often, these mitigations can have a serious impact on system performance. The L1TF mitigations for applications are much simpler and require changes in handling applications page tables. Mitigations to run VMs also required another microcode update and usage of a new MSR. Note that upstream Linux has not accepted IBRS mitigations - however Oracle (along with other Linux distributions) provides this support so that systems with Skylake CPUs can be mitigated. Read more about that in our blog post on retpoline. EIBRS, you're my only hope When this started (January of 2018), Intel added a flag which would tell the operating system whether any or some of these mitigations would be necessary. If the CPU exposes that it is impervious to Rogue Data Cache Load (RDCL) then this CPU is not affected by L1TF and Meltdown attacks. Great! The Spectre_v2 story is much more complicated. Recall that there are two mitigations: Updated microcode and usage of a new MSR called SPEC_CTRL. Using retpoline - a software construct generated by the compiler. There is a third one that is called the Enhanced Indirect Branch Restricted Speculation or EIBRS. This has only to be activated once, not on every every transition to a more privileged mode like the prior Spectre_v2 mitigations. This means that there are now three mitigations against Spectre_v2: IBRS, Retpoline, EIBRS Oracle's X8 Generation of Engineered Systems and x86 Servers The Oracle's X8 generation of Engineered Systems and x86 servers machines are powered by a Intel Cascade Lake CPUs. This family of CPUs are also known as EIBRS-capable and not susceptible to Rogue Data Cache Load (RDCL) attacks. Simply put, this means that the CPU is not affected by Meltdown or L1TF exploits and that it can pick the fastest of the Spectre_v2 mitigations. The Unbreakable Enterprise Kernel (UEK) takes advantage of that and reports this using both SysFS and using the kernel ring buffers: Using SysFS: $ cat /sys/devices/system/cpu/vulnerabilities/meltdown Not affected $ cat /sys/devices/system/cpu/vulnerabilities/l1tf Not affected $ cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 Mitigation: Enhanced IBRS, IBPB: conditional Kernel ring buffer: # dmesg | grep Spectre [ 0.085762] Spectre V2 : Options: IBRS(enhanced) IBPB retpoline [ 0.085763] Spectre V2 : Mitigation: Enhanced IBRS [ 0.085765] Spectre V2 : Spectre v2 mitigation: Filling RSB on context switch [ 0.085778] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier Or /proc/cpuinfo bugs flags: $ cat /proc/cpuinfo | grep bugs | uniq bugs : spectre_v1 spectre_v2 spec_store_bypass The spectre_v2 flag is still visible as what EIBRS offers is a hardware mechanism to squash branch prediction attacks. Folks can toggle to use retpoline or Enhanced IBRS on these CPUs. You can confirm this by looking at the kernel ring buffer output: [ 0.085762] Spectre V2 : Options: IBRS(enhanced) IBPB retpoline The same output on X7 (Skylake) would be: Spectre V2 : Options: IBRS(basic) IBPB retpoline What about the Spectre v1 A keen observer might notice that the list of bugs still includes Spectre_v1. However, a mitigation is in place for this as well: $ cat /sys/devices/system/cpu/vulnerabilities/spectre_v1 Mitigation: __user pointer sanitization N.B. UEK4 will report lfence mitigation. Solving Spectre_v1 attacks, also know as code gadgets, is a continuing effort. Oracle is using an internally developed static analyzer called Parfait along with an open source static analyzer known as smatch documentation to find them and fix them as they are discovered. The story doesn't end here, though. There is on-going research in the compiler communities to come up with a better solution to this problem. However, it is a very difficult one to solve completely. Resources Documentations that go in details: Reading privileged memory with a side-channel Retpoline: a software construct for preventing branch-target-injection Retpoline: A Branch Target Injection Mitigation Deep Dive: Retpoline: A Branch Target Injection Mitigation Speculative Execution Side Channel Mitigations v3.0 Intel Analysis of Speculative Execution Side Channels SOFTWARE TECHNIQUES FOR MANAGING SPECULATION ON AMD PROCESSORS L1TF Deep Dive: Indirect Branch Restricted Speculation-a Linux kernel boot parameters L1TF - L1 Terminal Fault

In this blog post, Oracle Linux kernel developer Konrad Rzeszutek Wilk gives an update on the state of speculative execution vulnerabilities and mitigations in 2019. In early 2018, researchers...

Linux

Linux kernel 5.0: Features and Developments We Are Watching

Thanks to Chuck Anderson, Linux kernel developer, Oracle, for compiling the information in this post. Enhancements to mainline Linux continue at a steady pace, though we don’t always hear a lot about this work. With the 5.1 release upon us, we wanted to give a shout out to some important and notable additions that the 5.0 release brought to bear. Some we chose because our kernel developers are directly involved, some because they affect Oracle workloads, and others simply because they piqued our interest. Here are our top picks. For a complete overview of Linux kernel 5.0 new features, see LWN. Valuable New Features: arm64 support The arm64 architecture has gained support for a number of features including the kexec_file_load() system call, 52-bit virtual address support for user space, hotpluggable memory, per-thread stack canaries, and pointer authentication (for user space only at this point). This commit has some documentation for the pointer-authentication feature. Retpoline-elimination The first two of the retpoline-elimination mechanisms described in this article have been merged, improving performance in core parts of the DMA mapping and networking layers. Core Kernel Changes: The long-awaited energy-aware scheduling patches have found their way into the mainline. This code adds a new energy model that allows the scheduler to determine the relative power cost of scheduling decisions. This enables the mainline scheduler to get better results on mobile devices and, should reduce or eliminate the scheduler patching that various vendors engage in now. The cpuset controller now works (with reduced features) under the version-2 control-group API. See the documentation updates in this commit for details. There is also a new "dynamic events" interface to the tracing subsystem. It unifies the three distinct interfaces (for kprobes, uprobes, and synthetic events) into a single control file. See this patch posting for a brief overview of how this interface works. Improving idle behavior in tickless systems Lead paragraph: "Most processors spend a great deal of their time doing nothing, waiting for devices and timer interrupts. In these cases, they can switch to idle modes that shut down parts of their internal circuitry, especially stopping certain clocks. This lowers power consumption significantly and avoids draining device batteries. There are usually a number of idle modes available; the deeper the mode is, the less power the processor needs. The tradeoff is that the cost of switching to and from deeper modes is higher; it takes more time and the content of some caches is also lost. In the Linux kernel, the cpuidle subsystem has the task of predicting which choice will be the most appropriate. Recently, Rafael Wysocki proposed a new governor for systems with tickless operation enabled that is expected to be more accurate than the existing menu governor." Ringing in a new asynchronous I/O API: io_uring io_uring is a new asynchronous I/O kernel interface whose development we’ve been watching with great interest, not only because it promises to deliver buffered asynchronous I/O via a simplified interface, but especially for its efficiency, scalability and the performance gains that come with it. For more background, see this article by the architect and lead io_uring developer, Jens Axboe. Lead paragraph: "While the kernel has had support for asynchronous I/O (AIO) since the 2.5 development cycle, it has also had people complaining about AIO for about that long. The current interface is seen as difficult to use and inefficient; additionally, some types of I/O are better supported than others. That situation may be about to change with the introduction of a proposed new interface from Jens Axboe called "io_uring". As might be expected from the name, io_uring introduces just what the kernel needed more than anything else: yet another ring buffer." Pressure stall monitors Lead paragraph: "One of the useful features added during the 4.20 development cycle was the availability of pressure-stall information, which provides visibility into how resource-constrained the system is. Interest in using this information has spread beyond the data-center environment where it was first implemented, but it turns out that there some shortcomings in the current interface that affect other use cases. Suren Baghdasaryan has posted a patch set aimed at making pressure-stall information more useful for the Android use case — and, most likely, for many other use cases as well." Persistent memory for transient data Lead paragraph: "Arguably, the most notable characteristic of persistent memory is that it is persistent: it retains its contents over power cycles. One other important aspect of these persistent-memory arrays that, we are told, will soon be everywhere, is their sheer size and low cost; persistent memory is a relatively inexpensive way to attach large amounts of memory to a system. Large, cheap memory arrays seem likely to be attractive to users who may not care about persistence and who can live with slower access speeds. Supporting such users is the objective of a pair of patch sets that have been circulating in recent months." Concurrency management in BPF Lead paragraph "In the beginning, programs run on the in-kernel BPF virtual machine had no persistent internal state and no data that was shared with any other part of the system. The arrival of eBPF and, in particular, its maps functionality, has changed that situation, though, since a map can be shared between two or more BPF programs as well as with processes running in user space. That sharing naturally leads to concurrency problems, so the BPF developers have found themselves needing to add primitives to manage concurrency (the "exchange and add" or XADD instruction, for example). The next step is the addition of a spinlock mechanism to protect data structures, which has also led to some wider discussions on what the BPF memory model should look like." io_uring, SCM_RIGHTS, and reference-count cycles Lead paragraph: "The io_uring mechanism that was described here, in January has been through a number of revisions since then; those changes have generally been fixing implementation issues rather than changing the user-space API. In particular, this patch set seems to have received more than the usual amount of security-related review, which can only be a good thing. Security concerns became a bit of an obstacle for io_uring, though, when virtual filesystem (VFS) maintainer Al Viro threatened to veto the merging of the whole thing. It turns out that there were some reference-counting issues that required his unique experience to straighten out." Per-vector software-interrupt masking Lead paragraph: "Software interrupts (or "softirqs") are one of the oldest deferred-execution mechanisms in the kernel, and that age shows at times. Some developers have been occasionally heard to mutter about removing them, but softirqs are too deeply embedded into how the kernel works to be easily ripped out; most developers just leave them alone. So the recent per-vector softirq masking patch set from Frederic Weisbecker is noteworthy as an exception to that rule. Weisbecker is not getting rid of softirqs, but he is trying to reduce their impact and improve their latency." Memory-mapped I/O without mysterious macros Lead paragraph:"Concurrency is hard even when the hardware's behavior is entirely deterministic; it gets harder in situations where operations can be reordered in seemingly random ways. In these cases, developers tend to reach for barriers as a way of enforcing ordering, but explicit barriers are tricky to use and are often not the best way to think about the problem. It is thus common to see explicit barriers removed as code matures. That now seems to be happening with an especially obscure type of barrier used with memory-mapped I/O (MMIO) operations." Reimplementing printk() Lead paragraph: "The venerable printk() function has been part of Linux since the very beginning, though it has undergone a fair number of changes along the way. Now, John Ogness is proposing to fundamentally rework printk() in order to get rid of handful of issues that currently plague it. The proposed code does this by adding yet another ring-buffer implementation to the kernel; this one is aimed at making printk() work better from hard-to-handle contexts. For a task that seems conceptually simple—printing messages to the console—printk() is actually a rather complex beast; that won't change if these patches are merged, though many of the problems with the current implementation will be removed." The RCU API, 2019 edition Lead paragraph:"Read-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October 2002. RCU is most frequently described as a replacement for reader-writer locking, but has also been used in a number of other ways. RCU is notable in that readers do not directly synchronize with updaters, which makes RCU read paths extremely fast; that also permits RCU readers to accomplish useful work even when running concurrently with updaters. Although the basic idea behind RCU has not changed in decades following its introduction into DYNIX/ptx, the API has evolved significantly over the five years since the 2014 edition of the RCU API, to say nothing of the nine years since the 2010 edition of the RCU API." Containers as kernel objects — again Lead paragraph: "Linus Torvalds once famously said that there is no design behind the Linux kernel. That may be true, but there are still some guiding principles behind the evolution of the kernel; one of those, to date, has been that the kernel does not recognize "containers" as objects in their own right. Instead, the kernel provides the necessary low-level features, such as namespaces and control groups, to allow user space to create its own container abstraction. This refusal to dictate the nature of containers has led to a diverse variety of container models and a lot of experimentation. But that doesn't stop those who would still like to see the kernel recognize containers as first-class kernel-supported objects." Internal Kernel Changes: There is a new "software node" concept that is meant to be analogous to the "firmware nodes" created in ACPI or device-tree descriptions. See this commit for some additional information. The software-tag-based mode for KASAN has been added for the arm64 architecture. The switch to using JSON schemas for device-tree bindings has begun with the merging of the core infrastructure and the conversion of a number of binding files. The long-deprecated SUBDIRS= build option is going away in the 5.3 merge window; users will start seeing a warning as of 5.0. The M= option should be used instead. The venerable access_ok() function, which verifies that an address lies within the user-space region, has lost its first argument. This argument was either VERIFY_READ or VERIFY_WRITE depending on the type of access, but no implementation of access_ok() actually used that information. Filesystems and Block Layer Changes: The Btrfs filesystem has regained the ability to host swap files, though with a lot of limitations (no copy-on-write, must be stored on a single device, and no compression allowed, for example). The fanotify() mechanism supports a new FAN_OPEN_EXEC request to receive notifications when a file is opened to be executed. The legacy (non-multiqueue) block layer code has been removed, now that no drivers require it. The legacy I/O schedulers (including CFQ and deadline) have been removed as well. Networking Changes: Generic receive offload (GRO) can now be enabled on plain UDP sockets. Benchmark numbers in this commit show a significant increase in receive bandwidth and a large reduction in the number of system calls required. ICMP error handling for UDP tunnels is now supported. The MSG_ZEROCOPY option is now supported for UDP sockets. Security Changes Support for the Streebog hash function (also known as GOST R 34.11-2012) has been added to the cryptographic subsystem. The kernel is now able to support non-volatile memory arrays with built-in security features; see Documentation/nvdimm/security.txt for details. A small piece of the secure-boot lockdown patch set has landed in the form of additional control over the kexec_load_file() system call. There is a new keyring (called .platform) for keys provided by the platform; it cannot be updated by a running system. Keys in this ring can be used to control which images may be run via kexec_load_file(). It has also become possible for security modules to prevent calls to kexec_load(), which cannot be verified in the same manner. The secure computing (seccomp) mechanism can now defer policy decisions to user space. See this new documentation for details on the final version of the API. The fscrypt filesystem encryption subsystem has gained support for the Adiantum encryption mode (which was added earlier in the merge window). The semantics of the mincore() system call have changed. In this commit, Linus Torvalds explains, how the new semantics of this system call restrict access to pages that are mapped by the calling process. An ancient OpenSSH vulnerability Lead paragraph:: "An advisory from Harry Sintonen describes several vulnerabilities in the scp clients shipped with OpenSSH, PuTTY, and others. "Many scp clients fail to verify if the objects returned by the scp server match those it asked for. This issue dates back to 1983 and rcp, on which scp is based. A separate flaw in the client allows the target directory attributes to be changed arbitrarily. Finally, two vulnerabilities in clients may allow server to spoof the client output." The outcome is that a hostile (or compromised) server can overwrite arbitrary files on the client side. There do not yet appear to be patches available to address these problems." Defending against page-cache attacks Lead paragraph: "The kernel's page cache works to improve performance by minimizing disk I/O and increasing the sharing of physical memory. But, like other performance-enhancing techniques that involve resources shared across security boundaries, the page cache can be abused as a way to extract information that should be kept secret. A recent paper [PDF] by Daniel Gruss and colleagues showed how the page cache can be targeted for a number of different attacks, leading to an abrupt change in how the mincore() system call works at the end of the 5.0 merge window. But subsequent discussion has made it clear that mincore() is just the tip of the iceberg; it is unclear what will really need to be done to protect a system against page-cache attacks or what the performance cost might be." Fixing page-cache side channels, second attempt Lead paragraph:"The kernel's page cache, which holds copies of data stored in filesystems, is crucial to the performance of the system as a whole. But, as has recently been demonstrated, it can also be exploited to learn about what other users in the system are doing and extract information that should be kept secret. In January, the behavior of the mincore() system call was changed in an attempt to close this vulnerability, but that solution was shown to break existing applications while not fully solving the problem. A better solution will have to wait for the 5.1 development cycle, but the shape of the proposed changes has started to come into focus." A proposed API for full-memory encryption Lead paragraph: "Hardware memory encryption is, or will soon be, available on multiple generic CPUs. In its absence, data is stored — and passes between the memory chips and the processor — in the clear. Attackers may be able to access it by using hardware probes or by directly accessing the chips, which is especially problematic with persistent memory. One new memory-encryption offering is Intel's Multi-Key Total Memory Encryption (MKTME) [PDF]; AMD's equivalent is called Secure Encrypted Virtualization (SEV). The implementation of support for this feature is in progress for the Linux kernel. Recently, Alison Schofield proposed a user-space API for MKTME, provoking a long discussion on how memory encryption should be exposed to the user, if at all." Other Developments of Note Snowpatch: continuous-integration testing for the kernel Lead paragraph: "Many projects use continuous-integration (CI) testing to improve the quality of the software they produce. By running a set of tests after every commit, CI systems can identify problems quickly, before they find their way into a release and bite unsuspecting users. The Linux kernel project lags many others in its use of CI testing for a number of reasons, including a fundamental mismatch with how kernel developers tend to manage their workflows. At linux.conf.au 2019, Russell Currey described a CI system called Snowpatch that, he hopes, will bridge the gap and bring better testing to the kernel development process." The Firecracker virtual machine monitor The Firecracker virtual machine monitor is not strictly speaking a Linux kernel 5.0 feature but it does use the KVM API. Lead paragraph:"Cloud computing services that run customer code in short-lived processes are often called "serverless". But under the hood, virtual machines (VMs) are usually launched to run that isolated code on demand. The boot times for these VMs can be slow. This is the cause of noticeable start-up latency in a serverless platform like Amazon Web Services (AWS) Lambda. To address the start-up latency, AWS developed Firecracker, a lightweight virtual machine monitor (VMM), which it recently released as open-source software. Firecracker emulates a minimal device model to launch Linux guest VMs more quickly. It's an interesting exploration of improving security and hardware utilization by using a minimal VMM built with almost no legacy emulation."   As covered above, there are many interesting developments in mainline Linux kernel 5.0, some of which we believe are interesting for Oracle customers. As of this writing 5.0.10 is considered stable. We will continue to monitor developments in upcoming kernels, so look for a blog post with highlights in the next few months. Additional Resources: For more on mainline Linux and other related topics, see: LWN Oracle’s Linux Kernel Development blog  

Thanks to Chuck Anderson, Linux kernel developer, Oracle, for compiling the information in this post. Enhancements to mainline Linux continue at a steady pace, though we don’t always hear a lot about...

Events

Meet the Oracle Linux and Virtualization Team at Dell Technologies World

April 29 – May 2, 2019, Las Vegas, Nevada   Heading to Las Vegas next week for Dell Technologies World, April 29 – May 2, 2019, at The Venetian hotel? It’s a great opportunity to learn about the latest advancements in Oracle Linux and virtualization technologies. Optimized for hybrid cloud environments, these Oracle offerings are used in both on-premises and cloud deployments, running billions of transactions per day.  At the conference, you can also learn about Oracle and Dell EMC’s deep engineering relationship. The companies have been working together for many years on industry solutions like data integrity, and to provide support for mutual customers. DELL EMC works closely with Oracle to qualify its servers and storage on Oracle Linux and Oracle VM and to provide validated configurations to help customers efficiently deploy joint solutions. Here’s where you can learn more: Oracle Linux and Virtualization @ Dell Technologies World Demos with Product Experts in Booth #124 Stop by our booth – #124 – meet our team and learn about Oracle Linux, Oracle VM, Oracle VM VirtualBox, cloud native solutions, and more. Talk with product experts, let us answer your questions, and guide you to the best solution for your business. Presentations Tuesday, April 30, @ 4:00 PM  Location: Theater 4 in the Storage Section of the Dell Technologies Infrastructure Solutions Booth Simplify Cloud Infrastructure Deployments, Increase Performance and Enhance Data Integrity with Oracle and Dell EMC This session will cover how to achieve peak performance for Oracle workloads running on Oracle Linux with Dell EMC servers and storage. You'll also learn how Oracle and Dell EMC's collaboration, including joint qualifications and data integrity standards, can help you save time and costs on cloud infrastructure and on-premises deployments. Speakers: Michele Resta, Product Management Senior Director, Oracle Linux and Virtualization Yaron Dar, Director, Partner Engineering, Dell EMC Wednesday, May 1, @ 12:10 PM Location: World Chat Theatre A on the Exhibit Floor Build a Cloud Native Environment with Oracle Linux Tried, tested, and tuned for enterprise workloads, Oracle Linux is used by developers worldwide. Oracle Linux offers an open, integrated operating environment with application development tools, management tools, containers, and orchestration capabilities, which enable DevOps teams to efficiently build reliable, secure cloud native applications. In this session learn how Oracle Linux can help you enhance productivity. Speaker: Ken Ellis, Sales Consulting Director, Oracle Monday, Apr 29, 4:30 PM – Location: Marco Polo 704 or Thursday, May 2, 1:00 PM – Location: Delfino 4005 Dell EMC PowerMax & Oracle: Performance, Availability & Efficiency Deep-Dive This session will focus on proof points, best practices, and guidelines for achieving peak performance for Oracle workloads, maintaining high availability through disasters, and achieving amazing data reduction and storage efficiency for Oracle databases. Speaker:  Yaron Dar, Director, Partner Engineering, Dell EMC Engage with Us on Social Media We will keep you up-to-date on conference happenings. Join the conversation via #OracleLinux @OracleLinux Register for Dell Technologies World, where you can learn new capabilities, how to reinvent processes, innovate faster and create value that will change the game for your business.  We look forward to seeing you at the conference. The Oracle Linux and Virtualization team

April 29 – May 2, 2019, Las Vegas, Nevada   Heading to Las Vegas next week for Dell Technologies World, April 29 – May 2, 2019, at The Venetian hotel? It’s a great opportunity to learn about the latest...

Linux Kernel Development

Towards A More Secure QEMU Hypervisor, Part 3 of 3

In this blog, the third in a series of three, Oracle Linux developer Jag Raman analyzes the performance of the disaggregated QEMU. Performance of Separated LSI device It is essential to check how the Separated LSI device performs in comparison with the LSI device built into QEMU. For this purpose, we ran CloudHarmony block-storage benchmark on both. We ran this benchmark on a BareMetal instance in Oracle Cloud. The results are summarized below. CloudHarmony results   For detailed CloudHarmony report, see the following performance reports:please follow the links below: Performance of Built-in LSI device (pdf) Performance of Separated LSI device (pdf) Built-in LSI vs. Separated LSI Analysis of Performance The Separated LSI device performs very similarly to the Built-in LSI device. There are some cases where there is a gap between them. We are working on improving our understanding of why this is the case, and have detailed proposals available to bridge the gap. Following is a technical discussion of the performance problem and plans to fix it. Message passing overhead The current model for multi-process QEMU uses a communication channel with Unix sockets to transfer messages between QEMU & the Separated process. Since MMIO read/write is also passed as a message, there is a significant overhead associated with the syscall used to move the MMIO request to the Separated device. We believe that this overhead adds up, especially in the cases where the IOPS are large, resulting in a noticeable performance drop. The following proposal tries to minimize this overhead. MMIO acceleration proposal The majority of data transfer between the VM & Separated process happens over DMA (which has no overhead). However, MMIO writes are used to initiate these transfers, and MMIO reads are used to monitor the status/completion of IO requests. We think that in some cases, these MMIO accesses could contribute to the IO performance. Even small overhead per MMIO could cumulatively result in a performance drop, especially in the case of high IOPS devices. As a result, it is essential to reduce the amount of this overhead, as much as possible. We are working on the following proposal to accelerate MMIO access. The proposal is to bypass QEMU and forward all the MMIOs trapped by Kernel/KVM directly to the Separated process. Secondly, a shared ring buffer would be used to send messages to the Separated process, instead of using Unix sockets. These two changes are expected to reduce the overhead associated with MMIO access, thereby improving performance.

In this blog, the third in a series of three, Oracle Linux developer Jag Raman analyzes the performance of the disaggregated QEMU. Performance of Separated LSI device It is essential to check how the...

Announcements

Announcing the Oracle Cloud Developer Image for Oracle Cloud Infrastructure

We are pleased to introduce the Oracle Cloud Developer Image, an Oracle Linux 7 based, ready-to-run image that provides a comprehensive out-of-the-box development platform on Oracle Cloud Infrastructure. The Oracle Cloud Developer Image enables you to rapidly pre-install, and automatically configure and launch a comprehensive development environment on Oracle Cloud Infrastructure that includes the latest tools, choice of popular development languages, Oracle Cloud Infrastructure Software Development Kits (SDKs), and database connectors.  The Oracle Cloud Developer Image for Oracle Cloud Infrastructure provides you with easy access to all the tools needed throughout the development lifecycle at your fingertips. You can use command line and GUI tools to write, debug, and run code in a variety of languages, and develop modern applications on Oracle Cloud Infrastructure. The introductory release of the Oracle Cloud Developer Image includes the following tools and packages: Latest Oracle Linux 7 image for Oracle Cloud Infrastructure Languages and Oracle Database Connectors Java Platform, Standard Edition (Java SE) 8 Python 3.6 cx_Oracle 7 Python module for Python 2.7 Node.js 10 and node-oracledb Go 1.12 Oracle Instant Client 18.5 Oracle Cloud Infrastructure Command Line Interface (CLI), Software Development Kits (SDKs) and Tools Oracle Cloud Infrastructure CLI Python, Java, Go, and Ruby Oracle Cloud Infrastructure SDKs Terraform and Oracle Cloud Infrastructure Terraform Provider Oracle Cloud Infrastructure Utilities Other Oracle Container Runtime for Docker Extra Packages for Enterprise Linux (EPEL) via Yum GUI Desktop with access via VNC Server More tools and packages will be included with the Oracle Cloud Developer Image in upcoming releases. Here are some examples of use cases where you can take advantage of the out-of-the-box tools included with Oracle Cloud Developer Image: Easily access a VNC client or SSH to connect to a desktop environment Use Oracle Cloud Infrastructure CLI and SDKs Create an Autonomous Transaction Processing Database and connect to it using Python, Java, PHP, and Node.js Use Terraform scripts and templates to configure and build the cloud application infrastructure Run Docker containers To get started, simply log into your Oracle Cloud Infrastructure console, and deploy the image from the Marketplace by selecting Marketplace under the main navigation menu under Solutions, Platform and Edge. Search for and select ‘Oracle Cloud Developer Image’. Follow the click-through instructions to launch the Oracle Cloud Developer Image instance. The Oracle Cloud Developer Image is available at no additional cost to Oracle Cloud Infrastructure subscribers. If you do not already have an Oracle Cloud Infrastructure account, register for one here. You can try out the Oracle Cloud Developer Image today with available free subscription credits on Oracle Cloud Infrastructure. We welcome your feedback on the Oracle Cloud Developer Image.  Please send your comments and questions to oraclelinux-info_ww_grp@oracle.com or post them on the Oracle Linux for Infrastructure Community. For more information, visit the following links: Oracle Linux Oracle Linux 7 Documentation Oracle Linux for Oracle Cloud Infrastructure Blog: Click to Launch Images by Using the Marketplace in Oracle Cloud Infrastructure Oracle Linux Blog Oracle Linux for Oracle Cloud Infrastructure Community Pages Oracle Linux for Oracle Cloud Infrastructure Training    

We are pleased to introduce the Oracle Cloud Developer Image, an Oracle Linux 7 based, ready-to-run image that provides a comprehensive out-of-the-box development platform on Oracle Cloud...

Linux

New Oracle Validated Configurations on Lenovo ThinkSystem Servers and Storage

Check out the latest Oracle Validated Configurations on Lenovo, published on the Oracle Validated Configurations website. These configurations use Lenovo ThinkSystem SR650/SR850/SR950 Servers with ThinkSystem Storage DM5000F.     Through Oracle Validated Configurations and the Hardware Certification List (HCL), Lenovo has performed thorough testing of its hardware in real-world configurations with Oracle Linux and Oracle VM. This helps assure mutual customers that Lenovo hardware is qualified on Oracle Linux and Oracle VM and the combined solution can provide optimal performance and reliability, with faster, lower-cost implementations. Additionally, the validated configurations can help minimize risk for enterprises by reducing deployment testing and validation efforts. The three latest validated configurations are qualified on Oracle Linux 7 and Oracle VM. These pre-tested, validated reference architectures - including software, hardware, storage, and network components are: Lenovo ThinkServer SR950 with ThinkSystem DM5000F storage array Lenovo ThinkServer SR850 with ThinkSystem DM5000F storage array Lenovo ThinkServer SR650 with ThinkSystem DM5000F storage array Lenovo ThinkSystem SR950 Server The Lenovo ThinkSystem SR950 is a 4U Rack server supports up to 8 processors and 96 DIMMs. It is designed for your most demanding, mission-critical workloads, such as Oracle in-memory databases, large transactional databases, batch and real-time analytics, ERP, CRM, and virtualized server workloads. Lenovo ThinkSystem SR850 Server Lenovo ThinkSystem SR850 is a 4-socket server that features a streamlined 2U rack design that is optimized for price and performance, with best-in-class flexibility and expandability. Built for workloads like general business applications, server consolidation, and accelerating transactional databases and analytics. Lenovo ThinkSystem SR650 Server Lenovo ThinkSystem SR650 is an ideal 2-socket server for small businesses up to large enterprises that need industry-leading reliability, management, and security, as well as the ability maximize performance and flexibility for future growth. The SR630 server is designed to handle a wide range of workloads, such as databases, virtualization, and cloud computing. Lenovo ThinkSystem DM5000F storage array Lenovo ThinkSystem DM5000F is a unified, all flash entry-level storage system that is designed to provide performance, simplicity, capacity, security, and high availability for medium to large businesses. Powered by the ONTAP software, ThinkSystem DM5000F delivers enterprise-class storage management capabilities with a wide choice of host connectivity options and enhanced data management features. The ThinkSystem DM5000F can handle a wide range of enterprise workloads, including big data and analytics, artificial intelligence, engineering and design, enterprise applications, and other storage I/O-intensive applications. Validated Configuration Summary:  Two Lenovo Think System SR650/SR850/SR950  Lenovo Think System Storage DM5000F  Lenovo Rackswitch G8272  Lenovo Think System Storage switch 32 GB FC SAN Switch DB620S  Intel Quad 10 GbE SFP+ adapter  QLogic 32 GB FC Dual-port HBA Oracle Linux 7 Update 6 with the Unbreakable Enterprise Kernel Release 4 Oracle Database 12c Release 2 These tested configurations, along with the benefits of reliability, availability, and serviceability (RAS) features from Lenovo Think System Server and Storage, are an excellent choice for business-critical Oracle deployments. These Oracle Validated Configuration -- from 2 Socket SR650 to 8 Socket SR950 -- provide flexibility for enterprises to choose the configuration that suits their enterprise workload demands. To learn more about the benefits of Lenovo Think System Servers and Storage, visit: Lenovo Servers for Mission Critical Workloads.

Check out the latest Oracle Validated Configurations on Lenovo, published on the Oracle Validated Configurations website. These configurations use Lenovo ThinkSystem SR650/SR850/SR950 Servers with...

Announcements

Announcing Oracle VirtIO Drivers 1.1.3 for Microsoft Windows

We are pleased to announce Oracle VirtIO Drivers for Microsoft Windows release 1.1.3. The Oracle VirtIO Drivers for Microsoft Windows are paravirtualized (PV) drivers for Microsoft Windows guests that are running on Oracle Linux KVM. The Oracle VirtIO Drivers for Microsoft Windows improve performance for network and block (disk) devices on Microsoft Windows guests and resolve common issues. What's New The Oracle VirtIO Drivers for Microsoft Windows 1.1.3 provides a new “Custom” installation option that facilitates the migration of guest VMs to run in PV mode on Oracle Cloud Infrastructure (OCI). This enables you to run existing Microsoft Windows images as PV instances on OCI. The “Default” option installs the Oracle VirtIO Drivers on the Microsoft Windows guest running on Oracle Linux KVM. Oracle VirtIO Drivers 1.1.3 is built on the 1.1.2 release of VirtIO Drivers that have been certified by Microsoft. The update (from 1.1.2 to 1.1.3) is due to a new custom installation option. Existing customers using Oracle VirtIO Drivers 1.1.2 do not need to upgrade to the 1.1.3 release. The new "Custom" installation, executed on a Microsoft Windows virtual machine in "Oracle Cloud Infrastructure - Classic (OCI-C)" or on premises in an Oracle VM virtual machine, adds and activates the Oracle VirtIO drivers required to run paravirtualized mode on Oracle Cloud Infrastructure. Oracle VirtIO Drivers 1.1.3 supports the KVM hypervisor with Oracle Linux 7 on premises and on Oracle Cloud Infrastructure. The following guest Microsoft Windows operating systems are supported:   Guest OS 64-bit 32-bit Microsoft Windows Server 2016 Yes Not Available Microsoft Windows Server 2012 R2 Yes Not Available Microsoft Windows Server 2012 Yes Not available Microsoft Windows Server 2008 R2 SP1 Yes Not Available Microsoft Windows Server 2008 SP2 Yes Yes Microsoft Windows Server 2003 R2 SP2 Yes Yes Microsoft Windows 10 Yes Yes Microsoft Windows 8.1 Yes Yes Microsoft Windows 7 SP1 Yes Yes Microsoft Windows Vista SP2 Yes Yes   For further details related to support and certifications, refer to the Oracle Linux 7 Administrator's Guide. Additional information on the Oracle VirtIO Drivers 1.1.2 certifications can be found in the Windows Server Catalog.   Downloading Oracle VirtIO Drivers Oracle VirtIO Drivers release 1.1.3 is available on the Oracle Software Delivery Cloud by searching on "Oracle Linux" Click on the "Add to Cart" button and then click on "Checkout" in the right upper corner. On the following window, select "x86-64" and click on the "Continue" button: Click on "V981734-01.zip - Oracle VirtIO Drivers Version for Microsoft Windows 1.1.3" to download the drivers:   The Oracle VirtIO Drivers release 1.1.3 is also available on My Oracle Support under the patch number 27637937. Oracle Linux Resources Documentation Oracle Linux Administrator's Guide for Release 7 - Virtualization Oracle VirtIO Drivers for Microsoft Windows Blogs Oracle Linux Blog Oracle Virtualization Blog Community Pages Oracle Linux Product Training and Education Oracle Linux Administration - Training and Certification Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter

We are pleased to announce Oracle VirtIO Drivers for Microsoft Windows release 1.1.3. The Oracle VirtIO Drivers for Microsoft Windows are paravirtualized (PV) drivers for Microsoft Windows guests that...

Announcements

Registration is Open for Oracle OpenWorld and Oracle Code One San Francisco 2019

Register now for Oracle OpenWorld and Oracle Code One San Francisco. These concurrent events are happening September 16-19, 2019 at Moscone Center. By registering now, you can take advantage of the Super Saver rate before it expires on April 20, 2019.  This year at Oracle OpenWorld San Francisco, you’ll learn how to do more with your applications, adopt new technologies, and network with product experts and peers.    Don’t miss the opportunity to experience: Future technology firsthand such as Oracle Autonomous Database, blockchain, and artificial intelligence New connections by meeting some of the brightest technologists from some of the world’s most compelling companies Technical superiority by taking home new skills and getting an insider’s look at the latest Oracle technology Lasting memories while experiencing all that Oracle has to offer, including many opportunities to unwind and have some fun   At Oracle Code One, the most inclusive developer conference on the planet, come learn, experiment, and build with us. You can participate in discussions on Linux, Java, Go, Rust, Python, JavaScript, SQL, R, and more. See how you can shape the future and break new ground. Join deep-dive sessions and hands-on labs covering leading-edge technology such as blockchain, chatbots, microservices, and AI. Experience cloud development technology in the Groundbreakers Hub, featuring workshops and other live, interactive experiences and demos. Register Now and Save! Now is the best time to register for these popular conferences and take us up on the Super Saver rate. Then be sure to check back in early May, 2019 for the full content catalog where you will see the breadth and depth of our sessions. You can also signup to be notified when the content catalog goes live. Register now for Oracle OpenWorld San Francisco 2019 Register now for Oracle Code One San Francisco 2019 We look forward to seeing you in September!  

Register now for Oracle OpenWorld and Oracle Code One San Francisco. These concurrent events are happening September 16-19, 2019 at Moscone Center. By registering now, you can take advantage of the...

Linux Kernel Development

Towards A More Secure QEMU Hypervisor, Part 2 of 3

In this blog, the second in a series of three, Oracle Linux kernel developer Elena Ufimtseva demonstrates how to configure and build our disaggregated QEMU. Configure and build multi-process QEMU To build the system that supports multi-process device emulation in QEMU, the build system was modified to add new objects. To get the latest development tree with multi-process support clone it from the git repository and branch multi-process-qemu-v0.1: git clone -b multi-process-qemu-v0.1 https://github.com/oracle/qemu.git Run configure with –enable-mpqemu to enable multi-process qemu and run make: ./configure --disable-xen --disable-tcg --disable-tcg-interpreter --target-list=x86_64-softmmu --enable-guest-agent --enable-mpqemu make all make install Notes on Xen: If no support of Xen on the system is needed, --disable-xen should be used. On OL7 --disable-xen should be used. There are few executable files, some of which are the remote programs. Depending on the options used while configuring Qemu, one may need to add the location of those remote programs to the PATH environment variable. In current version the program name is “qemu-scsi-dev”. configure script can be used with option –install= to specify the installation directory. Running multi-process QEMU To run qemu device emulation in a separate process, there are following options that are different from the original qemu: rdevice; rdrive; These options are similar to the ones in original qemu and can be used in the same way. For example, to run disk attached to LSI SCSI controller in remote process, the following command line can be used: /usr/local/bin/qemu-system-x86_64 -name vm -m 6G -drive file=/root/ol7.qcow2,format=raw -enable-kvm -machine q35,accel=kvm -rdevice lsi53c895a,rid=0,id=scsi0,command=qemu-scsi-dev -rdevice scsi-hd,rid=0,drive=drive0,bus=scsi0.0,scsi-id=0 -rdrive id=drive0,rid=0,file=/root/cirros-0.4.0-x86_64-disk.img,format=qcow2 -object memory-backend-file,id=mem,mem-path=/dev/shm/,size=6G,share=on -numa node,memdev=mem -display none -vnc :0 -monitor stdio -device e1000,netdev=net0 -netdev user,id=net0,hostfwd=tcp::5555-:22 Required options are: remote device options:-rdevice lsi53c895a,rid=0,id=scsi0,command=qemu-scsi-dev -rdevice scsi-hd,rid=0,drive=drive0,bus=scsi0.0,scsi-id=0 -rdrive id=drive0,rid=0,file=/root/cirros-0.4.0-x86_64-disk.img,format=qcow2 memory object to support file descriptor based memory synchronization between remote process and qemu: -object memory-backend-file,id=mem,mem-path=/dev/shm/,size=6G,share=on -numa node,memdev=mem The result of running multi-process qemu with one remote process: There are two processes listed here, one is the main qemu and the second is the qemu-scsi-dev remote process. Debugging and troubleshooting There are additional options to provide more diagnostics for debugging. To enable logging for multi process qemu, -D option can be specified with mask “rdebug”: -D /tmp/qemu.log -d rdebug To enable Qemu debugging with gdb, it can be configured with --enable-debug-info to include debug symbols. Since multi-process qemu has additional processes that are spawned during the execution, to use gdb to debug child processes the following settings can be used to launch gdb: set detach-on-fork off set follow-exec-mode new set follow-fork-mode child set print inferior-events on This will allow debug of the child process automatically. Below is the example of such a debug session: [root@localhost ~]# gdb /usr/local/bin/qemu-system-x86_64 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /usr/local/bin/qemu-system-x86_64...done. (gdb) r -enable-kvm -machine q35 -smp 4 -m 8000M -vnc :0 -net nic -net user,hostfwd=tcp::5022-:22 -drive file=/root/ol7.qcow2,format=raw -rdevice lsi53c895a,rid=0,id=scsi0 -rdevice scsi-hd,rid=0,drive=drive0,bus=scsi0.0,scsi-id=0 -rdrive id=drive0,rid=0,file=/root/cirros-0.4.0-x86_64-disk.img -object memory-backend-file,id=mem,mem-path=/dev/shm/,size=8000M,share=on -numa node,memdev=mem Starting program: /usr/local/bin/qemu-system-x86_64 -enable-kvm -machine q35 -smp 4 -m 8000M -vnc :0 -net nic -net user,hostfwd=tcp::5022-:22 -drive file=/root/ol7.qcow2,format=raw -rdevice lsi53c895a,rid=0,id=scsi0 -rdevice scsi-hd,rid=0,drive=drive0,bus=scsi0.0,scsi-id=0 -rdrive id=drive0,rid=0,file=/root/cirros-0.4.0-x86_64-disk.img -object memory-backend-file,id=mem,mem-path=/dev/shm/,size=8000M,share=on -numa node,memdev=mem [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [New Thread 0x7fffef5fe700 (LWP 14001)] [New Thread 0x7ffdfac1a700 (LWP 14003)] [New Thread 0x7ffdfa419700 (LWP 14005)] [New Thread 0x7ffdf9c18700 (LWP 14006)] [New Thread 0x7ffdf9417700 (LWP 14007)] [New Thread 0x7ffdebfff700 (LWP 14009)] [New inferior 14010] [New process 14010] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Thread 0x7ffff7fc5c00 (LWP 14010) is executing new program: /usr/local/bin/qemu-scsi-dev [New inferior 14010] (gdb) info inferior Num Description Executable 3 process 14010 /usr/local/bin/qemu-scsi-dev 2 <null> /usr/local/bin/qemu-system-x86_64 1 process 13997 /usr/local/bin/qemu-system-x86_64

In this blog, the second in a series of three, Oracle Linux kernel developer Elena Ufimtseva demonstrates how to configure and build our disaggregated QEMU. Configure and build multi-process QEMU To...

Linux Kernel Development

Towards A More Secure QEMU Hypervisor, Part 1 of 3

In this blog, the first in a series of three, Oracle Linux kernel developer John Johnson introduces Oracle’s work towards a more secure QEMU based hypervisor. Disaggregating QEMU QEMU is often used as the hypervisor for virtual machines running in the Oracle cloud. Since one of the advantages of cloud computing is the ability to run many VMs from different tenants in the same cloud infrastructure, a guest that compromised its hypervisor could potentially use the hypervisor’s access privileges to access data it is not authorized to. QEMU can be susceptible to security attack because it is a large, monolithic program that provides many services to the VMs it controls. Many of these services can be configured out of QEMU, but even a reduced configuration QEMU has a large amount of code a guest can potentially attack in order to gain additional privileges. QEMU services QEMU can be broadly described as providing three types of services. One is a VM control point, where VMs can be created, migrated, re-configured, and destroyed. A second service emulates the CPU instructions within the VM, usually accelerated by HW virtualization features such as Intel’s VT extensions. Finally, it provides IO services to the VM by emulating HW IO devices, such as disk and network devices. All these services exist within a single, monolithic QEMU process:   A disaggregated QEMU A disaggregated QEMU involves separating these services into multiple host processes. Having these services in separate processes allows us to use SELinux mandatory access controls to constrain the processes to only the files needed to provide its service, e.g., a disk emulation process would be given access to only the the disk images it provides; and not be allowed to access other host files, or any network devices. An attacker who compromised such a disk emulation process would not be able to exploit it beyond the host files the process has been granted access to. A QEMU control process would remain, but in disaggregated mode, it would be a control point that executes the processes needed to support the VM being created and sets up the communication paths between them. But the QEMU control process would have no direct interfaces to the VM, although it would still provide the user interface to control the VM, such as hot-plugging devices or live migrating the VM. Disaggregating IO services A first step in creating a disaggregated QEMU is to separate IO services from the main QEMU program. The main QEMU process would continue to provide CPU emulation as well as being the VM control point. In a later phase, CPU emulation could be separated from the control process. Disaggregating IO services is a good place to begin QEMU disaggregating for a couple of reasons. One is the sheer number of IO devices QEMU can emulate provides a large surface of interfaces which could potentially be exploited. Another is the modular nature of QEMU device emulation code provides interface points where the QEMU functions that perform device emulation can be separated from the QEMU functions that manage the emulation of guest CPU instructions.   Disaggregated CPU emulation After IO services have been disaggregated, a second phase would be to separate a process to handle CPU instruction emulation from the main QEMU control function. There are few existing object separation points for this code, so the first task would be to create interfaces between the control plane functions and functions that manage guest CPUs.   Progress to date We’ve separated our first device from the the main QEMU process: an LSI 895 SCSI disk controller. Future blogs posts on this topic will cover the design of the project, its performance, as well as where the source code can be found and how to use it. To see part 2 in this blog series, go to: https://blogs.oracle.com/linux/towards-a-more-secure-qemu-hypervisor%2c-part-2-of-3

In this blog, the first in a series of three, Oracle Linux kernel developer John Johnson introduces Oracle’s work towards a more secure QEMU based hypervisor. Disaggregating QEMU QEMU is often used as...

Linux

libresource - It is time for version 2

In this blog post, Oracle Linux kernel developer Rahul Yadav discusses a few details about version 2 of his libresource project. As discussed in my previous blog[1] on libresource, we are working on a library which provides APIs to get system resource information for user-land applications. The system resource information includes information related to memory, networking, devices and various other statistics. Currently an application developer needs to read this information mostly from procfs and sysfs. The developer needs to open a file, read the desired information, parse that information and then close the file. libresource provides simple APIs to do away with all these steps and allow the application to get the information via one call. In version 1 of libresource we delivered the following: Basic infrastructure so adding new resource information is straightforward We added a lot of memory and networking related system resource information All the user application facing APIs are done I presented the current status at Linux Plumbers Conference 2018[2] and discussed with the community on what we should be doing next in libresource version 2. The following things came out of that discussion: We need to add more system resource information in the library. A large application like a database or web servers need a lot of system related information to take decisions, and it will be good for them to get all system resource information from one library. We are planning to add the following: table { border-collapse: collapse; width: 80%; } td, th { border: 1px solid #dddddd; padding: 8px; } Resource Id Description NET_TCPSENDBUFSIZE Send buffer sizes for TCP NET_TCPRECVBUFSIZ recv buffer sizes for TCP NET_GLOBALSENDBUFSIZE Send buffer sizes for global NET_GLOBALRECVBUFSIZE Global recv buffer sizes NET_BUFSIZEINFOALL Send/recv buffer sizes for global and TCP MMAP_PROC_HEAPINFO Heap address and heap size for pid MMAP_PROC_STACKINFO Stack address and stack size for pid FS_AIONR Running total of the number of events specified on the io_setup system call for all currently active aio contexts FS_AIOMAXNR MAX AIONR possible. If aio-nr reaches aio-max-nr then io_setup will fail with EAGAIN FS_FILENR Number of allocated file handles, the number of allocated but unused file handles, and the maximum number of file handles FS_FILEMAXNR Maximum number of file-handles that the Linux kernel will allocate. CPU_CORECOUNT Core count CPU_THREADCOUNT Total thread count (CPU count) PROCSET_NUMCPU Get current number of CPUs in caller's processor set CPU_ARCHINFOALL Struct which has socket,core and thread count MEM_HUGEPAGESIZE Size of a huge page MEM_HUGEPAGEALL Struct with all information about huge pages VMSTAT_PAGEIN Number of page in since last boot VMSTAT_PAGEOUT Number of page out since last boot VMSTAT_SWAPIN Number of swapin since last boot VMSTAT_SWAPOUT Number of swapout since last boot VMSTAT_PGMAJFAULT Number of major page faults per sec. VMSTAT_INFOALL All information related to VMSTAT. LOADAVG_INFO All information related to load average; CPU and IO utilization of the last one, five, and 10 minute periods. It also shows the number of currently running processes and the total number of processes.   We need to start thinking about how we can virtualize the information provided by the library. Currently all the information which is provided is not virtualized because they are fetched from /proc which itself is not virtualized. This means if an application is running in a containerized environment then the information provided by the library may not necessarily be for the container itself, it might be information of the host system. One suggestion was to read this information using LXCfs which is a file system that can be bind mounted over /proc to provide cgroup aware information. But this seems to be pretty heavy because it uses FUSE to get cgroup aware container information. Another suggestion was to read this information for cgroup files and provide them to application. This seems more efficient and an ideal solution for the problem. libresource internally reads the information from procfs or sysfs itself, so currently it does not provide any performance improvement over reading that information directly from procfs or sysfs. We need to figure out ways to get the information in a more efficient manner. Various efforts have been made to add a system call or similar interface to provide this information, but none have been accepted by the community so far. There was a suggestion to use netlink to get some of the kernel information, especially networking related information. I had tried that while working on networking resources in first version of the library, but I did not see any performance improvement in comparison to getting the information from procfs. This is because we still need to open a socket and read information from it, parse the information and close the socket. If in the future we provide APIs to read system resource information continuously, then this might be useful. We can keep the socket open and read the information continuously. There was a suggestion to standardize the make/install processes. Currently the library has a simple Makefile which does the work. We are working on this in next version. I am working on a lot of these suggestions for the next version of libresource and they should be out for review and later for use soon. Meanwhile you can get the library from github[3] and start using it. If you have a request or a question, please use the issues[4] page on the github repository. [1] https://blogs.oracle.com/linux/getting-system-resource-information-with-a-standard-api [2] https://www.linuxplumbersconf.org/event/2/contributions/211/ [3] https://github.com/lxc/libresource [4] https://github.com/lxc/libresource/issues

In this blog post, Oracle Linux kernel developer Rahul Yadav discusses a few details about version 2 of his libresource project. As discussed in my previous blog[1] on libresource, we are working on a...

Linux

How to Install Node.js 10 with node-oracledb and Connect it to Oracle Database

This post was updated on 20 March, 2019 to reflect changes in the way yum configuration works.   A few months ago we added dedicated repositories for Node.js to the Oracle Linux yum server. These repos also include an RPM with the Oracle Database driver for Node.js, node-oracledb, so you can connect your Node.js application to the Oracle Database. In this post I describe the steps to install Node.js 10 and node-oracledb Node.js to Oracle Database. If you are in a rush or want to try this out in a non-destructive way, I recommend you use the latest Oracle Linux 7 Vagrant box . Configure Yum with Node.js and Oracle Instant Client Repositories To set up your system to access Node.js and Oracle Instant Client repos on Oracle Linux yum server, install the oracle-nodejs-release-el7 and oracle-release-el7 RPMs. As of this writing, the Node.js 10 repo will be enabled by default when you install the Oracle Node.js repelase RPM $ sudo yum -y oracle-node-release-el7 oracle-release-el7 Next, install Node.js 10 and the compatible node-oracledb, making sure to temporarily disable the EPEL repository to prevent the wrong version of Node.js getting installed. $ sudo yum -y install --disablerepo=ol7_developer_EPEL nodejs node-oracledb-node10 Connecting to Oracle Database For my testing I used Oracle Database 18c Express Edition (XE). You can download it here. Quick Start instructions are here. About Oracle Instant Client node-oracledb depends on Oracle Instant Client. During OpenWorld 2018 we released Oracle Instant Client 18.3 RPMs on Oracle Linux yum server in the ol7_oracle_instantclient and ol6_oracle_instantclient repositories, making installation a breeze. Assuming you have enabled the repository for Oracle Instant Client appropriate for your Oracle Linux release, it will be installed as a dependency. As of release 3.0, node-oracledb is built with Oracle Client 18.3, which connects to Oracle Database 11.2 and greater. Older releases of Oracle Instant Client are available on OTN. Add the Oracle Instant Client to the runtime link path. $ sudo sh -c "echo /usr/lib/oracle/18.3/client64/lib > /etc/ld.so.conf.d/oracle-instantclient.conf" $ sudo ldconfig A Quick Node.js Test Program Connecting to Oracle Database I copied this file from the examples in the node-oracledb Github repo. Running this will tell us whether Node.js can connect to the database. Copy this code into a file called connect.js. The file below comes from the same GitHub repo. Copy the code into a file called dbconfig.js and edit it to include your Database username, password and connect string. Run connect.js with node Before running connect.js, make sure NODE_PATH is set so that the node-oracledb module can be found. $ export NODE_PATH=`npm root -g` $ node connect.js Connection was successful!

This post was updated on 20 March, 2019 to reflect changes in the way yum configuration works.   A few months ago we added dedicated repositories for Node.js to the Oracle Linux yum server....

Technologies

Kata Containers: An Important Cloud Native Development Trend

Introduction One of Oracle’s top 10 predictions for developers in 2019 was that a hybrid model that falls between virtual machines and containers will rise in popularity for deploying applications. Kata Containers are a relatively new technology that combine the speed of development and deployment of (Docker) containers with the isolation of virtual machines. In the Oracle Linux and virtualization team we have been investigating Kata Containers and have recently released Oracle Container Runtime for Kata on Oracle Linux yum server for anyone to experiment with. In this post, I describe what Kata containers are as well as some of the history behind this significant development in the cloud native landscape. For now, I will limit the discussion to Kata as containers in a container engine. Stay tuned for a future post on the topic of Kata Containers running in Kubernetes. History of Containerization in Linux The history of isolation, sharing of resources and virtualization in Linux and in computing in general is rich and deep. I will skip over much of this history to focus on some of the key landmarks on the way there. Two Linux kernel features are instrumental building blocks for the Docker Containers we’ve become so familiar with: namespaces and cgroups. Linux namespaces are a way to partition kernel resources such that two different processes have their own view of resources such as process IDs, file names or network devices. Namespaces determine what system resources you can see. Control Groups or cgroups are a kernel feature that enable processes to be grouped hierarchically such that their use of subsystem resources (memory, CPU, I/O, etc) can be monitored and limited. Cgroups determine what system resources your can use. One of the earliest containerization features available in Linux combine both namespaces and cgroups was Linux Containers (LXC). LXC offered a userspace interface to make the Linux kernel containment features easy to use and enabled the creation of system or application containers. Using LXC, you could run, for example, CentOS 6 and Oracle Linux 7, two completely different operating systems with different userspace libraries and versions on the same Linux kernel. Docker expanded on this idea of lightweight containers by adding packagaging, versioning and component reuse features. Docker Containers have become widely used because they appealed to developers. They shortened the build-test-deploy cycle because they made it easier to package and distribute an application or service as a self-contained unit, together with all the libraries needed to run it. Their popularity also stems from the fact that they appeal to developers and operators alike. Essentially, Docker Containers bridge the gap between dev and ops and shorten the cycle from development to deployment. Because containers —both LXC and Docker-based— share the same underlying kernel, it’s not inconceivable that an exploit able to escape a container could access kernel resources or even other containers. Especially in multi-tenant environments, this is something you want to avoid. Projects like Intel® Clear Containers Hyper runV took a different approach to parceling out system resources: their goal was to combine the strong isolation of VMs with the speed and density (the number of containers you can pack onto a server) of containers. Rather than relying on namespaces and cgroups, they used a hypervisor to run a container image. Intel® Clear Linux OS Containers and Hyper runV came together in Kata Containers, an open source project and community, which saw its first release in March of 2018. Kata Containers: Best of Both Worlds The fact that Kata Containers are lightweight VMs means that, unlike traditional Linux containers or Docker Containers, Kata Containers don’t share the same underlying Linux kernel. Kata Containers fit into the existing container ecosystem because developers and operators interact with them through a container runtime that adheres to the Open Container Initiative (OCI)specification. Creating, starting, stopping and deleting containers works just the way it does for Docker Containers. Image by OpenStack Foundation licensed under CC BY-ND 4.0 In summary, Kata Containers: Run their own lightweight OS and a dedicated kernel, offering memory, I/O and network isolation Can use hardware virtualization extensions (VT) for additional isolation Comply with the OCI (Open Container Initiative) specification as well as CRI (Container Runtime Interface) for Kubernetes Installing Oracle Container Runtime for Kata As I mentioned earlier, we’ve been researching Kata Containers here in the Oracle Linux team and as part of that effort we have released software for customers to expermiment with. The packages are available on Oracle Linux yum server and its mirrors in Oracle Cloud Infrastructure (OCI). Specifically, we’ve released a kata-runtime and related compontents, as well an optimized Oracle Linux guest kernel and guest image used to boot the virtual machine that will run a container. Oracle Container Runtime for Kata relies on QEMU and KVM as the hypervisor to launch VMs. To install Oracle Container Runtime for Kata on a bare metal compute instance on OCI: Install QEMU Qemu is available in the ol7_kvm_utils repo. Enable that repo and install qemu sudo yum-config-manager --enable ol7_kvm_utils sudo yum install qemu Install and Enable Docker Next, install and enable Docker. sudo yum install docker-engine sudo systemctl start docker sudo systemctl enable docker Install kata-runtime and Configure Docker to Use It First, configure yum for access to the Oracle Linux Cloud Native Environment - Developer Preview yum repository by installing the oracle-olcne-release-el7 RPM: sudo yum install oracle-olcne-release-el7 Now, install kata-runtime: sudo yum install kata-runtime To make the kata-runtime an available runtime in Docker, modify Docker settings in /etc/sysconfig/docker. Make sure SELinux is not enabled. The line that starts with OPTIONS should look like this: $ grep OPTIONS /etc/sysconfig/docker OPTIONS='-D --add-runtime kata-runtime=/usr/bin/kata-runtime' Next, restart Docker: sudo systemctl daemon-reload sudo systemctl restart docker Run a Container Using Oracle Container Runtime for Kata Now you can use the usual docker command to run a container with the --runtime option to indictate you want to use kata-runtime. For example: sudo docker run --rm --runtime=kata-runtime oraclelinux:7 uname -r Unable to find image 'oraclelinux:7' locally Trying to pull repository docker.io/library/oraclelinux ... 7: Pulling from docker.io/library/oraclelinux 73d3caa7e48d: Pull complete Digest: sha256:be6367907d913b4c9837aa76fe373fa4bc234da70e793c5eddb621f42cd0d4e1 Status: Downloaded newer image for oraclelinux:7 4.14.35-1909.1.2.el7.container To review what happened here. Docker, via the kata-runtime instructed KVM and QMEU to start a VM based on a special purpose kernel and minimized OS image. Inside the VM a container was created, which ran the uname -r command. You can see from the kernel version that a “special” kernel is running. Running a container this way, takes more time than a traditional container based on namespaces and cgroups, but if you consider the fact that a whole VM is launched, it’s quite impressive. Let’s compare: # time docker run --rm --runtime=kata-runtime oraclelinux:7 echo 'Hello, World!' Hello, World! real 0m2.480s user 0m0.048s sys 0m0.026s # time docker run --rm oraclelinux:7 echo 'Hello, World!' Hello, World! real 0m0.623s user 0m0.050s sys 0m0.023s That’s about 2.5 seconds to launch a Kata Container versus 0.6 seconds to launch a traditional container. Conclusion Kata Containers represent an important phenomenon in the evolution of cloud native technologies. They address both the need for security through virtual machine isolation as well as speed of development through seamless integration into the existing container ecosystem without compromising on computing density. In this blog post I’ve described some of the history that brought us Kata Containers as well as showed how you can experiment with them yourself with packages using Oracle Container Runtime for Kata.

Introduction One of Oracle’s top 10 predictions for developers in 2019 was that a hybrid model that falls between virtual machines and containers will rise in popularity for deploying applications. Kata...