X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Recent Posts

Announcements

Announcing Gluster Storage Release 4.1 for Oracle Linux 7

The Oracle Linux and Virtualization team is pleased to announce the release of Gluster Storage Release 4.1 for Oracle Linux 7. Gluster Storage is a runtime component of the Oracle Linux Cloud Native Environment. Gluster Storage is an open source, POSIX compatible file system capable of supporting thousands of clients while using commodity hardware. Gluster provides a scalable, distributed file system that aggregates disk storage resources from multiple servers into a single global namespace. Gluster provides built-in optimization for different workloads and can be accessed using either an optimized Gluster FUSE client or standard protocols including SMB/CIFS. Gluster can be configured to enable both distribution and replication of content with quota support, snapshots, and bit-rot detection for self-healing. Gluster 4.1 for Oracle Linux 7 introduces support for: Either the Unbreakable Enterprise Kernel (Release 4 and higher) and the Red Hat Compatible Kernel x86_64 and aarch64 architectures Upgrades from an existing Gluster 3.12 configuration NFS-Ganesha, which provides NFSv3, v4, v4.1, and pNFS server support for Gluster volumes This release also includes a technology preview of Heketi, which provides a RESTful management interface to manage the lifecycle of GlusterFS volumes. Notable enhancement and new features: Management Samba volumes can be made inaccessible to clients without stopping the same smb.conf. This enhancement can also preserve changes to smb.conf, when configured externally GlusterD2 brings initial support for rebalancing, snapshots, intelligent volume provisioning (This is a technology preview and still experimental) Monitoring GlusterFS 4 offers a lightweight method to access internal information and avoids the performance penalty and complexities of previous approaches GlusterFS 4.1 introduces additional metrics to help determine the effectiveness of the xlator in various workloads Performance Gluster FUSE mounts now support FUSE extension to leverage the kernel "write-back cache" Improved performance when there are frequent metadata updates in the workload, typically seen with shared volumes Processing FUSE read requests can be done in parallel Better workload distribution on reads for replicate-based volumes Standalone Utime feature enables Gluster to maintain consistent change and modification time stamps on files and directories across bricks Thin Arbiter volumes in Gluster (part of the GlusterD2 technology preview) Automatically configure backup volfile servers in clients (part of the GlusterD2 technology preview) Installation Gluster Storage is available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. It is currently available for the x86_64 and aarch64 architectures and can be installed on any Oracle Linux 7 server running either the Red Hat Compatible Kernel (RHCK) or the Unbreakable Enterprise Kernel (UEK) Release 4 or 5.  For more information on hardware requirements and how to install and configure Gluster, please review the Gluster Storage for Oracle Linux Release 4.1 documentation. Support Support for Gluster Storage is available to customers with an Oracle Linux Premier support subscription. Refer to Oracle Linux 7 License Information User Manual for information about Oracle Linux support levels. Oracle Linux Resources: Documentation Oracle Linux Software Download Oracle Linux Oracle Container Registry Blogs Oracle Linux Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux For community-based support, please visit the Oracle Linux space on the Oracle Developer Community.

The Oracle Linux and Virtualization team is pleased to announce the release of Gluster Storage Release 4.1 for Oracle Linux 7. Gluster Storage is a runtime component of the Oracle Linux Cloud...

Events

Join the Oracle Linux and Virtualization Team at Oracle OpenWorld Asia

Singapore is the next stop on our world tour! Join us for Oracle OpenWorld Asia, March 26-27, at the Marina Bay Sands. At Oracle OpenWorld Asia, you can gain enterprise expertise and start-up ingenuity directly from experts across retail, manufacturing, financial services, technology, the public sector, and more. Join innovators as they challenge assumptions, design for better outcomes, and leverage transformational technologies to create future possibilities now. Oracle OpenWorld Asia speakers are the innovators, disruptors and thought leaders of tomorrow. From a pioneer in the mobile and data analytics industries, authors, futurists and many more. Discover your tomorrow, today. Register now, and be sure to attend these sessions: Oracle Linux and Virtualization Sessions Wednesday, March 27, Marina Bay Sands, Singapore Session Speaker Time & Location Jumpstart Your Development with Oracle Linux and Oracle Cloud [SOL1993-SIN] Wim Coekaerts, Senior Vice President, Operating Systems and Virtualization Engineering, Oracle 09:00 AM - 09:45 AM | Arena 8 (Level 3)  How Oracle Linux Cloud Native Environment and VirtualBox Can Make Developers Life Easier [SES2155-SIN] Avi Miller, Director of Product Management, Oracle 10:25 AM - 11:10 AM | Arena 8 (Level 3)  Build a Cloud Native Environment with Oracle Linux [SES2223-SIN] Robert Shimp, Product Management Group Vice President, Oracle Linux and Virtualization, Oracle 01:05 PM - 01:40 PM | Arena 5 (Level 3) The Exchange: a Showcase for Attendees to Connect, Discover and Learn Oracle Linux and Oracle Virtualization experts will be at The Exchange to answer your questions, update you on the latest product enhancements, and demo the latest software releases. Let us know about your experience -- #OOWSIN #OracleLinux @OracleLinux Enjoy the conference!

Singapore is the next stop on our world tour! Join us for Oracle OpenWorld Asia, March 26-27, at the Marina Bay Sands. At Oracle OpenWorld Asia, you can gain enterprise expertise and start-up ingenuity...

Linux

Making Code More Secure with GCC - Part 2

This blog entry was contributed by Maxim Kartashev In the previous post I focused on the static analysis capabilities of the gcc 7.3 compiler. Warnings issued at compile time can point to the place in a program where an error at run time might occur, thus enabling the programmer to fix the program even before it is run. Not all run time errors can be predicted at compile time, though, and there are good and bad reasons why. For instance, there might be many annoying false positive warnings that get routinely ignored (and sometimes rightly so), until that time when one of them points to the actual problem, but gets silenced together with the rest. Or the programmer invokes undefined behavior, which in many cases is impossible to diagnose at compile time because there are simply no provisions for that in the programming language. The GNU toolchain continues to help the programmer even past compile time with the help of code instrumentation and additional features baked into the glibc library. In this post I am going to describe the necessary steps to utilize these capabilities. Apart from flaws in the program that make it work incorrectly even on correct data, an attacker will attempt to create input unforeseen by the programmer in order to take control over the program. And again, gcc can help to strengthen the code it generates by structuring it differently and providing additional checks. This post list several most useful techniques that gcc 7.3 implements. Finding Bugs At Run Time Some compiler warnings can be legitimately - from the point of view of the language - suppressed. One example is shown below: an explicit type cast spelled out in the code makes the compiler believe that you know what you are doing and not complain. a.c int global; int main() { int* p = &global; long* lp = p; long l1 = *lp; // warning: initialization from incompatible // pointer type [-Wincompatible-pointer-types] long l2 = *(long*)p; // same as above, but no warning } $ gcc -fsanitize=undefined a.c a.c: In function ´main´: a.c:5:16: warning: initialization from incompatible pointer type [-Wincompatible-pointer-types] long* lp = p; ^ These kinds of tricks place the program into the undefined behavior territory meaning that it is no longer predictable what the program will do. It is often tempting to dismiss the severity of the undefined behavior; in fact, not many situations really lead to unpredictable results at low optimization levels. The danger increases tenfold with the high -O settings because the undefined behavior starts to break compiler's understanding of the program and, guessing incorrectly, the compiler can generate code that does peculiar things. As an example, see how undefined behavior can erase your hard disk. Fortunately, the gcc compiler can still help to find at least some kinds of undefined behavior situations. It can be asked to instrument the generated code with additional instructions that would perform various checks before actual user code gets executed. To enable this instrumentation, use the -fsanitize=undefined option when compiling and linking your program. When executed, the program will report problems spotted as "runtime errors". See, for instance, how the GNU toolchain detects two bugs in the above code at run time: $ ./a.out a.c:9:10: runtime error: load of misaligned address 0x0000006010dc for type 'long int', which requires 8 byte alignment 0x0000006010dc: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ a.c:9:10: runtime error: load of address 0x0000006010dc with insufficient space for an object of type 'int' 0x0000006010dc: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 The -fsanitize option has many sub-options. If you are interested in finding out which specific situation can be detected by the version of the GNU toolchain you are using, check the Program Instrumentation Options section of its documentation. By default, the first error aborts the program giving you an opportunity to debug the core file, but it is also possible to attempt to continue execution in order to catch more error at once. This is what the -fsanitize-recover=undefined compiler option does; remember, though, that errors can cascade and all but the first one may not be very useful. Memory Corruption Mitigation Memory corruption is perhaps the most common source of subtle bugs and vulnerabilities. Unsurprisingly, many tools exist to help the programmer to find the origin of the problem (memcheck, discover, etc, etc). The GNU toolchain has not one but two such technologies: run-time program instrumentation ("AddressSanitizer") and, independent from it, built-in checks of the glibc dynamic memory allocator. AddressSanitizer The gcc compiler can instrument memory access instructions so that out-of-bounds and use-after-free bugs can be detected. This method requires recompilation with the -fsanitize=address option and obviously produces code that runs slower than without instrumentation (expect ~x2 slowdown). When compiling with optimization, the -fno-omit-frame-pointer is recommended since the sanitizer runtime uses fast and simple frame-based stack unwinder that requires the frame pointer register to serve its primary function. At run time, a detailed error message will be issued to stderr complete with the stack traces at the time of the invalid access and allocation of the memory block (if it was in the heap). Many find it helpful to not abort on first error; the -fsanitize-recover=address option enables this. Here's an example of the sanitizer output from this code: a.c // ... char* p = malloc(2); p[2] = 0; // writes past the allocated buffer // ... $ gcc -fsanitize=address a.c $ ./a.out ================================================================= ==27056==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x619000000480 at pc 0x000000400726 bp 0x7fffffffd910 WRITE of size 1 at 0x619000000480 thread T0 #0 0x400725 in main (/tmp/a.out+0x400725) #1 0x7ffff6a7d3d4 in __libc_start_main (/lib64/libc.so.6+0x223d4) #2 0x400618 (/tmp/a.out+0x400618) 0x619000000480 is located 0 bytes to the right of 1024-byte region [0x619000000080,0x619000000480) allocated by thread T0 here: #0 0x7ffff6f01900 in __interceptor_malloc /.../asan_malloc_linux.cc:62 #1 0x4006d8 in main (/tmp/a.out+0x4006d8) #2 0x7ffff6a7d3d4 in __libc_start_main (/lib64/libc.so.6+0x223d4) SUMMARY: AddressSanitizer: heap-buffer-overflow (/tmp/a.out+0x400725) in main Shadow bytes around the buggy address: ... 0x0c327fff8080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x0c327fff8090:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ... This method works not only on dynamically allocated memory, but on stack (local, automatic) variables and global statically allocated data. Many aspects of the address sanitizer's work are controlled with the ASAN_OPTIONS environment variable, including what to check and what to report. For example, specifying ASAN_OPTIONS=log_path=memerr.log will redirect all output to the file named memerr.log.<pid> instead of stderr. See the complete option reference. Dynamic Memory Checks by glibc The glibc dynamic memory allocator can perform heap consistency checks and report problems to stderr supplied with the stack trace and memory map at the time of error, if requested. To utilize this capability, set the MALLOC_CHECK_ environment variable (values control what to do on error and can be found in mallopt(3)) prior to running the program. You can also insert explicit heap checks by either linking with -lmcheck option or calling the mcheck(3) function before the first call to malloc(3). All the specifics can be found in the mcheck(3) man page. This is an example of this facility: a.c char* p = malloc(n); // ... if ( argc == 1 ) { free(p); } // ... free(p); No additional compilation options are required: $ gcc a.c $ MALLOC_CHECK_=3 ./a.out *** Error in `./a.out': free(): invalid pointer: 0x0000000000602010 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x8362e)[0x7ffff7a9162e] ./a.out[0x40058b] ... Aborted (core dumped) The types of problems found by these checks are limited to heap metadata corruption (heap buffer overruns) and things like double free. Still, the method requires neither changes to the code nor recompilation, has lower performance impact than AddressSanitizer described above, and can be used to abort the program to ease debugging, all of which make it a useful tool in keeping your program clear from dynamic memory corruption. Options to Increase Code Security The GNU compiler implements several techniques to harden the program against possible attacks. They work by inserting small bits of code and/or by adding checks to some standard functions (strcat(3), for instance) that verify the integrity of the vital data at run time and abort the program if the data get damaged, which may be the result of a programming error or attempted attack. All these options are aimed at being enabled for production builds. The -fstack-protector option adds protection against stack smashing attacks by placing a few guarding bytes to the vulnerable (see below) function's stack and verifying that those bytes haven't been changed before returning from the function. If they have, an error is printed and the program aborts: *** stack smashing detected ***: ./a.out terminated ======= Backtrace: ========= /lib64/libc.so.6(__fortify_fail+0x37)[0x7ffff7b26677] /lib64/libc.so.6(+0x118632)[0x7ffff7b26632] ./a.out[0x400589] ./a.out[0x400599] ... By default, only functions with call alloca(3) and functions with buffers larger than 8 bytes are so protected by the option. There are several choices as to which functions to consider vulnerable and protect: -fstack-protector-strong protects also those that have local array definitions or have references to local frame addresses, -fstack-protector-all protects all functions, and -fstack-protector-explicit only protects those with the stack_protect attribute, which you need to add manually. Another compiler option that helps to protect against the stack-tampering attacks is -fstack-check. When a single-threaded program goes beyond its stack boundaries, the OS generates a signal (typically SIGSEGV) that terminates the program. With multi-threaded - and, therefore, multi-stack - programs, such situation is not so easily detectable because one thread's stack's bottom might be another stack's top and the gap between them (protected by the OS) is small enough so that it can be "jumped" over. The -fstack-check option will help to mitigate that and make sure the OS knows when a stack is being extended and by how many pages even if the attacker makes it so that the program doesn't touch every page of the newly extended stack. The result is the OS-guarded canary between different thread's stacks is guaranteed to get touched and the multi-threaded program receives the same neat terminating signal as with an offending single-threaded program. The next code hardening technique gets activated by defining the _FORTIFY_SOURCE macro to 1 (check without changing semantics) or 2 (more checking, but conforming programs might fail) and provides protection against silent buffer overruns by functions that manipulate strings or memory such as memset(3) or strcpy(3). Precise information for your version of the toolchain can be found in the feature_test_macros(3) man page. As I have mentioned in my previous post, many compiler checks benefit from an increased level of optimization that allows gcc to collect more data about the program. The use of _FORTIFY_SOURCE macro requires the optimization level of -O1 or above. Potential errors are detected both at run and compile time when possible. Consider this example: a.c #include <string.h> int main(int argc, char* argv[]) { char s[2]; strcpy(s, "a.out"); // buffer overrun here return 0; } Compiling it with the usual flags doesn't spot any problems, even though obviously the "a.out" string doesn't fit into the two bytes available in the local variable s: $ gcc -O2 -Wall -Wextra -Wno-unused a.c Even running the problem gives no hints to possible troubles: $ ./a.out $ echo $? 0 Let's add the _FORTIFY_SOURCE macro: $ gcc -D_FORTIFY_SOURCE=1 -O2 -Wall -Wextra -Wno-unused a.c In file included from /usr/include/string.h:638:0, from a.c:1: In function ´strcpy´, inlined from ´main´ at a.c:6:5: /usr/include/bits/string3.h:104:10: warning: ´__builtin___strcpy_chk´ writing 6 bytes into a region of size 2 overflows the destination [-Wstringop-overflow=] return __builtin___strcpy_chk (__dest, __src, __bos (__dest)); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ And we immediately get a warning from the compiler. Now let's see what happens if the source string being copied is not a compile-time constant: a.c #include <string.h> int main(int argc, char* argv[]) { char s[2]; strcpy(s, argv[0]); // buffer overrun here as argv[0] will be "./a.out" return 0; } Notice that this time there are no warnings: $ gcc -D_FORTIFY_SOURCE=1 -O2 -Wall -Wextra -Wno-unused a.c At run time, however, the support library has detected the buffer overrun and immediately aborted execution of the program. $ ./a.out *** buffer overflow detected ***: ./a.out terminated ======= Backtrace: ========= /lib64/libc.so.6(__fortify_fail+0x37)[0x7ffff7b26677] /lib64/libc.so.6(+0x1167f2)[0x7ffff7b247f2] ./a.out[0x40052d] ... Aborted (core dumped) Key Take-Aways The GNU toolchain can be utilized to find bugs and vulnerabilities at run time: Compile your program with the -fsanitize=undefined gcc option and run your tests. This will exercise a lot of additional checks that will help to ensure the program actually behaves as intended and doesn't do so simply by accident. Both suspected and unsuspected problems with the heap can often be detected by setting the MALLOC_CHECK_ environment variable prior to running your program (see mallopt(3) for more info). No re-compilation required! If recompiling is possible, all kinds of memory access problems can be detected by AddressSanitizer: compile with -fsanitize=address (adding -O -fno-omit-frame-pointer to reduce the negative performance impact and, possibly, -fsanitize-recover=address to not abort on first error). The GNU toolchain can also harden your program against certain kinds of attacks: The -fstack-protector option adds stack integrity checks to certain vulnerable functions. You can control which functions to protect with sub-options. Use -fstack-check for multi-threaded programs to prevent one thread from silently extending its stack on top of another. Add -D_FORTIFY_SOURCE=1 -O2 to your compilation flags to catch buffer overruns by certain standard memory manipulation functions both at run and compile time. See feature_test_macros(3) for more info. References List of gcc options for program instrumentation (-fsanitize= and friends). The complete list of gcc options with descriptions. glibc built-in heap consistency checks.

This blog entry was contributed by Maxim Kartashev In the previous post I focused on the static analysis capabilities of the gcc 7.3 compiler. Warnings issued at compile time can point to the place in...

Linux

Short Cuts to Better Solutions with the Oracle Linux ISV Catalog

Whether you’re an Oracle customer or partner, there are many reasons to look for solutions certified for use with Oracle Linux and Oracle VM – on-premises or in the cloud. There are many applications, tools, plug-ins, and more, that you may need to consider for your projects. Narrowing down the options and gathering all the necessary information can be a time-consuming task, so we’d like to offer some short cuts to the right solution for you. Oracle Software For Oracle customers, you’ll want to check out our support portal. Here you will find all Oracle software that runs on Oracle Linux and Oracle VM. You’ll find which versions have been tested and work together to help you jump start your project. You’ll also find a handy tool. In less than 10 minutes, this Getting Started with Certifications video provides a wealth of information on the various ways to search for tested software combinations. And, for customers with a support subscription, help is just one call away. Oracle Support provides technical assistance for Oracle Database, Oracle Applications, and all other Oracle software. With Oracle Linux and Oracle VM, the software is free to download, use, and distribute. You can purchase a support subscription, which is available in different support levels, to meet your needs. Third-Party Software Likewise, visit our ISV Catalog for third-party software that has been certified with Oracle Linux and Oracle VM. We work closely with ISVs to help them test their software with Oracle Linux and Oracle VM, so that mutual customers are confident that the software combinations that they deploy – on premises or in the cloud – have been tested and work well together. If you're an ISV and your software isn’t in our catalog – let’s fix that! You’re missing a great opportunity to reach customers who may be looking for your software to run on Oracle Linux – one of the top enterprise Linux distributions. Contact us to learn more about the certification process. We welcome the opportunity to work with you. Oracle Linux has a very rich set of certified solutions, comparable to other Linux distributions. These short cuts should help you find just the right solution for you.      

Whether you’re an Oracle customer or partner, there are many reasons to look for solutions certified for use with Oracle Linux and Oracle VM – on-premises or in the cloud. There are many applications,...

Linux Kernel Development

Writing kernel tests with the new Kernel Test Framework (KTF)

In this blog, Oracle Linux kernel developers Alan Maguire and Knut Omang explain how to write Kernel Test Framework tests. KTF is available as a standalone git repository, but we are also working to offer it as a patch set for integration into the kernel. Read more about KTF in our introductory blog post, here: https://blogs.oracle.com/linux/oracles-new-kernel-test-framework-for-linux-v2  Writing new KTF (Kernel Test Framework) tests Here we're going to try and describe how we can use KTF to write some tests. The neat thing about KTF is it allows us to test kernel code in kernel context directly. This means the environment we're running our tests in affords a lot of control. Here we're going to try and write some tests for a key abstraction in Linux kernel networking, the "struct sk_buff". The sk_buff (socket buffer) is the structure used to store packet data as it moves through the networking stack. For an excellent introduction, see http://vger.kernel.org/~davem/skb_data.html In fact we're going to base our tests around some of the descriptions there, by creating/manipulating/freeing skbs and asking questions like what is the state of an sk_buff when it is first allocated? what about when we reserve space for new packet headers, or add tailroom? ...etc. My hope is that we can show that adding tests is in fact a great way to understand an API. If we can formalize the guarantees of the API such that we can write tests to validate them, we've come a long way in understanding it. Brief Introduction to KTF KTF allows us to test both exported and un-exported kernel interfaces in test cases which are added in a dedicated test module. We can make assertions about state during these test cases and the results are communicated to userspace via netlink sockets. The googletest framework is used in conjunction with this framework. While KTF supports hybrid user- and kernel-mode tests, here we will focus on kernel-only tests. Creating our project First let's grab a copy of KTF and build it. We use separate source and build trees, and because KTF builds modules we need kernel-specific builds. Because we are building kernel modules we will also need the kernel-uek-devel package. We build googletest from source. Note: these instructions are for Oracle Linux; some package names etc may differ for other distros. Full instructions can be found in the doc/installation.txt file in KTF. We use Knut's version of googletest as it includes assertion counting and better test case naming. Building googletest # yum install cmake3 # cd ~ # mkdir -p src build/`uname -r` # cd src # git clone https://github.com/knuto/googletest.git # cd ~/build/`uname -r` # mkdir googletest # cd googletest # cmake3 ~/src/googletest/ -DBUILD_SHARED_LIBS=ON # make # sudo make install Building KTF We need kernel-uek-devel and cpp packages to build. Finally once we have built ktf, we insert the kernel module. # sudo yum install kernel-uek-devel cpp libnl3-devel # cd ~/src # git clone https://github.com/oracle/ktf # cd ktf # autoreconf # cd ~/build/`uname -r` # mkdir ktf # cd ktf # PKG_CONFIG_PATH=/usr/local/lib64/pkgconfig ~/src/ktf/configure KVER=`uname -r` # make # sudo make install # sudo insmod kernel/ktf.ko Creating our new test suite Getting started here is easy; Knut created a "ktfnew" program to populate a new suite: # ~/src/ktf/scripts/ktfnew -p ~/src skbtest Creating a new project under ~/src/skbtest Let's see what we got! # ls ~/src/skbtest ac autom4te.cache configure.ac m4 Makefile.in aclocal.m4 configure kernel Makefile.am The kernel subdir is where we will add tests to our "skbtest" module, and it has already been populated with a file: # ls ~/src/skbtest/kernel Makefile.in skbtest.c skbtest.c is a simple module with one test "t1" in test set "simple" which evaluates a true expression via the EXPECT_TRUE() macro. The module init function adds the test via the ADD_TEST(name) macro. ASSERT_() and EXPECT_() macros are used to test conditions, and if they fail the test fails. We can clean up by using the ASSERT_*_GOTO() variants which we can pass a label to jump to on failure. ASSERTs are fatal to a test case execution. We will see more examples of this later on. #include <linux/module.h> #include "ktf.h" MODULE_LICENSE("GPL"); KTF_INIT(); TEST(simple, t1) { EXPECT_TRUE(true); } static void add_tests(void) { ADD_TEST(t1); } static int __init skbtest_init(void) { add_tests(); return 0; } static void __exit skbtest_exit(void) { KTF_CLEANUP(); } module_init(skbtest_init); module_exit(skbtest_exit); So we're ready to start adding our tests! Before we do anything else, let's ensure we track our progress with git. We remove "configure" as we don't want to track it via git, we create it with "autoreconf". # cd ~/src/skbtest # rm configure # git init . # git add ac aclocal.m4 configure.ac kernel/ m4 Makefile.* # git commit -a -m "initial commit" The first thing we need to do is ensure that our tests have access to the skb interfaces. We need to add #include <linux/skbuff.h> Next, let's add a simple test that makes assertions about skb state after allocation. /** * alloc_skb_sizes() * * ensure initial skb state is as expected for allocations of various sizes. * - head == data * - end >= tail + size * - len == data_len == 0 * - nr_frags == 0 * **/ TEST(skb, alloc_skb_sizes) { unsigned int i, sizes[] = { 127, 260, 320, 550, 1028, 2059 }; struct sk_buff *skb = NULL; for (i = 0; i < ARRAY_SIZE(sizes); i++) { skb = alloc_skb(sizes[i], GFP_KERNEL); ASSERT_ADDR_NE_GOTO(skb, 0, done); ASSERT_ADDR_EQ_GOTO(skb->head, skb->data, done); /* * skb->end will be aligned and include overhead of shared * info. */ ASSERT_TRUE_GOTO(skb->end >= skb->tail + sizes[i], done); ASSERT_TRUE_GOTO(skb->tail == skb->data - skb->head, done); ASSERT_TRUE_GOTO(skb->len == 0, done); ASSERT_TRUE_GOTO(skb->data_len == 0, done); ASSERT_TRUE_GOTO(skb_shinfo(skb)->nr_frags == 0, done); kfree_skb(skb); skb = NULL; } done: kfree_skb(skb); } static void add_tests(void) { ADD_TEST(alloc_skb_sizes); } If one of our ASSERT_ macros fails, we will goto "done", and we clean up there by freeing the skb. Ensuring tests tidy up after themselves is important as we don't want our tests to induce memory leaks! Now we build and run our test. Building and running our test Here we build our test kernel module. Since we installed ktf/googletest in /usr/local, we need to tell configure to look there. # cd ~/src/skbtest # autoreconf # cd ~/build/`uname -r` # mkdir skbtest # cd skbtest # ~/src/skbtest/configure KVER=`uname -r` --prefix=/usr/local --libdir=/usr/local/lib64 --with-ktf=/usr/local # make # sudo make install Now let's load our test module (we loaded ktf above) and run the tests: # sudo insmod kernel/skbtest.ko # sudo LD_LIBRARY_PATH=/usr/local/lib64 /usr/local/bin/ktfrun [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from skb [ RUN ] skb.alloc_skb_sizes [ OK ] skb.alloc_skb_sizes, 42 assertions (0 ms) [----------] 1 test from skb (0 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (0 ms total) [ PASSED ] 1 test. Error injection Now the above admittedly looks pretty dull. However it's worth emphasizing something before we move on. This code actually ran in-kernel! With a lot of pain, it would be possible to hack up a user-space equivalent test, but it would require adding definitions for kmalloc, kmem_cache_alloc etc. Here we test the code in the same environment in which it is run, with no caveats or special-purpose environments. This makes KTF execution pretty unique; no need for extensive stubbing, we're testing the code as-is. Next we're going to try and inject an error and see how skb allocation behaves in low-memory conditions. KTF allows us to catch function execution and return and mess with the results via kprobes; or specifically kretprobes. To catch a return value we declare: KTF_RETURN_PROBE(function_name, function_handler) { void *retval = (void *)KTF_RETURN_VALUE(); ... KTF_SET_RETURN_VALUE(newvalue); return 0; } We get the intended return value witl KTF_RETURN_VALUE(), and we can set our own via KTF_SET_RETURN_VALUE(). Note that the return value that the above function returns should always be 0 - the value that the functionw we're probing actually returns is set by KTF_SET_RETURN_VALUE(). For neatness, if it's a memory allocation we should free it, otherwise we'll be inducing a memory leak with our test! However we face a few problems with this sort of error injection. First, the kmem_cache used - skbuff_head_cache - is not exported as a symbol, so how do we access it in order to kmem_cache_free() our skb memory? Luckily, ktf has a handy function for cases like this - ktf_find_symbol(). We pass in the module name (NULL in this case because it's a core kernel variable) and the symbol name, and we get back the address of the symbol. Remember though that this is essentially &skbuff_head_cache, so we need to dereference it before use. Second, we don't want to fail skb allocations for everyone as that will kill our network access etc. So by recording the task_struct * for the test in alloc_skb_nomem_task, we can limit the damage to our test thread. Here's what the test looks like in full: struct task_struct *alloc_skb_nomem_task; KTF_RETURN_PROBE(kmem_cache_alloc_node, kmem_cache_alloc_nodehandler) { struct sk_buff *retval = (void *)KTF_RETURN_VALUE(); struct kmem_cache **cache; /* We only want alloc failures for this task! */ if (alloc_skb_nomem_task != current) return 0; /* skbuf_head_cache is private to skbuff.c */ cache = ktf_find_symbol(NULL, "skbuff_head_cache"); if (!cache || !*cache || !retval) return 0; kmem_cache_free(*cache, retval); KTF_SET_RETURN_VALUE(0); return 0; } /** * alloc_skb_nomem() * * Ensure that in the face of allocation failures (kmem cache alloc of the * skb) alloc_skb() behaves sensibly and returns NULL. **/ TEST(skb, alloc_skb_nomem) { struct sk_buff *skb = NULL; alloc_skb_nomem_task = current; ASSERT_INT_EQ_GOTO(KTF_REGISTER_RETURN_PROBE(kmem_cache_alloc_node, kmem_cache_alloc_nodehandler), 0, done); skb = alloc_skb(128, GFP_KERNEL); ASSERT_ADDR_EQ_GOTO(skb, 0, done); alloc_skb_nomem_task = NULL; done: KTF_UNREGISTER_RETURN_PROBE(kmem_cache_alloc_node, kmem_cache_alloc_nodehandler); kfree_skb(skb); } static void add_tests(void) { ADD_TEST(alloc_skb_sizes); ADD_TEST(alloc_skb_nomem); } Let's run it! # sudo LD_LIBRARY_PATH=/usr/local/lib64 /usr/local/bin/ktfrun [==========] Running 2 tests from 1 test case. [----------] Global test environment set-up. [----------] 2 tests from skb [ RUN ] skb.alloc_skb_nomem [ OK ] skb.alloc_skb_nomem, 2 assertions (27 ms) [ RUN ] skb.alloc_skb_sizes [ OK ] skb.alloc_skb_sizes, 42 assertions (0 ms) [----------] 2 tests from skb (27 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test case ran. (27 ms total) [ PASSED ] 2 tests. Neat! Our error injection must have worked since alloc_skb() returned NULL, and we also cleaned up the memory that was really allocated but we pretended wasn't. alloc_skb and bad skb sizes Next we might wonder; given the arguments, can we see what happens when we provide an invalid size? But what is an invalid size? 0? UINT_MAX? Let's try a test where we pass in 0 and UINT_MAX and expect alloc_skb() to fail: TEST(skb, alloc_skb_invalid_sizes) { unsigned int i, sizes[] = { 0, UINT_MAX }; struct sk_buff *skb = NULL; for (i = 0; i < ARRAY_SIZE(sizes); i++) { skb = alloc_skb(sizes[i], GFP_KERNEL); ASSERT_ADDR_EQ_GOTO(skb, 0, done); } done: kfree_skb(skb); } Build again, and let's see what happens: # sudo LD_LIBRARY_PATH=/usr/local/lib64 ktfrun [==========] Running 3 tests from 1 test case. [----------] Global test environment set-up. [----------] 3 tests from skb [ RUN ] skb.alloc_skb_invalid_sizes /var/tmp/build/4.14.35+/skbtest/kernel/skbtest.c:113: Failure Assertion '(u64)(skb)==(u64)(0)' failed: (u64)(skb)==0xffffa07b53ca6c00, (u64)(0)==0x0 [ FAILED ] skb.alloc_skb_invalid_sizes, where GetParam() = "alloc_skb_invalid_sizes" (19 ms) [ RUN ] skb.alloc_skb_nomem [ OK ] skb.alloc_skb_nomem, 2 assertions (23 ms) [ RUN ] skb.alloc_skb_sizes [ OK ] skb.alloc_skb_sizes, 2 assertions (0 ms) [----------] 3 tests from skb (42 ms total) [----------] Global test environment tear-down [==========] 3 tests from 1 test case ran. (42 ms total) [ PASSED ] 2 tests. [ FAILED ] 1 test, listed below: [ FAILED ] skb.alloc_skb_invalid_sizes, where GetParam() = "alloc_skb_invalid_sizes" 1 FAILED TEST Okay so that failed, which means our allocation succeeded; why? Taking a closer look at alloc_skb(), there's no bar on 0 values. What about UINT_MAX, that shouldn't work, right? Actually it does! If we look at the code however, the size value that gets passed in gets the sizeof(struct skb_shared_info) etc added to it. So we just overflow the value, but what's interesting about that is we'll end up with an skb that invalidates the initial state expectations. Let's demonstrate that by add UINT_MAX to our "sizes" array in our valid skb alloc test "alloc_skb_sizes": unsigned int i, sizes[] = { 0, 127, 260, 320, 550, 1028, 2059, UINT_MAX }; Rebuilding and running we see this: # sudo LD_LIBRARY_PATH=/usr/local/lib64 ktfrun [==========] Running 3 tests from 1 test case. [----------] Global test environment set-up. [----------] 3 tests from skb [ RUN ] skb.alloc_skb_invalid_sizes [ OK ] skb.alloc_skb_invalid_sizes, 2 assertions (0 ms) [ RUN ] skb.alloc_skb_nomem [ OK ] skb.alloc_skb_nomem, 2 assertions (23 ms) [ RUN ] skb.alloc_skb_sizes /var/tmp/build/4.14.35+/skbtest/kernel/skbtest.c:45: Failure Failure '(skb->end >= skb->tail + sizes[i])' occurred [ FAILED ] skb.alloc_skb_sizes, where GetParam() = "alloc_skb_sizes" (15 ms) [----------] 3 tests from skb (38 ms total) So if we pass UINT_MAX to alloc_skb() we end up with a broken skb, in that skb->end isn't pointing where it should be. Seems like there could be some range checking here, but alloc_skb() is such a hot codepath it's likely the pragmatic argument that "no-one should allocate dumb-sized skbs" wins. We can modify our test to use "safer" bad sizes for now: TEST(skb, alloc_skb_invalid_sizes) { /* We cannot just use UINT_MAX here as the "size" argument passed in * has sizeof(struct skb_shared_info) etc added to it; let's settle for * UINT_MAX >> 1, UINT_MAX >> 2, etc. */ unsigned int i, sizes[] = { UINT_MAX >> 1, UINT_MAX >> 2}; struct sk_buff *skb = NULL; for (i = 0; i < ARRAY_SIZE(sizes); i++) { skb = alloc_skb(sizes[i], GFP_KERNEL); ASSERT_ADDR_EQ_GOTO(skb, 0, done); } done: kfree_skb(skb); } In general the skb interfaces assume the data they are provided is sensible, but we've just learned what can happen when it isn't! Writing tests is a great way to learn about an API.

In this blog, Oracle Linux kernel developers Alan Maguire and Knut Omang explain how to write Kernel Test Framework tests. KTF is available as a standalone git repository, but we are also working to...

Announcements

Oracle OpenWorld and Oracle Code One San Francisco 2019 – Call for Speakers is Open!

This year, Oracle OpenWorld and Oracle Code One San Francisco 2019 are taking place Sunday, September 15 – Thursday, September 19, 2019. The Call for Speakers is now open, and the deadline is coming up fast! Oracle customers and partners are encouraged to submit proposals to present at either or both of the conferences. The deadline to submit a proposal has been extended to 4pm ET on Friday, March 22, 2019. Whether you’re focused on securing your enterprise, operating in a hybrid cloud environment, finding ways to optimize cloud native or DevOps solutions, using Oracle Infrastructure Technologies, these conferences are ideal for sharing best practices, case studies, lessons learned, how-to’s, and deep-dives. We’re excited to have you join us, to make these the ultimate cloud and developer learning conferences of 2019. Don’t wait! Submit your proposal by 4pm ET on Friday, March 22. Details and submission guidelines are available on the Oracle OpenWorld and Oracle Code One websites below. Important Links Oracle OpenWorld Conference website Call for Speakers – proposal deadline: March 22, 2019 Oracle Code One Conference website Call for Speakers – proposal deadline: March 22, 2019

This year, Oracle OpenWorld and Oracle Code One San Francisco 2019 are taking place Sunday, September 15 – Thursday, September 19, 2019. The Call for Speakers is now open, and the deadline is coming...

Linux Kernel Development

Reboot faster with kexec

Oracle Linux kernel developer Steve Sistare contributes this article on speeding up kernel reboots for development and for production systems. Fast reboot with kexec The kexec command loads a new kernel and jumps directly to it, bypassing firmware and grub. It is most often used as the first step in generating a crash dump, but it can also be used to perform an administrative reboot. The time saved by skipping firmware is substantial on a server with large memory, many CPUS, and many devices. This is particularly useful during kernel development when you frequently rebuild and reboot the kernel. The kexec options are a bit arcane, so I wrote a script to make it easier to use for basic reboot. You specify the new or old kernel, and new or old or additional kernel command-line parameters. The script loads the new kernel and initramfs using kexec -l, then gracefully stops systemd services and jumps to the new kernel using systemctl kexec. It could save a few more seconds by abruptly killing processes with kexec -e, but I choose the graceful route to mimic a normal reboot as closely as possible. The dramatic time savings come from skipping firmware, rather than skipping systemd shutdown. Here is the bash script, which I call kboot: #!/bin/sh [[ "$1" != '-' ]] && kernel="$1" shift if [[ "$1" == '-' ]]; then reuse=--reuse-cmdline shift fi [[ $# == 0 ]] && reuse=--reuse-cmdline kernel="${kernel:-$(uname -r)}" kargs="/boot/vmlinuz-$kernel --initrd=/boot/initramfs-$kernel.img" kexec -l -t bzImage $kargs $reuse --append="$*" && \ systemctl kexec   Usage: kboot kboot <kernel> [<params>] ... The first arg (if any) specifies the kernel, where a '-' means use the current kernel. If the 2nd arg is '-' or is omitted, then the existing kernel parameters are appended. Any remaining args are also appended to the kernel parameters.   Examples:   Reboot to the same (possibly updated) kernel with same kernel command line. # kboot Reboot to a different kernel with the same kernel command line. # kboot 4.20.0-rc5 Reboot to the same kernel with same kernel command line plus additional parameters: # kboot - - log_buf_len=16M enforcing=0 Reboot to a different kernel with the same kernel command line plus additional parameters # kboot 4.20.0-rc5 - log_buf_len=16M enforcing=0 Reboot to the same kernel with a new command line. Add single quotes around the parameters if they contain shell meta characters. # kboot - 'root=/dev/mapper/vg00-lv_root ro crashkernel=auto rd.lvm.lv=vg00/lv_root rd.lvm.lv=vg00/lv_swap console=ttyS0,115200 systemd.log_level=debug' On an X6-2 test system with 2 sockets * 22 cores and 448 GB of RAM, running Oracle Linux 7.5 with UEK5, a normal reboot takes 184 seconds, as measured from typing reboot to the availability of the sshd port for logging in. kboot takes 30 seconds, a 6X speedup. For my kernel projects, kexec reboot has made the edit-compile-debug cycle a pleasure rather than a punishment!

Oracle Linux kernel developer Steve Sistare contributes this article on speeding up kernel reboots for development and for production systems. Fast reboot with kexec The kexec command loads a new kernel...

Linux

Easy Compute Instance Metadata Access with OCI Utils

About OCI Utilities Instances created in Oracle Cloud Infrastructure using Oracle-Provided Images based on Oracle Linux include a pre-installed set of utilities that are designed to make it easier to work with Oracle Linux images. This is a quick blog post to demonstrate how the oci-metadata command included in OCI Utilities make quick work of accessing instance metadata. Update - March 4th, 2019: This post was updated to include the --value-only option As of this writing, the following components are included. You can read more about each of the utilities in the OCI Utilities documentation. ocid oci-growfs oci-iscsi-config oci-metadata oci-network-config oci-network-inspector oci-network-inspector oci-public-ip Working With Instance Metadata Using oci-metadata Display all instance metadata To display all metadata in human-readable format, simply run oci-metadata $ oci-metadata Instance details: Display Name: autonomous blog Region: iad - us-ashburn-1 (Ashburn, VA, USA) Canonical Region Name: us-ashburn-1 Availability Domain: PDkt:US-ASHBURN-AD-3 Fault domain: FAULT-DOMAIN-2 OCID: ocid1.instance.oc1.iad.abuwcl.................7crrhz2g......aq Compartment OCID: ocid1.tenancy.oc1..aaaaaaaa5............qok3lunzc6.....jw7q Instance shape: VM.Standard2.1 Image ID: ocid1.image.oc1.iad.aaaaaaaawuf..............zjc7klojix6vmk42va Created at: 1548877740674 state: Running Instance Metadata: user_data: dW5kZWZpbmVk ssh_authorized_keys: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAxxxxxxxxxxVMheESQgRukanNBmLxaXA0kZw4DxaCispcEjTgAmBmHpUWQBsG7Y/s3zVQDUZ5irMKr2Rtc5DAkH+y6SsNw+xxxxxx+Zix85RClbmu3vl6Mf1++15VoxxxxxEP16mPZl+Cfk/T9LVIlMtV+brph8AQACxFxxxxxxWSNTj1tE8DTml2QnSA6F6MtP6OvOQ0KzQViNm1kN9MaarGOoNxxxxxxNyJGayh8YA6+n8Y07A3fr870H bmc Networking details: VNIC OCID: ocid1.vnic.oc1.iad.abuwcljsrul..............................hioysuicdmzcq VLAN Tag: 804 MAC address: 02:00:17:01:78:09 Subnet CIDR block: 10.0.2.0/24 Virtual router IP address: 10.0.2.1 Private IP address: 10.0.2.3   To display all metadata in JSON format: $ oci-metadata -j oci-metadata -j { "instance": { "compartmentId": "ocid1.tenancy.oc1..aaaaaaaa5............qok3lunzc6.....jw7q", "displayName": "autonomous blog", "timeCreated": 1548877740674, "state": "Running", "image": "ocid1.image.oc1.iad.aaaaaaaawufnve5jxze4xf7orejupw5iq3pms6cuadzjc7klojix6vmk42va", "canonicalRegionName": "us-ashburn-1", "metadata": { ... Display a specific metadata key value The following command displays the value of the canonicalRegionName key, trimming the path to the last component only using the --trim option. $ oci-metadata -g canonicalRegionName --trim Canonical Region Name: us-ashburn-1 Exporting metadata values as environment variables Using eval To set an environment with the name and value from instance metadata: $ eval $(oci-metadata --get compartmentId --export) $ echo $compartmentId ocid1.tenancy.oc1..aaaaaaaa5............qok3lunzc6.....jw7q Using the --value-only option The --value-only option, as the name implies, outputs only the key value without a label. For example: $ oci-metadata -g "CanonicalRegionName" --value-only us-phoenix-1 Or, to assign the compartment ID to an environment variable: $ export MYCOMPARTMENT=`oci-metadata -g "compartmentID" --value-only` $ echo MYCOMPARTMENT ocid1.tenancy.oc1..aaaaaaaa5............qok3lunzc6.....jw7q Using jq To set an environment variable you name yourself, you can extract the raw value of a key using the jq JSON processor: $ export MYCOMPARTMENT=`oci-metadata -j --trim -g /instance/compartmentID | jq -r .[]` $ echo MYCOMPARTMENT ocid1.tenancy.oc1..aaaaaaaa5............qok3lunzc6.....jw7q Display the Instance's Public IP Address To display the instance's public IP address $ oci-public-ip Public IP address: 129.xxx.yyy.175   And to extract just the IP address: oci-public-ip -j | jq -r .[] 129.xxx.yyy.175

About OCI Utilities Instances created in Oracle Cloud Infrastructure using Oracle-Provided Images based on Oracle Linux include a pre-installed set of utilities that are designed to make it easier to...

Linux Kernel Development

Introducing SPDK for Oracle Linux

Oracle Linux Kernel developer Lance Hartmann contributes this blog post on using SPDK, the Storage Performance Development Kit. Introducing the SPDK Slated to arrive soon in the developer yum channel via ULN, the Storage Performance Development Kit SPDK is an open-source project providing user space tools and libraries for writing high performance, scalable storage applications built largely but not solely around a user space NVMe driver. Harnessing the power of multi-core CPUs and the multi-queue architecture of NVMe, SPDK applications can easily achieve the maximum bandwidth that a NVMe drive supports and enjoy low latency by polling for I/O completions instead of using interrupts. Under the Hood Zero-copy I/O is managed through the use of hugepages whose physical pages are always pinned for the data buffers and the I/O queues. A single thread per NVMe queue which both dispatches I/Os and checks for completions enables a lockless I/O path. For those NVMe controllers designated for use by the SPDK, the default Linux kernel nvme driver is unbound from them and replaced with a binding to either the uio_pci_generic or vfio-pci kernel drivers. Using the SPDK API, applications then gain access, via mmap(), to the NVMe controller's register set enabling them to perform admin actions and trigger I/Os. In addition to providing I/O to locally (PCIe) attached NVMe drives, the SPDK also ships with a NVMe over Fabric target application. The RDMA transport for Infiniband and RoCE has been supported in the SPDK for a while, and the TCP transport was just recently added. A set of patches for supporting the Fibre Channel (FC) transport have been proposed. Configuration of the target is facilitated with either a configuration file, or may be dynamically managed via RPC calls provided by SPDK Python scripts. The growth and active development of the SPDK has yielded additional functionality. A user space block layer also exists and provides a highly modular architecture enabling the development of "bdevs" which may be used alone or stacked atop one another enabling complex I/O pipelines. Existing bdev modules today include NVMe, RAM disk, Linux AIO, RAID (level-0/striping), iSCSI and more all of which may be configured as targets to the SPDK's target applications. Common features of the block layer include mechanisms for enumerating SPDK block devices and exposing their supported I/O operations, queueing I/Os in the case of the underlying device's queue is full, support of hotplug remove notification, obtaining I/O statistics which may be used for quality-of-service (QoS) throttling, timeout and reset handling, and more. Traditionally, the SPDK has relied on portions of the Data Plane Development Kit DPDK to provide lower level functionality which is referred to as the run time environment. This includes things like thread and co-process management, memory management, virtual to physical address translation, lockless data structures like rings, and PCI enumeration and mmap()'d I/O. Over time it was realized that a number of consumers of the SPDK already had much of such functionality in place, and moreover, that their implementation was highly tailored to their types of workloads. Hence, an abstraction layer was created in the SPDK enabling consumers to employ their own run time environment if preferred over the default DPDK. Packaging The SPDK is currently in use in a number of production environments around the world, though to date has yet to appear via packages. Instead, consumers of the SPDK have been downloading source and building the SPDK from scratch to enable integration with their applications. Coming soon, SPDK rpm packages "spdk", "spdk-tools" and "spdk-devel" will make their inaugural debut for Oracle Linux. The aim is to offer users the ability to experiment with some example SPDK applications and provide the include files and libraries to build their own SPDK applications saving them the need to locate, download and build the SPDK themselves. Both static and their shared library equivalents are available though note that ABI versioning it not yet in place but planned in a future release.

Oracle Linux Kernel developer Lance Hartmann contributes this blog post on using SPDK, the Storage Performance Development Kit. Introducing the SPDK Slated to arrive soon in the developer yum channel...

Announcements

10 Leaders Share Their Infrastructure Transformation Stories

The detailed use cases in this paper are of the 2018 Winners of the Oracle Excellence Awards “Leadership In Infrastructure Transformation” category. In these 10 individual stories, you'll learn how IT leaders accelerated innovation and drove business transformation.  Each of these leaders ultimately delivered value to their organizations through the use of multiple Oracle technologies which have resulted in reduced cost of IT operations, improved time to deployment, and performance and end user productivity gains. Each story includes the use of at least one, if not a combination of several, of the below:   •    Oracle Linux •    Oracle Virtualization (VM, VirtualBox) •    Oracle Private Cloud Appliance •    Oracle SuperCluster •    Oracle SPARC •    Oracle Solaris •    Oracle Storage, Tape/Disk The stories feature Michael Polepchuk, Deputy Chief Information Officer, BCS Global Markets; Brian Young, Vice President, Cerner, Brian Bream, CTO, Collier IT; Rudolf Rotheneder, CEO, cons4u GmbH; Heidi Ratini, Senior Director of Engineering, IT Convergence; Philip Adams, Chief Technology Officer, Lawrence Livermore National Labs; JK Pareek, Vice President, Global IT and CIO, Nidec Americas Holding Corporation; Baris Findik, CIO, Pegasus Airlines; Michael Myhrén, Senior DBA Senior Systems Engineer and Charles Mongeon, Vice President Data Center Solutions and Services (TELUS Corporation). Learn more here. 

The detailed use cases in this paper are of the 2018 Winners of the Oracle Excellence Awards “Leadership In Infrastructure Transformation” category. In these 10 individual stories, you'll learn how IT...

Announcements

Oracle Linux 7 Completes Common Criteria Evaluation

Oracle is pleased to announce that Oracle Linux 7 received Common Criteria Certification which was performed against the National Information Assurance Partnership (NIAP) General Purpose Operating System v4.1 and additionally at Evaluation Assurance Level (EAL) 1. Common Criteria is an international framework (ISO/IEC 15408) which defines a common approach for evaluating security features and capabilities of Information Technology security products. A certified product is one that a recognized Certification Body asserts as having been evaluated by a qualified, accredited, and independent evaluation laboratory competent in the field of IT security evaluation to the requirements of the Common Criteria and Common Methodology for Information Technology Security Evaluation. Security evaluation is a process by which independent but accredited organizations provide assurance in the security of IT products and systems to commercial, government, and military institutions. Such evaluations, and the criteria upon which they are based, are designed to help establish an acceptable level of confidence for IT purchasers and vendors alike. Furthermore, security evaluation criteria and ratings can be used as concise expressions of IT security requirements. The completed evaluation for Oracle Linux 7 update 3 was performed by atsec information security AB, in accordance with the requirements of Common Criteria, version 3.1, release 5, and the Common Methodology for IT Security Evaluation, version 3.1, release 5. The evaluation was performed at the evaluation assurance level Evaluation Activities for OSPP (Protection Profile for General Purpose Operating Systems v4.1) and SSH-EP (Extended Package for Secure Shell) as well as at the evaluation assurance level EAL 1, augmented by ALC_FLR.3 Flaw Remediation reporting procedures. The evaluation platform was Oracle Server X7-2 with both the Unbreakable Enterprise Kernel (UEK) and Red Hat Compatible Kernel (RHCK). Oracle Linux is engineered for open cloud infrastructure. It delivers leading performance, scalability, reliability, and security for enterprise SaaS and PaaS workloads as well as traditional enterprise applications. Oracle Linux Support offers access to award-winning Oracle support resources and Linux support specialists, zero-downtime updates using Ksplice, additional management tools such as Oracle Enterprise Manager and lifetime support, all at a low cost. Unlike many other commercial Linux distributions, Oracle Linux is easy to download and completely free to use, distribute, and update. For a matrix of Oracle security evaluations currently in progress as well as those completed, please refer to the Oracle Security Evaluations. Visit Oracle Linux Security to learn how Oracle Linux can help keep your systems secure and improve the speed and stability of your operations.

Oracle is pleased to announce that Oracle Linux 7 received Common Criteria Certification which was performed against the National Information Assurance Partnership (NIAP) General Purpose Operating...

Announcements

Announcing Oracle Container Services 1.1.12 for use with Kubernetes

Oracle is pleased to announce the general availability of Oracle Container Services 1.1.12 for use with Kubernetes which is based on Kubernetes version 1.12.5, as released upstream. It is available for Oracle Linux 7 and is designed to integrate with the Oracle Container Runtime for Docker, provided and supported by Oracle. Oracle Container Services for use with Kubernetes runs in a series of Docker containers which are available from the Oracle Container Registry.  This release maintains Oracle's commitment to conformance with the upstream project and is Certified Kubernetes by the Cloud Native Computing Foundation (CNCF). New features in this release: Support for high availability multi-master clusters kubeadm-ha-setup provides a setup and configuration tool to lessen the administrative burden in the creation of  "high availability" clusters.   Replacement of KubeDNS with CoreDNS CoreDNS is introduced and functions as the cluster DNS service. CoreDNS is installed by default on all new clusters, and support for KubeDNS is deprecated. Note that CoreDNS support requires the Unbreakable Enterprise Kernel Release 5 for Oracle Linux 7 or later. Although Oracle makes KubeDNS and support for the Unbreakable Enterprise Kernel Release 4 for Oracle Linux 7 available for users upgrading from earlier versions, the KubeDNS configuration is deprecated and future upgrades from this combination may not be possible.   Flexvolume driver for Oracle Cloud Infrastructure The flexvolume driver enables you to add block storage volumes hosted on Oracle Cloud Infrastructure to your Kubernetes cluster.  In this release, flexvolume driver for Oracle Cloud Infrastructure is a technical preview, you can read more at https://github.com/oracle/oci-flexvolume-driver Additional features in this release of Oracle Container Services for use with Kubernetes include upstream Kubernetes 1.12.5 software packaged for Oracle Linux, improved setup and configuration utilities, updated Kubernetes Dashboard software, improved cluster backup and restore tools, and integration testing for use with Oracle Cloud Infrastructure. For more information about these and other new features, please review the Oracle Container Services for Use with Kubernetes User Guide. The guide also contains documentation on how to use the new setup and configuration utility to install a multi-master cluster as well as upgrade existing clusters. Note that Oracle does not support upgrading an existing single-master cluster to a high availability cluster. Installation and Update Oracle Container Services 1.1.12 for use with Kubernetes is free to download from the Oracle Linux yum server. Customers are encouraged to use the latest updates for Oracle Container Services for use with Kubernetes that are released on the Oracle Linux yum server and on Oracle's Unbreakable Linux Network (ULN). You can use the standard yum update command to perform an upgrade. For more information about how to install and configure Oracle Container Services for use with Kubernetes, please review the Oracle Container Services for use with Kubernetes User's Guide. Oracle does not support Kubernetes on systems where the ol7_preview, ol7_developer, or ol7_developer_EPEL yum repositories or ULN channels are enabled, or where software from these repositories, or channels, is currently installed on the systems where Kubernetes runs.  Support This release of Oracle Container Services for use with Kubernetes is made available for Oracle Linux 7 and is designed to integrate with Oracle Container Runtime for Docker. Support is available to customers having an Oracle Linux Premier Support subscription and is restricted to the combination of Oracle Container Services for Kubernetes and Oracle Container Runtime for Docker on Oracle Linux 7. Refer to Oracle Linux 7 License Information User Manual for information about Oracle Linux support levels. Kubernetes® is a registered trademark of The Linux Foundation in the United States and other countries, and is used pursuant to a license from The Linux Foundation. Resources – Oracle Linux Documentation Oracle Linux Software Download Oracle Linux Oracle Container Registry Blogs Oracle Linux  Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux - http://oracle.com/education/linux For community-based support, please visit the Oracle Linux space on the Oracle Technology Network Community.

Oracle is pleased to announce the general availability of Oracle Container Services 1.1.12 for use with Kubernetes which is based on Kubernetes version 1.12.5, as released upstream. It is available...

Announcements

Click to Launch Oracle Linux KVM and Oracle Linux Storage Appliance using the Oracle Cloud Marketplace

We are pleased to announce the availability of the Oracle Linux KVM Image and the Oracle Linux Storage Appliance application on Oracle Cloud Marketplace.  Oracle Cloud Infrastructure (OCI) now provides ready access to these images for fast and easy deployment using the embedded Marketplace in Oracle Cloud Infrastructure. You can launch these applications directly from the Marketplace on your OCI Compute instance.  With a few clicks, you can get your Oracle Linux KVM and Oracle Linux Storage Appliance instances up and running. To access the Marketplace from the OCI Console, click the navigation menu. Then, under Solutions, Platform and Edge, go to Marketplace. To demonstrate how easy it is to deploy Oracle Linux KVM with a few clicks, simply select the Oracle Linux KVM application from the Marketplace. This will take you to overview details about the application and provides useful links to documentation and resources, and usage information on how to access the KVM instance on OCI.  After clicking on the Launch Instance button, you will need to select the version of the image and the compartment in which you wish to deploy the image, and accept the terms of usage.  Clicking Launch Instance then will take you directly to the Create Compute Instance window with the pre-populated KVM image source and instance configuration details. You may modify instance configuration details here, and clicking on Create Instance will immediately deploy your instance.  This is how easy it is to deploy the Oracle Linux KVM Image on OCI. Deploying the Oracle Linux Storage Appliance is just as easy using the Marketplace in OCI.  You can also find the Oracle Linux KVM Image and Oracle Linux Storage Appliance on the public Oracle Cloud Marketplace at https://cloudmarketplace.oracle.com. Navigating from the public Marketplace will also allow you to deploy these images quickly from within the OCI console. By simplifying how software development teams access and deploy Oracle Linux solutions on OCI, customers can innovate and respond quickly to changing business needs. Experience for yourself how easy it is to deploy Oracle Linux solutions on Oracle Cloud Infrastructure. If you are not subscribed to Oracle Cloud Infrastructure, you can try it out by creating a free account with available free credits. For more information, visit: Oracle Linux for Oracle Cloud Infrastructure Oracle Linux KVM Image for Oracle Cloud Infrastructure Getting Started: Oracle Linux KVM for Oracle Cloud Infrastructure Oracle Linux Storage Appliance Blog: Click to Launch Images by Using the Marketplace in Oracle Cloud Infrastructure

We are pleased to announce the availability of the Oracle Linux KVM Image and the Oracle Linux Storage Appliance application on Oracle Cloud Marketplace.  Oracle Cloud Infrastructure (OCI) now...

Announcements

Announcing Oracle Container Runtime for Docker Release 18.09

Oracle is pleased to announce the release of Oracle Container Runtime for Docker version 18.09. Oracle Container Runtime allows you to create and distribute applications across Oracle Linux systems and other operating systems that support Docker. Oracle Container Runtime for Docker consists of the Docker Engine, which packages and runs the applications, and integrates with the Docker Hub, Docker Store and Oracle Container Registry to share the applications in a Software-as-a-Service (SaaS) cloud. Notable Updates Oracle has implemented multi-registry support that makes it possible to run the daemon with the --default-registry flag, which can be used to change the default registry to point to a registry other than the standard Docker Hub registry. More flexibility is provided with the --add-registry option which defines alternate registries to be used in case the default registry is not available. Other functionality available in this feature includes the --block-registry flag which can be used to prevent access to a particular Docker registry. Registry lists help ensure that images are prefixed with their source registry automatically, so that a listing of Docker images indicates the source registry from which an image was pulled.   This release of Docker introduces an integrated SSH connection helper that allows a Docker client to connect to a remote Docker engine securely over SSH.   The Docker client application can now be installed as an independent package, docker-cli, so that the Docker engine daemon does not need to be installed on a system that may be used to manage a remote Docker daemon instance.   Docker 18.09 uses a new version of containerd, version 1.2.0. This version of containerd includes many enhancements for greater compatibility with the most recent Kubernetes release. This release has integrated additional improvements and security fixes, including the fix to CVE-2019-5736.   Upgrading To learn how to upgrade from a previously supported version of Oracle Container Runtime for Docker, please review the Upgrading Oracle Container Runtime for Docker chapter of the documentation. Note that upgrading from a developer preview release is not supported by Oracle. Support Support for the Oracle Container Runtime for Docker is available to customers with an Oracle Linux support subscription. Refer to Oracle Linux 7 License Information User Manual for information about Oracle Linux support levels. Oracle Linux Resources: Documentation Oracle Container Runtime for Docker User's Guide Oracle Container Services for use with Kubernetes User's Guide Oracle Linux Software Download Oracle Linux Oracle Container Registry Blogs Oracle Linux Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux - education.oracle.com/linux For community-based support, please visit the Oracle Linux space on the Oracle Developer Community.

Oracle is pleased to announce the release of Oracle Container Runtime for Docker version 18.09. Oracle Container Runtime allows you to create and distribute applications across Oracle Linux systems...

Linux

Making Code More Secure with GCC - Part 1

This blog entry was contributed by Maxim Kartashev In today's world, programming securely is the default option for any project. A program that doesn't validate its input, contains buffer overruns or uninitialized variables, uses obsolete interfaces, etc. quickly becomes a liability. Standards, best practices, and tools that help find security-related bugs and prevent them from creeping into code are in no short supply. There are, for instance, SEI CERT secure coding standards designed to be statically verifiable. And a multitude of static checkers such as Fortify, Coverity, Parfait, to name a few. They all help to make the code more secure, each at its own cost to developers, but inevitably involve tools that are generally foreign to the development process. The effort required to start using that software in your project varies and is never zero. Authors of Coverity, a popular static program analyzer, formulated two laws of bug finding: "Law #1: You can't check code you don't see. Law #2: You can't check code you can't parse". From the tool developer's perspective, this means that the code analyzer must mimic the toolchain (compiler, linker, support libraries) that is used to build the program as closely as possible. Or, risk missing bugs and seeing what's not there, which may result in false positives that so frequently throw us all off. On the other hand, the toolchain itself has many qualities of a program checker: the compiler can flag potential errors in the code, often at no additional cost to the user, the linker can help to find inconsistencies in inter-module calls and warn about the use of insecure and outdated interfaces, the run-time support libraries can do additional bookkeeping and help to locate accidental interface misuse. This post starts a short series, in which I am going to explore the capabilities of the GNU 7.3 toolchain in the area of secure programming. I'll focus on the power of the compiler as a static analyzer in this post. GCC Static Analysis Options Before generating executable code, the compiler has to perform a great deal of checks in order to make sure the program conforms to syntactic and semantic constraints of the language it is written in. For example, direct initialization of a 1-byte variable with an integer constant that doesn't fit in it is an error in C++11 and, quite naturally, the error will be reported: char c{0xFFFF}; // error: narrowing conversion of `65535` from `int` to `char` inside { } [-Wnarrowing] Some of those checks aren't strictly required, but help to bring potential problems to the programmer's attention. For example, using a different kind of initialization of the same variable with the same value is technically allowed, but still has the same chance of being an error on the programmer's part. This is where compiler warnings come in: char c(0xFFFF); // warning: overflow in implicit constant conversion [-Woverflow] gcc 7.3 has almost 150 distinct options that control warnings. Some are useful because they indicate unintended user errors, even if the language rules say nothing in that situation. Some are there to help to enforce certain guidelines that may or may not be employed by your project (for instance, -Weffc++). Fortunately, very few of those options need to be mentioned by name thanks to several "macro" options that enable many warnings at once. These are -Wall and -Wextra. Together, the two options control 50+ warnings, all which are useful, so it is a sensible default for any build. Despite the name, -Wall doesn't turn on all the warnings; neither does -Wextra. While they give diagnostics worth paying attention to, there could be an overwhelming amount of "unused variable" warnings at first. Those rarely indicate real problems in the code (but see below), so it might be a good idea to add -Wno-unused until all the other warnings have been dealt with. The proper solution to silencing the "unused" warnings is to add __attribute__((unused)) to those variables that are intentionally unused. Note Sometimes the warning about an unused variable hints at the real problem, so don't turn them off forever. For example, in the following code, the "unused" warning indirectly points to the fact that the constructor's parameter was used instead of the class member, which was obviously intended to be initialized in the constructor. This is the result of naming the constructor parameter the same as the class's data member (may compilers be merciful to those adventurous souls who do such a thing). struct A { int field; A(int field) // warning: parameter `field` set but not used [-Wunused-but-set-parameter] { field = 42; // meant to initialize the data member, but set constructor's parameter instead } }; Additional Help The -Wall -Wextra warnings do not fully unleash the potential of gcc's static analysis capabilities. To narrow the compiler's focus on the program security, consider adding these options as well: -Wformat-security -Wduplicated-cond -Wfloat-equal -Wshadow -Wconversion -Wjump-misses-init -Wlogical-not-parentheses -Wnull-dereference   Here is why I consider those useful:   Options Example -Wformat-security or -Wformat=2 Help to catch all kinds of format-string-related security holes. It always makes sense to keep either option on the command line. void foo(const char* s) { printf(s); // warning: format not a string literal and no format arguments [-Wformat-security] } -Wduplicated-cond Seems to always indicate a bug; for example, a misspelled comparison operator. if ( p == 42 ) { return 1; } else if ( 42 == p ) { // warning: duplicated `if` condition [-Wduplicated-cond] return 2; } -Wfloat-equal The result of comparing floating point numbers for equality is rarely predictable and therefore indicate a possible bug in the code. Consult this 1991 article titled "What Every Computer Scientist Should Know About Floating-Point Arithmetic" for an in-depth explanation of the reasons. double d = 3; return d == 3; // warning: comparing floating point with == or != is unsafe [-Wfloat-equal] -Wshadow This option helps to catch accidental misuse of variables from different scopes and is highly recommended. int global = 42; int main() { char global = 'a'; // warning: declaration of 'global' shadows a global declaration [-Wshadow] // ... many lines later ... return global; // refers to the char variable, not ::global } -Wconversion The rules of adjusting the value when changing its type are complex and sometimes counter-intuitive. This option helps to spot unintended value adjustments. unsigned u = -1; // warning: negative integer implicitly converted // to unsigned type [-Wsign-conversion] -Wjump-misses-init Unlike in C++, jumping past variable initialization is not an error in C, but is nevertheless dangerous. switch(i) { case 10: foo(); int j = 42; case 11: // warning: switch jumps over variable initialization [-Wjump-misses-init] return j; default: return 42; } -Wlogical-not-parentheses This option helps to find questionable - from the readability point of view - conditions that may or may not indicate a bug in the code. if ( ! a > 1 ) // warning: logical not is only applied to the left hand // side of comparison [-Wlogical-not-parentheses] -Wnon-virtual-dtor This option is specific to C++ and usually indicates a problem in the code, but is not sophisticated enough to guarantee the absence of false positives. struct A // warning: `struct A` has virtual functions and non-virtual destructor { virtual void foo(); ~A(); }; void foo(A* a) { delete a; // warning: deleting object of polymorphic class type `A` which has // non-virtual destructor might cause undefined behavior [-Wdelete-non-virtual-dtor] } -Wnull-dereference This is a very useful warning with little to no false positives, but it requires the-fdelete-null-pointer-checks option, which is enabled by optimizations for most targets. void foo(int* p) { *p = 1; // warning: null pointer dereference [-Wnull-dereference] } int main() { int *p = 0; foo(p); }   Higher Optimization Means Better Analysis In order to be helpful with some of those warnings (like, for example, -Wnull-dereference and -Wstringop-overflow), the compiler needs to collect and analyze various kinds of information about the program. Some types of analysis are only performed at higher optimization levels, which is why it is advisable to compile at least with -O1 to get better diagnostics. For example: #include int main(int argc, char *argv[]) { char buf[4]; const char *s = argc > 10 ? argv[0] : "adbc"; // "s" may require 5 or more bytes strcpy(buf, s); // there's only room for 3 characters and the terminating 0 byte in buf } With the default optimization level - implying no optimization at all - you get no warnings: $ gcc -Wall -Wextra a.c But with -O1, the problem gets spotted: $ gcc -Wall -Wextra -O1 a.c a.c: In function ‘main’: a.c:8:5: |*warning*|: ‘strcpy’ writing 5 bytes into a region of size 4 overflows the destination [-Wstringop-overflow=] strcpy(buf, s); ^~~~~~~~~~~~~~ Inter-module Checks Even the highest optimization level cannot compensate for lack of information: the compiler is usually given one compilation unit (CU) at a time, making cross-checks between CUs impossible. There's a solution, though: the -flto option. It works with the linker's help and can spot otherwise very hard-to-find bugs. In this example, a function is defined as char foo(int) in one file, but declared int foo(int) in another: a.c char foo(int i) { /* ... */ } b.c extern int foo(int); typedef int (*FUNC)(int); int main() { FUNC fp = &foo; int i = fp(1); // foo() actually only returns 1 byte, while we read sizeof(int) here return i; // may return garbage } Notice the difference in the size of return types; when this function is called by the CU that only sees the latter declaration, it can end up reading uninitialized memory (3 bytes more than the function actually returns). Only the final link step with -flto can help to catch this: $ gcc -flto -c a.c b.c # no warnings $ gcc -flto a.o b.o b.c:3:12: |*warning*|: type of ‘foo’ does not match original declaration [-Wlto-type-mismatch] extern int foo(int); ^ a.c:1:6: note: return value type mismatch char foo(int i) ^ a.c:1:6: note: type ‘char’ should match type ‘int’ a.c:1:6: note: ‘foo’ was previously declared here As you can see, -flto has enabled gcc to compare declaration and definition of the function and find that they aren't really compatible. Key Take-Aways To make your gcc-compiled program more secure: Always add -Wall -Wextra to the gcc command line to get an ever-expanding set of useful diagnostics about your program. Add -Wno-unused if the amount of messages regarding unused variables is overwhelming; consider using __attribute__((unused)) later. Don't forget these additional options help to make the code even more secure: -Wformat-security -Wduplicated-cond -Wfloat-equal -Wshadow -Wconversion -Wjump-misses-init -Wlogical-not-parentheses -Wnull-dereference Compile with optimization (-O1 or higher) to enable the compiler to issue better diagnostics and help to find real bugs in the code. Use the latest possible gcc; each new major version adds dozens of new checks and improves existing ones. What's Next Static program analysis is always the result of a trade-off between the quality of real bugs it finds and quantity of false positives. In other words, not all true bugs are found and reported at compile time. Which is why keeping your eyes open at run time is also important and the GNU compiler can help with that, too. gcc is capable of adding checks to the code that it generates ("sanitizing" it), thus enabling automatic bug detection at run time. This compiler feature can help to find bugs that completely escape static analysis. I also plan to look at the built-in debugging capabilities of support libraries the GNU toolchain provides. References SEI CERT Coding Standards for C, C++, Java, and Perl. A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World - an article from the creators of Coverity. A complete list of gcc 7.3 options with description. Difference in gcc options between versions that shows the amount of new kinds of analysis each new gcc version adds. Parfait - Oracle Labs static program analysis tool.

This blog entry was contributed by Maxim Kartashev In today's world, programming securely is the default option for any project. A program that doesn't validate its input, contains buffer overruns or...

Linux Kernel Development

Talk of Huge Pages at Linux Plumbers Conference 2018

Oracle Linux kernel developer Mike Kravetz, who is also the hugetlbfs maintainer, attended Linux Plumbers Conference 2018 and shares some of his thought about the conference especially around huge pages in this blog post.   Huge Pages and Contiguous Allocations at LPC 2018 At the 2018 Linux Plumbers Conference, Huge Page utilization was discussed during the Performance and Scalability microconf, and the topic of Contiguous Allocations was discussed during the RDMA microconf. Christoph Lameter and myself gave brief presentations and led discussions on these topics. Neither of these topics are new to Linux and are often discussed at conferences and other developer gatherings. One reason for frequent discussion is that the issues are somewhat complicated and difficult to implement to everyone’s satisfaction. As a result, discussions tend to rehash old ideas, talk about any progress made and look for new ideas. Below are some of my observations from this year’s discussions. Huge Pages One may think that there is little to talk about in the realm of huge pages. After all, they have been available in Linux via hugetlbfs for over 15 years. When Transparent Huge Pages (THP) were added, huge pages could be used without all the required hugetlbfs application changes and sysadmin support. While hugetlbfs functionality is mostly settled, new features have recently been added to THP: notably work by Kirill Shutemov and others to add THP support in shm and tmpfs. Kirill has even proposed patches that add THP support to ext4. In addition to hugetlbfs and THP, DAX (Persistent Memory) defaults to using huge pages for suitably sized mappings. Ongoing Xarray work by Matthew Wilcox will make page cache management of multiple page sizes much easier. On systems with very large memory sizes people would ideally like to scale up the base page size. The well known default base page size is 4K on x86 and most other architectures. However, it is possible to change the base page size on some architectures such as arm64 and powerpc. There is interest in exploring ways to increase the base page size on x86. However, jumping to the next size supported by the MMU (2M) would be wasteful in most cases. But, for really really big memory systems (think multi- TB) it may be worth exploring. Contiguous Allocation This discussion was a follow up on the LPC 2017 presentation that formally introduced the a new contiguous allocation request. The use case from 2017 was the need for a RDMA driver to have physically contiguous areas for optimal performance. Ideally, these areas would be allocated by and passed in from user space. The ideal size for this driver would be 2G. Two things make this use case especially difficult. First, there is no interface capable of obtaining a physically contiguous area of such a large size. The in kernel memory allocators are based on the buddy allocator and have a maximum allocation size of MAX_ORDER-1 pages (4M default on x86). CMA (Contiguous Memory Allocator) can allocate such large areas, but it requires administrative overhead and coordination. Secondly, is the general problem of memory fragmentation. After the system is up and running for a while, it becomes less and less likely to find large physically contiguous areas. Memory migration is used to try and create large contiguous areas. However, some pages become locked and can not be moved which prevents their migration. In a separate presentation, work in the area of fragmentation avoidance was presented by Vlastmil Babka: The hard work behind large physical allocations in the kernel. In addition, Mel Gorman has been working on a patch series to help address this issue. Christoph Lameter suggested an idea to protect large order pages from being broken up so that they would be available for contiguous allocations. However, he admits this is a controversial hack that will likely not be accepted due to the “memory reservation” aspect of the approach. Even though the likelihood of actually obtaining large contiguous allocations is only slowly moving forward, an in kernel interface to obtain contiguous pages has been proposed. alloc_contig_pages() would search for and return an arbitrary number contiguous pages if possible. There is similar special case code in the kernel today to allocate gigantic huge pages. The idea is to use this new interface for gigantic huge pages as well as other use cases.

Oracle Linux kernel developer Mike Kravetz, who is also the hugetlbfs maintainer, attended Linux Plumbers Conference 2018 and shares some of his thought about the conference especially around huge...

Linux Kernel Development

BPF: Using BPF to do Packet Transformation

Notes on BPF (6) - BPF packet transformation using tc Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering. In earlier blog entries, we've tried to run through some of the concepts in BPF and hopefully now we're ready to try writing some BPF programs. One of the great use cases for BPF is in network packet handling. Here we will try and do some magic using BPF; we're going to turn IPv4 packets we receive on the wire into IPv6 packets for the receiving Linux networking stack, so that the receiving TCP/IP stack only sees IPv6 traffic, and then we will reverse the trick on outbound. So our system running BPF will only see IPv6 in the networking stack, while IPv4 traffic will be what's seen on the wire. Specifically we'll do this for an ICMP echo request (ping), converting an inbound ping into an IPv6 echo request. Then we will take the IPv6 echo reply and convert it into IPv4. So the remote ping application thinks it's talking to an IPv4 endpoint, while the local Linux TCP/IP stack thinks it's talking to an remote IPv6 ping client! So on inbound, what happens is this: +----> 3. IPv6 packet is processed by TCP/IP stack | +-----> 2. BPF ingress (inbound) filter transforms it into IPv6 | 1. IPv4 inbound packet arrives Similarly for outbound packets: +----- 1. IPv6 packet is sent by TCP/IP stack V +-------2. BPF egress (outbound) filter transforms it into IPv4 | 3. IPv4 outbound packet is sent on wire. Why do this? Mostly because it's a non-trivial example of using BPF to do packet transformation, and I couldn't find any existing examples that do IPv4 -> IPv6 transformation. As a reminder though, the samples/bpf directory in the kernel tree has a bunch of different examples that are useful if you're trying to learn how to write BPF programs. If you want to see the fully worked example, check out https://github.com/alan-maguire/bpf-test/blob/master/bpf/test_bpf_helper_bpf_skb_change_proto_kern.c It's part of a repo which does unit tests of various bpf helpers. This one covers the bpf_skb_change_proto() helper function which allows us to turn an IPv4 packet into IPv6 and vice versa. The test converts IPv4 ICMP echo requests (pings) into IPv6 echo requests on ingress, and takes IPv6 echo replies on egress and converts them into IPv4 echo replies. So the remote system pings an IPv4 address and BPF translates things so that the echo request is processed an IPv6 ping. Doing all this allows us to test that the protocol change helper works. Converting IPv4 to IPv6 - a quick primer To convert between the protocols, we need to remind ourselves what the differences are between IPv4 and IPv6. As always, consult the RFCs for full details, but to summarize the key details we need to care about: IPv6 does not utilize a checksum while IPv4 checksums the IPv4 header IPv6 headers are 40 bytes in size while IPv4 are 20 bytes, largely because... IPv6 addresses are 128 bits in size rather than 32 for IPv4. IPv6 uses extension headers, while IPv4 uses options which are tacked on the end of the header. Note that for higher-level protocols, we also need to consider the concept of a pseudo-header. When checksumming TCP, UDP and ICMPv6, we checksum the TCP, UDP and ICMPv6 packet content, but also add a pseudo-header consisting of the source/destination addresses, payload length and protocol type. Again consult the RFCs for full details, but the consequences for BPF are this: if moving from IPv4 to IPv6, we need to modify layer 4 checksums also because in changing the IP addresses (from v4 to v6 or vice versa), we also change the pseudo-header and thus the checksum calculation. Another pain point is that IPCMPv6 != ICMP; types and codes are different, even for simple packet data like ping echo requests/replies. So if we're converting ICMPv4 to ICMPv6 we will need to modify these fields too. And ICMPv4 does not use a pseudo-header, so we need to take that into account in checksum calculations. All seems kind of daunting, but the great news is BPF provides helpers to do checksum calculations, convert IPv4 to IPv6 and vice versa and so on. Choosing our BPF program type When we initially described the various program types in BPF, we talked about when the BPF program associated with the program type is run. For this case, we have two requirements: We need to be able to run it on ingress for inbound traffic and for egress for outbound traffic. It needs to process the packet on ingress prior to handing it off to the TCP/IP networking stack, and on egress prior to handing it to the driver for transmission. There are a few options for us to choose from, but a "tc" bpf program makes most sense. tc supports symmetric (ingress and egress) program attach, and the advantage of using XDP - not having to allocate packet metadata - doesn't really buy us much here, since we want to pass our packet upstream to the Linux TCP/IP stack. If we were doing some form of firewalling or DDoS mitigation where we were dropping a lot of the received packets, doing that without the overhead of skbuff packet metadata allocation in XDP is ideal. Userspace interactions? In the real world, you'd likely want to restrict such conversions to a specific IP address or port, so you could store those in a BPF hash map. In the case of our tests, we use a BPF array map to store test status for each test; this allows us to mark a test case failed from within our BPF program and to be able to pick that up in the userspace program that launches the test. Beware of offload functionality! If you are doing anything involving tunnel encapsulation/de-encapsulation, it can be difficult to get that functionality working with generic send offload/generice receive offload functionality. As a reminder, GSO allows us to send a large packet down to the device which segments it into individual under-MTU-sized packets for transmission. If we are pre-pending tunnel headers etc we may need to switch off such functionality as we want each packet to have the tunnel header pre-pended. I haven't had much luck with getting these offload features to work with BPF so I generally turn them off with ethtool, but your experience may be different. Direct packet access versus bpf_skb_load/store_bytes Initially the way to read write packet data in BPF was to use bpf_skb_load_bytes() and bpf_skb_store_bytes(). These interfaces were useful because they handled cases where the packet is what is known as non-linear. This means that the buffers storing packet data are not contiguous. In general packet headers are in the linear portion of an sk_buff, but I've come across cases (in heavily encapsulated traffic for VMs) where header data falls into non-linear parts of packet data. For a review of how sk_buff data structures work, see David Miller's "How SKBs work": http://vger.kernel.org/~davem/skb_data.html Later direct packet access was added to BPF, which meant we could use the __sk_buff "data" pointer to access packet data like a normal pointer. However for safety BPF requires we first test we have not reached the end of the linear portion of the packet (data_end). So most packet accesses have to be prefixed with checks for this condition. If we fall off the end of the packet we can explicitly call bpf_skb_pull_data() to request that the desired amount of data be in the linear portion. Writing our ingress filter Our goal is to process an IPv4 inbound ICMPv4 echo request packet and convert it into ICMPv6. I've chosen ICMP because it's harder to do than TCP or UDP - for those protocols, L4 checksum modification is done for the changed IP addresses only. For ICMPv4->ICMPv6 we also need to change ICMP type and take into account the fact that ICMPv6 has a pseudo header whereas ICMPv4 does not. So to adapt this example to TCP/UDP, you will just need to modify the checksum computations and the checksum offset. Verify our packet is IPv4/ICMP We define our ingress ELF section, and we use direct packet access (hence the initial checks) to ensure we've got an IPv4 (ETH_P_IP) packet, and moreover that it's an ICMP echo requests (ICMP_ECHO). Note we could do an explicit bpf_skb_pull_data() for these cases, but since it's unlikely that the first few bytes of the packet are non-linear we just pass such packets up to Linux intact (by returning TC_ACT_OK). SEC("ipv4toipv6_ingress") int ipv4toipv6_ingress(struct __sk_buff *skb) { /* We use an icmp hdr for icmp6 because we only want type/code/check */ struct icmphdr *icmph, icmp6h = { 0 }; void *data_end = (void *)(long)skb->data_end; void *data = (void *)(long)skb->data; struct eth_hdr *eth = data, eth_copy; struct icmphdr *icmph; sruct iphdr *iph; if (data + sizeof(*eth) > data_end) return TC_ACT_OK; if (bpf_ntohs(eth->h_proto) != ETH_P_IP) return TC_ACT_OK; if (data + sizeof(*eth) + sizeof(*iph) > data_end) return TC_ACT_OK; iph = data + sizeof(*eth); if (iph->protocol != IPPROTO_ICMP) return TC_ACT_OK; if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*icmph) > data_end) return TC_ACT_OK; icmph = data + sizeof(*eth) + sizeof(*iph); if (icmph->type != ICMP_ECHO) return TC_ACT_OK; Also note that if IP options were present, we'd need to adjust offsets accordingly, but we will keep things simple here. Copy our ethernet header, extract needed info from IP header When we convert from IPv4 to IPv6, we need 20 bytes extra space for the IPv6 header. The bpf helper bpf_skb_change_proto() will reserve extra headroom in the sk_buff for us to do this, but at the cost of overwriting the existing ethernet header. So let's copy that out and modify the protocol to ETH_P_IPV6. /* Copy original ethernet header, as it must be moved. */ ret = bpf_skb_load_bytes(skb, 0, &eth_copy, sizeof(eth_copy)); if (ret) { bpf_debug("bpf_skb_load_bytes returned %d\n", ret); return TC_ACT_OK; } eth_copy.h_proto = bpf_htons(ETH_P_IPV6); /* IPv6 payload len does not include header len. */ payload_len = bpf_ntohs(iph->tot_len) - (iph->ihl << 2); Construct our ICMPv6, IPv6 headers. Here we use hardcoded IPv6 addresses along with a simple __always_inline function to set 4 32-bit values comprising an IPv6 address: static __always_inline void ipv6_addr_set(struct in6_addr *addr, __be32 w1, __be32 w2, __be32 w3, __be32 w4) { addr->in6_u.u6_addr32[0] = w1; addr->in6_u.u6_addr32[1] = w2; addr->in6_u.u6_addr32[2] = w3; addr->in6_u.u6_addr32[3] = w4; } The "__always_inline" is needed to ensure the function gets into our ingress ELF section. Back to our ingress handler: /* Time to construct ICMPv6 header. */ icmp6h.type = ICMPV6_ECHO_REQUEST; icmp6h.code = icmph->code; /* Time to construct IPv6 header and copy it. */ __builtin_memset(&ip6h, 0, sizeof(ip6h)); ip6h.version = 6; ip6h.payload_len = bpf_htons(payload_len); ip6h.nexthdr = IPPROTO_ICMPV6; ip6h.hop_limit = 8; ipv6_addr_set(&ip6h.saddr, BPF_HELPER_IPV6_PREFIX, 0, 0, BPF_HELPER_IPV6_REMOTE_SUFFIX); ipv6_addr_set(&ip6h.daddr, BPF_HELPER_IPV6_PREFIX, 0, 0, BPF_HELPER_IPV6_LOCAL_SUFFIX); Calculate value for ICMPv6 checksum Internet checksums have some really nice mathematical properties; one key property is if the field of a header changes, we can recalcuate the checksum without traversing the whole header if we know the old and new values. We take advantage of that behaviour here, because in moving from IPv4 ICMP to IPv6 ICMPv6 - we need to add a pseudo-header to our ICMPv6 checksum - to do so we need to sum over the IPv6 addresses, the payload length and the protocol (IPPROTO_ICMPV6) - we also need to take into account the difference between the old ICMP type (ICMP_ECHO) and the ICMPv6 equivalent (ICMPV6_ECHO_REQUEST). We need a function to generate the sum of 16-bit values; so we use Clang's loop-unrolling feature to define sum16(): static __always_inline __u32 sum16(__u16 *addr, __u8 len) { __u32 sum = 0; int i; #pragma clang loop unroll(full) for (i = 0; i < len; i++) sum += *addr++; return sum; } ...and then use it to sum up the checksum value changes in adding the pseudo-header and modifying the ICMP type values: /* Fix up our checksum. Source/destination addresses have changed, and * so has ICMP type. Note that ICMPv6 also has a pseudo-header, so * we also need to add payload length and ICMPv6 protocol to newsum, * but do not add IPv4 equivalents to oldsum because ICMPv4 does not * use a pseudo-header in checksum calculation. Only thing that changes * for oldsum is ICMP type. */ oldsum = icmph->type; newsum = sum16((__u16 *)&ip6h.saddr, sizeof(ip6h.saddr) >> 1); newsum += sum16((__u16 *)&ip6h.daddr, sizeof(ip6h.daddr) >> 1); newsum += icmp6h.type + bpf_htons(payload_len) + bpf_htons(IPPROTO_ICMPV6); Later we will use these values to modify the checksum. Change from IPv4 -> IPv6 and store our new ethernet, IPv6 and ICMPv6 data We also update the checksum via bpf_l4_csum_replace(), specifying our oldsum and newsum values from above: /* Convert skb to IPv6 and adjust headroom to allow for space for * IPv6 header. */ ret = bpf_skb_change_proto(skb, bpf_htons(ETH_P_IPV6), 0); if (ret) { bpf_debug("bpf_skb_change_proto returned %d\n", ret); return TC_ACT_OK; } /* Store our copied ethernet header at new start of packet. */ ret = bpf_skb_store_bytes(skb, 0, &eth_copy, sizeof(eth_copy), 0); if (ret) { bpf_debug("bpf_skb_store_bytes returned %d\n", ret); return TC_ACT_SHOT; } /* Store our IPv6 header after the copied ether header */ ret = bpf_skb_store_bytes(skb, sizeof(eth), &ip6h, sizeof(ip6h), 0); if (ret) { bpf_debug("bpf_skb_store_bytes returned %d\n", ret); return TC_ACT_SHOT; } /* Only two bytes type/code change */ ret = bpf_skb_store_bytes(skb, sizeof(eth) + sizeof(ip6h), &icmp6h, 2, 0); if (ret) { bpf_debug("bpf_skb_store_bytes returned %d\n", ret); return TC_ACT_SHOT; } /* Lastly, recompute L4 checksum. */ ret = bpf_l4_csum_replace(skb, sizeof(eth) + sizeof(ip6h) + offsetof(struct icmphdr, checksum), oldsum, newsum, BPF_F_PSEUDO_HDR | sizeof(newsum)); if (ret) { bpf_debug("bpf_l4_csum_replace returned %d\n", ret); return TC_ACT_SHOT; } Note that in failure cases, we return TC_ACT_SHOT since we've modified the packet in bpf_skb_change_proto() such that it's not in a proper state if something goes wrong. Writing our egress filter This is mostly reversing the above, with the caveat that we need to calcuate the IPv4 checksum. Again see the referenced example for a fully-worked out version: https://github.com/alan-maguire/bpf-test/blob/master/bpf/test_bpf_helper_bpf_skb_change_proto_kern.c Conclusion BPF is an extremely flexible environment in which to do packet processing. We didn't touch on encapsulation/de-enapsulation here, but we can handle cases like that with the helper bpf_skb_adjust_room() to add/remove headroom in a packet. Hopefully the above demonstrates that we can do some interesting things in BPF! Be sure to visit the previous installments of this series on BPF, here, and stay tuned for our next blog posts! 1. BPF program types 2. BPF helper functions for those programs 3. BPF userspace communication 4. BPF program build environment 5. BPF bytecodes and verifier 6. BPF Packet Transformation

Notes on BPF (6) - BPF packet transformation using tc Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley...

Linux Kernel Development

BPF In Depth: The BPF Bytecode and the BPF Verifier

Notes on BPF (5) - BPF bytecodes and the BPF verifier Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering. Previously, we've described what sorts of BPF programs can be used; the BPF helper functions those programs can call the ways BPF programs can communicate with userpace how to set up a BPF build environment. Now we've got one more topic to cover before we're ready to start writing BPF programs. How does BPF ensure that programs are safe? When working with BPF, the first wall you are likely to hit - after compiling your program and trying to load it - is a BPF verifier complaint such as this one I came across recently when loading a BPF-based tc classifier: from 1545 to 1615: R0=inv0 R6=inv2 R7=ctx(id=0,off=0,imm=0) R8=inv(id=0,umin_value=28,umax_value=1048,var_off=(0x0; 0x7fc)) R9=inv(id=0,umax_value=1020,var_off=(0x0; 0x3fc)) R10=fp0 fp-248=inv fp-232=map_value 1615: (61) r1 = *(u32 *)(r7 +76) 1616: (79) r2 = *(u64 *)(r10 -152) 1617: (0f) r1 += r2 math between pkt pointer and register with unbounded min value is not allowed To understand what all this means, we need to describe the BPF instruction set and what the verifier does. BPF instruction set As mentioned previously, the eBPF instruction set extended the set of bytecodes available, moved to 64-bit registers and in general created an instruction set that looks quite like x86_64.This isn't a coincidence; the aim was to support just-in-time (JIT) compilation as a speedier alternative to interpreting bytecodes. JIT compilation can be enabled via # sysctl net/core/bpf_jit_enable=1 The instruction set is documented at https://www.kernel.org/doc/Documentation/networking/filter.txt ...but be sure to look at the "BPF kernel internals" section and later, as above that is a description of "classic" BPF, the initial instruction set used for packet filtering. Classic BPF is translated in-kernel to support existing filtering mechanisms in wireshark, tcpdump etc. BPF Registers Register Function x86_64 equiv R0 return value from in-kernel function/exit value for prog rax R1 first arg to in-kernel function/scratch variable rdi R2 second arg to in-kernel function/scratch variable rsi R3 third arg to in-kernel function/scratch variable rdx R4 fourth arg to in-kernel function/scratch variable rcx R5 fifth arg to in-kernel function/scratch variable r8 R6 callee saved registers that in-kernel function preserves rbx R7 callee saved registers that in-kernel function preserves r13 R8 callee saved registers that in-kernel function preserves r14 R9 callee saved registers that in-kernel function preserves r15 R10 read-only frame pointer to access stack rbp As we can see, the maximum number of function register arguments BPF can use is 5 (x86_64 supports more). So just-in-time compilation can simply use the x86_64 equivalents when creating a mapping from BPF instructions to the x86_64 ISA. The x86_64 implementation of JIT compilation for 4.14 is found at https://github.com/oracle/linux-uek/blob/uek5/master/arch/x86/net/bpf_jit_comp.c bpf_int_jit_compile() makes multiple passes over the bytecodes, shrinking the image each time until no more shrinking occurs. do_jit() carries out the mapping, cycling through the instructions in a big switch() statement. There is no great need to describe the various supported BPF opcodes, as the output from the verifier is rendered in a quite human-readable form. Returning to our verifier complaint, it showed us a few snippets of our program: 1615: (61) r1 = *(u32 *)(r7 +76) 1616: (79) r2 = *(u64 *)(r10 -152) 1617: (0f) r1 += r2 From the above, we know that r1 and r2 are registers used to pass arguments to BPF functions. So on 1615, we are setting r1 to the u32 value pointed at by (r7 + 76). And on 1616, we are setting r2 to an offset from the frame pointer, i.e. a local variable on the stack. Finally we add both together. The last piece of the puzzle is to describe what the verifier context information "from 1545 to 1615: R0=inv0 R6=inv2 R7=ctx(id=0,off=0,imm=0) R8=inv(id=0,umin_value=28,umax_value=1048,var_off=(0x0; 0x7fc)) R9=inv(id=0,umax_value=1020,var_off=(0x0; 0x3fc)) R10=fp0 fp-248=inv fp-232=map_value" and error "math between pkt pointer and register with unbounded min value is not allowed" ...mean. To describe that, we need a bit more information on what the verifier does. The BPF verifier What does the verifier do? At a high-level, the BPF verifier ensures that BPF programs are safe - i.e. they will not crash the system, access invalid memory addresses, etc. The verifier code is pretty well commented; I'd recommend starting with https://github.com/oracle/linux-uek/blob/uek5/master/kernel/bpf/verifier.c : Here's what it says bpf_check() is a static code analyzer that walks eBPF program instruction by instruction and updates register/stack state. All paths of conditional branches are analyzed until 'bpf_exit' insn. The first pass is depth-first-search to check that the program is a DAG. It rejects the following programs: - larger than BPF_MAXINSNS insns - if loop is present (detected via back-edge) - unreachable insns exist (shouldn't be a forest. program = one function) - out of bounds or malformed jumps The second pass is all possible path descent from the 1st insn. Since it's analyzing all pathes through the program, the length of the analysis is limited to 64k insn, which may be hit even if total number of insn is less then 4K, but there are too many branches that change stack/regs. Number of 'branches to be analyzed' is limited to 1k DAG is a "directed acyclic graph" - we want to ensure our program has no backward branches. It is a directed graph because we always branch forwards. However, multiple places in the code can branch forward to the same destination, so it's not a tree. First pass verifier complaints So from the above, when we considering verifier errors, we have a few different classes of problem. In the first pass we can get verifier errors if we pass in programs that are too big (larger than BPF_MAXINSNS instructions). In Linux 4.14 BPF_MAXINSNS is set to 4096. programs with loops. Note that we can unroll simple loops in clang via "#pragma clang loop unroll(full)", but such loops should be simple as unrolling can fail. In particular loops which are bounded by a variable value (such as a value retrieved from a packet header) are risky. In addiition, adding complex predicates within a loop body can be problematic also. programs which call other functions (that are not BPF helpers or defined as __always_inline). unreachable instructions. Hard to see how this could happen in a restricted C environment, using raw BPF it could be a risk of course. Most of these problems are reasonably easy to eliminate. Second pass verifier complaints The second pass is trickier. The verifier will try all paths, tracking types of registers used as input to instructions, and updating resulting type via register state values. For example, PTR_TO_PACKET + SCALAR_VALUE → PTR_TO_PACKET. Certain operations are forbidden, e.g. adding two pointer values together gives an invalid value. The bpf verifier explores all paths of the program. For conditional jumps, a stack is used, so one path is explored while the instruction for the other path is pushed onto the stack. So we do a depth-first search of the instruction set. When we arrive at an instruction with a state equivalent to an earlier instruction state analysis (see the states_equal() function for how this is determined), we can prune the search. If we reach bpf_exit() without any complaints and a valid R0 value (the return value of the BPF program), a state is marked safe. We then backtrack to the first pushed instruction and repeat the cycle until the stack is empty and we're done. In my experience, the verifier can find quite subtle issues in code, and while you will spend a considerable amount of time tracking them down, it is usually a genuine bug! There are cases where compiler optimizations can confuse the verifier, so it's always best to run "llvm-objdump" (see below) to examine your program. So the "math between pkt pointer and register with unbounded min value is not allowed" error we saw above is essentially the verifier figuring out the set of actions in your program can lead to a situation where it is possible to run off the start of the packet by subtracting a value from the packet pointer. It is important to ensure values we add to packet pointers are unsigned, e.g. __u16. Also watch out for overflows when bit-shifting, adding, subtracting or multiplying. Ensure same for accesses on stack - for that we would get a similar error but it would mention the fp (frame pointer) instead. So what about "from 1545 to 1615: R0=inv0 R6=inv2 R7=ctx(id=0,off=0,imm=0) R8=inv(id=0,umin_value=28,umax_value=1048,var_off=(0x0; 0x7fc)) R9=inv(id=0,umax_value=1020,var_off=(0x0; 0x3fc)) R10=fp0 fp-248=inv fp-232=map_value" ? With the above info, we can see that this is the output of the verifier analysis, and it is telling us the state of the registers as per the verifier when processing a given chunk of instructions. Direct packet access, non-linear SKBs and verifier complaints For BPF programs which support sk_buff access, there are two modes with which we can retrieve/store information from/to a packet. The first is to use bpf_skb_load|store_bytes(). The advantage of this interface is that it works for linear and non-linear sk_buffs. Packet data is stored in the sk_buff structure, and it can hold that data in an initial "linear" section from skb→data to skb→data + skb→end (If this all sounds confusing, I'd recommend reading "How SKBs work" by David Miller). However, additional packet data can also be stored in fragments associated with the packet. These are referenced via the skb_shared_info structure. In BPF, we get a modified version of the skb, "struct __sk_buff" which contains pointers "data" and "data_end" which point at the start and end of the linear portion of the packet. We can directly read and write packet data using these pointers, but the BPF verifier requires that we first ensure that the packet data we wish to read/write is between "data" and "data_end". So a lot of BPF code which uses direct packet access looks like this: struct eth_hdr *eth; if (data + sizeof(*eth) > data_end) return TC_ACT_OK; eth = data; if (bpf_ntohs(eth→h_proto) == ETH_P_IP) { ... This leads naturally to a question - what if some of our packet data falls into the non-linear portion? I've encountered situations - particulary in VMs which use multiple layers of packet encapsulation - where some packet header data falls into the non-linear part of the skb. The best approach is to test for the condition where additional packet data is present and not in the linear portion. If that is the case, we can call bpf_skb_pull_data(skb, data_len) to ensure that data_len bytes will be in the linear portion.David Miller writes about direct packet access here. However, you may get a bunch more verifier warnings if you do this. Why? Well, many BPF functions such as bpf_skb_store_bytes(), bpf_skb_pull_data(), bpf_skb_adjust_room() etc will invalidate the data/data_end pointers and any checks done on them. So when using direct packet access, we need to retrieve data/data_end from the skb again and ensure that we verify the data we read/write falls between them. Examining your program with llvm-objdump If you get verifier errors, you will want to figure out where they occur in your restricted C code. We can dump our BPF program with annotated source if we run # llvm-objdump -S -no-show-raw-insn program.o Ensure the original program was compiled with -g to get source annotations. The result will look something like this: program.o: file format ELF64-BPF Disassembly of section program_handle_egress: program_handle_egress: ; { 0: r7 = r1 ; { 1: r6 = 0 ; void *data_end = (void *)(long)skb->data_end; 2: r2 = *(u32 *)(r7 + 80) ; void *data = (void *)(long)skb->data; 3: r1 = *(u32 *)(r7 + 76) ; if (data + sizeof(*eth) > data_end) 4: r3 = r1 5: r3 += 14 6: if r3 > r2 goto 570 You can see the source code interspersed with the BPF program, so it's a neat way to figure out what's happening around wherever the BPF is complaining about. Learning more about BPF Thanks for reading this installment of our series on BPF. We hope you found it educational and useful. Questions or comments? Use the comments field below! Stay tuned for the final installment in this series, BPF Packet Transformation. Be sure to visit the previous installments of this series on BPF, here, and stay tuned for our next blog posts! 1. BPF program types 2. BPF helper functions for those programs 3. BPF userspace communication 4. BPF program build environment 5. BPF bytecodes and verifier

Notes on BPF (5) - BPF bytecodes and the BPF verifier Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley...

Linux Kernel Development

BPF In Depth: Building BPF Programs

Notes on BPF (4) - Setting up your environment to build BPF programs Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering. Here I'm going to describe how I set up my programming environment to build BPF programs. The advice is mostly based around using Linux UEK5, which is based on a 4.14 Linux kernel, so a bit of adjustment will be needed for other distros. Note - I'm not going to talk about BCC (the BPF Compiler Collection) here; for UEK5 that extra step involves building BCC from source. BCC isn't required to build BPF programs - clang/LLVM support a BPF target, so what we're aiming for here is to compile and use BPF programs. BCC however is a great resource for programs and provides python bindings and much more. Install dependencies First, verify the kernel you are working with has the following configuration options enabled: CONFIG_BPF=y CONFIG_BPF_SYSCALL=y CONFIG_NET_CLS_BPF=m CONFIG_NET_ACT_BPF=m CONFIG_BPF_JIT=y CONFIG_HAVE_BPF_JIT=y CONFIG_BPF_EVENTS=y All of these are enabled for our latest release based on Linux kernel 4.14, UEK5. To check these values for your running kernel: # grep BPF /boot/config-`uname -r` To build BPF programs, add LLVM and clang packages. clang is used to compile C programs to BPF bytecodes, and to ensure your version supports bpf, run "llc --version"; BPF should be listed as a registered target. To support BPF compilation, clang should be > version, 3.4.0 LLVM > version 3.7.1, according to http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/end-user/build_environment.html . For UEK5, you can install them from the developer EPEL yum repository. Note; to use the latest LLVM/clang, "scl enable rh-dotnet20 bash" must be run. For UEK5 # yum install -y yum-utils # yum-config-manager --add-repo=http://yum.oracle.com/public-yum-ol7.repo # yum-config-manager --enable ol7_developer --enable ol7_developer_EPEL # yum install -y rh-dotnet20-clang rh-dotnet20-llvm # scl enable rh-dotnet20 bash # llc --version |grep bpf If you are using tc to build classifiers/actions via BPF, or want to use BPF to manage lightweight tunnel encapsulation for routes, you will need up-to-date versions of iproute and tc; i.e. 4.14-based versions, which support interaction with BPF programs. Follow those links for the UEK5 packages; for other distros you can also build these from source if needs be. # yum-config-manager --enable ol7_UEKR5 # yum install -y iproute iproute-tc Finally, to build BPF programs, you will need an up-to-date kernel development package to compile against, e.g. the UEK5 kernel-uek-devel or kernel-uek-headers package, since the headers shipped with UEK5 in the kernel-headers do not contain the 4.14 definitions for the BPF syscall etc. # yum install -y kernel-uek-devel Warning - installing kernel-uek-headers installs updated headers in /usr/include and can cause compatibility issues, so it is often best avoided in production environments. For an alternative approach to updating the header files in /usr/include using kernel-uek-devel, read on. The kernel-uek-devel package provides kernel headers and makefiles sufficient to build modules against the kernel package, and has nearly everything we need to build BPF programs; the below shows us how to use it and add the extra pieces. Building BPF programs outside the kernel tree This is mostly specific to my needs, but if you want to build BPF programs outside of the kernel tree while having them compatible with samples/bpf (so hopefully you can push them upstream later!), read on... Once the above dependencies have been installed, you are ready to start building BPF programs. However, it's important to ensure they are compiled against the right headers with all the latest BPF definitions. The approach I use is to point compilation at the kernel-uek-devel headers from /usr/src/kernels/, and we can pick up the up-to-date BPF headers, even if out-of-date headers are still installed in /usr/include (which they often are for backwards compatibility).The only complication with doing this is that the kernel-uek-devel package does not include a few files that are needed for compilation to succeed on the kernel side, and on the user-space side there are some convenience functions etc implemented which we would like to use in our programs. Here I'll describe how I've tackled this; again this may not make sense for your situation, but there may be aspects of the Makefiles that you can re-use. For BPF projects, I use a directory structure as follows: bpf/ include/ Makefile user/ I like to mirror the samples/bpf functionality - in particular I want to be able to use bpf.c/bpf_load.c as these simplify BPF interactions by providing code to scan a BPF program and load ELF sections as maps and programs. Loading BPF programs - when not using tc or iproute for configuring lightweight tunnels, which both have BPF integration - is a pain. Explaining each subdirectory: The "bpf" directory contains the code to be compiled into BPF programs "include" contains headers used to build - bpf_endian.h, bpf_ and linux/types.h (the latter is needed for u64, u32 definitions etc.), along with any common headers needed by the user and bpf subdirectories. "user" contains the BPF user-space code that load BPF program in the kernel and interacts with them, and uses copies of bpf_load.c/bpf_load.h, and bpf.c/bpf.h from the kernel tree to do this. This approach minimizes dependencies by providing local copies of some of the samples and tools .c and .h files ; we provide our own because kernel-uek-devel does not deliver these files. Then we point compilation at the kernel headers from /usr/src/kernels/, and we can pick up the up-to-date BPF headers even if out-of-date headers are installed in /usr/include. Thus we can build BPF programs without having to install up-to-date kernel-uek-headers which, when installed, could cause breakage elsewhere. Building BPF programs - kernel The "bpf" subdirectory is where BPF programs are built with LLVM/clang, and to simplify the build process I add local copies of bpf_helpers.h and bpf_endian.h to the include/ directory. Also added here is linux/types.h; a copy of tools/include/linux/types.h. Here is the full bpf/Makefile:I use; in this case we are building one object; socket_filter_kernel.o; more can be added to OBJS as needed. # SPDX-License-Identifier: GPL-2.0 # # Copyright (c) 2018, Oracle and/or its affiliates. All rights reserved. # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 # as published by the Free Software Foundation. # # Build bpf code (kernel) out-of-tree by referencing local copies of # bpf .h files along with headers from kernel source tree. # Creates similar environment to that used by samples/bpf by adding # ../include/[bpf_endian.h,bpf_helpers.h,linux/types.h]. The latter is # used to get definitions for u64, u32 etc which are needed by other kernel # headers. # # - ../include/bpf_helpers.h is a copy of tools/testing/selftest/bpf/bpf_helpers.h # - ../include/bpf_endian.h is a copy of tools/testing/selftest/bpf/bpf_endian.h # - ../include/linux/types.h is a copy of tools/include/linux/types.h # # # Assumptions: # # - kernel-uek-devel package or equivalent has installed (partial) source # tree in /usr/src/kernels/`uname -r` # # - llc/clang are available and support "bpf" target; check with "llc --verison" # OBJS = socket_filter_kernel.o LLC ?= llc CLANG ?= clang INC_FLAGS = -nostdinc -isystem `$(CLANG) -print-file-name=include` EXTRA_CFLAGS ?= -O2 -emit-llvm # In case up-to-date headers are not installed locally in /usr/include, # use source build. linuxhdrs ?= /usr/src/kernels/`uname -r` LINUXINCLUDE = -I$(linuxhdrs)/arch/x86/include/uapi \ -I$(linuxhdrs)/arch/x86/include/generated/uapi \ -I$(linuxhdrs)/include/generated/uapi \ -I$(linuxhdrs)/include/uapi \ -I$(linuxhdrs)/include prefix ?= /usr/local INSTALLPATH = $(prefix)/lib/bpf install_PROGRAM = install install_DIR = install -dv all: $(OBJS) .PHONY: clean clean: rm -f $(OBJS) INC_FLAGS = -nostdinc -isystem `$(CLANG) -print-file-name=include` $(OBJS): %.o:%.c $(CLANG) $(INC_FLAGS) \ -D__KERNEL__ -D__ASM_SYSREG_H \ -Wno-unused-value -Wno-pointer-sign \ -Wno-compare-distinct-pointer-types \ -Wno-gnu-variable-sized-type-not-at-end \ -Wno-address-of-packed-member -Wno-tautological-compare \ -Wno-unknown-warning-option \ -I../include $(LINUXINCLUDE) \ $(EXTRA_CFLAGS) -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@ install: $(OBJS) $(install_DIR) -d $(INSTALLPATH) ; \ $(install_PROGRAM) $^ -t $(INSTALLPATH) uninstall: $(OBJS) rm -rf $(INSTALLPATH) Building BPF programs - user-space In the user/ subdirectory I add local copies of bpf.[ch], bpf_load.[ch], bpf_util.h and perf-sys.h. Here is the Makefile: # SPDX-License-Identifier: GPL-2.0 # # Copyright (c) 2018, Oracle and/or its affiliates. All rights reserved. # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License version 2 # as published by the Free Software Foundation. # # Build bpf userspace code out-of-tree by referencing local copies of # bpf .c and .h files. # # - bpf.[ch] are copies of tools/lib/bpf/bpf.[ch] # - bpf_load.[ch] are copies are samples/bpf/bpf_load.[ch], with references # to #include the unneeded libbpf.h removed, replaced by references to bpf.h # - bpf_util.h is a copy of tools/testing/selftests/bpf/bpf_util.h # - perf-sys.h is a copy of tools/perf/perf-sys.h # COMMONOBJS = bpf.o bpf_load.o SOCKETFILTERPROG = socket_filter_user SOCKETFILTEROBJ = $(SOCKETFILTERPROG).o PROGS= $(SOCKETFILTERPROG) OBJS= $(COMMONOBJS) $(SOCKETFILTEROBJ) linuxhdrs ?= /usr/src/kernels/`uname -r` LINUXINCLUDE = -I$(linuxhdrs)/arch/x86/include/uapi \ -I$(linuxhdrs)/arch/x86/include/generated/uapi \ -I$(linuxhdrs)/include/generated/uapi \ -I$(linuxhdrs)/include/uapi \ -I$(linuxhdrs)/include prefix ?= /usr/local INSTALLPATH = $(prefix)/bin install_PROGRAM = install install_DIR = install -d LDLIBS = -lelf all: $(SOCKETFILTERPROG) .PHONY: clean clean: rm -f $(OBJS) $(PROGS) %.o: %.c $(CC) -g -Wno-unused-variable -I../include $(LINUXINCLUDE) -c -o $@ $< $(CFLAGS) $(PROGS): $(OBJS) $(CC) -g -o $@ $(@).o $(COMMONOBJS) $(CFLAGS) $(LDLIBS) install: $(PROGS) $(install_DIR) -d $(INSTALLPATH) ; \ $(install_PROGRAM) $^ -t $(INSTALLPATH) uninstall: $(PROGS) cd $(INSTALLPATH); rm -f $^ Example For an example of a project, see https://github.com/alan-maguire/bpf-test It consists of a set of BPF helper unit tests that I'm hoping to push upstream, but at present it lives out-of-tree in a format such as I've descibed above. Summary So we've seen how to install dependencies and how to set things up to build BPF programs. From here you might want to build some BPF programs of your own, or install BCC and play around with it. Anyway hopefully some of this has been useful. Learning more about BPF Thanks for reading this installment of our six part series on BPF. We hope you found it educational and useful. Questions or comments? Use the comments field below! Stay tuned for the next installment in this series, The BPF Bytecode Verifier. Be sure to visit the previous installments of this series on BPF, here, and stay tuned for our next blog posts! 1. BPF program types 2. BPF helper functions for those programs 3. BPF userspace communication 4. BPF program build environment

Notes on BPF (4) - Setting up your environment to build BPF programs Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at...

Linux

DTrace a Docker Container

Content for this blog entry was provided by Tomas Jedlicka and Eugene Loh Software containers have standardized how software is shipped, making it simpler to run an application with its expected dependencies on a wide variety of platforms. Here, we illustrate use of DTrace on a host system to observe activity within a Docker container, running on Oracle Linux using runC.  The central idea is that DTrace predicates can limit tracing on the host system to children of the Docker container process. Note that a runC-like container runtime shares a single kernel instance across all containers.  Some other container implementations would not work with this same DTrace script.  An example is Clear Containers, which creates lightweight virtual machines with their own kernel instances. System Setup 1.  Install DTrace:     # yum install dtrace-utils 2. Install and start Docker: Make sure that the ol7_addons repo is enabled, for example by making sure in /etc/yum.repos.d/public-yum-ol7.repo that [ol7_addons] has enabled=1.  Then:     # yum install docker-engine     [...]     Complete!     # systemctl start docker     # systemctl enable docker     Created symlink [...]     # systemctl status docker       docker.service - Docker Application Container Engine     [...]     Hint: Some lines were ellipsized, use -l to show in full. 3. Get an image: In this example, we use the Oracle Container Registry. You might first have to log into the website at https://container-registry.oracle.com to accept license agreements.      # docker login container-registry.oracle.com     Username: myname     Password:     Login Succeeded     # docker pull container-registry.oracle.com/os/oraclelinux     Using default tag: latest     [...]   Running in a container We run our container, using -it to allocate a pseudo-TTY, creating an interactive bash shell:    # docker run --rm -it container-registry.oracle.com/os/oraclelinux From the host system, we can get the container ID and thereby also the process ID (PID) of the bash shell running in the container.     # docker ps -q     4c80fafde812     # docker ps -q | xargs docker inspect --format "{{.State.Pid}}"     12345     # ps -p 12345 -o pid,ppid,command       PID  PPID COMMAND     12345 12340 /bin/bash     # ps -p 12340 -o pid,cmd       PID  CMD     12340  docker-containerd-shim -namespace moby -workdir /var/lib/docker/cont     # The bash command has pid 12345 and its parent is docker-containerd-shim, whose pid is 12340.   Using DTrace to observe activities in the container While we manually execute commands in the container's bash shell, we can observe those activities from the host system. The key is that DTrace allows predicates to filter tracing, and progenyof() allows us to trace only processes that are progeny (children, grandchildren, etc.) of the pid of interest. Consider the D script syscalls.d:     #!/usr/sbin/dtrace -s     syscall:::                              /* trace system calls */     / progenyof($1) /                       /* only progeny of specified pid */     {                                       /* count occurrences */       @[pid, execname, probefunc] = count();     } We can run this script on the host system, specifying the pid on the command line, to study system calls in the container, aggregating results that are reported when the script is terminated.     # ./syscalls.d 12345 As with many DTrace scripts, which are short though powerful, we could just run a one-liner.     # dtrace -n 'syscall::: /progenyof($1)/ {@[pid,execname,probefunc]=count()}' 12345 Here is the example output:     dtrace: script './syscalls.d' matched 638 probes     ^C         [...]         13012  ls       close              26         13012  ls       read               28         13012  ls       mprotect           36         13012  ls       newlstat           38         13012  ls       open               42         13012  bash     rt_sigaction       44         12345  bash     rt_sigaction       48         13012  ls       mmap               54         12345  bash     rt_sigprocmask     58         13012  ls       rt_sigaction       76 We can also look exclusively at read and write system calls.  Since argument 2 to these calls is the size of the I/O operation, we can form histograms of these sizes:     #!/usr/sbin/dtrace -s     syscall::read:entry,             /* trace read and write system calls */     syscall::write:entry     /progenyof($1) && arg2 > 1024/   /* only progeny of the specified pid */     {       @[probefunc] = quantize(arg2);     } This script filters not only on progeny of the specified pid, but also on large (> 1 Kbyte) transfers. While this script is running on the host, we execute the following copy in the container:     [root@4c80fafde812 /]# dd if=/dev/zero of=/dev/null bs=1M count=10     10+0 records in     10+0 records out     10485760 bytes (10 MB) copied, 0.00237154 s, 4.4 GB/s     [root@4c80fafde812 /]# Upon completion, we can terminate the script on the host:     # ./large_rw.d 12345     dtrace: script './large_rw.d' matched 2 probes     ^C       read                                                              value  ------------- Distribution ------------- count                   524288 |                                         0                      1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10                     2097152 |                                         0               write                                                             value  ------------- Distribution ------------- count                   524288 |                                         0                      1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10                     2097152 |                                         0      DTrace confirms that there were ten 1-Mbyte read and write transfers in the container. Conclusions DTrace is a simple-yet-powerful tracing tool.  Using a progenyof() predicate, it can focus exclusively on processes running within a Docker container.  Using D syntax, the tracing can be filtered, and data can be selected, aggregated, and otherwise manipulated. Docker documentation can be found at https://docs.docker.com The Oracle Container Registry is at https://container-registry.oracle.com An earlier post https://blogs.oracle.com/linux/dtrace-on-linux%3a-an-update discussed the port of DTrace to Linux.

Content for this blog entry was provided by Tomas Jedlicka and Eugene Loh Software containers have standardized how software is shipped, making it simpler to run an application with its...

Linux Kernel Development

BPF In Depth: Communicating with Userspace

Notes on BPF (3) - How BPF communicates with userspace - BPF maps, perf events, bpf_trace_printk Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering. We've seen how userspace sets up BPF programs, but once a program is attached and running, how do we gather information from it? There are three ways to do do this; using BPF maps, perf events and bpf_trace_printk. 1. BPF maps When should I use maps? BPF maps are useful for gathering information during BPF programs to share with other running BPF programs, or with userspace programs which can also see the map data. How can I use it? The set of map types is described in include/linux/uapi/bpf.h. In our UEK5 release - based on Linux 4.14 - the enumerated bpf_map_type looks like this: enum bpf_map_type { BPF_MAP_TYPE_UNSPEC, BPF_MAP_TYPE_HASH, BPF_MAP_TYPE_ARRAY, BPF_MAP_TYPE_PROG_ARRAY, BPF_MAP_TYPE_PERCPU_HASH, BPF_MAP_TYPE_PERCPU_ARRAY, BPF_MAP_TYPE_STACK_TRACE, BPF_MAP_TYPE_CGROUP_ARRAY, BPF_MAP_TYPE_LRU_HASH, BPF_MAP_TYPE_LRU_PERCPU_HASH, BPF_MAP_TYPE_LPM_TRIE, BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH_OF_MAPS, BPF_MAP_TYPE_DEVMAP, BPF_MAP_TYPE_SOCKMAP, }; Map actions We can create/update, delete and lookup map information, both in BPF programs and in user-space. User-space map interactions are done via the BPF syscall. Their function signatures are slightly different to those of their in-kernel BPF program equivalents. In tools/lib/bpf/bpf.c wrappers for these actions are present: int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, int max_entries, __u32 map_flags); Description Create BPF map of specified type, with key/value size, of max_entries size with map flags specified. Returns File descriptor for map on success, negative error on failure. int bpf_create_map_node(enum bpf_map_type map_type, int key_size, int value_size, int max_entries, __u32 map_flags, int node); Description NUMA node-specific creation of BPF map. Returns File descriptor for map on success, negative error on failure. int bpf_create_map_in_map(enum bpf_map_type map_type, int key_size, int inner_map_fd, int max_entries, __u32 map_flags); Description Create map of specified type, passing in fd of inner map as representative Returns File descriptor for map on success, negative error on failure. int bpf_create_map_in_map_node(enum bpf_map_type map_type, int key_size, int inner_map_fd, int max_entries, __u32 map_flags, int node); Description NUMA node-specific creation of BPF map-in-map. Returns File descriptor for map on success, negative error on failure. int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags); Description Update element with specified key with new value. A few flag values are supported. BPF_NOEXIST The entry for key must not exist in the map. BPF_EXIST The entry for key must already exist in the map. BPF_ANY No condition on the existence of the entry for key Flag value BPF_NOEXIST cannot be used for maps of types _ARRAY (all elements always exist), the helper would return an error. Returns 0 on success, negative errno on failure. int bpf_map_lookup_elem(int fd, const void *key, void *value); Description Look up value associated with specific key. If successful value will point to retrieved value. The value will be copied if necessary. Returns 0 on success, negative errno on failure. int bpf_map_delete_elem(int fd, const void *key); Description Delete element with specified key. Delete is not supported for array values. Returns 0 on success, negative errno on failure. int bpf_map_get_next_key(int fd, const void *key, void *next_key); Description On success, next_key will point at next key after specified *key. Returns 0 on success, negative error on failure or when no more keys are available. int bpf_map_get_next_id(__u32 start_id, __u32 *next_id); Description Get id of next map given start id. Returns 0 on success, negative error on failure or when no more ids are available. Defining a map in a BPF program Under samples/bpf, maps are defined in a kernel BPF program in a dedicated section as a type "struct bpf_map_def" which bpf_load.h defines as: struct bpf_map_def { unsigned int type; unsigned int key_size; unsigned int value_size; unsigned int max_entries; unsigned int map_flags; unsigned int inner_map_idx; unsigned int numa_node; }; An example of a definition using this structure is in samples/bpf/lathist_kern.c : struct bpf_map_def SEC("maps") my_map = { .type = BPF_MAP_TYPE_ARRAY, .key_size = sizeof(int), .value_size = sizeof(u64), .max_entries = MAX_CPU, }; Once bpf_load.c has scanned the ELF headers, it calls bpf_create_map_node() or bpf_create_map_in_map_node() which are implemented in tools/lib/bpf/bpf.c as wrappers to the BPF_MAP_CREATE command for the SYS_BPF syscall. Unless you are writing tc or lightweight tunnel BPF programs - which, since they implement BPF program loading themselves have their own map loading mechanisms - I'd recommend re-using this code rather than re-inventing the wheel. We can see it's generally a case of defining a map type, key/value sizes and a maximum number of entries. Programs which use "tc"/"ip route" for loading can utilize a data structure like this (from tc_l2_redirect_kern.c): #define PIN_GLOBAL_NS 2 struct bpf_elf_map { __u32 type; __u32 size_key; __u32 size_value; __u32 max_elem; __u32 flags; __u32 id; __u32 pinning; }; struct bpf_elf_map SEC("maps") tun_iface = { .type = BPF_MAP_TYPE_ARRAY, .size_key = sizeof(int), .size_value = sizeof(int), .pinning = PIN_GLOBAL_NS, .max_elem = 1, }; The bpf_elf_map data structure mirrors that defined in https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/include/bpf_elf.h?h=v4.14.1. Map pinning In that file, we can see that there are a few options for pinning a map: /* Object pinning settings */ #define PIN_NONE 0 #define PIN_OBJECT_NS 1 #define PIN_GLOBAL_NS 2 Pinning options determine how the map's file descriptor is exported via the filesystem. Outside of tc etc, we can pin a map fd to a file via libbpf's bpf_obj_pin(fd, path). Then other programs etc can retrieve the fd via bpf_obj_get(). The PIN_* options for iproute determine that path - for example maps which specify PIN_GLOBAL_NS are found in /sys/fs/bpf/tc/globals/ , so to retrieve the map fd one simply runs mapfd = bpf_obj_get(pinned_file); ...where "pinned_file" is the filename. From looking at the iproute code it appears a custom pinning path can also be used (by specifying a value > PIN_GLOBAL_NS). Map operation definitions Examining include/linux/bpf_types.h, we see that the various map types have associated sets of operations; for example: BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) etc. The functions in the various ops variables define how the map allocates, frees, looks up data and much more. For example, as you might imagine the key for the lookup function for a BPF_MAP_TYPE_ARRAY is simply an index into the array. We see in kernel/bpf/arraymap.c: /* Called from syscall or from eBPF program */ static void *array_map_lookup_elem(struct bpf_map *map, void *key) { struct bpf_array *array = container_of(map, struct bpf_array, map); u32 index = *(u32 *)key; if (unlikely(index >= array->map.max_entries)) return NULL; return array->value + array->elem_size * (index & array->index_mask); } Array Maps Array maps are implemented in kernel/bpf/arraymap.c . All arrays restrict key size to 4 bytes (64 bits), and delete of values is not supported. BPF_MAP_TYPE_ARRAY: Simple array. Key is the array index, and elements cannot be deleted. BPF_MAP_TYPE_PERCPU_ARRAY: As above, but kernel programs implicitly write to a per-CPU allocated array which minimizes lock contention in BPF program context. When bpf_map_lookup_elem() is called, it retrieves NR_CPUS values. For example, if we are summing a stat across CPUs, we would do something like this: long values[nr_cpus]; ... ret = bpf_map_lookup_elem(map_fd, &next_key, values); if (ret) { perror("Error looking up stat"); continue; } for (i = 0; i < nr_cpus; i++) { sum += values[i]; } Use of a per-cpu data structure is to be preferred in codepaths which are frequently executed, since we will likely be aggregating the results across CPUs in user-space much less frequently than writing updates. BPF_MAP_TYPE_PROG_ARRAY: An array of BPF programs used as a jump table by bpf_tail_call(). See samples/bpf/sockex3_kern.c for an example. BPF_MAP_TYPE_PERF_EVENT_ARRAY: Array map which is used by the kernel in bpf_perf_event_output() to associate tracing output with a specific key. User-space programs associate fds with each key, and can poll() those fds to receive notification that data has been traced. See "Perf Events" section below for more details. BPF_MAP_TYPE_CGROUP_ARRAY: Array map used to store cgroup fds in user-space for later use in BPF programs which call bpf_skb_under_cgroup() to check if skb is associated with the cgroup in the cgroup array at the specified index. BPF_MAP_TYPE_ARRAY_OF_MAPS: Allows map-in-map definition where the values are the fds for the inner maps. Only two levels of map are supported, i.e. a map containing maps, not a map containing maps containing maps. BPF_MAP_TYPE_PROG_ARRAY does not support map-in-map functionality as it would make tail call verification harder. See https://www.mail-archive.com/netdev@vger.kernel.org/msg159387.html. for more. Hash Maps Hash maps are implemented in kernel/bpf/hashmap.c . Hash keys do not appear to be limited in size but must be > 0 for obvious reasons. Hash lookup matches the key to the appropriate value via a hashing function rather than an indexed lookup. Unlike the array case, values can be deleted from a hashmap. Hash maps are ideal when using a value such as an IP address for storage/retrieval. BPF_MAP_TYPE_HASH: simple hash map. Continually adding new elements can fail with E2BIG - if this is likely to be an issue, an LRU (least recently used) hash is recommended as it will recycle old entries out of buckets. BPF_MAP_TYPE_PERCPU_HASH: same as above, but kernel programs implicitly write to the CPU-specific hash. Retrieval works as described above. BPF_MAP_TYPE_LRU_HASH: Each hash maintains an LRU (least recently used) list for each bucket to inform delete when the hash bucket fills up. BPF_MAP_TYPE_HASH_OF_MAPS: Similar to ARRAY_OF_MAPS for for hash. See https://www.mail-archive.com/netdev@vger.kernel.org/msg159383.html for more. Other BPF_MAP_TYPE_STACK_TRACE: defined in kernel/bpf/stackmap.c. Kernel programs can store stacks via the bpf_get_stackid() helper. The idea is we store stacks based on an identifier which appears to correspond to a 32-bit hash of the instruction pointer addresses that comprise the stack for the current context. The common use case is to get stack id in kernel, and use it as key to update another map. So for example we could profile specific stack traces by counting their occurence, or associate a specific stack trace with the current pid as key. See samples/bpf/offwaketime_kern.c for an example of the latter. In user-space we can look up the symbols associated with the stackmap to unwind the stack (see samples/bpf/offwaketime_user.c). BPF_MAP_TYPE_LPM_TRIE: Map supporting efficient longest-prefix matching. Useful for storage/retrieval of IP routes for example. BPF_MAP_TYPE_SOCKMAP: sockmaps are used primarily for socket redirection, where sockets added to a socket map and referenced by a key which dictates redirection when bpf_sockmap_redirect() is called. BPF_MAP_TYPE_DEVMAP: does a similar job to sockmap, with netdevices for XDP and bpf_redirect(). 2. Perf Events As well as using maps, perf events can be used to gather information from BPF in user-space. Perf events allow BPF programs to store data in mmap()ed shared memory accessible by user-space programs. When should I use perf events? If you are gathering kernel data that is not amenable to map storage (such as variable-length chunks of memory) and does not need to be shared with other BPF programs. How can I use it? To see an example of how to set this up on the user-space side, see samples/bpf/trace_output_user.c and samples/bpf/trace_output_kern.c. User-space First we may need to up the rlimit (resource limit) of how much memory we can lock in RAM (RLIMIT_MEMLOCK) - we need to lock memory for maps. See setrlimit(2)/getrlimit(2) Create a map of type BPF_MAP_TYPE_PERF_EVENT_ARRAY. It can be keyed by CPU, and in that case the associated value for each key will be the fd associated with the perf event opened for that CPU. For each CPU, run perf_event_open() with a perf event with attributes of type PERF_TYPE_SOFTWARE, config PERF_COUNT_SW_BPF_OUTPUT, sample_type PERF_SAMPLE_RAW Update the BPF_MAP_TYPE_PERF_EVENT_ARRAY for the current CPU with the fd retrieved from the perf_event_open(). See test_bpf_perf_event() Run PERF_EVENT_IOC_ENABLE ioctl() for perf event fd mmap() read/write shared memory for the perf event fd. See perf_event_mmap(). This will store struct perf_event_mmap_page * containing the data. Add the perf event fd to the set of fds used in poll() so we can poll on events from the set of fds for each CPU to catch events. Now we are ready to run poll(), and handle events enqueued (see perf_event_read()) Kernel The program needs to define BPF_MAP_TYPE_PERF_EVENT_ARRAY to share with userspace. Program should run bpf_perf_event_output(ctx, &map, index, &data, sizeof(data)) . The index is the key of the BPF_MAP_TYPE_PERF_EVENT_ARRAY map, so if we're keying per-cpu it should be a CPU id. As we saw previously, bpf_perf_event_output() is supported for tc, XDP, lightweight tunnel, and kprobe, tracepoint and perf events program types. The context passed in is the relevant context for each of those program types. 3. bpf_trace_printk When should I use it? This option is more for debugging and should not be used in production BPF code. All BPF program types support bpf_trace_printk() and it is useful for debugging. How can I use it? Simply add a bpf_trace_printk() to your program. Messages can be retrieved via # cat /sys/kernel/debug/tracing/trace_pipe One gotcha here; you need to pre-define the format string otherwise the BPF verifier will complain. I usually use the following approach: define a general error message format, and have it add specifics with a particular string. For example: char errmsg[] = "egress: got unexpected error (%s) %x\n"; char store_fail[] = "could not store ipv6 hdr"; bpf_trace_printk(errmsg, sizeof(errmsg), store_fail, ret); One approach to consider is to have a config option BPF map shared between your program and user-space and if the config debug option is set, emit bpf_trace_printk()s. Summary We've seen that BPF maps are a good fit for communicating between different BPF programs and user-space. More customized data handling requires using perf events, and for debug logging bpf_trace_printk() is really useful. # Learning more about BPF Thanks for reading this installment of our six part series on BPF. We hope you found it educational and useful. Questions or comments? Use the comments field below! Stay tuned for the next installment in this series, Building BPF Programs. Previously: BPF program types BPF helper functions for those programs BPF userspace communication

Notes on BPF (3) - How BPF communicates with userspace - BPF maps, perf events, bpf_trace_printk Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an...

Linux Kernel Development

Dealing with Realtime Processes in Linux User Namespaces

Linux kernel developer Prakash Sangappa works closely with the Oracle Database team to ensure that the database runs best on Oracle Linux. As the Oracle Database team brings new capabilities to a release, Prakash ensures that any necessary support is in Oracle Linux. This is always exciting when Prakash and the team are delivering new features to the Linux operating system. In this blog post, Prakash talks about the challenges of trying to run a process with real time priority in a user namespace. Realtime (RT) Processes Inside User Namespace User namespaces provide user id and group id isolation. With use of user namespaces, an unprivileged user can be mapped to the root user(uid 0) inside a user namespace. That unprivileged user gains full privileges and capabilities to perform operations within the user namespace. This includes the ability to create other namespaces which is useful. Oracle Multitenant, the architecture for the next-generation database cloud, will be using namespaces to create and isolate database instances on a system. Though uid 0 in user namespace gets all capabilities, some of the capabilities are ineffective (e.g. CAP_SYS_NICE, CAP_IPC_LOCK, CAP_SYS_TIME) as they would allow modifying global resources, like setting RT priority, locking memory and setting system time respectively. This restriction is problematic for Oracle Multitenant, especially the capability CAP_SYS_NICE, which is required to set RT priority on some of its critical processes. Below is a brief description of the architecture and the use case. Oracle Multitenant Architecture Oracle Multitenant helps simplify consolidation, provisioning, management and more. This new architecture allows a container database (CDB) to hold zero or more customer databases called pluggable databases (PDBs). It helps to manage many databases as one. An existing database can be adaptoped without change as a pluggable database. You can find more information about Oracle Multitenant here. For security and isolation, Oracle Multitenant will use Linux namespaces including user namespaces to sandbox PDBs which are nested inside the CDB. Namespaces will also be used to isolate many CDBs on the system. Within a CDB, there are critical processes like the log writer that has to run at a higher priority. It needs to be scheduled on a CPU immediately as soon as it is ready to run. For this reason, these critical processes are assigned RT priority. However, with use of user namespaces, setting RT priority from within the user namespace is not possible. One way to handle this limitation would be to use a helper process running as root in the init namespace. This process could set RT priority for the critical processes within the user namespace on request. However, this is not convenient. Possible Approaches As RT priority is not a resource that can be namespace'd by introducing a new namespace type, the following approaches could address the requirement. The main concern would be runaway processes running with RT priority that render the system unresponsive. Allow root user(uid 0) from init namespace, when mapped inside a user namespace to set RT priority. If a user namespace were to be tagged or indicated in some way, permit CAP_SYS_NICE capability to take affect and be able to set RT priority. With use of cgroups bandwidth control, allow root user(uid 0) inside user namespace to be able to set RT priority. Add a scheduler option to run processes at a fixed high priority above all user priority, like a new scheduling class. This topic was presented at Linux Plumbers Conference 2018. From the discussion that ensued after the presentation, opinion seems to be leaning towards some solution based on cgroups bandwidth control to allow setting RT priority inside user namespace. We plan to further explore this approach.

Linux kernel developer Prakash Sangappa works closely with the Oracle Database team to ensure that the database runs best on Oracle Linux. As the Oracle Database team brings new capabilities to a...

Events

Join us at Oracle OpenWorld Europe

2019 is off to a great start with Oracle OpenWorld going global. The first stop is London, January 16-17. Join us for insights from leading experts, informative sessions and demos, and opportunities to connect with your peers. Discover how to stay competitive with today’s transformational technology. Highlights to include on your schedule: Featured Speakers Oracle OpenWorld Europe’s featured speakers are the innovators, disruptors and thought leaders of tomorrow. From a former number two at United Nations, digital anthropologist, a pioneer in the mobile and data analytics industries, authors, futurists and many more. Oracle Linux and Virtualization Sessions Wednesday, January 16 9:45 a.m.–10:20 a.m. | Arena 4 (Level 3) - ExCeL London [SES1793] Build a Cloud Native Environment with Oracle Linux Speaker: Karen Sigman, Vice President, Product and Partner Marketing, Product Channel Marketing Oracle Linux offers an open, integrated operating environment with application development tools, management tools, containers, and orchestration capabilities, which enable DevOps teams to efficiently build reliable, secure cloud native applications. In this session learn how Oracle Linux can help you enhance productivity. 10:30 a.m.–11:15 a.m. | Arena 8 (Level 3) - ExCeL London [SES1792] How Oracle Linux Cloud Native Environment and VirtualBox can Make a Developer's Life Easier Speaker: Simon Coter, Product Management Director, Oracle Linux and Virtualization, Oracle Tried, tested, and tuned for enterprise workloads, Oracle Linux is used by developers worldwide. Oracle Linux yum server and Oracle Container Registry provide easy access to Linux developer preview software, including the latest Oracle Linux Cloud Native Environment software. Thousands of EPEL packages also have been built and signed by Oracle for security and compliance. Software Collections include recent versions of Python, PHP, Node.js, nginx, and more. In addition, Oracle Cloud developer tools such as Terraform, SDKs, and CLI are available for an improved experience. Use Oracle VM VirtualBox to run Oracle Linux with cloud native software on your desktop and easily deploy to the cloud. Come to this session to learn more about speeding up your development and move to the cloud.   The Exchange: a Showcase for Attendees to Connect, Discover and Learn Oracle Linux and Oracle Virtualization experts will be at The Exchange to answer your questions, update you on the latest product enhancements, and demo the latest software releases. Let us know about your experience -- #OOWLON #OracleLinux @OracleLinux Enjoy the conference!

2019 is off to a great start with Oracle OpenWorld going global. The first stop is London, January 16-17. Join us for insights from leading experts, informative sessions and demos, and...

Linux Kernel Development

BPF In Depth: BPF Helper Functions

Notes on BPF (2) - BPF helper functions Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering. Now that we have a list of program types, what can we do within programs we attach? A good place to start with writing BPF programs is to see what helper functions the various BPF program types have available to them. To see some of this, check out https://github.com/oracle/linux-uek/blob/uek5/master/net/core/filter.c It contains a set of data structures used by the bpf verifier - struct bpf_verifier_ops. Here's an example for sk_filter programs: const struct bpf_verifier_ops sk_filter_prog_ops = { .get_func_proto = sk_filter_func_proto, .is_valid_access = sk_filter_is_valid_access, .convert_ctx_access = bpf_convert_ctx_access, }; "get_func_proto" defines the set of functions supported by the program. The "is_valid_access" function checks if the read/write access for the memory offset is valid. The "convert_ctx_access" function converts accesses from bpf-specific (e.g. struct __sk_buff) structures into real access to the "struct sk_buff". This is all so the verifier can ensure your BPF program is calling valid functions and accessing valid data for the given instrumentation point. Back to the function prototypes. Firstly, there are a base set of functions available, the prototypes of which are returned by bpf_base_func_proto(). The following descriptions come from Quentin's more recent bpf-next changes which document helpers in include/uapi/linux/bpf.h - https://lwn.net/Articles/751527/ - here we're simply organizing them by the program type(s) that can use them. void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) Description Perform a lookup in map for an entry associated to key. Return Map value associated to key, or NULL if no entry was found. int bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags) Description Add or update the value of the entry associated to key in map with value. flags is one of: BPF_NOEXIST The entry for key must not exist in the map. BPF_EXIST The entry for key must already exist in the map. BPF_ANY No condition on the existence of the entry for key. Flag value BPF_NOEXIST cannot be used for maps of types BPF_MAP_TYPE_ARRAY or BPF_MAP_TYPE_PERCPU_ARRAY (all elements always exist), the helper would return an error. Return 0 on success, or a negative error in case of failure. int bpf_map_delete_elem(struct bpf_map *map, const void *key) Description Delete entry with key from map. Return 0 on success, or a negative error in case of failure. u32 bpf_get_prandom_u32(void) Description Get a pseudo-random number. From a security point of view, this helper uses its own pseudo-random internal state, and cannot be used to infer the seed of other random functions in the kernel. However, it is essential to note that the generator used by the helper is not cryptographically secure. Return A random 32-bit unsigned value. u32 bpf_get_smp_processor_id(void) Description Get the SMP (symmetric multiprocessing) processor id. Note that all programs run with preemption disabled, which means that the SMP processor id is stable during all the execution of the program. Return The SMP id of the processor running the program. int bpf_get_numa_node_id(void) Description Return the id of the current NUMA node. Return The id of current NUMA node. int bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index) Description This special helper is used to trigger a "tail call", or in other words, to jump into another eBPF program. The same stack frame is used (but values on stack and in registers for the caller are not accessible to the callee). This mechanism allows for program chaining, either for raising the maximum number of available eBPF instructions, or to execute given programs in conditional blocks. For security reasons, there is an upper limit to the number of successive tail calls that can be performed. Upon call of this helper, the program attempts to jump into a program referenced at index in prog_array_map, a special map of type BPF_MAP_TYPE_PROG_ARRAY, and passes ctx, a pointer to the context. If the call succeeds, the kernel immediately runs the first instruction of the new program. This is not a function call, and it never returns to the previous program. If the call fails, then the helper has no effect, and the caller continues to run its subsequent instructions. A call can fail if the destination program for the jump does not exist (i.e. index is higher than the number of entries in prog_array_map), or if the maximum number of tail calls has been reached for this chain of programs. This limit is defined in the kernel by the macro MAX_TAIL_CALL_CNT (not accessible to user space), which is currently set to 32. Return 0 on success, or a negative error in case of failure. u64 bpf_ktime_get_ns(void) Description Return the time elapsed since system boot, in nanoseconds. Return Current *ktime*. int bpf_trace_printk(const char *fmt, u32 fmt_size, ...) Description This helper is a "printk()-like" facility for debugging. It prints a message defined by format fmt (of size fmt_size) to file /sys/kernel/debug/tracing/trace from DebugFS, if available. It can take up to three additional u64 arguments (as in eBPF helpers, the total number of arguments is limited to five). Each time the helper is called, it appends a line to the trace. The format of the trace is customizable, and the exact output one will get depends on the options set in /sys/kernel/debug/tracing/trace_options (see also the README file under the same directory). However, it usually defaults to something like: telnet-470 [001] .N.. 419421.045894: 0x00000001: <formatted msg> In the above: * ``telnet`` is the name of the current task. * ``470`` is the PID of the current task. * ``001`` is the CPU number on which the task is running * In ``.N..``, each character refers to a set of options (whether irqs are enabled, scheduling options, whether hard/softirqs are running, level of preempt_disabled respectively). N means that TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED are set. * ``419421.045894`` is a timestamp. * ``0x00000001`` is a fake value used by BPF for the instruction pointer register. * ``<formatted msg>`` is the message formatted with fmt The conversion specifiers supported by fmt are similar, but more limited than for printk(). They are %d, %i, %u, %x, %ld, %li, %lu, %lx, %lld, %lli, %llu, %llx, %p, %s. No modifier (size of field, padding with zeroes, etc.) is available, and the helper will return -EINVAL (but print nothing) if it encounters an unknown specifier. Also, note that bpf_trace_printk() is slow, and should only be used for debugging purposes. For this reason, a notice block (spanning several lines) is printed to kernel logs and states that the helper should not be used "for production use" the first time this helper is used (or more precisely when the trace_printk () buffers are allocated). For passing values to user space, perf events should be preferred. Return The number of bytes written to the buffer, or a negative error in case of failure. Additionally, for each class of instrumentation target we see a _func_proto() function which enumerates the additional functions available, along with the base set. We will describe these functions, grouped by the program types that support them. 1. socket-related program functions Socket-related BPF programs support the generic set of operations above, and a set of program-specific functions. 1.1 sk_filter programs int bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len) Description This helper was provided as an easy way to load data from a packet. It can be used to load len bytes from offset from the packet associated to skb, into the buffer pointed to by to. Since Linux 4.7, usage of this helper has mostly been replaced by "direct packet access", enabling packet data to be manipulated with skb->data and skb→data_end pointing respectively to the first byte of packet data and to the byte after the last byte of packet data. However, it remains useful if one wishes to read large quantities of data at once from a packet into the eBPF stack. Return 0 on success, or a negative error in case of failure. u64 bpf_get_socket_cookie(struct sk_buff *skb) Description If the struct sk_buff * pointed by skb has a known socket, retrieve the cookie (generated by the kernel) of this socket. If no cookie has been set yet, generate a new cookie. Once generated, the socket cookie remains stable for the life of the socket. This helper can be useful for monitoring per socket networking traffic statistics as it provides a unique socket identifier per namespace. Return A 8-byte long non-decreasing number on success, or 0 if the socket field is missing inside skb. u32 bpf_get_socket_uid(struct sk_buff *skb) Return The owner UID of the socket associated to skb. If the socket is NULL, or if it is not a full socket (i.e. if it is a time-wait or a request socket instead), overflowuid value is returned (note that overflowuid might also be the actual UID value for the socket). 1.2 sock_ops programs int bpf_setsockopt(struct bpf_sock_ops *bpf_socket, int level, int optname, char *optval, int optlen) Description Emulate a call to setsockopt() on the socket associated to bpf_socket, which must be a full socket. The level at which the option resides and the name optname of the option must be specified, see setsockopt(2) for more information. The option value of length optlen is pointed by optval. This helper actually implements a subset of setsockopt(). It supports the following levels: * SOL_SOCKET, which supports the following optnames: SO_RCVBUF, SO_SNDBUF, SO_MAX_PACING_RATE, SO_PRIORITY, SO_RCVLOWAT, SO_MARK. * IPPROTO_TCP, which supports the following optnames: TCP_CONGESTION, TCP_BPF_IW, TCP_BPF_SNDCWND_CLAMP. * IPPROTO_IP, which supports optname IP_TOS. * IPPROTO_IPV6, which supports optname IPV6_TCLASS. Return 0 on success, or a negative error in case of failure. int bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) Description Add an entry to, or update a map referencing sockets. The skops is used as a new value for the entry associated to key. flags is one of: BPF_NOEXIST The entry for key must not exist in the map. BPF_EXIST The entry for key must already exist in the map. BPF_ANY No condition on the existence of the entry for key. If the map has eBPF programs (parser and verdict), those will be inherited by the socket being added. If the socket is already attached to eBPF programs, this results in an error. Return 0 on success, or a negative error in case of failure. 1.3 sk_skb programs In addition to the base set, the following are supported: int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags) Description Store len bytes from address from into the packet associated to skb, at offset. flags are a combination of BPF_F_RECOMPUTE_CSUM (automatically recompute the checksum for the packet after storing the bytes) and BPF_F_INVALIDATE_HASH (set skb->hash, skb->swhash and skb->l4hash to 0). A call to this helper is susceptible to change the underlaying packet buffer. Therefore, at load time, all checks on pointers previously done by the verifier are invalidated and must be performed again, if the helper is used in combination with direct packet access. Return 0 on success, or a negative error in case of failure. int bpf_skb_pull_data(struct sk_buff *skb, u32 len) Description Pull in non-linear data in case the skb is non-linear and not all of len are part of the linear section. Make len bytes from skb readable and writable. If a zero value is passed for len, then the whole length of the skb is pulled. This helper is only needed for reading and writing with direct packet access. For direct packet access, testing that offsets to access are within packet boundaries (test on skb->data_end) is susceptible to fail if offsets are invalid, or if the requested data is in non-linear parts of the skb. On failure the program can just bail out, or in the case of a non-linear buffer, use a helper to make the data available. The bpf_skb_load_bytes() helper is a first solution to access the data. Another one consists in using bpf_skb_pull_data to pull in once the non-linear parts, then retesting and eventually access the data. At the same time, this also makes sure the skb is uncloned, which is a necessary condition for direct write. As this needs to be an invariant for the write part only, the verifier detects writes and adds a prologue that is calling bpf_skb_pull_data() to effectively unclone the skb from the very beginning in case it is indeed cloned. A call to this helper is susceptible to change the underlaying packet buffer. Therefore, at load time, all checks on pointers previously done by the verifier are invalidated and must be performed again, if the helper is used in combination with direct packet access. Return 0 on success, or a negative error in case of failure. int bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags) Description Resize (trim or grow) the packet associated to skb to the new len. The flags are reserved for future usage, and must be left at zero. The basic idea is that the helper performs the needed work to change the size of the packet, then the eBPF program rewrites the rest via helpers like bpf_skb_store_bytes(), bpf_l3_csum_replace(), and others. This helper is a slow path utility intended for replies with control messages. And because it is targeted for slow path, the helper itself can afford to be slow: it implicitly linearizes, unclones and drops offloads from the skb.. A call to this helper is susceptible to change the underlaying packet buffer. Therefore, at load time, all checks on pointers previously done by the verifier are invalidated and must be performed again, if the helper is used in combination with direct packet access. Return 0 on success, or a negative error in case of failure. int bpf_skb_change_head(struct sk_buff *skb, u32 len, u64 flags) Description Grows headroom of packet associated to skb and adjusts the offset of the MAC header accordingly, adding len bytes of space. It automatically extends and reallocates memory as required. This helper can be used on a layer 3 *skb* to push a MAC header for redirection into a layer 2 device. All values for flags are reserved for future usage, and must be left at zero. A call to this helper is susceptible to change the underlaying packet buffer. Therefore, at load time, all checks on pointers previously done by the verifier are invalidated and must be performed again, if the helper is used in combination with direct packet access. Return 0 on success, or a negative error in case of failure. int bpf_sk_redirect_map(struct bpf_map *map, u32 key, u64 flags) Description Redirect the packet to the socket referenced by map (of type BPF_MAP_TYPE_SOCKMAP) at index key. Both ingress and egress interfaces can be used for redirection. The BPF_F_INGRESS value in flags is used to make the distinction (ingress path is selected if the flag is present, egress path otherwise). This is the only flag supported for now. Return SK_PASS on success, or SK_DROP on error. bpf_skb_load_bytes, bpf_get_socket_cookie, bpf_get_socket_uid are also supported. See above for descriptions of these. 2. tc (traffic control) subsystem program functions In addition to the base function set, the following are supported: s64 bpf_csum_diff(__be32 *from, u32 from_size, __be32 *to, u32 to_size, __wsum seed) Description Compute a checksum difference, from the raw buffer pointed by from, of length from_size (that must be a multiple of 4), towards the raw buffer pointed by to, of size to_size (same remark). An optional seed can be added to the value (this can be cascaded, the seed may come from a previous call to the helper). This is flexible enough to be used in several ways: * With from_size == 0, to_size > 0 and seed set to checksum, it can be used when pushing new data. * With from_size > 0, *to_size* == 0 and seed set to checksum, it can be used when removing data from a packet. * With from_size > 0, to_size > 0 and seed set to 0, it can be used to compute a diff. Note that from_size and to_size do not need to be equal. This helper can be used in combination with bpf_l3_csum_replace() and bpf_l4_csum_replace(), to which one can feed in the difference computed with bpf_csum_diff(). Return The checksum result, or a negative error code in case of failure. s64 bpf_csum_update(struct sk_buff *skb, __wsum csum) Description Add the checksum csum into skb->csum in case the driver has supplied a checksum for the entire packet into that field. Return an error otherwise. This helper is intended to be used in combination with bpf_csum_diff(), in particular when the checksum needs to be updated after data has been written into the packet through direct packet access. Return The checksum on success, or a negative error code in case of failure. int bpf_l3_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 size) Description Recompute the layer 3 (e.g. IP) checksum for the packet associated to skb. Computation is incremental, so the helper must know the former value of the header field that was modified (from), the new value of this field (to), and the number of bytes (2 or 4) for this field, stored in size. Alternatively, it is possible to store the difference between the previous and the new values of the header field in to, by setting from and size to 0. For both methods, offset indicates the location of the IP checksum within the packet. This helper works in combination with bpf_csum_diff(), which does not update the checksum in-place, but offers more flexibility and can handle sizes larger than 2 or 4 for the checksum to update. A call to this helper is susceptible to change the underlaying packet buffer. Therefore, at load time, all checks on pointers previously done by the verifier are invalidated and must be performed again, if the helper is used in combination with direct packet access. Return 0 on success, or a negative error in case of failure. int bpf_l4_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 flags) Description Recompute the layer 4 (e.g. TCP, UDP or ICMP) checksum for the packet associated to skb. Computation is incremental, so the helper must know the former value of the header field that was modified (from), the new value of this field (to), and the number of bytes (2 or 4) for this field, stored on the lowest four bits of flags. Alternatively, it is possible to store the difference between the previous and the new values of the header field in to, by setting from and the four lowest bits of flags to 0. For both methods, offset indicates the location of the IP checksum within the packet. In addition to the size of the field, flags can be added (bitwise OR) actual flags. With BPF_F_MARK_MANGLED_0, a null checksum is left untouched (unless BPF_F_MARK_ENFORCE is added as well), and for updates resulting in a null checksum the value is set to CSUM_MANGLED_0 instead. Flag BPF_F_PSEUDO_HDR indicates the checksum is to be computed against a pseudo-header. This helper works in combination with bpf_csum_diff(), which does not update the checksum in-place, but offers more flexibility and can handle sizes larger than 2 or 4 for the checksum to update. A call to this helper is susceptible to change the underlaying packet buffer. Therefore, at load time, all checks on pointers previously done by the verifier are invalidated and must be performed again, if the helper is used in combination with direct packet access. Return 0 on success, or a negative error in case of failure. int bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags) Description Clone and redirect the packet associated to skb to another net device of index ifindex. Both ingress and egress interfaces can be used for redirection. The BPF_F_INGRESS value in flags is used to make the distinction (ingress path is selected if the flag is present, egress path otherwise). This is the only flag supported for now. In comparison with bpf_redirect() helper, bpf_clone_redirect() has the associated cost of duplicating the packet buffer, but this can be executed out of the eBPF program. Conversely, bpf_redirect() is more efficient, but it is handled through an action code where the redirection happens only after the eBPF program has returned. A call to this helper is susceptible to change the underlaying packet buffer. Therefore, at load time, all checks on pointers previously done by the verifier are invalidated and must be performed again, if the helper is used in combination with direct packet access. Return 0 on success, or a negative error in case of failure. int bpf_redirect(u32 ifindex, u64 flags) Description Redirect the packet to another net device of index ifindex. This helper is somewhat similar to bpf_clone_redirect(), except that the packet is not cloned, which provides increased performance. Except for XDP, both ingress and egress interfaces can be used for redirection. The BPF_F_INGRESS value in flags is used to make the distinction (ingress path is selected if the flag is present, egress path otherwise). Currently, XDP only supports redirection to the egress interface, and accepts no flag at all. The same effect can be attained with the more generic bpf_redirect_map(), which requires specific maps to be used but offers better performance. Return For XDP, the helper returns XDP_REDIRECT on success or XDP_ABORTED on error. For other program types, the values are TC_ACT_REDIRECT on success or TC_ACT_SHOT on error. u32 bpf_get_cgroup_classid(struct sk_buff *skb) Description Retrieve the classid for the current task, i.e. for the net_cls cgroup to which skb belongs. This helper can be used on TC egress path, but not on ingress. The net_cls cgroup provides an interface to tag network packets based on a user-provided identifier for all traffic coming from the tasks belonging to the related cgroup. See also the related kernel documentation, available from the Linux sources in file Documentation/cgroup-v1/net_cls.txt. The Linux kernel has two versions for cgroups: there are cgroups v1 and cgroups v2. Both are available to users, who can use a mixture of them, but note that the net_cls cgroup is for cgroup v1 only. This makes it incompatible with BPF programs run on cgroups, which is a cgroup-v2-only feature (a socket can only hold data for one version of cgroups at a time). This helper is only available is the kernel was compiled with the CONFIG_CGROUP_NET_CLASSID configuration option set to "y" or to "m". Return The classid, or 0 for the default unconfigured classid. int bpf_skb_under_cgroup(struct sk_buff *skb, struct bpf_map *map, u32 index) Description Check whether skb is a descendant of the cgroup2 held by map of type BPF_MAP_TYPE_CGROUP_ARRAY, at index. Return The return value depends on the result of the test, and can be: * 0, if the skb failed the cgroup2 descendant test. * 1, if the skb succeeded the cgroup2 descendant test. * A negative error code, if an error occurred. int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci) Description Push a vlan_tci (VLAN tag control information) of protocol vlan_proto to the packet associated to skb, then update the checksum. Note that if vlan_proto is different from ETH_P_8021Q and ETH_P_8021AD, it is considered to be ETH_P_8021Q. A call to this helper is susceptible to change the underlaying packet buffer. Therefore, at load time, all checks on pointers previously done by the verifier are invalidated and must be performed again, if the helper is used in combination with direct packet access. Return 0 on success, or a negative error in case of failure. int bpf_skb_vlan_pop(struct sk_buff *skb) Description Pop a VLAN header from the packet associated to skb. A call to this helper is susceptible to change the underlaying packet buffer. Therefore, at load time, all checks on pointers previously done by the verifier are invalidated and must be performed again, if the helper is used in combination with direct packet access. Return 0 on success, or a negative error in case of failure. int bpf_skb_change_proto(struct sk_buff *skb, __be16 proto, u64 flags) Description Change the protocol of the skb to proto. Currently supported are transition from IPv4 to IPv6, and from IPv6 to IPv4. The helper takes care of the groundwork for the transition, including resizing the socket buffer. The eBPF program is expected to fill the new headers, if any, via skb_store_bytes() and to recompute the checksums with bpf_l3_csum_replace() and bpf_l4_csum_replace(). The main case for this helper is to perform NAT64 operations out of an eBPF program. Internally, the GSO type is marked as dodgy so that headers are checked and segments are recalculated by the GSO/GRO engine. The size for GSO target is adapted as well. All values for *flags* are reserved for future usage, and must be left at zero. A call to this helper is susceptible to change the underlaying packet buffer. Therefore, at load time, all checks on pointers previously done by the verifier are invalidated and must be performed again, if the helper is used in combination with direct packet access. Return 0 on success, or a negative error in case of failure. int bpf_skb_change_type(struct sk_buff *skb, u32 type) Description Change the packet type for the packet associated to skb. This comes down to setting skb->pkt_type to type, except the eBPF program does not have a write access to skb→pkt_type beside this helper. Using a helper here allows for graceful handling of errors. The major use case is to change incoming skbs to PACKET_HOST in a programmatic way instead of having to recirculate via redirect(..., BPF_F_INGRESS), for example. Note that type only allows certain values. At this time, they are: PACKET_HOST Packet is for us. PACKET_BROADCAST Send packet to all. PACKET_MULTICAST Send packet to group. PACKET_OTHERHOST Send packet to someone else. Return 0 on success, or a negative error in case of failure. int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags) Description Get tunnel metadata. This helper takes a pointer key to an empty struct bpf_tunnel_key of size, that will be filled with tunnel metadata for the packet associated to skb. The flags can be set to BPF_F_TUNINFO_IPV6, which indicates that the tunnel is based on IPv6 protocol instead of IPv4. The struct bpf_tunnel_key is an object that generalizes the principal parameters used by various tunneling protocols into a single struct. This way, it can be used to easily make a decision based on the contents of the encapsulation header, "summarized" in this struct. In particular, it holds the IP address of the remote end (IPv4 or IPv6, depending on the case) in key->remote_ipv4 or key->remote_ipv6. Also, this struct exposes the key->tunnel_id, which is generally mapped to a VNI (Virtual Network Identifier), making it programmable together with the bpf_skb_set_tunnel_key() helper. Let's imagine that the following code is part of a program attached to the TC ingress interface, on one end of a GRE tunnel, and is supposed to filter out all messages coming from remote ends with IPv4 address other than 10.0.0.1: int ret; struct bpf_tunnel_key key = {}; ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0); if (ret < 0) return TC_ACT_SHOT; // drop packet if (key.remote_ipv4 != 0x0a000001) return TC_ACT_SHOT; // drop packet return TC_ACT_OK; // accept packet This interface can also be used with all encapsulation devices that can operate in "collect metadata" mode: instead of having one network device per specific configuration, the "collect metadata" mode only requires a single device where the configuration can be extracted from this helper. This can be used together with various tunnels such as VXLan, Geneve, GRE or IP in IP (IPIP). Return 0 on success, or a negative error in case of failure. int bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags) Description Populate tunnel metadata for packet associated to skb. The tunnel metadata is set to the contents of key, of size. The flags can be set to a combination of the following values: BPF_F_TUNINFO_IPV6 Indicate that the tunnel is based on IPv6 protocol instead of IPv4. BPF_F_ZERO_CSUM_TX For IPv4 packets, add a flag to tunnel metadata indicating that checksum computation should be skipped and checksum set to zeroes. BPF_F_DONT_FRAGMENT Add a flag to tunnel metadata indicating that the packet should not be fragmented. BPF_F_SEQ_NUMBER Add a flag to tunnel metadata indicating that a sequence number should be added to tunnel header before sending the packet. This flag was added for GRE encapsulation, but might be used with other protocols as well in the future. Here is a typical usage on the transmit path: struct bpf_tunnel_key key; populate key ... bpf_skb_set_tunnel_key(skb, &key, sizeof(key), 0); bpf_clone_redirect(skb, vxlan_dev_ifindex, 0); See also the description of the bpf_skb_get_tunnel_key() helper for additional information. Return 0 on success, or a negative error in case of failure. int bpf_skb_get_tunnel_opt(struct sk_buff *skb, u8 *opt, u32 size) Description Retrieve tunnel options metadata for the packet associated to skb, and store the raw tunnel option data to the buffer opt of size. This helper can be used with encapsulation devices that can operate in "collect metadata" mode (please refer to the related note in the description of bpf_skb_get_tunnel_key() for more details). A particular example where this can be used is in combination with the Geneve encapsulation protocol, where it allows for pushing (with bpf_skb_get_tunnel_opt() helper) and retrieving arbitrary TLVs (Type-Length-Value headers) from the eBPF program. This allows for full customization of these headers. Return The size of the option data retrieved. int bpf_skb_set_tunnel_opt(struct sk_buff *skb, u8 *opt, u32 size) Description Set tunnel options metadata for the packet associated to skb to the option data contained in the raw buffer opt of size. See also the description of the bpf_skb_get_tunnel_opt() helper for additional information. Return 0 on success, or a negative error in case of failure. u32 bpf_get_route_realm(struct sk_buff *skb) Description Retrieve the realm or the route, that is to say the tclassid field of the destination for the skb. The indentifier retrieved is a user-provided tag, similar to the one used with the net_cls cgroup (see description for bpf_get_cgroup_classid() helper), but here this tag is held by a route (a destination entry), not by a task. Retrieving this identifier works with the clsact TC egress hook (see also tc-bpf(8)), or alternatively on conventional classful egress qdiscs, but not on TC ingress path. In case of clsact TC egress hook, this has the advantage that, internally, the destination entry has not been dropped yet in the transmit path. Therefore, the destination entry does not need to be artificially held via netif_keep_dst() for a classful qdisc until the skb is freed. This helper is available only if the kernel was compiled with CONFIG_IP_ROUTE_CLASSID configuration option. Return The realm of the route for the packet associated to skb, or 0 if none was found. u32 bpf_get_hash_recalc(struct sk_buff *skb) Description Retrieve the hash of the packet, skb->hash. If it is not set, in particular if the hash was cleared due to mangling, recompute this hash. Later accesses to the hash can be done directly with skb->hash. Calling bpf_set_hash_invalid(), changing a packet prototype with bpf_skb_change_proto(), or calling bpf_skb_store_bytes() with the BPF_F_INVALIDATE_HASH are actions susceptible to clear the hash and to trigger a new computation for the next call to bpf_get_hash_recalc(). Return The 32-bit hash. void bpf_set_hash_invalid(struct sk_buff *skb) Description Invalidate the current skb->hash. It can be used after mangling on headers through direct packet access, in order to indicate that the hash is outdated and to trigger a recalculation the next time the kernel tries to access this hash or when the **bpf_get_hash_recalc**\ () helper is called. u32 bpf_set_hash(struct sk_buff *skb, u32 hash) Description Set the full hash for skb (set the field skb→hash) to value hash. Return 0 int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size) Description Write raw data blob into a special BPF perf event held by map of type BPF_MAP_TYPE_PERF_EVENT_ARRAY. This perf event must have the following attributes: PERF_SAMPLE_RAW as sample_type, PERF_TYPE_SOFTWARE as type, and PERF_COUNT_SW_BPF_OUTPUT as config. The flags are used to indicate the index in map for which the value must be put, masked with BPF_F_INDEX_MASK. Alternatively, flags can be set to BPF_F_CURRENT_CPU to indicate that the index of the current CPU core should be used. The value to write, of size, is passed through eBPF stack and pointed by data. The context of the program ctx needs also be passed to the helper. On user space, a program willing to read the values needs to call perf_event_open() on the perf event (either for one or for all CPUs) and to store the file descriptor into the map. This must be done before the eBPF program can send data into it. An example is available in file samples/bpf/trace_output_user.c in the Linux kernel source tree (the eBPF program counterpart is in samples/bpf/trace_output_kern.c). bpf_perf_event_output() achieves better performance than bpf_trace_printk() for sharing data with user space, and is much better suitable for streaming data from eBPF programs. Note that this helper is not restricted to tracing use cases and can be used with programs attached to TC or XDP as well, where it allows for passing data to user space listeners. Data can be: * Only custom structs, * Only the packet payload, or * A combination of both. Return 0 on success, or a negative error in case of failure. bpf_skb_store_bytes,: bpf_skb_load_bytes, bpf_skb_pull_data, bpf_skb_change_tail, bpf_get_socket_cookie, bpf_get_socket_uid:are also supported. See above for descriptions. 3. xdp : Xpress Data Path program functions In addition to the base set, bpf_perf_event_output, bpf_get_smp_processor_id, bpf_redirect and bpf_redirect_map are all supported as described above. int bpf_xdp_adjust_head(struct xdp_buff *xdp_md, int delta) Description Adjust (move) xdp_md->data by delta bytes. Note that it is possible to use a negative value for delta. This helper can be used to prepare the packet for pushing or popping headers. A call to this helper is susceptible to change the underlaying packet buffer. Therefore, at load time, all checks on pointers previously done by the verifier are invalidated and must be performed again, if the helper is used in combination with direct packet access. Return 0 on success, or a negative error in case of failure. 4. kprobes, tracepoints and perf events program functions To figure out which helper functions are supported for these program types, we need to look at kernel/trace/bpf_trace.c. Here a common set of verifier ops valid for all these program types is defined in tracing_func_proto(). This is the equivalent of the base function prototype in filter.c. The base set of functions for for BPF filters are supported here too; bpf_map_lookup_elem, bpf_map_update_elem, bpf_map_delete_elem, bpf_ktime_get_ns, bpf_tail_call, bpf_trace_printk, bpf_get_smp_processor_id, bpf_get_numa_node_id. bpf_get_prandom_u32. In addition bpf_perf_event_read, bpf_perf_event_output are all valid, and defined above. Other functions (not previously described) are : int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags) Description Walk a user or a kernel stack and return its id. To achieve this, the helper needs ctx, which is a pointer to the context on which the tracing program is executed, and a pointer to a map of type BPF_MAP_TYPE_STACK_TRACE. The last argument, flags, holds the number of stack frames to skip (from 0 to 255), masked with BPF_F_SKIP_FIELD_MASK. The next bits can be used to set a combination of the following flags: BPF_F_USER_STACK Collect a user space stack instead of a kernel stack. BPF_F_FAST_STACK_CMP Compare stacks by hash only. BPF_F_REUSE_STACKID If two different stacks hash into the same *stackid*, discard the old one. The stack id retrieved is a 32 bit long integer handle which can be further combined with other data (including other stack ids) and used as a key into maps. This can be useful for generating a variety of graphs (such as flame graphs or off-cpu graphs). For walking a stack, this helper is an improvement over bpf_probe_read(), which can be used with unrolled loops but is not efficient and consumes a lot of eBPF instructions. Instead, bpf_get_stackid() can collect up to PERF_MAX_STACK_DEPTH both kernel and user frames. Note that this limit can be controlled with the sysctl program, and that it should be manually increased in order to profile long user stacks (such as stacks for Java programs). To do so, use: # sysctl kernel.perf_event_max_stack=<new value> Return The positive or null stack id on success, or a negative error in case of failure. int bpf_probe_read(void *dst, u32 size, const void *src) Description For tracing programs, safely attempt to read size bytes from address src and store the data in dst. Return 0 on success, or a negative error in case of failure. u64 bpf_get_current_pid_tgid(void) Return A 64-bit integer containing the current tgid and pid, and created as such: current_task->tgid << 32 | current_task->pid. u64 bpf_get_current_uid_gid(void) Return A 64-bit integer containing the current GID and UID, and created as such: current_gid << 32 | current_uid. int bpf_get_current_comm(char *buf, u32 size_of_buf) Description Copy the comm attribute of the current task into buf of size_of_buf. The comm attribute contains the name of the executable (excluding the path) for the current task. The size_of_buf must be strictly positive. On success, the helper makes sure that the buf is NUL-terminated. On failure, it is filled with zeroes. Return 0 on success, or a negative error in case of failure. u64 bpf_get_current_task(void) Return A pointer to the current task struct. int bpf_probe_write_user(void *dst, const void *src, u32 len) Description Attempt in a safe way to write len bytes from the buffer src to dst in memory. It only works for threads that are in user context, and dst must be a valid user space address. This helper should not be used to implement any kind of security mechanism because of TOC-TOU attacks, but rather to debug, divert, and manipulate execution of semi-cooperative processes. Keep in mind that this feature is meant for experiments, and it has a risk of crashing the system and running programs. Therefore, when an eBPF program using this helper is attached, a warning including PID and process name is printed to kernel logs. Return 0 on success, or a negative error in case of failure. int bpf_current_task_under_cgroup(struct bpf_map *map, u32 index) Description Check whether the probe is being run is the context of a given subset of the cgroup2 hierarchy. The cgroup2 to test is held by map of type BPF_MAP_TYPE_CGROUP_ARRAY, at index. Return The return value depends on the result of the test, and can be: * 0, if the skb task belongs to the cgroup2. * 1, if the skb task does not belong to the cgroup2. * A negative error code, if an error occurred. . int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr) Description Copy a NUL terminated string from an unsafe address unsafe_ptr to dst. The size should include the terminating NUL byte. In case the string length is smaller than size, the target is not padded with further NUL bytes. If the string length is larger than *size*, just *size*-1 bytes are copied and the last byte is set to NUL. On success, the length of the copied string is returned. This makes this helper useful in tracing programs for reading strings, and more importantly to get its length at runtime. See the following snippet: SEC("kprobe/sys_open") void bpf_sys_open(struct pt_regs *ctx) { char buf[PATHLEN]; // PATHLEN is defined to 256 int res = bpf_probe_read_str(buf, sizeof(buf), ctx->di); // Consume buf, for example push it to // userspace via bpf_perf_event_output(); we // can use res (the string length) as event // size, after checking its boundaries. } In comparison, using bpf_probe_read() helper here instead to read the string would require to estimate the length at compile time, and would often result in copying more memory than necessary. Another useful use case is when parsing individual process arguments or individual environment variables navigating current->mm->arg_start and current->mm->env_start using this helper and the return value, one can quickly iterate at the right offset of the memory area. Return On success, the strictly positive length of the string, including the trailing NUL character. On error, a negative value. 5. cgroups-related program functions For cgroup sock/skb programs, in addition to the base set, one additional function is supported: bpf_get_current_uid_gid:is supported, and defined above. 6. Lightweight tunnel program functions For Lightweight tunnel in/out/xmit, in addition to the base set of functions, bpf_skb_load_bytes:, bpf_skb_pull_data, bpf_csum_diff, bpf_get_cgroup_classid: bpf_perf_event_output, bpf_get_route_realm, bpf_get_hash_recalc, bpf_perf_event_output, bpf_get_smp_processor_id, bpf_skb_under_cgroup: are all supported, and defined above. For Lightweight tunnel xmit only: bpf_skb_get_tunnel_key, bpf_skb_set_tunnel_key, bpf_skb_get_tunnel_opt, bpf_skb_set_tunnel_opt, bpf_redirect, bpf_clone_redirect, bpf_skb_change_tail, bpf_skb_change_head, bpf_skb_store_bytes, bpf_csum_update, bpf_l3_csum_replace, bpf_l4_csum_replace, bpf_set_hash_invalid are all supported and defined above. Summary We've described the various program types, and the functions they support. However before we can start writing BPF programs, we need to talk about BPF maps - a key data structure for sharing information that can be used (among other things) to share information between BPF programs and user-space. Learning more about BPF Thanks for reading this installment of our series on BPF. We hope you found it educational and useful. Questions or comments? Use the comments field below! Stay tuned for the next installment in this series, BPF and Userspace. Previously: BPF program types

Notes on BPF (2) - BPF helper functions Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley Packet Filter" --...

Linux Kernel Development

BPF: A Tour of Program Types

Notes on BPF (1) - A Tour of Progam Types Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley Packet Filter" -- a useful and extensible kernel function for much more than packet filtering. If you follow Linux kernel development discussions and blog posts, you've probably heard BPF mentioned a lot lately. It's being used for high-performance load-balancing, DDoS mitigation and firewalling, safe instrumentation of kernel and user-space code and much more! BPF does this by supporting a safe flexible programming environment in many different contexts; networking datapaths, kernel probes, perf events and more. Safety is key - in most environments, adding a kernel module introduces significant risk. BPF programs however are verified at program load time to ensure no out-of-bounds accesses occur etc. In addition BPF supports just-in-time compilation of its bytecode to the native instructions set, so BPF programs are also fast. If you're interested in topics like fast packet processing and observability, learning BPF should definitely be on your to-do list. Here we try to give a guide to BPF, covering a range of topics which will hopefully help developers trying to get to grips with writing BPF programs. This guide is based on Linux 4.14 (which is the kernel for Oracle Linux UEK5), so do bear that in mind as there have been a bunch of changes in BPF since, and some package names etc may differ for other distributions. Because BPF in Linux is such a fast-moving target, I'm going to try and point you at relevant places in the kernel codebase that may help you get a sense for what the technology can do. The samples/bpf directory is a great place to look to see what others have done, but here we'll also dig into the implementation as reference, as it may give you some ideas how to create new BPF programs. The aim here isn't to give a deep dive into BPF internals, but rather to give a few pointers to areas in the code which reveal BPF functionality. The source tree I'm using for reference is our UEK5 release, based on Linux 4.14.35. See https://github.com/oracle/linux-uek/tree/uek5/master . Most of the functionality described can be found in any recent kernel. The bpf-next tree (where BPF kernel development happens) can be found at https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git An important caveat; again, what is below describes the state as per the 4.14 kernel. A lot has changed since; but hopefully with the pointers into the code, you'll be better equipped to figure out what some of these changes are! The aim here is to be able to get to the point of working on interesting problems with BPF. However before we get there, let's look at the various pieces and how they fit together. The first question to ask is what can we do with BPF? What kinds of programs can we write? To get a sense for this, let's examine the enumerated type definition from include/uapi/linux/bpf.h https://github.com/oracle/linux-uek/blob/uek5/master/include/uapi/linux/bpf.h#L117 enum bpf_prog_type { BPF_PROG_TYPE_UNSPEC, BPF_PROG_TYPE_SOCKET_FILTER, BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_SCHED_CLS, BPF_PROG_TYPE_SCHED_ACT, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_XDP, BPF_PROG_TYPE_PERF_EVENT, BPF_PROG_TYPE_CGROUP_SKB, BPF_PROG_TYPE_CGROUP_SOCK, BPF_PROG_TYPE_LWT_IN, BPF_PROG_TYPE_LWT_OUT, BPF_PROG_TYPE_LWT_XMIT, BPF_PROG_TYPE_SOCK_OPS, BPF_PROG_TYPE_SK_SKB, }; What are all of these program types? To understand this, we will ask the same set of questions for each program type: what do I do with this program type? how do I attach my BPF program for this program type? what context is provided to my program? By this we mean what argument(s) and data are provided for us to work with. when does the attached program get run? It's important to understand this, as it gives a sense of where for example in the network stack a filter is applied. We won't worry about how you create the programs for now; that side of things is relatively uniform across the various program types. 1. socket-related program types - SOCKET_FILTER, SK_SKB, SOCK_OPS First, let's consider the socket-related program types which allow us to filter, redirect socket data and monitor socket events. The filtering use case relates to the origins of BPF. When observing the network we want to see only a portion of network traffic, for example all traffic from a troublesome system. Filters are used to describe the traffic we want to see, and ideally we want it to be fast, and we want to give users an open-ended set of filtering options. But we have a problem; we want to throw away unneeded data as early as possible, and to do that we need to filter in kernel context. Consider the alternative to an in-kernel solution - incurring the cost of copying packets to user-space and filtering there. That would be very expensive, especially if we only want to see a portion of the network traffic and throw away the rest.To achieve this, a safe mini-language was invented to translate high-level filters into a bytecode program that the kernel can use (termed classic BPF, cBPF). The aim of the language was to support a flexible set of filtering options while being fast and safe. Filters written in this assembly-like language could be pushed by userspace programs such as tcpdump to accomplish filtering in-kernel. See https://www.tcpdump.org/papers/bpf-usenix93.pdf ...for the classic paper describing this work. Modern eBPF took these concepts, expanded the register and instruction set, added data structures called maps, hugely expanded the kinds of events we can attach to, and much more! For socket filtering, the common case is to attach to a raw socket (SOCK_RAW), and in fact you'll notice most programs that do socket filtering have a line like this: s = socket(AF_PACKET,SOCK_RAW,htons(ETH_P_ALL)); Creating such a socket, we specify the domain (AF_PACKET), socket type (SOCK_RAW) and protocol (all packet types). In the Linux kernel, receive of raw packets is implemented by the raw_local_deliver() function. It is called called by ip_local_deliver_finish(), just prior to calling the relevant IP protocol's handler, which is where the packet is passed to TCP, UDP, ICMP etc. So at this point the traffic has not been associated with a specific socket; that happens later, when the IP stack figures out the mapping from packet to layer 4 protocol, and then to the relevant socket (if any). You can see the cBPF bytecodes generated by tcpdump by using the -d option. Here I want to run tcpdump on the wlp4s0 interface, filtering TCP traffic only: # tcpdump -i wlp4s0 -d 'tcp' (000) ldh [12] (001) jeq #0x86dd jt 2 jf 7 (002) ldb [20] (003) jeq #0x6 jt 10 jf 4 (004) jeq #0x2c jt 5 jf 11 (005) ldb [54] (006) jeq #0x6 jt 10 jf 11 (007) jeq #0x800 jt 8 jf 11 (008) ldb [23] (009) jeq #0x6 jt 10 jf 11 (010) ret #65535 (011) ret #0 Without much deep knowledge we can get a feel for what's happening here. On line 000 we load the offset of the ether header + 12 ; the ether header protocol type. On line 001, we jump to 002 if it matches ETH_P_IPv6 (0x86dd) (jt 2), otherwise jump to 007 if false (jf 7) (handle the IPv4 case). Let's look at the IPv6 case first. On line 003 we jump to 010 - success - if the IPv6 protocol (offset + 20) is 6 (IPPROTO_TCP) - line 010 returns 65535 which is the max length so we're accepting the packet. Otherwise we jump to 004. Here we compare to 0x2c, which indicates there's an IPv6 fragment header. If that's true we check if the fragment header (offset 54) specifies a next protocol value as IPPROTO_TCP, and if so we jump to 10 (success) or 11 (failure). Returning 0 means dropping the packet for filtering purposes. Handling IPv4 is simpler; on 007 (arrived at via "jf" on 001), we check for ETH_P_IPV4 and, if found, we verify that the IPPROTO is TCP. And we're done! Remember though this is cBPF; eBPF has an extended instruction/op set similar to x86_64 and additional registers. One other thing to note - socket filtering is distinct from netfilter-based filtering. Netfilter defines its own set of hooks with NF_HOOK() definitions, which netfilter-based technologies such as ipfilter can use to filter traffic also. You might think - couldn't we use eBPF there too? And you'd be right! bpfilter is replacing ipfilter in more recent Linux kernels. So with all that in mind, let's return to examining the socket-related program types. 1.1 BPF_PROG_TYPE_SOCKET_FILTER What do I do with it? The filtering actions include dropping packets (if the program returns 0) or trimming packets (if the program returns a length less than the original). See sk_filter_trim_cap() and its call to bpf_prog_run_save_cb(). Note that we're not trimming or dropping the original packet which would still reach the intended socket intact; we're working with a copy of the packet metadata which raw sockets can access for observability. In addition to filtering packet flow to our socket, we can also do things that have side-effects; for example collecting statistics in BPF maps. How do I attach my program? BPF programs can be attached to sockets via the SO_ATTACH_BPF setsockopt(), which passes in a file descriptor to the program. What context is provided? A pointer to the struct __sk_buff containing packet metadata/data. This structure is defined in include/linux/bpf.h, and includes key fields from the real sk_buff. The bpf verifier converts access to valid __sk_buff fields into offsets into the "real" sk_buff, see https://lwn.net/Articles/636647/ for more details. When does it run? Socket filters run for receive in sock_queue_rcv_skb() which is called by various protocols (TCP, UDP, ICMP, raw sockets etc) and can be used to filter inbound traffic. To give a sense for what programs look like, here we will create a filter that trims packet data we filter on the basis of protocol type; for IPv4 TCP, let's grab the IPv4 + TCP header only, while for UDP, we'll take the IPv4 and UDP header only. We won't deal with IPv4 options as it's a simple example, so in all other cases we return 0 (drop packet). #include <linux/bpf.h> #include <linux/in.h> #include <linux/types.h> #include <linux/string.h> #include <linux/if_ether.h> #include <linux/if_packet.h> #include <linux/ip.h> #include <linux/tcp.h> #include <linux/udp.h> #include "bpf_helpers.h" #ifndef offsetof #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER) #endif /* * We are only interested in TCP/UDP headers, so drop every other protocol * and trim packets after the TCP/UDP header by returning length of * ether header + IPv4 header + TCP/UDP header. */ SEC("socket") int bpf_prog1(struct __sk_buff *skb) { int proto = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); int size = ETH_HLEN + sizeof(struct iphdr); switch (proto) { case IPPROTO_TCP: size += sizeof(struct tcphdr); break; case IPPROTO_UDP: size += sizeof(struct udphdr); break; default: size = 0; break; } return size; } char _license[] SEC("license") = "GPL"; This program can be compiled into BPF bytecodes using LLVM/clang by specifying arch of "bpf" , and once that is done it will contain an object with an ELF section called "socket". That is our program. The next step is to use the BPF system call to assign a file descriptor to the program, then attach it to the socket. In samples/bpf , you can see that bpf_load.c scans the ELF sections, and sections with name prefixed by "socket" are recognized as BPF_PROG_TYPE_SOCKET_FILTER programs. If you're adding a sample I'd recommend including bpf_load.h so you can just call load_bpf_file() on your BPF program. For example, in samples/bpf/sockex1_user.c we take the filename of our program (sockex1) and load sockex1_kern.o ; the associated BPF program. Then we open a raw socket to loopback (lo) and attach the program there: snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); if (load_bpf_file(filename)) { printf("%s", bpf_log_buf); return 1; } sock = open_raw_sock("lo"); assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd, sizeof(prog_fd[0])) == 0); 1.2 BPF_PROG_TYPE_SOCK_OPS What do I do with it? Attach a BPF program to catch socket operations such as connection establishment, retransmit timeout etc. Once caught options can also be set via bpf_setsockopt(), so for example on passive establishment of a connection from a system not on the same subnet, we could lower the MTU so we won't have to worry about intermediate routers fragmenting packets. Programs can return success (0) or failure (a negative value) and a reply value can be set to indicate the desired value for a socket option (e.g. TCP rwnd). See https://lwn.net/Articles/727189/ for full details, and look for tcp_call_bpf()s inline definition in include/net/tcp.h to see how TCP handles execution of such programs. Another use case is for sockmap updates in combination with BPF_PROG_TYPE_SK_SKB programs; the bpf_sock_ops struct pointer passed into the BPF_PROG_TYPE_SOCK_OPS program is used to update the sockmap, associating a value for that socket. Later sk_skb programs can reference those values to specify which socket to redirect to via bpf_sk_redirect_map() calls. If this sounds confusing, I'd recommend taking a look at the code in samples/sockmap. How do I attach my program? It is attached to a cgroup file descriptor using BPF_CGROUP_SOCK_OPS attach type. What context is provided? Argument provided is the context, struct bpf_sock_ops *.. Op field specifies the operatiion, BPF_SOCK_OPS_RWND_INIT, BPF_SOCK_OPS_TCP_CONNECT_CB etc. The reply field can be used to indicate to the caller a new value for a parameter set. /* User bpf_sock_ops struct to access socket values and specify request ops * and their replies. * Some of this fields are in network (bigendian) byte order and may need * to be converted before use (bpf_ntohl() defined in samples/bpf/bpf_endian.h). * New fields can only be added at the end of this structure */ struct bpf_sock_ops { __u32 op; union { __u32 reply; __u32 replylong[4]; }; __u32 family; __u32 remote_ip4; /* Stored in network byte order */ __u32 local_ip4; /* Stored in network byte order */ __u32 remote_ip6[4]; /* Stored in network byte order */ __u32 local_ip6[4]; /* Stored in network byte order */ __u32 remote_port; /* Stored in network byte order */ __u32 local_port; /* stored in host byte order */ }; When does it run? As per the above article, unlike other BPF program types that expect to be called at a particular place in the codebase, SOCK_OPS program can be called at different places and use an "op" field to indicate that context. See include/uapi/linux/bpf.h for the enumerated BPF_SOCK_OPS_* definitions, but they include events like retransmit timeout, connection establishment etc. 1.3 BPF_PROG_TYPE_SK_SKB What do I do with it? Allows users to access skb and socket details such as port and IP address with a view to supporting redirect of skbs between sockets. See https://lwn.net/Articles/731133/ . This functionality is used in conjunction with a sockmap - a special-purpose BPF map that contains references to socket structures and associated values. sockmaps are used to support redirection. The program is attached and the bpf_sk_redirect_map() helper can be used to carry out the redirection. The general approach we catch socket creation events with sock_ops BPF programs, associate values with the sockmap for these, and then use data at the sk_skb instrumentation points to inform socket redirection - this is termed the verdict, and the program for this is attached to the sockmap via BPF_SK_SKB_STREAM_VERDICT. The verdict can be __SK_DROP, __SK_PASS, or __SK_REDIRECT. Another use case for this program type is in the strparser framework (https://www.kernel.org/doc/Documentation/networking/strparser.txt). BPF programs can be used to parse message streams via callbacks for read operations, verdict and read completion. TLS and KCM use stream parsing. How do I attach my program? A redirection progaram is attached to a sockmap as BPF_SK_SKB_STREAM_VERDICT; it should return the result of bpf_sk_redirect_map(). A strparser program is attached via BPF_SK_SKB_STREAM_PARSER and should return the length of data parsed. What context is provided? A pointer to the struct __sk_buff containing packet metadata/data. However more fields are accessible to the sk_skb program type. The extra set of fields available are documented in include/linux/bpf.h like so: /* Accessed by BPF_PROG_TYPE_sk_skb types from here to ... */ __u32 family; __u32 remote_ip4; /* Stored in network byte order */ __u32 local_ip4; /* Stored in network byte order */ __u32 remote_ip6[4]; /* Stored in network byte order */ __u32 local_ip6[4]; /* Stored in network byte order */ __u32 remote_port; /* Stored in network byte order */ __u32 local_port; /* stored in host byte order */ /* ... here. */ So from the above alone we can see we can gather information about the socket, since the above represents the key information that identifies the socket uniquely (protocol is already available in the globally-accessible portion of the struct __sk_buff). When does it run? A stream parser can be attached to a socket via BPF_SK_SKB_STREAM_PARSER attachment to a sockmap, and the parser runs on socket receive via smap_parse_func_strparser() in kernel/bpf/sockmap.c . BPF_SK_SKB_STREAM_VERDICT also attaches to the sockmap, and is run via smap_verdict_func(). 2. tc (traffic control) subsystem programs Next let's examine the program type related to the TC kernel packet scheduling subsystem. See the tc(8) manpage for a general introduction, and tc-bpf(8) for BPF specifics. 2.1 tc_cls_act : qdisc classifier What do I do with it? tc_cls_act allows us to use BPF programs as classifiers and actions in tc, the Linux QoS subsystem. What's even better is the tc(8) command has eBPF support also, so we can directly load BPF programs as classifiers and actions for inbound (ingress) and outbound (egress) traffic. See http://man7.org/linux/man-pages/man8/tc-bpf.8.html for a description of how to use tc's BPF functionality. tc programs can classify, modify, redirect or drop packets. How do I attach my program? tc(8) can be used; see tc-bpf(8) for details. The basics are we create a "clsact" qdisc for a network device, and add ingress and egress classifiers/actoins by specifying the BPF object and relevant ELF section. Example, to add an ingress classifier to eth0 in ELF section my_elf_sec from myprog_kernel.o (a bpf-bytecode-compiled object file): # tc qdisc add dev eth0 clsact # tc filter add dev eth0 ingress bpf da obj myprog_kernel.o sec my_elf_sec What context is provided? A pointer to struct __sk_buff packet metadata/data. When does it get run? As mentioned above, classifier qdisc must be added, and once it is we can attach BPF programs to classify inbound and outbound traffic. Implementation-wise, act_bpf.c and cls_bpf.c implement action/classifier modules. On ingress/gress sch_handle_ingress()/egress() call tcf_classify(). In the case of ingress, we do classification via the core network interface receive function, so we are getting the packet after the driver has processed it but before IP etc see it. On egress, the filtering is done prior to submitting to the device queue for transmit. 3. xdp : the Xpress Data Path. The key design goal for XDP is to introduce programmability in the network datapath. The aim is to provide the XDP hooks as close to the device as possible (before the OS has created sk_buff metadata) to maximize performace while supporting a common infrastructure across devices. To support XDP like this requires driver changes. For an example see drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c. A bpf net device op (ndo_bpf) is added. For bnxt it supports XDP_SETUP_PROG and XDP_QUERY_PROG actions; the former configures the device for XDP, reserving rings and setting the program as active. The latter returns the BPF program id. BPF-specific transmit and receive functions are provided and called by the real send/receive functions if needed. 3.1 BPF_PROG_TYPE_XDP What do I do with it? XDP allows access to packet data as early as possible, before packet metadata (struct sk_buff) has been assigned. Thus it is a useful place to do DDoS mitigation or load balancing since such activities can often avoid the expensive overhead of sk_buff allocation. XDP is all about supporting run-time programming of the kernel in via BPF hooks, but by working in concert with the kernel itself; i.e. not a kernel bypass mechanism. Actions supported include XDP_PASS (pass into network processing as usual), XDP_DROP (drop), XDP_TX (transmit) and XDP_REDIRECT. See include/uapi/linux/bpf.h for the "enum xdp_action". How do I attach my program? Via netlink socket message. A netlink socket - socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE) - is created and bound, and then we send a netlink message of type NLA_F_NESTED | 43 ; this specifies XDP message. The message contains the BPF fd, the interface index (ifindex). See samples/bpf/bpf_load.c for an example. What context is provided? An xdp metadata pointer; struct xdp_md * . XDP metadata is deliberately lightweight; from include/uapi/linux/bpf.h: /* user accessible metadata for XDP packet hook * new fields must be added to the end of this structure */ struct xdp_md { __u32 data; __u32 data_end; }; When does it get run? "Real" XDP is implemented at the driver level, and transmit/receive ring resources are set aside for XDP usage. For cases where drivers do not support XDP, there is the option of using "generic" XDP, which is implemented in net/core/dev.c. The downside of this is we do not bypass skb allocation, it just allows us to use XDP for such devices also. 4. kprobes, tracepoints and perf events kprobes, tracepoints and perf events all provide kernel instrumentation. kprobes - https://www.kernel.org/doc/Documentation/kprobes.txt - allow instrumentation of specific functions - entry of a function can be monitored via a kprobe, along with most instructions within a function, or entry/return can be instrumented via a kretprobe. When one of these probes is enabled, the code at the enable point is saved, and replaced with a breakpoint instruction. When this breakpoint is hit a trap instruction is generated, registers are saved and we branch to the relevant instrumentation handler. For example, kprobes are handled by kprobe_dispatcher() which gets the address of the kprobe and register context as arguments. kretprobes are implemented via kprobes; a kprobe fires on entry and modifies the return address, saving the original and replacing it with the location of the instrumentation handler. Tracepoints - https://www.kernel.org/doc/Documentation/trace/tracepoints.rst - are similar, but ratther than being enabled at particular instructions, they are explicitly marked at sites in code, and if enabled can be used to collect debugging information at those sites of interest. The same tracepoint can be declared in multiple places; for example trace_drv_return_int() is called in multiple places in net/mac80211/driver-ops.c . Perf events - https://perf.wiki.kernel.org/index.php/Main_Page - are the basis for eBPF support for these program types. BPF essentially piggy-backs on the existing infrastructure for event sampling, allowing us to attach programs to perf events of interest, which include kprobes, uprobes, tracepoints etc as well has other software events, and indeed hardware events can be monitored too. These instrumentation points are what gives BPF the capability to be a general-purpose tracing tool as well as a means for supporting the original networking-centric use cases like socket filtering. 4.1 BPF_PROG_TYPE_KPROBE What do I do with it? instrument code in any kernel function (bar a few exceptions) via kprobe, or instrument entry/return via kretprobe. k[ret]probe_perf_func() executes a BPF program attached to the probe point. Note that this program type can also be used to attach to u[ret]probes - see https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt for details How do I attach my program? When the kprobe is created via sysfs, it has an id associated with it, stored in /sys/kernel/debug/tracing/events/[uk]probe//id , /sys/kernel/debug/tracing/events/[uk]retprobe/probename/id . https://www.kernel.org/doc/Documentation/trace/kprobetrace.txt contains details on how to create a kprobe using sysfs.For example, to create a probe called "myprobe" on entry to tcp_retransmit_skb() and retrieve its id: # echo 'p:myprobe tcp_retransmit_skb' > /sys/kernel/debug/tracing/kprobe_events # cat /sys/kernel/debug/tracing/events/kprobes/myprobe/id 2266 We can use that probe id to open a perf event, enable it, and set the BPF program for that perf event to be our program. See samples/bpf/bpf_load.c in the load_and_attach() function for how this can be done for k[ret]probes. The code might look something like this: struct perf_event_attr attr; int eventfd, programfd; int probeid; /* Load BPF program and assign programfd to it; and get probeid of probe from sysfs */ attr.type = PERF_TYPE_TRACEPOINT; attr.sample_type = PERF_SAMPLE_RAW; attr.sample_period = 1; attr.wakeup_events = 1; attr.config = probeid; eventfd = sys_perf_event_open(&attr, -1, 0, programfd, 0); if (eventfd < 0) return -errno; if (ioctl(eventfd, PERF_EVENT_IOC_ENABLE, 0)) { close(eventfd); return -errno; } if (ioctl(eventfd, PERF_EVENT_IOC_SET_BPF, programfd)) { close(eventfd); return -errno; } What context is provided? A struct pt_regs *ctx , from which the registers can be accessed. Much of this is platform-specific, but some general-purpose functions exist such as regs_return_value(regs), which returns the value of the register than holds the function return value (regs→ax on x86). When does it get run? When the probe is enabled and breakpoint is hit, k[ret]probe_perf_func() executes a BPF program attached to the probe point via trace_call_bpf(). Similar story for u[ret]probe_perf_func(). 4.2 BPF_PROG_TYPE_TRACEPOINT What do I do with it? Instrument tracepoints in kernel code. Tracepoints can be enabled via sysfs as is the case with kprobes, and in a similar way. The list of trace events can be seen under /sys/kernel/debug/tracing/events. How do I attach my program? As we saw above, when the tracepoint is created via sysfs, it has an id associated with it. We can use that probe id to open a perf event, enable it, and set the BPF program for that perf event to be our program. See samples/bpf/bpf_load.c in the load_and_attach() function for how this can be done for tracepoints; the above code snippet for kprobes works for tracepoints also. As an example showing how tracepoints are enabled, here we enable the net/net_dev_xmit tracepoint as "myprobe2" and retrieve its id: # echo 'p:myprobe2 trace:net/net_dev_xmit' > /sys/kernel/debug/tracing/kprobe_events # cat /sys/kernel/debug/tracing/events/kprobes/myprobe2/id 2270 What context is provided? The context provided by the specific tracepoint; arguments and data types are associated with the tracepoint definition. When does it get run? When the tracepoint is enabled and hit, perf_trace_() (see definition in include/trace/perf.h) calls perf_trace_run_bpf_submit() which will invoke the bpf program via trace_call_bpf(). 4.3 BPF_PROG_TYPE_PERF_EVENT What do I do with it? Instrument software and hardware perf events. These include events like syscalls, timer expiry, sampling of hardware events, etc. Hardware events include PMU events (processor monitoring unit) which tell us things like how many instructions completed etc. Perf event monitoring can be targeted at a specific process or group, processor, and a sample period can be specified for profiling. How do I attach my program? A similar model as per the above; we call perf_event_open() with an attribute set, enable the perf event via PERF_EVENT_IOC_ENABLE ioctl(), and set the bpf program via PERF_EVENT_IOC_SET_BPF ioctl(). For PMU (processor monitoring unit) perf event example, see these snippets from samples/bpf/sampleip_user.c: ... struct perf_event_attr pe_sample_attr = { .type = PERF_TYPE_SOFTWARE, .freq = 1, .sample_period = freq, .config = PERF_COUNT_SW_CPU_CLOCK, .inherit = 1, }; ... pmu_fd[i] = sys_perf_event_open(&pe_sample_attr, -1 /* pid */, i, -1 /* group_fd */, 0 /* flags */); if (pmu_fd[i] < 0) { fprintf(stderr, "ERROR: Initializing perf sampling\n"); return 1; } assert(ioctl(pmu_fd[i], PERF_EVENT_IOC_SET_BPF, prog_fd[0]) == 0); assert(ioctl(pmu_fd[i], PERF_EVENT_IOC_ENABLE, 0) == 0); ... What context is provided? A struct bpf_perf_event_data *, which looks like this: struct bpf_perf_event_data { struct pt_regs regs; __u64 sample_period; }; When does it get run? Depends on the perf event firing and the sample rate chosen, specified by freq and sample_period fields in the perf event attribute structure. 5. cgroups-related program types CGroups are used to handle resource allocation, allowing or denying access to system resources such as CPU, network bandwidth etc for groups of processes. One key use case for cgroups is containers; a container's resource access is limited via cgroups while its activities are isolated by the various classes of namespace (network namespace, process ID namespace etc). In the BPF context, we can create eBPF programs that allow or deny access. In include/linux/bpf-cgroup.h we can see definitions for execution of socket/skb programs, where __cgroup_bpf_run_filter_skb is called wrapped in a check that cgroup BPF is enabled: #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb) \ ({ \ int __ret = 0; \ if (cgroup_bpf_enabled) \ __ret = __cgroup_bpf_run_filter_skb(sk, skb, \ BPF_CGROUP_INET_INGRESS); \ \ __ret; \ }) #define BPF_CGROUP_RUN_SK_PROG(sk, type) \ ({ \ int __ret = 0; \ if (cgroup_bpf_enabled) { \ __ret = __cgroup_bpf_run_filter_sk(sk, type); \ } \ __ret; \ }) If cgroups are enabled, we attach our program to the cgroup and it will be executed at the relevant hook points. To get an idea of the full list of hooks, consult include/uapi/linux/bpf.h and examine the enumerated type "bpf_attach_type" for BPF_CGROUP_* definitions. 5.1 BPF_PROG_TYPE_CGROUP_SKB What do I do with it? Allow or deny network access on IP egress/ingress (BPF_CGROUP_INET_INGRESS/BPF_CGROUP_INET_EGRESS). BPF programs should return 1 to allow access. Any other value results in the function __cgroup_bpf_run_filter_skb() returning -EPERM, which will be propagated to the caller such that the packet is dropped. How do I attach my program? The program is attached to a specific cgroup's file descriptor. What context is provided? The relevant skb. When does it get run? For inet ingress, sk_filter_trim_cap() (see above) contains a call to BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb); if a non-zero value is returned, the error is propogated to the caller (e.g. __sk_receive_skb()) and the packet is discarded and freed. A similar approach is taken on egress, but in ip[6]_finish_output(). 5.2 BPF_PROG_TYPE_CGROUP_SOCK What do I do with it? Allow or deny network access at various socket-related events (BPF_CGROUP_INET_SOCK_CREATE, BPF_CGROUP_SOCK_OPS). As above, BPF programs should return 1 to allow access. Any other value results in the function __cgroup_bpf_run_filter_sk() returning -EPERM, which will be propagated to the caller such that the packet is dropped. How do I attach my program? The program is attached to a specific cgroup's file descriptor. What context is provided? The relevant socket (sk). When does it get run? At socket creation time, in inet_create() we call BPF_CGROUP_RUN_PROG_INET_SOCK() with the socket as argument, and if that function fails, the socket is released. 6. Lightweight tunnel program types. Lightweight tunnels - https://lwn.net/Articles/650778/ - are a simple way to do tunneling by attaching encapsulation instructions to routes. The examples in the patch description make things clearer: iproute examples (without BPF): VXLAN: ip route add 40.1.1.1/32 encap vxlan id 10 dst 50.1.1.2 dev vxlan0 MPLS: ip route add 10.1.1.0/30 encap mpls 200 via inet 10.1.1.1 dev swp1 So we're telling Linux for example that for traffic to 40.1.1.1/32 addresses, we want to encapsulate with a a VXLAN ID of 10 and destination IPv4 address of 50.1.1.2. BPF programs can do the encapsulation on outbound/transmit (inbound packets are readonly). See https://lwn.net/Articles/705609/ for more details. Similarly to tc, iproute eBPF support allows us to attach the eBPF program ELF section directly: ip route add 192.168.253.2/32 \ encap bpf out obj lwt_len_hist_kern.o section len_hist \ dev veth0 6.1 BPF_PROG_TYPE_LWT_IN What do I do with it? Examine inbound packets for lightweight tunnel de-encapsulation. How do I attach my program? Via "ip route add". # ip route add <route+prefix> encap bpf in obj <bpf object file.o> section <ELF section> dev <device> What context is provided? Pointer to the sk_buff. When does it get run? Via lwtunnel_input() ; that function supports a number of encapsulation types including BPF. The BPF case runs bpf_input in net/core/lwt_bpf.c with redirection disallowed. 6.2 BPF_PROG_TYPE_LWT_OUT What do I do with it? Implement encapsulation for lightweight tunnels for specific destination routes on outbound. Encapsulation is forbidden How do I attach my program? Via "ip route add": # ip route add <route+prefix> encap bpf out obj <bpf object file.o> section <ELF section> dev <device> What context is provided? Pointer to the struct __sk_buff When does it get run? Via lwtunnel_output(). 6.3 BPF_PROG_TYPE_LWT_XMIT What do I do with it? Implement encapsulation/redirection for lightweight tunnels on transmit. How do I attach my program? Via "ip route add" # ip route add <route+prefix> encap bpf xmit obj <bpf object file.o> section <ELF section> dev <device> What context is provided? Pointer to the struct __sk_buff When does it get run? Via lwtunnel_xmit(). Summary So hopefully this roundup of program types was useful. We can see that BPF's safe in-kernel programmable environment can be used in all sorts of interesting ways! The next thing we will talk about is what BPF helper functions are available within the varoius program types. Learning more about BPF Thanks for reading this installment of our series on BPF. We hope you found it educational and useful. Questions or comments? Use the comments field below! Stay tuned for the next installment in this series, BPF Helper Functions.

Notes on BPF (1) - A Tour of Progam Types Oracle Linux kernel developer Alan Maguire presents this six-part series on BPF, wherein he presents an in depth look at the kernel's "Berkeley Packet Filter"...

Linux Kernel Development

Linux Plumbers Conference 2018 Report

Dhaval Giani, an Oracle Linux kernel developer and development manager, shares some of his thoughts and insights from Linux Plumbers Conference 2018. This is my report from LPC 2018 held in Vancouver, BC on November 13-15, 2018. Oracle had a strong presence at LPC this year. I counted at least 20 of us (so ~4% of the conference attendees). Like last year, I organized the Testing and Fuzzing microconference. It was a popular microconference (I counted over 100 attendees at peak times). Highlights of the testing microconference: The Automated Testing Summit (ATS) this year was co-located with ELCE. Kevin Hilman provided a report about events at the summit. There was a push to standardize testing procedures, starting with how testing is done, to defining various terms. There was a strong embedded presence this year but the summit organizers would like to expand it to include more distributions and server folks. A big goal going forward is to unify all the secret sauces which various companies have. LWN has covered this talk, and I highly recommend reading about it here. KernelCI was another important topic. The goal of this project is to test non-x86 platforms. The original success criteria used to be a successful compile on the platform. Now, we have reached a point where more often than not, we have a successful boot. This is great work by the kernel community, and kernelCI is only going to get more and more important as we reach a point where we will actually be able to run test suites for more than a few references platforms. KernelCI will soon be a Linux Foundation project. LWN has covered this as well. Our own Knut Omang talked about his new make runchecks which helps to improve code quality by running a bunch of automated checks (such as checkpatch, smatch, sparse, checkdoc, Coccinelle). The tool aims to categorize errors so that users can filter on them. The talk was quite well received, with a lot of "this is a cool idea" being heard around the room. Dmitry Vyukov from Google talked about syzkaller and syzbot. These projects are working very well to automate a lot of kernel fuzzing and reporting a number of issues (some of them having security implications). Around 70% of the bugs reported by syzkaller/syzbot are getting fixed upstream. There is also a lot of future automation being worked on. Matthew Wilcox from Oracle talked about his kernel testing in userspace. Matthew has extracted some kernel code into a library which can then be built into a user space test suite, but also be run as part of kernel test suites. At this point, it only works for Matthew but if he gets some collaborators, he believes it can be made more generic. Steven Rostedt implemented an ftrace probe filter to inject allocation failures during this talk. This was an enjoyable discussion. Finally Dan Carpenter, also from Oracle, talked about smatch. Smatch had quite some success in finding issues in the kernel with many of them having security implications. It was a great talk on how smatch worked and what new features were coming. One notable moment was Dan's work with detecting spectre v1 issues, and confirming if something was a real issue or not. Matthew Wilcox stated that the issue Dan highlighted was real and even created a patch for it. This was an excellent session, and I loved the interaction we had. There was talk about expanding ATS next year to get more coverage. Stay tuned to hear more about it. I also attended an old favourite of mine, Real Time Scheduling microconference. That was also a fantastic microconference. I have been following the PREEMPT_RT project for many years now, and it is getting close to being all in mainline. Talks have shifted from, when will we get to mainline, to what do we do after it is mainline (more testing seemed to the theme). Oracle's Prakash Sangappa talked about real-time inside namespaces, which led to a spirited discussion. Once missing bandwidth inheritance comes into place, we will be able to allow real-time inside namespaces which would make namespaces (and by extension containers) very useful. Daniel Jordan from Oracle organized the Scalability microconference. This was another exciting microconference, with a lot of great problems being discussed. Steven Sistare and Subhra Mazumdar from Oracle talked about improving the scheduler's load balancer. One of the key issues coming up now is, does the scheduler scale well on both the higher end (big iron) as well as the lower end (embedded and mobile) at the same time. Is it time to start bringing in tunables. Oracle is currently one of the very few participants looking to improve the high end performance. We will be keeping a close eye to maintain and improve performance characteristics of Linux in general and the scheduler in particular. There was talk about how to improve hugepages by Mike Kravetz, an Oracle kernel developer (along with Christoph Lameter) , and on ktask by Daniel Jordan. This was also a very interactive microconference that I enjoyed greatly. The rest of my time at the conference was spent in the hallway track, meeting old friends and discussing crazy ideas. We talked about how it is time now for cgroups v3 (just kidding!), and in general about the various other problems people are trying to solve across the stack. The Linux Plumbers Conference 2018 website located here has a link for the detailed conference schedule which has clickable links for each session. This year sessions were video taped so you can watch the presentations as well as the discussions that occurred. Also, the etherpads have all the notes from each session available. Finally, no report of LPC is complete without a mention of the social events. We got to meet a lot of old friends and made new ones. One key takeaway I had was that our push to get patches accepted upstream, and increasing participation is getting noticed. It is good to see Oracle's contribution to Linux and open source being acknowledged by other developers. I would like to thank Oracle for sponsoring my travel and the LPC organizing committee for organizing another great edition of the conference.

Dhaval Giani, an Oracle Linux kernel developer and development manager, shares some of his thoughts and insights from Linux Plumbers Conference 2018. This is my report from LPC 2018 held in Vancouver,...

Linux Kernel Development

Can better task stealing make Linux faster?

Load balancing via scalable task stealing Oracle Linux kernel developer Steve Sistare contributes this discussion on kernel scheduler improvements. The Linux task scheduler balances load across a system by pushing waking tasks to idle CPUs, and by pulling tasks from busy CPUs when a CPU becomes idle. Efficient scaling is a challenge on both the push and pull sides on large systems. For pulls, the scheduler searches all CPUs in successively larger domains until an overloaded CPU is found, and pulls a task from the busiest group. This is very expensive, costing 10's to 100's of microseconds on large systems, so search time is limited by the average idle time, and some domains are not searched. Balance is not always achieved, and idle CPUs go unused. I have implemented an alternate mechanism that is invoked after the existing search in idle_balance() limits itself and finds nothing. I maintain a bitmap of overloaded CPUs, where a CPU sets its bit when its runnable CFS task count exceeds 1. The bitmap is sparse, with a limited number of significant bits per cacheline. This reduces cache contention when many threads concurrently set, clear, and visit elements. There is a bitmap per last-level cache. When a CPU becomes idle, it searches the bitmap to find the first overloaded CPU with a migratable task, and steals it. This simple stealing yields a higher CPU utilization than idle_balance() alone, because the search is cheap, costing 1 to 2 microseconds, so it may be called every time the CPU is about to go idle. Stealing does not offload the globally busiest queue, but it is much better than running nothing at all. Results Stealing improves utilization with only a modest CPU overhead in scheduler code. In the following experiment, hackbench is run with varying numbers of groups (40 tasks per group), and the delta in /proc/schedstat is shown for each run, averaged per CPU, augmented with these non-standard stats: %find - percent of time spent in old and new functions that search for idle CPUs and tasks to steal and set the overloaded CPUs bitmap. steal - number of times a task is stolen from another CPU. X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz hackbench process 100000 sched_wakeup_granularity_ns=15000000 baseline grps time %busy slice sched idle wake %find steal 1 8.084 75.02 0.10 105476 46291 59183 0.31 0 2 13.892 85.33 0.10 190225 70958 119264 0.45 0 3 19.668 89.04 0.10 263896 87047 176850 0.49 0 4 25.279 91.28 0.10 322171 94691 227474 0.51 0 8 47.832 94.86 0.09 630636 144141 486322 0.56 0 new grps time %busy slice sched idle wake %find steal %speedup 1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1 2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6 Elapsed time improves by 8 to 36%, costing at most 0.4% more find time. CPU busy utilization is close to 100% for the new kernel, as shown by the green curve in the following graph, versus the orange curve for the baseline kernel: Stealing improves Oracle database OLTP performance by up to 9% depending on load, and we have seen some nice improvements for mysql, pgsql, gcc, java, and networking. In general, stealing is most helpful for workloads with a high context switch rate. The code As of this writing, this work is not yet upstream, but the latest patch series is at https://lkml.org/lkml/2018/12/6/1253. If your kernel is built with CONFIG_SCHED_DEBUG=y, you can verify that it contains the stealing optimization using # grep -q STEAL /sys/kernel/debug/sched_features && echo Yes Yes If you try it, note that stealing is disabled for systems with more than 2 NUMA nodes, because hackbench regresses on such systems, as I explain in https://lkml.org/lkml/2018/12/6/1250 . However, I suspect this effect is specific to hackbench and that stealing will help other workloads on many-node systems. To try it, reboot with kernel parameter sched_steal_node_limit = 8 (or larger). Future work After the basic stealing algorithm is pushed upstream, I am considering the following enhancements: If stealing within the last-level cache does not find a candidate, steal across LLC's and NUMA nodes. Maintain a sparse bitmap to identify stealing candidates in the RT scheduling class. Currently pull_rt_task() searches all run queues. Remove the core and socket levels from idle_balance(), as stealing handles those levels. Remove idle_balance() entirely when stealing across LLC is supported. Maintain a bitmap to identify idle cores and idle CPUs, for push balancing.

Load balancing via scalable task stealing Oracle Linux kernel developer Steve Sistare contributes this discussion on kernel scheduler improvements. The Linux task scheduler balances load across a system...

Announcements

Announcing Software Collection Library 3.2 for Oracle Linux

We are pleased to announce the release of the Software Collection Library 3.2 to the Unbreakable Linux Network and the Oracle Linux yum server. Software collections are primarily intended for development environments which require access to the latest features of software components such as Perl, PHP, or Python. For these environments, you need to minimize the disruption of system processes that rely on the versions of these components. The Software Collections library allows you to install and use several versions of the same software on a system, simultaneously, and without disruption. You use the software collection library utility (scl) to run the developer tools from the software collections that you have installed. The scl utility isolates the effects of running these tools from other versions of the same software utilities that you have installed. Additions and Updates for Oracle Linux 7 The following collections have been added in the 3.2 release of the Software Collection Library: devtoolset-8 rh-git218 rh-haproxy18 rh-nginx114 rh-nodejs10 rh-perl526 rh-php72 rh-ruby25 rh-varnish5 rh-varnish6 The following collections have been updated in the 3.2 release of the Software Collection Library: devtoolset-7 httpd24 rh-git29 rh-nodejs6 rh-nodejs8 rh-php70 Software Collections Libraries Available for Oracle Linux 7 (aarch64) Oracle only provides the latest versions and additions to the software collection library for the Arm (aarch64) platform and these are only supported for the latest update level of Oracle Linux 7. A subset of the complete software collection library, as available for the x86_64 platform, is available for aarch64. The following collections are currently available for Oracle Linux 7 (aarch64): devtoolset-6 devtoolset-7 devtoolset-8 httpd24 oracle-armtoolset-1 python27 rh-git218 rh-git29 rh-maven35 rh-nginx112 rh-nginx114 rh-nodejs10 rh-nodejs6 rh-nodejs8 rh-perl526 rh-php70 rh-php71 rh-php72 rh-python36 rh-ruby25 rh-varnish5 rh-varnish6 The Oracle Linux 7 (aarch64) release of the software collection library, additionally includes oracle-armtoolset-1 which provides a solid developer toolset to build code for 64-bit Arm platforms and to compile modules against the provided kernel. This includes the version 7.3 of the gcc compiler that is used to build the aarch64 version of UEK R5. Oracle Linux 7 users can find more information in the Software Collection Library 3.2 for Oracle Linux 7 Release Notes in the Oracle Linux 7 documentation library. Additions and Updates for Oracle Linux 6 The following collections for Oracle Linux 6 have been added in the 3.2 release of the Software Collection Library: devtoolset-8 The following collections have been updated in the 3.2 release of the Software Collection Library: devtoolset-7 httpd24 rh-git29 rh-nodejs6 rh-php70 Oracle Linux 6 users can find more information in the Software Collection Library 3.2 for Oracle Linux 6 Release Notes in the Oracle Linux 6 documentation library. Support Support for the Software Collection Library is provided at no extra cost to customers with an Oracle Linux Premier Support subscription. If you do not have paid support, you can get peer support via the Oracle Community forums at https://community.oracle.com. Resources – Oracle Linux Documentation Oracle Linux Software Download Oracle Linux Blogs Oracle Linux Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux - http://oracle.com/education/linux

We are pleased to announce the release of the Software Collection Library 3.2 to the Unbreakable Linux Network and the Oracle Linux yum server. Software collections are primarily intended for...

Announcements

Announcing the general availability of the Unbreakable Enterprise Kernel Release 5 Update 1

The Oracle Linux operating system is engineered for an open cloud infrastructure. It delivers leading performance, scalability and reliability for enterprise SaaS and PaaS workloads as well as traditional enterprise applications. Oracle Linux Support offers access to award-winning Oracle support resources and Linux support specialists; zero-downtime updates using Ksplice; additional management tools such as Oracle Enterprise Manager and Spacewalk; and lifetime support, all at a low cost. And unlike many other commercial Linux distributions, Oracle Linux is easy to download, completely free to use, distribute, and update. What's New? The Unbreakable Enterprise Kernel Release (UEK) 5 Update 1 is based on the mainline kernel version 4.14.35 and includes several new features, added functionality and bug fixes across a range of subsystems. Notable changes Improved support for 64-bit Arm (aarch64) architecture. Oracle continues to deliver kernel modifications to improve support for 64-bit Arm (aarch64) architecture. These changes are built and tested against existing Arm hardware and provide support for Oracle Linux for Arm. Cgroup v2 CPU controller backported to support kABI. The update includes changes to the code that handles cgroup resource usage statistics and improves performance when handling frequent reads where there are many cgroups that are not active. Improved scheduler scalability for fast path. This release includes scheduler scalability improvements for fast path. In addition, a new scheduler feature, SIS_CORE, is introduced to improve performance for certain workloads such as Oracle Database OLTP. DTrace has been enhanced to include additional runtime options on both the x86_64 and Arm architectures. In addition, DTrace has been enhanced to include the implementation of ustack() along with the implementation of SDT probes, FBT entry probes, and FBT return probes on the Arm architecture. libnvdimm subsystem updated for PMEM and DAX. The libnvdimm kernel subsystem, which is responsible for the detection, configuration, and management of Non-Volatile Dual Inline Memory Modules (NVDIMMs) is updated in UEK R5U1 to take advantage a large number of upstream patches, bug fixes and backports. Notably, these include fixes to /proc/smaps to reflect the actual PMEM page size and some work to improve Address Range Scrub (ARS). Also included is support for direct access (DAX) page operations on Non-Volatile Dual Inline Memory Modules (NVDIMMs) using either the ext4 or XFS file systems. For more details on these and other new features and changes, please consult the Release Notes for the Unbreakable Enterprise Kernel Release 5 Update 1. Security (CVE) Fixes A full list of CVEs fixed in this release can be found in the Release Notes for the UEK R5 Update 1. Supported upgrade path Customers can upgrade existing Oracle Linux 7 Update 5 (and later) servers using the Unbreakable Linux Network or the Oracle Linux yum server. Software Download Oracle Linux can be downloaded, used, and distributed free of charge and all updates and errata are freely available. This allows you to decide which of your systems require a support subscription and makes Oracle Linux an ideal choice for your development, testing and production systems. You decide which support coverage is the best for each of your systems individually, while keeping all of your systems up-to-date and secure. For customers with Oracle Linux Premier Support, you also receive access to zero-downtime kernel updates using Oracle Ksplice and support for Oracle OpenStack. Compatibility UEK R5 Update 1 is fully compatible with the UEK R5 GA release. The kernel ABI for UEK R5 remains unchanged in all subsequent updates to the initial release. In this release, there are changes to the kernel ABI relative to UEK R4 that require recompilation of third-party kernel modules on the system. Before installing UEK R5, verify its support status with your application vendor. Resources – Oracle Linux Documentation Oracle Linux Software Download Oracle Linux Blogs Oracle Linux Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux - http://oracle.com/education/linux

The Oracle Linux operating system is engineered for an open cloud infrastructure. It delivers leading performance, scalability and reliability for enterprise SaaS and PaaS workloads as well...

Linux Kernel Development

Seccomp: Safe and Secure and Slow No More

Linux kernel developer Tom Hromatka has been working in the area of seccomp looking to improve performance of large seccomp filters. Background Seccomp is a critical component to safely isolate and secure containers by restricting the syscalls that a container is allowed to invoke. In a nod to the many security threats that have arisen lately, current seccomp best practices are to create a (typically large) whitelist of allowed syscalls. This is safer than a small blacklist because new syscalls are occasionally added to the Linux kernel. If a blacklist is used and the seccomp filter and a new kernel are not updated together, malicious code could call this new syscall and use it as an attack vector to harm the system. But libseccomp isn't equipped to manage large whitelists at present. In its current form, it generates a series of sequential if syscall == n statements. Thus, a large seccomp filter can consist of hundreds of classic Berkeley Packet Filter (cBPF) instructions. The kernel must execute every if syscall == n cBPF instruction until the if statement matching the syscall being processed is found. For syscalls near the end of the filter, it can take milliseconds to process the cBPF instructions.   Hope (and better seccomp performance) is on the way! At Oracle, we are working on significantly improving seccomp performance when running large filters. We have proposed changes to libseccomp to utilize a binary tree which will reduce the cBPF computation time from O(n) down to O(log n). For a seccomp filter with 300 syscalls, this will drastically decrease the number of cBPF instructions executed from 300+ down to as little as 9 instructions.   An Example We created a simple test program using Docker's default libseccomp filter. In this test program, we call getppid() (a very fast syscall) millions of times and record how quickly it is executed. By modifying libseccomp and the cBPF instructions it generates, we're able to identify the impact of large syscall filters and their effect on performance. The results are even better than we hoped:     The performance of the current libseccomp implementation degrades linearly as the syscall falls later in the filter. Conversely, the performance of the binary tree remains consistent regardless of the location of the syscall within the filter. It is nearly as fast as the best case of the current filter. Expect to see this feature in libseccomp in 2019! Resources The binary tree RFC for libseccomp is available here:   https://github.com/seccomp/libseccomp/issues/116 Tom presented this topic at  Linux Plumbers Conference 2018. The video of that talk and presentation can be found here:   https://www.linuxplumbersconf.org/event/2/contributions/213/  

Linux kernel developer Tom Hromatka has been working in the area of seccomp looking to improve performance of large seccomp filters. Background Seccomp is a critical component to safely isolate and...

Linux Kernel Development

New Concepts in Scalability and Performance

Scalability and Performance Microconference at LPC 2018 This year at the Linux Plumbers Conference, Oracle Linux developer Daniel Jordan co-organized the performance and scalability microconference along with Pasha Tatashin from Microsoft and Ying Huang from Intel. The event had nine speakers, about half of whom were from Oracle, so this was a nice opportunity for our team to raise its concerns with the community. Daniel contributes this writeup on the challenges and opportunities they discussed. This was a good year for Oracle at the 2018 Linux Plumbers Conference with several of the attendees telling me that they noticed the heavy representation from Oracle, both in talks and the hallway. Plumbers was most useful for having small, focused discussions that would never happen on mailing lists. You can end up going deeper and finding more common ground with extended face time. Tim Chen from Intel spoke about a bottleneck in TPC-C with scheduler task accounting for cgroups on multi-socket systems. (Oracle's Unbreakable Enterprise Kernel 5 is configured with the same scheduler options he used in the runs.) An atomic operation to track the aggregate load average in a task group (load_avg in struct task_group) was showing up at the top of his profiles. Rik van Riel pointed out that this wasn't just a problem on multisocket boxes, they were also seeing this on single-socket systems at Facebook. Peter Zijlstra suggested that Tim revive some old patches that broke this counter up across NUMA nodes, and after further discussion, it was agreed to split along Last-Level Cache (LLC) boundaries instead because systems have grown larger since the patches were last posted. We're hopeful to see good results from these changes soon! Pasha Tatashin from Microsoft spoke about seamlessly updating a host OS while minimizing downtime of guest VMs, presenting two high-level strategies for the audience to consider. First, kexec a new host kernel and transfer control between the old and new host kernel as the transition between the two kernels is happening, and second, boot a new host OS inside a VM, migrate the guests into this VM (relying on nested virtualization), and kexec into it, fixing EPT translations before transferring control. There was some concern about how to support SR-IOV devices in the second solution, but in the end, Pasha decided to experiment with the second option. Steve Sistare and Subhra Mazumdar spoke about scheduler scalability work they've been involved in. Steve's blog post is coming in January; read Subhra's blog post here. Mike Kravetz and Christoph Lameter led a session on huge page issues in the kernel. Here's an excerpt about the session from Mike: During this MC, Christoph Lameter and myself talked about promoting huge page usage. This was mostly a rehash of material previously presented and discussed. The 'hope' was to spark discussion and possibly new ideas. During this session, one really good suggestion was made. Align mmap addresses for THP. I sent out a similar RFC to align for pmd sharing a couple years back (https://lkml.org/lkml/2016/3/28/478) but did not follow through. Will add both to my todo list. Boqun Feng held a discussion about an issue he had seen with workqueues and CPU hotplug that had come up when optimizing an RCU (Read-Copy-Update lock) path. According to a comment above its definition, queue_work_on requires callers to ensure the requested CPU for the work item can't go offline before the queueing is finished, and the problem was the RCU code path doesn't follow this requirement. Boqun wanted to disable preemption around the queue_work_on call, effectively preventing CPU hotplug, but Thomas Gleixner opposed this, saying that disabling preemption prevents CPU hotplug only by accident and that there was no semantic guarantee for it. Paul McKenney and Thomas went back and forth about what to do, but to make a long story short, in the hallway afterward it was discovered that the workqueue comment giving the requirement about CPU hotplug was stale. The workqueue splats Boqun Feng saw had some other cause, pending investigation. I spoke about ktask, an interface for parallelizing CPU-intensive kernel work, with the goal of discussing a few of the open problems in this work. The audience had a different idea, and we spent the session fielding questions about how ktask worked and where else it might be used. For example, Junaid Shahid from Google had a use case for ktask to multithread kvm dirty page tracking during live migration, but was concerned about threads having different amounts of work to do in their assigned memory regions such that the load may not be equally shared between them. My new plan, post-conference, will be to alleviate this by splitting up the ranges to be tracked into small pieces interleaved across threads to minimize the chance that one thread would get stuck with a busy range. Finally, Yang Shi led a discussion on mmap_sem, a perennial bottleneck in the kernel that often serializes updates to a process's address space, including its rbtree of VMAs and various fields in mm_struct. This discussion is hard to summarize since there were so many comments: To Yang Shi's suggestion about a per-VMA lock, Davidlohr Bueso said it wouldn't help alleviate contention when multiple threads update the same VMA. Vlastimil Babka suggested splitting large VMAs into many smaller ones, even if they shared the same flags, to make per-VMA locks work better. Rik van Riel was skeptical about this, since application threads may not have an even access pattern across the process's virtual address space. On a different topic, Waiman Long warned that a strategy used in one of the recent mmap_sem fixes for alleviating contention, downgrading the holder from writer to reader, may not benefit sometimes because readers don't optimistically spin the way writers do. Matthew Wilcox mentioned a planned experiment to use an RCU-safe B-tree (aka the Maple Tree) to avoid taking mmap_sem for read. Steve Sistare said the problem with the range locks is you have to traverse a tree of ranges to find which range to operate on, which creates a bottleneck in itself, and suggested potentially using a hashed array of locks, which is parallelizable but can suffer when the VA region to operate on is very large. Davidlohr Bueso mentioned that a range locking primitive exists already upstream. An rwsem will always outperform it because of optimistic spinning, but the worst case scenario described in the range locking series isn't that much worse than rwsem. He believes the main question right now is how to serialize threads operating on the same VMA. Laurent Dufour said the problem with the VMA is that there are so many ways to get to it: mm_struct's rbtree, mm_struct's VMA list, and anon_vma lists. Matthew Wilcox agreed, and said the kernel doesn't differentiate the case where the whole address space needs protecting and the case where an individual VMA does. Laurent and Matthew agreed on the need for a per-VMA lock. Matthew hopes for a per-process spinlock for the entire address space and a semaphore for each VMA. Thanks to Paul McKenney, Davidlohr Bueso, Dave Hansen, and Dhaval Giani, who provided helpful advice in the process of organizing this microconference Other Talks Here are a few recommended talks from LPC. Videos and slides are posted on the talk pages, linked from here . Mike Kravetz's and Christoph Lameter's "Very large Contiguous regions in userspace" for its useful and very interactive discussion about how to proceed with a common problem between different kernel communities. "RDMA and get_user_pages" from Matthew Wilcox, Dan Williams, Jan Kara, and John Hubbard for the great audience interaction, problem solving, and interesting technical content. Vlastimil Babka's "The hard work behind large physical allocations in the kernel" because of how well it laid out current issues in this area. The slide deck is very readable on its own, for those who prefer reading to watching but become frustrated at following powerpoints. "Concurrency with tools/memory-model" from Andrea Parri and Paul McKenney. If your work involves memory barriers, this is a good one to watch to learn about the expectations of the maintainers of the LKMM (Linux Kernel Memory Model). It turns out they filter for upstream postings containing barriers (e.g. smp_mb) and review the changes to make sure they're correct, and follow the expected commenting style for paired barriers. Concluding Thoughts Thanks to everyone who helped organize this event: it is a massive undertaking to make a conference this large happen. Excited for next year!

Scalability and Performance Microconference at LPC 2018 This year at the Linux Plumbers Conference, Oracle Linux developer Daniel Jordan co-organized the performance and scalability microconference...

Announcements

Modularizing the Oracle Linux Yum Server Repository Configurations: Breaking Up is Hard to Do

TL;DR: Oracle Linux yum server changes - Action Required After 18-JAN-2019 Beginning on 18 January 2019: the existing repo files (public-yum-ol7.repo and public-yum-ol6.repo) for yum.oracle.com will no longer be updated, in favor of smaller repo files that are more targeted in scope running a yum update on an Oracle Linux 6 or Oracle Linux 7 system will automatically install .repo files relevant to your system it will be easier to enable specific repositories for Oracle Linux yum server and to keep your yum repository definitions up to date to complete the transition from the legacy .repo file you must run the script: /usr/bin/ol_yum_configure.sh after it is installed.   Background Since launching Oracle Linux yum server more than 9 years ago, we’ve added quite a bit of software. From free updates and errata for Oracle Linux, to Oracle Container Runtime for Docker, Container Services for use with Kubernetes to Python, Node.js and PHP with corresponding connectors for Oracle Database. As we continued to add new packages over the years, we simply added more repository definitions to our single yum repo file, turning it into a bit of a monolith. In hindsight, we could have planned this a little better. But let’s not dwell on the past. On January 18, 2019 we will publish release RPMs in the _latest repository that deliver separate, smaller repo files. This change will simplify installing new software from Oracle Linux yum server and keeping repository definitions up to date.   New Release RPMs for Oracle Linux 7 and Oracle Linux 6 As part of this change, the following RPMs will be published on Oracle Linux yum server. .table { width: 100%; max-width: 100%; margin-bottom: 1rem } .table td, .table th { padding: .75rem; vertical-align: top; border-top: 1px solid #eceeef; } .table thead th { vertical-align: bottom; border-bottom: 2px solid #eceeef } .table tbody + tbody { border-top: 2px solid #eceeef } .table .table { background-color: #fff } .table-striped tbody tr:nth-of-type(odd) { background-color: rgba(0, 0, 0, .05) } RPM Description oraclelinux-release-el7, oraclelinux-release-el6 Oracle Linux, UEK & Virtualization tools oraclelinux-patchonly-release-el7, oraclelinux-patchonly-release-el6 Oracle Linux patch repositories (for Oracle Cloud Infrastructure customers only) oracle-softwarecollection-release-el7, oracle-softwarecollection-release-el6 Software Collection Library for Oracle Linux oracle-openstack-release-el7, oracle-openstack-release-el6 Oracle OpenStack for Oracle Linux oracle-spacewalk-server-release-el7, oracle-spacewalk-server-release-el6 Spacewalk Server oracle-spacewalk-client-release-el7, oracle-spacewalk-client-release-el6 Spacewalk Client oracle-gluster-release-el7, oracle-gluster-release-el6 Gluster Storage oracle-ceph-release-el7 Ceph Storage oracle-release-el7, oracle-release-el6 Oracle Instant Client oracle-epel-release-el7 EPEL for Oracle Linux oraclelinux-developer-release-el7, oraclelinux-developer-release-el6 Packages for Developers and Oracle Cloud Infrastructure mysql-release-el7, mysql-release-el6 MySQL Community releases oracle-golang-release-el7 Stable releases of the Go programming language oracle-php-release-el7, oracle-php-release-el6 Stable PHP releases oracle-nodejs-release-el7, oracle-nodejs-release-el6 Stable Node.js releases   Once these changes take effect, use the following command to see an up to date list of installed and available release packages: yum list *release-el7   What This Means Rather than using a single repo file with definitions for all RPMs on Oracle Linux yum server, these release RPMs let you configure and update repositories for exactly the RPMs you are interested in. Only interested in Oracle Linux 7 with EPEL? Simply install epel-release-el7 (oraclelinux-release-el7 is already there on your system and will automatically be updated if needed). Let’s suppose we release Node.js 11 at some point in the future. Previously, to install Node.js 11, we’d ask you to re-download public-yum-ol7.repo and enable the Node.js repo. Using these release RPMs, you’ll run: yum update nodejs-release-el7. These commands will leave you with a repo file for Oracle Linux, one for Unbreakable Enterprise Kernel and one for EPEL or Node.js, as the case may be. These change do not affect any user created or custom .repo files under /etc/yum.repos.d/.   What You Need to Do For those running Oracle Linux releases or images released after January 18 2019 —whether on premises or in Oracle Cloud Infrastructure— without a monolithic public-yum-ol7.repo or public-yum-ol6.repo file, you will not have to do anything. Simply install the release RPM you are interested in and yum repo files will be configured accordingly. If you have a “legacy system” with a monolithic .repo file, you must complete the transition by running a script. This is so that can be sure we don’t disable previously enabled repositories, resulting in unexpected behavior.   For Legacy Oracle Linux Installations With a public-yum-ol7.repo or public-yum-ol6.repo file Most of you will fall into this category. If you update an existing oraclelinux-release-el7 package or install one of the new release RPMs, new repo files will be placed into /etc/yum/repos.d/ but they will be disabled via a .disabled extension. Note that a simple yum update may trigger oraclelinux-release-el7 being updated. A script, /usr/bin/ol_yum_configure.sh will be installed. Run this script to complete the transition to the new Oracle Linux yum server experience. To complete the yum configuration, run the following as the root user: /usr/bin/ol_yum_configure.sh   For New Systems without an existing public-yum-ol7.repo file If your system is already configured to use the release RPM-based yum configuration approach, there are no extra actions required. Simply install a release RPM based on what you are interested in: yum list *release-el7*   To confirm whether your system follows the new approach, confirm that either oraclelinux-release-el7 or oraclelinux-release-el6 is installed:   rpm -q oraclelinux-release-el7 oraclelinux-release-el7-1-1.el7.noarch   Conclusion While breaking up can be hard to do, in the long run you will find it makes installing and updating software from Oracle Linux yum server easier. See the Oracle Linux documentation for more details. Also, if you have any questions about these changes or anything else related to Oracle Linux, come find us and other Oracle Linux experts in the Oracle Developer Community.

TL;DR: Oracle Linux yum server changes - Action Required After 18-JAN-2019 Beginning on 18 January 2019: the existing repo files (public-yum-ol7.repo and public-yum-ol6.repo) for yum.oracle.com will no...

Announcements

Big News at KubeCon + CloudNativeCon North America 2018: Oracle Cloud Native Framework and Oracle Linux

This post is contributed by Robert Shimp, Group Vice President of Product Management and Strategy, Oracle Linux and Virtualization.  For several years now we have seen the decomposition of applications into microservices running on container infrastructure with developers and operations collaborating using DevOps methodologies. This week at KubeCon, Oracle is introducing its Cloud Native Framework, a consistent, aligned, and unified collection of cloud services and on-premises software based on open, community-driven Cloud Native Computing Foundation (CNCF) projects.  This announcement fills out Oracle’s vision for addressing modern cloud application development and deployment and ushers in a new era for cloud developers and operations. We in the Oracle Linux team are particularly excited about this announcement.  We have included Open Container Initiative (OCI)-compliant container software and CNCF Certified Conformance orchestration software with Oracle Linux for several years.  Earlier this year at Oracle OpenWorld in San Francisco, we announced Oracle Linux Cloud Native Environment, our cloud native development and deployment software, which is being delivered as part of Oracle Linux.  We are planning to make available additional components in 2019. Components available in preview are made available via Oracle Linux yum server or Oracle Container Registry. Customer interest has been overwhelming.  I think it’s because of the value that we are offering.  We are delivering software that supports the open standards, specifications, and APIs defined by CNCF.  In addition, this is the first cloud native solution delivered and supported as both managed cloud services and on-premises software. It is the only solution that provides deployment models for public cloud (Oracle Cloud Infrastructure), hybrid cloud and on-premises users. We are the only cloud vendor that supports seamless, bi-directional portability of cloud native applications built anywhere on the Oracle Cloud Native Framework. Applications built on the Oracle Cloud Native Framework will not lock you in.  They are portable to any Kubernetes conformant environment – on any cloud or infrastructure. Oracle is a platinum member of CNCF as well as a platinum member of the Linux Foundation.  Oracle closely tracks the CNCF standards and contributes to the CNCF community. Support for the Oracle Linux Cloud Native Environment is included with an Oracle Linux Premier Support subscription at no additional cost. Getting Started Oracle Linux is freely available to everyone at Oracle Software Delivery Cloud. Updates can be obtained from Oracle Linux yum server.  Oracle VM VirtualBox is the most popular cross-platform virtualization software for development environments. You can download a copy of VirtualBox to run Oracle Linux and the cloud-native software on your desktop and easily deploy to the cloud.  Oracle is offering up to 3,500 free hours on Oracle Cloud to developers that would like to use our cloud for their development environment.

This post is contributed by Robert Shimp, Group Vice President of Product Management and Strategy, Oracle Linux and Virtualization.  For several years now we have seen the decomposition of applications...

Linux Kernel Development

Linux Scheduler Scalabilty like a boss!

Oracle Linux developer Subhra Mazumdar has been working on scalability improvements in the Linux scheduler. In this blog post, he talks about some of his latest work. At Oracle, we run big database workloads which are very susceptible to how the OS chooses to schedule threads. Spending too much time in the scheduler to select a CPU can translate to higher transaction latency (at a given throughput) or lower maximum achievable transaction throughput in TPC-C workloads. In this article, we're introducing our "Scheduler Scalability" project, which improves this latency and exposes knob to further allow us to tune such workloads. The Linux scheduler searches for idle CPUs upon which to enqueue a thread when it becomes runnable. It first searches to find a fully idle core using select_idle_core(). If that fails, the scheduler finds any idle CPU via select_idle_cpu(). Both of these routines can end up scanning the entire last level cache (LLC) domain which is expensive and can hurt context switch intensive workloads like TPC-C where threads wakeup and run for small amounts of time and go to sleep. This is a scalability bottleneck in big systems that have a large number of cores per LLC. For such workloads, it is desirable to have a constant bound on the search time while at the same time achieving a good spread of threads. These are two conflicting interests and the right balance needs to be struck. Some experimentation with constant upper and lower bounds on the number of CPUs searched in select_idle_cpu() reveals different bounds work best for different architectures. For example, a SMT2 Intel processor upper bound of 4 and lower bound of 2 works well while on a SPARC SMT8, an upper bound of 16 and lower bound of 8 works well. A quick guess reveals an upper bound of 2 cores and a lower bound of 1 core works on both architectures. This makes sense as cores are different scheduling domains and it usually is good idea to search beyond the current domain for idle cpus as the neighbouring domain may be differently loaded. This can happen since scheduler load balancing works on a domain basis. While putting constant bounds on the search will reduce search time, it can lead to localization of threads with uneven spreading. To solve this the scheduler can keep a per-CPU variable to track the boundary of search. If no idle CPUs are found in one instance, the search can begin from the boundary next time so that any other idle CPUs will be quickly found. Together these work well and improve the scalability of select_idle_cpu(). Next, we focus on select_idle_core() which searches for a fully idle core (i.e core that has all CPUs idle). Any CPU in that core is the best CPU to run since all the hardware resources can be used to run the thread as fast as possible. While it has a dynamic switch that turns off if no idle core is present, it is still a bottleneck since in practice we can have only a few cores fully idle and it will end up scanning the entire LLC domain. It can be challenging to come up with data structures to do it fast as this code path is very sensitive. Experiments showed touching too many cache lines during the search or using atomic operations ruins any margins. In practice we found just disabling idle core search improves Oracle Database TPC-C on Intel x86 systems while it regresses some other benchmarks like hackbench on SPARC systems. This is a common problem in the scheduler where one workload optimized on one architecture can hurt another on a different architecture or even the same architecture. Linux uses scheduler features to work around this. It can block execution of certain code paths unsuitable for the workload. This can be turned on or off on live systems via /sys/kernel/debug/sched_features. A new scheduler feature called SIS_CORE was introduced for this purpose to disable idle core search at run time. This can be used by the Oracle Database instances meant for OLTP. Results Following are the performance numbers with various benchmarks with SIS_CORE true (idle core search enabled). Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5816 8.94 0.5903 (-1.5%) 11.28 2 0.6428 10.64 0.5843 (9.1%) 4.93 4 1.0152 1.99 0.9965 (1.84%) 1.83 8 1.8128 1.4 1.7921 (1.14%) 1.76 16 3.1666 0.8 3.1345 (1.01%) 0.81 32 5.6084 0.83 5.5677 (0.73%) 0.8 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline %stdev patch %stdev 8 45.36 0.43 46.28 (2.01%) 0.29 16 87.81 0.82 89.67 (2.12%) 0.38 32 151.19 0.02 153.5 (1.53%) 0.41 48 190.2 0.21 194.79 (2.41%) 0.07 64 190.42 0.35 202.9 (6.55%) 1.66 128 323.86 0.28 343.56 (6.08%) 1.34 Oracle Database on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline %stdev patch %stdev 20 1 0.9 1.0068 (0.68%) 0.27 40 1 0.8 1.0103 (1.03%) 1.24 60 1 0.34 1.0178 (1.78%) 0.49 80 1 0.53 1.0092 (0.92%) 1.5 100 1 0.79 1.0090 (0.9%) 0.88 120 1 0.06 1.0048 (0.48%) 0.72 140 1 0.22 1.0116 (1.16%) 0.05 160 1 0.57 1.0264 (2.64%) 0.67 180 1 0.81 1.0194 (1.94%) 0.91 200 1 0.44 1.028 (2.8%) 3.09 220 1 1.74 1.0229 (2.29%) 0.21 Hackbench process on 2 socket, 16 core and 128 threads SPARC machine (lower is better): groups baseline %stdev patch %stdev 1 1.3085 6.65 1.2213 (6.66%) 10.32 2 1.4559 8.55 1.5048 (-3.36%) 4.72 4 2.6271 1.74 2.5532 (2.81%) 2.02 8 4.7089 3.01 4.5118 (4.19%) 2.74 16 8.7406 2.25 8.6801 (0.69%) 4.78 32 17.7835 1.01 16.759 (5.76%) 1.38 64 36.1901 0.65 34.6652 (4.21%) 1.24 128 72.6585 0.51 70.9762 (2.32%) 0.9 Following are the performance numbers with various benchmarks with SIS_CORE false (idle core search disabled). Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5816 8.94 0.5835 (-0.33%) 8.21 2 0.6428 10.64 0.5752 (10.52%) 4.05 4 1.0152 1.99 0.9946 (2.03%) 2.56 8 1.8128 1.4 1.7619 (2.81%) 1.88 16 3.1666 0.8 3.1275 (1.23%) 0.42 32 5.6084 0.83 5.5856 (0.41%) 0.89 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline %stdev patch %stdev 8 45.36 0.43 46.94 (3.48%) 0.2 16 87.81 0.82 91.75 (4.49%) 0.43 32 151.19 0.02 167.74 (10.95%) 1.29 48 190.2 0.21 200.57 (5.45%) 0.89 64 190.42 0.35 226.74 (19.07%) 1.79 128 323.86 0.28 348.12 (7.49%) 0.77 Oracle Database on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline %stdev patch %stdev 20 1 0.9 1.0056 (0.56%) 0.34 40 1 0.8 1.0173 (1.73%) 0.13 60 1 0.34 0.9995 (-0.05%) 0.85 80 1 0.53 1.0175 (1.75%) 1.56 100 1 0.79 1.0151 (1.51%) 1.31 120 1 0.06 1.0244 (2.44%) 0.5 140 1 0.22 1.034 (3.4%) 0.66 160 1 0.57 1.0362 (3.62%) 0.07 180 1 0.81 1.041 (4.1%) 0.8 200 1 0.44 1.0233 (2.33%) 1.4 220 1 1.74 1.0125 (1.25%) 1.41 Hackbench process on 2 socket, 16 core and 128 threads SPARC machine (lower is better): groups baseline %stdev patch %stdev 1 1.3085 6.65 1.2514 (4.36%) 11.1 2 1.4559 8.55 1.5433 (-6%) 3.05 4 2.6271 1.74 2.5626 (2.5%) 2.69 8 4.7089 3.01 4.5316 (3.77%) 2.95 16 8.7406 2.25 8.6585 (0.94%) 2.91 32 17.7835 1.01 17.175 (3.42%) 1.38 64 36.1901 0.65 35.5294 (1.83%) 1.02 128 72.6585 0.51 71.8821 (1.07%) 1.05

Oracle Linux developer Subhra Mazumdar has been working on scalability improvements in the Linux scheduler. In this blog post, he talks about some of his latest work. At Oracle, we run big database...

Linux

Encrypting NFS data on the Wire

Oracle Linux developer Chuck Lever has been collaborating on an internet draft standard to bring transparent, end-to-end encryption for NFS (actually, all RPC-based protocols) in this new internet draft. As more Linux workloads traverse shared network infrastructure, we have seen an uptick in requests for encryption for network traffic. While there are many ways to do point-to-point traffic encryption, leading members of the Linux NFS community have proposed a different, and simpler, strategy for achieving over-the-wire encryption of NFS traffic. Linux NFS maintainer Trond Myklebust and Oracle Linux developer Chuck Lever propose NFS-over-TLS, a transparent, easy to configure end-to-end encryption standard for RPC-based protocols like NFS. This solution relies on self-signed certificates to set up standard encryption for nfs over-the-wire traffic without the heavy overhead of Kerberos or Active Directory. There are many ways to encrypt NFS traffic over the wire, including IPSEC and Kerberos, but in their current incarnations, each have significant drawbacks that keep most users from using them. Much like HTTPS, this proposal to enable RPC-over-TLS makes encryption the "easy" option, opting for self-signed certificates. Although this standard is put forward as the simplest, easiest-to-use solution, this solution also provides unique benefits in cases where the alternative encryption solutions may not have good answers -- for example, with per-flow encryption as opposed to per-connection (ipsec) encryption, or if the customer's user authentication domain is separate from the host's identity management (as is often the case in cloud environments!) There are plenty of deployment cases where the client and server trust each other already, and all that is needed is protection of the NFS traffic as it flows over an untrusted network. Most NFS works this way already: a tenant trusts the IP addresses provided by the DNS service, but does not trust the other tenants not to spy on the traffic. This solution takes a hint from the https solution for encrypting web traffic: focusing on the encryption separately from authorization/authentication. While this solution would not be as full-featured as the user authentication solutions, this is a solution which would be useable with minimal configuration required by an administrator. And this standard would be rolled out with that in mind: defaulting to a "use-if-available" model, meaning that if both ends support it and there is sufficient certificate trust available, NFS traffic would be encrypted. Someday this could mean that all NFS traffic would be transparently encrypted as this capability rolls out to NFS clients and servers. This is still a draft standard, so don't expect this on your Oracle Linux servers very soon, but it's already starting to get talked about in the industry press

Oracle Linux developer Chuck Lever has been collaborating on an internet draft standard to bring transparent, end-to-end encryption for NFS (actually, all RPC-based protocols) in this new internet draf...

Linux

5 things you may not know about Ksplice

Ksplice is a cool technology and I wanted to share a few things that you might not know about it along with a few tips on how to get started if you aren't already using it. 1. New patching advances As the security landscape evolves, so does Ksplice to keep up with more and more complex patches.  These new changes and techniques allow Ksplice to safely patch even more of the kernel entry assembly code, even on heavily loaded systems.  To date, Ksplice is the only technology that has been able to live patch the CVE-2018-3639 (Spectre v4) and CVE-2018-3620+CVE-2018-3646 (L1 terminal fault), the latter comprising of thousands of lines of changes across the kernel. The Linux kernel continues to advance over time gaining new features and optimizations to scale to all kinds of workloads. Oracle Ksplice continues to advance in kind. The Ksplice team is actively developing Ksplice, making sure that we can give the best patching experience with every supported kernel in all configurations.  This includes safe integration with DTrace probes in Oracle UEKR4 and UEKR5, full support for Meltdown mitigations including both KAISER and KPTI with no reduction in patch coverage, and support for linker optimization in modern Fedora toolchains. We have optimized some of the Ksplice core to minimize the period that the system is paused during the safety checks to ensure that we can scale to larger SMP systems with no visibility to the running workload. These enhancements allow us to make sure that we are patching all of the issues that you care about on a running system.  With recent developments in speculative side channel attacks, we have seen an increase in number and complexity of patches to some of the lowest levels of the kernel that have resulted in some new patching techniques for Ksplice. 2. A dizzying number of supported kernels Ksplice supports a wide range of kernels, from Oracle Linux 5 2.6.18 (32-bit, 32-bit PAE, 64-bit and Xen paravirt) to the latest Fedora 28 4.17 64-bit kernels.  At any one time, Ksplice supports around 5,000 binary kernels.  In the extreme case of a wide-reaching vulnerability affecting all kernels, Ksplice releases updates for all of those in short order.  Today, Ksplice supports kernels almost 8 years old with the oldest kernels having over 700 unique fixes applied through Ksplice - that's a lot of reboots saved. When Ksplice updates are created, we take fixes from the most recently released kernel in a series and then iteratively apply those fixes to all older kernels in that series as far as they are applicable.  In some of the older series this means a lot of versions - at the time of writing, Oracle UEKR2 has 118 distinct source releases with each requiring new patches backporting.  Ksplice will take each of the new fixes that are applicable to already running systems and iteratively backport to all of those older releases where appropriate. You can check if your kernel is supported by Ksplice with our inspector which will helpfully show you a list of fixes that have been applied with Ksplice and you could fix today without any downtime. 3. User-space + Xen Since Oracle acquired Ksplice in 2011, we have continued to invest heavily in the technology and added several noteworthy new features that other Linux distributions have no competitive offering for.  In 2015, Oracle introduced Ksplice patching for user-space on Oracle Linux 6 and 7 for key components - glibc and OpenSSL.  glibc is fundamental to almost all Linux applications, providing the core functions for memory management, networking, threading and many other essentials.  OpenSSL is the Oracle Linux library used for SSL/TLS and many other common cryptographic functions seeing use in many security sensitive applications such as web servers, SSH, postfix, NTP and many other network clients and servers. When a vulnerability is found in one of these core libraries, a new RPM is created and can be installed on systems with newly executed processes using the patched version.  However, already running applications will continue to use the vulnerable code and it can be extremely hard to even determine what applications are still using the old libraries and then schedule those to be restarted without any customer visible downtime.  Patching these vulnerabilities with Ksplice means no application downtime or reboots with all of the same deployment options that you are used to with Ksplice kernel updates.  Ksplice has patched some critical, high profile vulnerabilities since we started supporting these libraries including CVE-2015-7547, a remote code execution bug in the glibc DNS resolver and CVE-2016-0800 (DROWN), a cipher downgrade in OpenSSL. With OVM 3.4.5, Ksplice can now patch the Xen hypervisor and user-space components such as xenstored, libxenctrl and qemu.  This means you can have a full live patchable virtualization stack, from the hypervisor, through the Dom0 kernel+user-space and the guests themselves, something that no other Linux vendor can offer. 4. OCI Ksplice is just as important in a cloud environment as an on-premise environment and we've made it incredibly easy to get started with in Oracle Cloud Infrastructure IaaS.  OCI Oracle Linux images come with an Oracle Linux Premier Support entitlement and have Ksplice pre-installed with no registration required.  Simply create an instance and you're one "uptrack-upgrade" away from having the latest security patches applied to your kernel with zero downtime.  For legacy or custom images, Ksplice can be installed with a simple script, again without registration. Simply run the following commands in your instance: # wget -N https://ksplice.oracle.com/uptrack/install-uptrack-oc # sh install-uptrack-oc --autoinstall and the system will start automatically installing Ksplice kernel updates without any further interaction and no downtime. Ksplice isn't just available inside OCI tenancies though. Ksplice powers it.  The same technology you use inside your instances is also used to proactively patch OCI without any disruptive downtime, keeping everything running securely and stably. 5. Safety One of the key principles behind Ksplice is that security and safety come first.  This means delivering the right patches quickly and applying them to the running system without any visible side-effects. Doing so means that we need to handle a variety of edge cases and unexpected setups. Ksplice is the only live patching system that fully covers these cases.  We'll look at a couple of these here, but as always, the devil is in the detail and there are a lot of details in live patching! Firstly, Ksplice performs integrity checks - we want to make sure that the code your are running on your system matches what we expect so that a patch doesn't get applied to incorrect code and either make things worse or fail to close the vulnerability.  A mismatch sounds unlikely, but there are a number of things that could cause it to happen: You could be running modules from a different kernel or compatible module provided by a hardware vendor. You could have another application that has modified the running code such as antivirus or intrusion detection, or even a root kit or virus Ksplice handles these conditions automatically, checking every byte of the compilation units that we want to patch making sure that they match exactly.  If a mismatch is found then we'll safely abort patching, explaining why. Secondly, we make sure that we're not replacing anything that is in use.  We employ conservative checks here doing full stack walks. Ksplice makes sure that not only is a function to be patched not currently being called, but additionally no local function pointers or data pointers that could call the wrong version of code are present.  Simple frame pointer based stack walks do cover these cases. Simple walks might work most of the time, but failure to make thorough safety checks could result in a crash or even worse, a new security vulnerability. Lastly, Ksplice isn't just in-memory patching.  The Linux kernel can be changed at runtime by loading new modules either on explicit request by the user or automatically when doing things like mounting filesystems, opening a network socket, hot plugging hardware, or using a new cryptographic algorithm.  Ksplice handles these unloaded modules gracefully, providing on-disk patched modules and arranging for the patched to be loaded rather than the old vulnerable versions.  Without this, it would be possible to load vulnerable module code and leaving a system unstable or open to exploitation. Conclusions Ksplice offers unparalleled functionality and safety, allowing system administrators to take control of patching in all deployments from on-premise to OCI.  By leveraging Ksplice in your environment you can avoid unplanned reboots, and rapidly patch against the latest vulnerabilities with minimal configuration and maintenance.  If you aren't already using Ksplice, why not give it a go?  Oracle Linux instances in OCI come preconfigured with Ksplice, and for all other uses, please visit the Ksplice website to learn how to get started in a few minutes.

Ksplice is a cool technology and I wanted to share a few things that you might not know about it along with a few tips on how to get started if you aren't already using it. 1. New patching advances As...

Linux Kernel Development

Making kernel tasks faster with ktask, an update

Making kernel tasks faster with ktask, an update Kernel developer Daniel Jordan got a nice writeup on lwn.net for his work on ktask. Daniel wrote a blog post about ktask when the first version of this work was submitted to the Linux kernel community. Since then, the code has evolved to cover many additional dimensions in order to help it integrate with other systems. LWN.net subscribers can learn more in this recent writeup on the evolution of ktask. Ktask is a generic framework for kernel task parallelization: any task which is currently single-threaded in the kernel can be broken up into workable chunks and handed off to the ktask helper, which will make clever scheduling and CPU participation decisions to ensure that the task finishes quickly. This change is not automatic; ktask introduces a coding construct which must be used by developers who wish to take advantage of this parallelized functionality. Memory initialization (page zeroing) will be helped considerably by the parallelization of ktask. Initializing memory is a critical task done by the OS to keep data secure, and can be a significant factor in the startup time for database applications and for virtual machines. ktask allows those tasks to be spread out across all the cores on a system and allow it to scale across the CPUs on the system. As the patches have been reviewed and revised, more use cases have bubbled up and we're excited to see more opportunities to make this generic framework useful for the kernel, including parallelizing kernel operations in the infiniband driver, to improve vfio performance, and more! ktask: parallelize CPU-intensive kernel work ktask is a generic framework for parallelizing CPU-intensive work in the kernel. The intended use is for big machines that can use their CPU power to speed up large tasks that can't otherwise be multithreaded in userland. The API is generic enough to add concurrency to many different kinds of tasks--for example, page clearing over an address range or freeing a list of pages--and aims to save its clients the trouble of splitting up the work, choosing the number of helper threads to use, maintaining an efficient concurrency level, starting these threads, and load balancing the work between them. Some Results Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory Test: Clear a range of gigantic pages (triggered via fallocate) nthread speedup size (GiB) min time (s) stdev 1 100 41.13 0.03 2 2.03x 100 20.26 0.14 4 4.28x 100 9.62 0.09 8 8.39x 100 4.90 0.05 16 10.44x 100 3.94 0.03 1 200 89.68 0.35 2 2.21x 200 40.64 0.18 4 4.64x 200 19.33 0.32 8 8.99x 200 9.98 0.04 16 11.27x 200 7.96 0.04 1 400 188.20 1.57 2 2.30x 400 81.84 0.09 4 4.63x 400 40.62 0.26 8 8.92x 400 21.09 0.50 16 11.78x 400 15.97 0.25 1 800 434.91 1.81 2 2.54x 800 170.97 1.46 4 4.98x 800 87.38 1.91 8 10.15x 800 42.86 2.59 16 12.99x 800 33.48 0.83 This data shows the speedup for zeroing large amounts of memory, and the advantages as the tasks are spread across available cores. Raw data for these results. We look forward to seeing ktask as part of upstream Linux!

Making kernel tasks faster with ktask, an update Kernel developer Daniel Jordan got a nice writeup on lwn.net for his work on ktask. Daniel wrote a blog post about ktask when the first version of this...

Announcements

Announcing the release of Oracle Linux 7 Update 6

Oracle is pleased to announce the general availability of Oracle Linux 7 Update 6 for the x86_64 and Arm architectures. You can find the individual RPM packages on both th Unbreakable Linux Network (ULN) and the Oracle Linux yum server. ISO installation images will soon be available for download from the Oracle Software Delivery Cloud and Docker images will soon be available via Oracle Container Registry and Docker Hub. Oracle Linux 7 Update 6 ships with the following kernel packages: Unbreakable Enterprise Kernel (UEK) Release 5 (4.14.35-1818.3.3) for x86-64 and Arm Red Hat Compatible Kernel (3.10.0-957) for x86-64 only Application Compatibility Oracle Linux maintains user space compatibility with Red Hat Enterprise Linux (RHEL), which is independent of the kernel version that underlies the operating system. Existing applications in user space will continue to run unmodified on Oracle Linux 7 Update 6 with UEK Release 5 and no re-certifications are needed for applications already certified with Red Hat Enterprise Linux 7 or Oracle Linux 7. Notable new features in this release Pacemaker now supports path, mount, and timer systemd unit files. Although previous releases of Pacemaker supported service and socket systemd unit files, alternative units would fail. Pacemaker can now manage path, mount and timer systemd units. Package installation and upgrade using rpm can be tracked using audit events. The RPM package manager has been updated to provide audit events so that software package installation and updates can be tracked using the Linux Audit system. Software installation and upgrades using yum are also tracked. Features specific to the x86_64 architecture Clevis support for TPM 2.0. The Clevis automated encryption framework that can automatically encrypt or decrypt data, or unlock LUKS volumes, has been updated to support the encryption of keys in a Trusted Platform Module 2.0 (TPM2) chip. Note that this feature is only available for x86_64 systems. Features now available as a technology preview on the x86_64 architecture Block and object storage layouts for parallel NFS (pNFS) DAX (Direct Access) for direct persistent memory mapping from an application. This is under technical preview for the ext4 and XFS file systems Multi-queue I/O scheduling for SCSI (scsi-mq). Please note that this functionality is disabled by default Features specific to the Arm architecture DTrace has been enabled for Arm platforms and ports of the DTrace code are available in the Unbreakable Enterprise Kernel Release 5 channel on the Oracle Linux yum server. The DTrace user space code in the dtrace-utils package has been ported to run on 64-bit Arm platforms to fully enable DTrace for Oracle Linux 7 Update 6 (aarch64). For more details on these and other new features and changes, please consult the Oracle Linux 7 Update 6 Release Notes and the Oracle Linux 7 Update 6 (aarch64) Release Notes in the Oracle Linux Documentation Library. Btrfs continues to be fully supported in Oracle Linux 7 Update 6 with UEK R5. Btrfs support is deprecated in the Red Hat Compatible Kernel. Oracle Linux Support Options Oracle Linux can be downloaded, used, and distributed free of charge and all updates and errata are freely available. Customers decide which of their systems require a support subscription. This makes Oracle Linux an ideal choice for development, testing, and production systems. The customer decides which support coverage is best for each individual system while keeping all systems up-to-date and secure. Customers with Oracle Linux Premier Support also receive support for additional Linux programs, including Ceph Storage, Oracle Linux software collections, Oracle OpenStack and zero-downtime kernel updates using Oracle Ksplice. For more information about Oracle Linux, please visit www.oracle.com/linux.

Oracle is pleased to announce the general availability of Oracle Linux 7 Update 6 for the x86_64 and Arm architectures. You can find the individual RPM packages on both th Unbreakable Linux Network...

Technologies

Installing cx_Oracle and Oracle Instant Client via Oracle Linux Yum Server

Note: this post was updated on 30 january, 2019 to include simplified installation of Oracle Instant Client via Oracle Linux yum server. cx_Oracle enables access to Oracle Database from Python and conforms with the Python database API specification. The module works with Oracle Database 11g and 12c and both Python 2.x and 3.x. We have just released the first RPM builds of cx_Oracle on the Oracle Linux yum server, including the latest cx_Oracle 7.0.  You can find them in  the Oracle Linux 7 (x86_64) Development  (ol7_developer) and Oracle Linux 6 (x86_64) Development (ol6_developer) This post covers the steps to install and set up cx_Oracle 7.0 with the default Python 2.7.5 on Oracle Linux 7.  I used our latest Oracle Linux 7 Vagrant box    1. Confirm Yum Configuration First, verify your Oracle Linux yum server configuration as we've recently made some changes in the way repository definitions are delivered. Follow the steps here to verify your setup. 2. Install Appropriate Release Packages for Instant Client and cx_Oracle Once you've verified your yum configuration, install the oracle-release-el7 and oraclelinux-developer-release-el7 release packages to set up yum repository access for Oracle Instant Client and cx_Oracle. $ sudo yum -y install oracle-release-el7 oraclelinux-developer-release-el7 3. Install cx_Oracle RPM Note that case matters here. The RPM is called python-cx_Oracle sudo yum -y install python-cx_Oracle ... Running transaction Installing : oracle-instantclient18.3-basic-18.3.0.0.0-2.x86_64 1/2 Installing : python-cx_Oracle-7.0-1.0.1.el7.x86_64 2/2 Verifying : python-cx_Oracle-7.0-1.0.1.el7.x86_64 1/2 Verifying : oracle-instantclient18.3-basic-18.3.0.0.0-2.x86_64 2/2 Installed: python-cx_Oracle.x86_64 0:7.0-1.0.1.el7 Dependency Installed: oracle-instantclient18.3-basic.x86_64 0:18.3.0.0.0-2 Complete! 4. Add the Oracle Instant Client to the Runtime Link Path cx_Oracle depends on Oracle Instant Client. During OpenWorld 2018 we released Oracle Instant Client 18.3 RPMs on Oracle Linux yum server in the ol7_oracle_instantclient and ol6_oracle_instantclient repositories, making installation a breeze. Assuming you have enabled the repository for Oracle Instant Client appropriate for your Oracle Linux release, it is installed as a dependency. Older releases of Oracle Instant Client are available on OTN.   Oracle Instant Client was installed as a dependency of cx_Oracle in the previous step. Before you can make use of Oracle Instant Client, set the runtime link path so that cx_Oracle can find the libraries it needs to connect to Oracle Database. $ sudo sh -c "echo /usr/lib/oracle/18.3/client64/lib > /etc/ld.so.conf.d/oracle-instantclient.conf" $ sudo ldconfig 5. Test connection to Oracle Database $ python Python 2.7.5 (default, Nov 1 2018, 03:12:47) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36.0.1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import cx_Oracle >>> db = cx_Oracle.connect("scott/tiger@10.0.1.127/orclpdb1") >>> db.version '12.2.0.1.0' >>> These cx_Oracle RPMs offer Python-on-Oracle developers a quick and straightforward way to get started. Give it a try and let us know what you think in the comments or in the Python and Oracle Developer Community Read on... In this next post I go into more detail about what the different cx_Oracle RPMs are for and I show how to install our cx_Oracle RPM on Oracle Linux 6 to connect Python 3.5 to Oracle Database. See this page information about Python on Oracle Linux

Note: this post was updated on 30 january, 2019 to include simplified installation of Oracle Instant Client via Oracle Linux yum server. cx_Oracle enables access to Oracle Database from Python...

Linux

How to Install Node.js 10 with node-oracledb and Connect it to Oracle Database

A few months ago we added dedicated repositories for Node.js to the Oracle Linux yum server. These repos also include an RPM with the Oracle Database driver for Node.js, node-oracledb, so you can connect your Node.js application to the Oracle Database. In this post I describe the steps to install Node.js 10 and node-oracledb Node.js to Oracle Database. If you are in a rush or want to try this out in a non-destructive way, I recommend you use the latest Oracle Linux 7 Vagrant box . Grab the Latest Oracle Linux Yum Server Repo File First, make sure you have the most recent Oracle Linux yum server repo file by grabbing it from the source: $ sudo mv /etc/yum.repos.d/public-yum-ol7.repo /etc/yum.repos.d/public-yum-ol7.repo.bak $ sudo wget -O /etc/yum.repos.d/public-yum-ol7.repo http://yum.oracle.com/public-yum-ol7.repo Enable Node.js 10 Repo, Install Node.js and node-oracledb $ sudo yum -y install yum-utils $ sudo yum-config-manager --enable ol7_developer_nodejs10 ol7_oracle_instantclient $ sudo yum -y install nodejs node-oracledb-node10 Connecting to Oracle Database For my testing I used Oracle Database 18c Express Edition (XE). You can download it here. Quick Start instructions are here. About Oracle Instant Client node-oracledb depends on Oracle Instant Client. During OpenWorld 2018 we released Oracle Instant Client 18.3 RPMs on Oracle Linux yum server in the ol7_oracle_instantclient and ol6_oracle_instantclient repositories, making installation a breeze. Assuming you have enabled the repository for Oracle Instant Client appropriate for your Oracle Linux release, it will be installed as a dependency. As of release 3.0, node-oracledb is built with Oracle Client 18.3, which connects to Oracle Database 11.2 and greater. Older releases of Oracle Instant Client are available on OTN. Add the Oracle Instant Client to the runtime link path. $ sudo sh -c "echo /usr/lib/oracle/18.3/client64/lib > /etc/ld.so.conf.d/oracle-instantclient.conf" $ sudo ldconfig A Quick Node.js Test Program Connecting to Oracle Database I copied this file from the examples in the node-oracledb Github repo. Running this will tell us whether Node.js can connect to the database. Copy this code into a file called connect.js. The file below comes from the same GitHub repo. Copy the code into a file called dbconfig.js and edit it to include your Database username, password and connect string. Run connect.js with node Before running connect.js, make sure NODE_PATH is set so that the node-oracledb module can be found. $ export NODE_PATH=`npm root -g` $ node connect.js Connection was successful!

A few months ago we added dedicated repositories for Node.js to the Oracle Linux yum server. These repos also include an RPM with the Oracle Database driver for Node.js, node-oracledb, so you can...

Events

How to spend your last day at Oracle OpenWorld 2018

It’s the last day of Oracle OpenWorld 2018 and it has a lot to offer You’ll find plenty of useful information in these sessions and HOLs. Join us and soak it all up!  Oracle Linux Is Really the Ideal Linux for Oracle Cloud Developers [DEV6017] SPEAKERS Wim Coekaerts, Senior Vice President, Operating Systems and Virtualization Engineering, Oracle 09:00 AM - 09:45 AM | Moscone West - Room 2003 Build an ARM64-Based Solution with Oracle Linux [PRM4722] SPEAKERS Michele Resta, Product Management Sr. Director - Alliances, Oracle Honglin Su, Sr. Director of Product Management, Oracle 09:00 AM - 09:45 AM | Moscone West - Room 2000 Building a Cost-Effective Cloud with Oracle OpenStack and Oracle's x86 Servers [PRO4786] SPEAKERS Joshua Rosen, Oracle Dilip Modi, Principal Product Manager, Oracle OpenStack, Linux and VM Development, Oracle Subban Raghunathan, VP, Product Management, Oracle 09:00 AM - 09:45 AM | Moscone South - Room 207 Infrastructure as Code on Oracle Cloud Infrastructure with Terraform [HOL5139] SPEAKERS Christophe Pauliat, Oracle Solution Center Sales Consultant, Oracle Simon Hayler, Sr Principal Technical Product Manager, Oracle Paul Bramy, CEO, reloca Matthieu Bordonne, Oracle Solution Center Sales Consultant, Oracle 10:30 AM - 11:30 AM | Marriott Marquis (Yerba Buena Level) - Salon 12/13 Strategy and Insights from the Oracle Linux and Oracle VM Product Management Team [BQS4730] SPEAKERS Avi Miller, Product Management Director, Oracle Robert Shimp, Product Management Group Vice President - Oracle Linux, Virtualization and Linux and VM Development, Oracle Honglin Su, Sr. Director of Product Management 11:00 AM - 11:45 AM | Moscone South - Room 206 Observing and Optimizing Your Application on Oracle Linux with DTrace [HOL6339] SPEAKERS Jeff Savit, Director, Oracle 12:00 PM - 01:00 PM | Marriott Marquis (Yerba Buena Level) - Salon 12/13 Detecting and Blocking Attacks with Oracle Audit Vault and Database Firewall [PRO4110] SPEAKERS Russ Lowenthal, Director, Product Management, Oracle Ram Subramanian, Director, Database Services, Symantec Corporation Rohit Muttepawar, IT Architect - Database Platform, Symantec Corporation Oct 25, 12:00 PM - 12:45 PM | Moscone West - Room 3006 Embrace Open Source Projects on GitHub for Cloud Automation [TIP5795] SPEAKERS Avi Miller, Product Management Director, Oracle Simon Coter, Director of Product Management, Linux and Virtualization, Oracle 12:00 PM - 12:45 PM | Moscone South - Room 160 How Oracle Linux Delivers Superior Application Scalability for Exadata [PRO5798] SPEAKERS Swamy Kiran, Infrastructure Architect / DBA Team Technical Lead, The World Bank Group Sudhakar Dindukurti, Oracle 12:00 PM - 12:45 PM | Moscone West - Room 2000 Why Oracle Linux Is the Best Platform for Oracle Database and Oracle Cloud [PRO5797] SPEAKERS Ravi Thammaiah, Director of Software Development, Oracle Dhaval Giani, Oracle 01:00 PM - 01:45 PM | Moscone South - Room 207 Oracle Database 18c: Reliable DevOps with Vagrant, Oracle VM VirtualBox, and Oracle Linux [HOL6394] SPEAKERS Gerald Venzl, Senior Principal Product Manager, Oracle Simon Coter, Director of Product Management, Linux and Virtualization, Oracle 01:30 PM - 02:30 PM | Marriott Marquis (Yerba Buena Level) - Salon 12/13

It’s the last day of Oracle OpenWorld 2018 and it has a lot to offer You’ll find plenty of useful information in these sessions and HOLs. Join us and soak it all up!  Oracle Linux Is Really the Ideal...

Events

The first half of Oracle OpenWorld 2018 was a hit--Wednesday’s sessions will continue to impress

The first two days of Oracle OpenWorld 2018 are in the record books. Sessions were well attended and the Infrastructure Technologies showcase, #120, drew a steady crowd and long lines at the VR game. Ajay Srivastava, Senior Vice President, Oracle presented an “Overview of Oracle Infrastructure Technologies” to a full house. Providing a behind-the-scenes look at the Oracle servers, Linux operating system, virtualization, and other software components that power Oracle Cloud. Ajay offered this take away: “Oracle Linux is the only OS on this planet that allows you to apply patches with zero-downtime. It’s free in Oracle Cloud.” Sessions recommended for Wednesday, October 24: Provide Zero Downtime Update for Your Cloud Infrastructure [HOL6340] 08:00 AM - 09:00 AM | Marriott Marquis (Yerba Buena Level) - Salon 12/13 SPEAKERS Christophe Pauliat, Oracle Solution Center Sales Consultant, Oracle Simon Coter, Director of Product Management, Linux and Virtualization, Oracle   Keynote: The Role of Security and Privacy in a Globalized Society—Threats, Implications and Opportunities [KEY6573] 09:00 AM - 10:30 AM | Moscone North - Hall D SPEAKERS Mark Hurd, Chief Executive Officer, Oracle General Michael Hayden, Former Director of the CIA and NSA Jeh Johnson, Former Secretary of Homeland Security Sir John Scarlett , KCMG OBE Former Chief of the British Secret Intelligence Service Edward Screven, Chief Corporate Architect, Oracle   The OS Factor: Advice for the Technology Buyer from IDC [BUS4729] 11:15 AM - 12:00 PM | Moscone West - Room 2000 SPEAKERS Karen Sigman, Vice President, Product and Partner Marketing, Oracle Ashish Nadkarni, Research Director, IDC   Secure and Agile Orchestration for Linux Containers [TRN4723] 12:30 PM - 01:15 PM | Moscone West - Room 2000 SPEAKERS Avi Miller, Product Management Director, Oracle   AMD EPYC: Freeing the Data Center [PRM6946] 12:30 PM - 01:15 PM | Moscone South - Room 154 SPEAKERS Rajan Panchapakesan, Oracle Daniel Bounds, Sr. Director, Product Management, AMD   Oracle Private Cloud Appliance: Deploy Your Private Cloud IaaS Out-of-the-Box [CAS6167] 12:30 PM - 01:15 PM | Moscone South - Room 214 SPEAKERS Sam K Tan, Product Manager, ODA and PCA, Oracle Ryan Lea, Solution Consultant, Revera   Securing Your Critical Oracle Cloud Infrastructure Workloads [THT6585] 01:00 PM - 01:20 PM | The Exchange @ Moscone South - Theater 2 SPEAKERS Rich Vorwaller, Product Manager, Symantec   Keynote: Fusion Cloud Applications—Secure and Extensible [KEY3879] 2:00 PM - 03:00 PM | Moscone North - Hall D SPEAKERS Larry Ellison, Executive Chairman and CTO, Oracle   The Emergence of New Threats: A Look at Spectre and Meltdown [TIP3992] 04:45 PM - 05:30 PM | Moscone West - Room 2000 SPEAKERS Greg Marsden, Linux Kernel Development, Oracle Bruce Lowenthal, Senior Director, Security Alerts Group, Oracle    

The first two days of Oracle OpenWorld 2018 are in the record books. Sessions were well attended and the Infrastructure Technologies showcase, #120, drew a steady crowd and long lines at the VR game. Aj...

Announcements

Oracle Announces 2018 Oracle Excellence Awards – Congratulations to our “Leadership in Infrastructure Transformation" Winners

We are pleased to announce the 2018 Oracle Excellence Awards “Leadership in Infrastructure Transformation" Winners. This elite group of recipients includes customers and partners who are using Oracle Infrastructure Technologies to accelerate innovation and drive business transformation by increasing agility, lowering costs, and reducing IT complexity. This year, our 10 award recipients were selected from amongst hundreds of nominations. The winners represent 5 different countries: Austria, Russia, Turkey, Sweden, United States and 6 different industries:  Communications, Financial, Government, Manufacturing, Technology, Transportation. Winners must use at least one, or a combination, of the following for category qualification:   •    Oracle Linux •    Oracle Virtualization (Oracle VM, VirtualBox) •    Oracle Private Cloud Appliance •    Oracle SuperCluster •    Oracle SPARC •    Oracle Solaris •    Oracle Storage, Tape/Disk Oracle is pleased to honor these leaders who have delivered value to their organizations through the use of multiple Oracle technologies which have resulted in reduced cost of IT operations, improved time to deployment, and performance and end user productivity gains.  This year’s winners are Michael Polepchuk, Deputy Chief Information Officer, BCS Global Markets; Brian Young, Vice President, Cerner, Brian Bream, CTO, Collier IT; Rudolf Rotheneder, CEO, cons4u GmbH; Heidi Ratini, Senior Director of Engineering, IT Convergence; Philip Adams, Chief Technology Officer, Lawrence Livermore National Labs; JK Pareek, Vice President, Global IT and CIO, Nidec Americas Holding Corporation; Baris Findik, CIO, Pegasus Airlines; Michael Myhrén, Senior DBA Senior Systems Engineer and Charles Mongeon, Vice President Data Center Solutions and Services (TELUS Corporation). More information on these winners can be found at https://www.oracle.com/corporate/awards/leadership-in-infrastructure-transformation/winners.html.

We are pleased to announce the 2018 Oracle Excellence Awards “Leadership in Infrastructure Transformation" Winners. This elite group of recipients includes customers and partners who are using...

Events

Oracle OpenWorld 2018: Day One is a Wrap. What’s in Store for Day Two?

Cloud native development and security were among the key themes in today’s sessions from keynotes to HOLs and in the news… Starting with today’s Oracle Linux Cloud Native Environment announcement. In keeping with long-standing Oracle OpenWorld traditions, Wim Coekaerts delivered the “State of the Penguin.” In  the session, he shared updates on product releases, new areas of focus including Oracle Linux Cloud Native Environment, Kata Containers, KVM work, and Oracle Instant Client. Hear more from @WimOracle in this Oracle Groundbreakers Live interview. In other news, today: Gluster Storage 3.12 for Oracle Linux 7 was announced Oracle VM VirtualBox 6.0 Beta is out Catch-up on other keynotes on demand. Tomorrow there is another information-packed line up: Tuesday, October 23 Accelerating Growth in the Cloud [KEY3877] 09:00 AM - 10:30 AM | Moscone North - Hall D An Overview of Oracle Infrastructure Technologies in Oracle Cloud [PRO5904] 11:15 a.m. - 12:00 p.m. | Moscone West - Room 2000 Kubernetes, Docker, and Oracle Linux from On-Premises to Oracle Cloud with Ease [DEV6015] 11:30 a.m. - 12:15 p.m. | Moscone West - Room 2009 Accelerate Your Business with Machine Learning and Oracle Linux [PRO4731] 1:45 p.m. - 2:30 p.m. | Moscone West - Room 2000 Best Practices: Oracle Linux and Oracle VM in Oracle Cloud Infrastructure [PRO4721] 4:45 p.m. - 5:30 p.m. | Moscone South - Room 160 Building a Cloud Native Environment with Oracle Linux [THT6913] 5:25 p.m. - 5:45 p.m. | The Exchange @ Moscone South - Theater 6 Maximize Performance with Oracle Linux and Oracle VM [TIP4725] 5:45 p.m. - 6:30 p.m. | Moscone West - Room 2000

Cloud native development and security were among the key themes in today’s sessions from keynotes to HOLs and in the news… Starting with today’s Oracle Linux Cloud Native Environment announcement. In...

Linux

Oracle Instant Client RPMs Now Available on Oracle Linux Yum Server (yum.oracle.com)

Recently, we added Oracle Instant Client RPMs to the yum servers inside Oracle Cloud Infrastructure (OCI). Those yum servers are accessible from systems within OCI only. Today, I'm pleased to announce we added Oracle Instant Client RPMs to Oracle Linux yum server. That's right, no more manual steps to accept a license before you can download Oracle Instant Client. Simply run yum install from any Oracle Linux system connected to the Internet. How to access Oracle Instant Client on Oracle Linux yum server (yum.oracle.com) First, verify your Oracle Linux yum server configuration as we've recently made some changes in the way repository definitions are delivered. Follow the steps here to verify your setup. Once you've verified your yum configuration, install the oracle-release-el7 or oracle-release-el6 release package to configure repository definitions for Oracle Instant Client sudo yum install oracle-release-el7 Oracle Linux yum server currently offers Oracle Instant Client 18.3, which can connect to Oracle Database 11.2 or later. Here are the instant RPMs currently available: $ sudo yum list oracle-instantclient* ol7_UEKR4 | 1.2 kB 00:00:00 ol7_latest | 1.4 kB 00:00:00 ol7_oracle_instantclient | 1.2 kB 00:00:00 (1/2): ol7_oracle_instantclient/x86_64/primary | 2.2 kB 00:00:00 (2/2): ol7_oracle_instantclient/x86_64/updateinfo | 145 B 00:00:00 ol7_oracle_instantclient 7/7 Available Packages oracle-instantclient18.3-basic.x86_64 18.3.0.0.0-2 ol7_oracle_instantclient oracle-instantclient18.3-basiclite.x86_64 18.3.0.0.0-2 ol7_oracle_instantclient oracle-instantclient18.3-devel.x86_64 18.3.0.0.0-2 ol7_oracle_instantclient oracle-instantclient18.3-jdbc.x86_64 18.3.0.0.0-2 ol7_oracle_instantclient oracle-instantclient18.3-odbc.x86_64 18.3.0.0.0-2 ol7_oracle_instantclient oracle-instantclient18.3-sqlplus.x86_64 18.3.0.0.0-2 ol7_oracle_instantclient oracle-instantclient18.3-tools.x86_64 18.3.0.0.0-2 ol7_oracle_instantclient $ Conclusion With Oracle Instant Client RPMs now on our publicly available Oracle Linux yum server, it's even easier to develop and deploy applications for Oracle Database.

Recently, we added Oracle Instant Client RPMs to the yum servers inside Oracle Cloud Infrastructure (OCI). Those yum servers are accessible from systems within OCI only. Today, I'm pleased to...

Announcements

Announcing Gluster Storage Release 3.12 for Oracle Linux 7

Oracle is pleased to announce the release of Gluster Storage Release 3.12 for Oracle Linux 7. Gluster Storage is an open source, POSIX compatible filesystem capable of supporting thousands of clients while using commodity hardware. Gluster provides a scalable, distributed file system that aggregates disk storage resources from multiple servers into a single global namespace. Gluster provides built-in optimisation for different workloads and can be accessed using either an optimised Gluster FUSE client or standard protocols including SMB/CIFS. Gluster can be configured to enable both distribution and replication of content with quota support, snapshots and bit-rot detection for self-healing.  Installation Gluster Storage is available on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. It is currently available for the x86_64 architecture only and can be installed on any Oracle Linux 7 server running either the Red Hat Compatible Kernel (RHCK) or the Unbreakable Enterprise Kernel (UEK) Release 4 or 5.  For more information on hardware requirements and how to install and configure Gluster, please review the Gluster Storage for Oracle Linux Release 3.12 documentation. Support Support for Gluster Storage is available to customers with an Oracle Linux Premier support subscription. Refer to Oracle Linux 7 License Information User Manual for information about Oracle Linux support levels. Oracle Linux Resources: Documentation Oracle Linux Software Download Oracle Linux Oracle Container Registry Blogs Oracle Linux Blog Oracle Ksplice Blog Oracle Mainline Linux Kernel Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux - https://oracle.com/education/linux For community-based support, please visit the Oracle Linux space on the Oracle Developer Community.

Oracle is pleased to announce the release of Gluster Storage Release 3.12 for Oracle Linux 7. Gluster Storage is an open source, POSIX compatible filesystem capable of supporting thousands of clients...

Announcements

Announcing Oracle Linux Cloud Native Environment

Oracle is pleased to announce Oracle Linux Cloud Native Environment, a curated set of open source Cloud Native Computing Foundation (CNCF) projects that can be easily deployed, have been tested for interoperability, and for which enterprise-grade support is offered. For several years now we have seen the decomposition of applications into microservices running on container infrastructure with developers and operations collaborating using DevOps methodologies. Enterprises are looking for technologies that can help them reduce time to market and keep ahead of the competition. Cloud native microservices-based applications offer the agility and increased productivity needed. However, most IT operations are overwhelmed with the changing cloud native technology landscape. One option is to build your own cloud native environment from open source software but that requires dealing with the complexity of picking the right software and getting it all to work together without any vendor support. The other approach is to use a stack or distribution from a software vendor. This option offers support but that could mean lock-in with that vendor, which may also not be up to date with the latest technologies. Oracle offers a better alternative: one that can give you the best of both worlds by delivering software that supports the open standards, specifications, and APIs defined by the Cloud Native Computing Foundation or CNCF. The CNCF promulgates guidelines and defines certifications for cloud-native microservices software. Oracle is a platinum member of CNCF as well as a platinum member of the Linux Foundation. Oracle closely tracks the CNCF standards and contributes to the CNCF community. Oracle has been investing in components of the CNCF framework for some time. For example, Open Container Initiative (OCI)-compliant container software and CNCF Certified Conformance orchestration software have been included with Oracle Linux for several years. “We’re always thrilled to see members and long-standing open source contributors driving cloud native innovations that benefit both developers and enterprises," said Dee Kumar, vice president of marketing, Cloud Native Computing Foundation. "CNCF looks forward to seeing how Oracle continues its efforts to meet the quality, availability, and security needs of enterprises for cloud native DevOps.” Oracle Linux Cloud Native Environment With the Oracle Linux Cloud Native Environment, Oracle provides the features for customers to develop microservices-based applications that can be deployed in environments that support open standards and specifications. Container Infrastructure Containers are the fundamental infrastructure to deploy modern cloud applications. Oracle delivers the tools to create and provision OCI-compliant containers with the Oracle Container Runtime for Docker package available for Oracle Linux 7 on both the x86_64 and Arm architectures. To provide additional security and isolation of workloads, Oracle has adopted Kata Containers, an OpenStack Foundation project. Oracle is using Kata Container software to deliver the framework for creating lightweight virtual machines that can easily plug into a container ecosystem. A combination of Intel’s Clear Container initiative and the Hyper runV project, Kata Containers offer additional levels of security while maintaining the development and deployment speed of traditional containers. Kata Containers are available as a developer preview with Oracle Linux. Container Orchestration and Management Oracle Container Services for use with Kubernetes is an extension to Oracle Linux, based on the upstream Kubernetes project and released under the CNCF Kubernetes Certified Conformance program. Oracle Container Services for use with Kubernetes simplifies the configuration and setup of Kubernetes with support for backup and recovery. This solution is developed for Oracle Linux and integrates with Oracle Container Runtime for Docker to provide a comprehensive container and orchestration environment for the delivery of microservices and next-generation application development. CRI-O, an implementation of the Kubernetes CRI (Container Runtime Interface) to enable using Open Container Initiative compatible runtimes, is available in preview. CRI-O allows you to run containers directly from Kubernetes without any unnecessary code or tooling. As long as the container is Open Container Initiative (OCI)-compliant, CRI-O can run it, cutting out extraneous tooling and allowing containers to do what they do best: fuel your next-generation cloud native applications. Cloud Native Networking CNCF project Flannel provides the overlay network used in Oracle Container Services for use with Kubernetes today and simplifies container-to-container networking. The Container Network Interface (CNI) project currently incubating under CNCF seeks to simplify networking for container workloads by defining a common network interface for containers. The CNI plugin is available as a developer preview. Coming soon additional features like Calico will enable customers to define fine-grained connection policies to further improve container and virtual machine network security. Cloud Native Storage There are a number of storage projects associated with the CNCF foundation and several providers are included by default in Oracle Container Services for use with Kubernetes including a plugin for Gluster Storage for Oracle Linux Release 3.12. The future of storage integration will be provided through the use of a new plugin referred to as the Container Storage Interface (CSI) which was released in alpha beginning with Kubernetes 1.9. This new plugin will adhere to a standard specification and allow storage vendors to manage their plugins against their own timelines versus alignment with upstream Kubernetes releases. The alpha CSI plugin is available as a developer preview. Continuous Integration / Continuous Delivery The increased adoption of microservices and the development of cloud native applications requires continuous integration and delivery options to keep pace with growing release frequencies. Jenkins X, available in preview, is a CNCF project which rethinks how developers should interact with CI/CD in the cloud with a focus on making development teams more productive through automation, tooling and DevOps best practices. Observability and Diagnostics Prometheus is a powerful, flexible, instrumentation solution for monitoring container environments.  It provides time-series dimensional data, powerful query tools and alerting features to improve visibility across the environment.  In addition, integration with 3rd party “exporters” allow users to collect additional data and turn it into a metric in Prometheus.  One example of this would be with Fluentd which is a data collector that decouples data sources from backend systems by providing a unified logging layer in between. Fluentd provides an exporter for Prometheus, allowing for a more simple integration experience.  Both Prometheus and Fluentd are available as previews. Oracle Linux for Development Tried, tested, and tuned for enterprise workloads, Oracle Linux is used by developers worldwide. The Oracle Linux yum server provides easy access to Linux developer preview software, including the latest Cloud Native Environment software. Thousands of EPEL packages also have been built and signed by Oracle for security and compliance. Software collections include recent versions of Python, PHP, Node.js, nginx, and more. In addition, Oracle Cloud developer tools such as Terraform, SDKs, and CLI are available for an improved experience. And finally, Oracle VM VirtualBox helps customers get started with Oracle Linux Cloud Native Environment quickly. Greater Value Support for the Oracle Linux Cloud Native Environment is included with an Oracle Linux Premier support subscription at no additional cost. Components available in preview are made available via Oracle Linux yum server or Oracle Container Registry. Getting Started Oracle Linux is freely available to everyone at Oracle Software Delivery Cloud. Updates can be obtained from Oracle Linux yum server.  Oracle VM VirtualBox is the most popular cross-platform virtualization software for development environments. You can download a copy of VirtualBox to run Oracle Linux and the cloud-native software on your desktop and easily deploy to the cloud.  Oracle is offering up to 3,500 free hours on Oracle Cloud to developers that would like to use our cloud for their development environment. Oracle OpenWorld 2018 To learn more about Oracle Linux Cloud Native Environment at Oracle OpenWorld 2018, attend the sessions and visit Oracle Infrastructure Technologies showcase, booth #120, located in Moscone South, on the right side, just past the Autonomous Database showcase.

Oracle is pleased to announce Oracle Linux Cloud Native Environment, a curated set of open source Cloud Native Computing Foundation (CNCF) projects that can be easily deployed, have been tested for...

Events

Oracle Sponsors Open Source Summit Europe - Oct 22-25

Open Source Summit Europe (OSSEU) is the leading conference for developers, architects and other technologists – as well as open source community and industry leaders – to collaborate, share information, learn about the latest technologies and gain a competitive advantage by using innovative open solutions. Oracle is a gold sponsor of Open Source Summit in Edinburg. We have two great sessions : Tuesday, Oct 23 - 16:40    : ​Rapid and Secure Cloud Native DevOps  presented by Shane James Learn how to rapidly and securely build your cloud-native DevOps using tools such as: ready to deploy Oracle products as Docker images, Oracle Container Runtime for Docker, Oracle Container Services for use with Kubernetes, and Oracle VitualBox which enables multiple operating systems on one desktop and transporting live virtual machines between hosts and the cloud without interruption. Wednesday, Oct 24 - 16:15  : Test Driven Kernel Development presented by Knut Omang In this talk Knut will make a case for a pragmatic test driven approach to Linux kernel development. Most of the testing we are aware of are based on tests that are executed from user space only, and can only observe what the kernel exposes. Often this is not sufficient to test detailed semantics of components of the kernel, as many of the stimulis needed to activate certain pieces of the code is not easily generated. Also, even when a certain class of problems can be exposed using system level tests, running it as part of a continuous integration (CI) system may not be feasible due to the hardware needs. A good unit test framework can make it easier to write tests that asserts certain behavour that some code rely on. Oracle is developing and improving KTF (Kernel Test Framework), available on github to allow unit testing across the user/kernel boundary, which will be demonstrated as part of the talk. Visit us at Booth #9 to talk to engineers and get more information about Linux and Virtualization products and see demos.            

Open Source Summit Europe (OSSEU) is the leading conference for developers, architects and other technologists – as well as open source community and industry leaders – to collaborate,...

Key Happenings on the First Day of Oracle OpenWorld 2018

Today was a gorgeous day in San Francisco. The temperature was in the mid-60’s, the sun was shining, and there was a light breeze. The forecast is for similar weather all week – just in time for Oracle OpenWorld 2018. I flew in to cloud cover, but it burned off between baggage, finding my ride, and the drive from SFO to the city. I made my way to the convention center and the hustle is on!   It’s remarkable how the expanse of Moscone South can transform from an empty hall to the booming venue that tomorrow will be The Exchange – and I thought SFO was a busy place. Monday, October 22, is the first day of Oracle OpenWorld and there are several key things you’ll want to do… Register. If you haven’t already, use the advance check-in option. It’s a breeze. Sign-up for Sessions. Be sure to register for the ones you must attend to be sure to have a seat. Here are some links to help you finish building your Monday schedule:   Monday, Oct. 22        11:30 a.m. - 12:15 p.m. Oracle Linux: State of the Penguin [PRO4720] – Moscone West - Room 2000 Wim Coekaerts, Senior Vice President, Operating Systems and Virtualization Engineering, Oracle   1:45 p.m. - 3:00 p.m. Keynote: Cloud Generation 2 [KEY3784] – Moscone North – Hall D Larry Ellison, Executive Chairman and CTO, Oracle   3:45 p.m. - 4:45 p.m. Oracle's Systems Strategy for Cloud and On-Premises [PKN5901] – The Exchange @ Moscone South - The Arena Ali Alasti, Senior Vice President, Hardware Engineering, Oracle Wim Coekaerts, Senior Vice President, Operating Systems and Virtualization Engineering, Oracle Edward Screven, Chief Corporate Architect, Oracle   As you exit The Arena at 4:45, you conveniently have to walk by the Infrastructure Technologies showcase, #120. Take a few minutes to see what’s going on. If the demos and product experts don’t keep you enthralled, the VR game will. Apparently, shooting widgets with a bow and arrow in VR is a lot of fun – hope you’ll give it a try – you could win a prize.   For more session information, visit the Focus on Oracle Linux and Virtualization page.

Today was a gorgeous day in San Francisco. The temperature was in the mid-60’s, the sun was shining, and there was a light breeze. The forecast is for similar weather all week – just in time...

Events

Don’t Miss the Theater While in San Francisco or at Oracle OpenWorld

There’s a lot to see and do in San Francisco. If you’re joining us for Oracle OpenWorld 2018, be sure to partake in some of the wonderful attractions and culture that the city has to offer. If you like theater, you’ll find lots of options from large world-renowned venues such as the Orpheum Theater and the Golden Gate Theater to smaller ones like the SF Playhouse on Union Square, offering a variety of productions from Broadway hits to comedies and locally written pieces. You can also find some great content in the theaters of The Exchange @ Moscone South while you’re attending the conference. After walking the streets (and hills) of San Francisco, not to mention the halls of Moscone, who wouldn’t want to take a seat?  For 20 minutes, you can rest your feet and gather some knowledge. Here are a few sessions in Theater 6, located in the Infrastructure Technologies showcase, #120, to mark on your schedule: Monday, October 22 12:40 p.m. – 1:00 p.m. Oracle Infrastructure Technologies in Oracle Cloud [THT6914] Robert Shimp, Product Management Group Vice President - Oracle Linux, Virtualization and Linux and VM Development, will outline the many infrastructure technologies that Oracle designs, builds, and optimizes that power Oracle Cloud. Learn about the inner workings that make Oracle Cloud unique. Tuesday, October 23 1:30 p.m. – 1:50 p.m. Oracle Linux/Oracle VM VirtualBox: An Enterprise Development Platform for Oracle Cloud [THT6912] Simon Coter, Director of Product Management, Linux and Virtualization, will discuss the advantages of using Oracle Linux and Oracle VM VirtualBox as an enterprise development platform for Oracle Cloud. 5:25 p.m. – 5:45 p.m. Building a Cloud Native Environment with Oracle Linux [THT6913] Avi Miller, Product Management Director, will delve into the open, integrated operating environment that Oracle Linux offers, with application development tools, management tools, containers, and orchestration capabilities, which enable DevOps teams to efficiently build reliable, secure cloud native applications. Learn how Oracle Linux can help you enhance productivity.

There’s a lot to see and do in San Francisco. If you’re joining us for Oracle OpenWorld 2018, be sure to partake in some of the wonderful attractions and culture that the city has to offer. If you...

Events

Hewlett Packard Enterprise at Oracle OpenWorld 2018

Coming to Oracle OpenWorld 2018? Then come see HPE. We're excited to have our partner returning to the conference this year! And they’re ready to help you learn how the right HPE hardware running Oracle Linux and Oracle VM can provide an optimal solution for your most demanding workloads. Whether your requirements are for increasing database performance or keeping critical applications available, HPE can help you optimize your investments in Oracle. HPE participates in Oracle’s HCL program to qualify hardware on Oracle Linux, Oracle Solaris, and Oracle VM. Qualified solutions can be found here. Meet with HPE compute and storage experts in the Infrastructure Technologies showcase, # 120. Learn about HPE's all-flash 3PAR and Nimble Storage with extreme performance, predictive analytics and robust data protection, or, for your mission-critical compute, the unparalleled scale-up server capacity offered by HPE Superdome servers with Intel® Xeon® Scalable processors. HPE provides a full portfolio of right-sized server and storage solutions allowing IT organizations to match processing power and scale with current and future needs, from small to large enterprise deployments, at price points that fit within almost any IT budget.        

Coming to Oracle OpenWorld 2018? Then come see HPE. We're excited to have our partner returning to the conference this year! And they’re ready to help you learn how the right HPE hardware running...

Linux

Announcing Oracle Linux Storage Appliance 1.8 for Oracle Cloud Infrastructure

We are pleased to announce the release of Oracle Linux Storage Appliance 1.8. The Oracle Linux Storage Appliance allows you to easily build NFS and Samba shared file system storage with attached NVMe devices or block volumes on Oracle Cloud Infrastructure (OCI). This release provides Microsoft Active Directory support for greater integration with Windows domain networks.  Many Microsoft Windows Server deployments use Active Directory for managing user authentication and access authorization.  Oracle Linux Storage Appliance can now authenticate users defined in the Active Directory server, and authorize or restrict access to Samba shared file system directories implementing the Server Message Block (SMB) export protocol. To take advantage of Microsoft Active Directory support, you can easily upgrade your existing Oracle Linux Storage Appliance deployment using the Update Appliance option in the Administration page of the web console.  To install a new deployment of Oracle Linux Storage Appliance on Oracle Cloud Infrastructure, simply follow a few easy steps provided here.  Active Directory support is enabled in the Samba Global Settings option in the web console’s Administration page. For more information visit: Oracle Linux Storage Appliance Oracle Linux Storage Appliance Deployment and User’s Guide

We are pleased to announce the release of Oracle Linux Storage Appliance 1.8. The Oracle Linux Storage Appliance allows you to easily build NFS and Samba shared file system storage with attached NVMe...

Announcements

Announcing Oracle Linux 7 Update 6 Developer Preview

Oracle is pleased to announce the availability of the developer preview for Oracle Linux 7 Update 6 as part of our ongoing goal of making Oracle Linux the distribution for development. The Oracle Linux 7 Update 6 Developer Preview includes the following kernel packages: kernel-uek-4.14.35-1818.2.1.el7uek.x86_64 The Unbreakable Enterprise Kernel Release 5, which is the default kernel. kernel-3.10.0-933.el7.x86_64 The latest Red Hat Compatible Kernel (RHCK). To get started with Oracle Linux 7 Update 6 Developer Preview, you can simply perform a fresh installation by using the ISO images available for download from Oracle Technology Network. Or, you can perform an upgrade from an existing Oracle Linux 7 installation by using the developer preview channels for Oracle Linux 7 Update 6 on the Oracle Linux yum server or the Unbreakable Linux Network (ULN).  # vi /etc/yum.repos.d/public-yum-ol7.repo [ol7_u6_developer] name=Oracle Linux $releasever Update 6 installation media copy ($basearch) baseurl=https://yum.oracle.com/repo/OracleLinux/OL7/6/developer/$basearch/ gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle gpgcheck=1 enabled=1 [ol7_u6_developer_optional] name=Oracle Linux $releasever Update 6 optional packages ($basearch) baseurl=https://yum.oracle.com/repo/OracleLinux/OL7/optional/developer/$basearch/ gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle gpgcheck=1 enabled=1 Oracle Linux yum server is mirrored inside Oracle Cloud Infrastructure to enable faster downloads. You can follow the instructions to configure Oracle Linux yum server mirrors in Oracle Cloud Infrastructure. Modify the yum channel setting and enable the Oracle Linux 7 Update 6 Developer Preview channels. Then you perform the upgrade. # yum update After the upgrade is completed, reboot the system and you will have Oracle Linux 7 Update 6 Developer Preview running. # cat /etc/oracle-release Oracle Linux Server release 7.6 This release is provided for development and test purposes only and is not covered by Oracle Linux support. Oracle does not recommended using preview releases in production. If you have any questions, please visit the Oracle Linux and UEK Preview space on the Oracle Linux Community. If you come to Oracle OpenWorld and want to learn more about Oracle Linux and Virtualization and to speak with product experts, visit the Oracle Infrastructure Technologies showcase, booth #120, located in Moscone South, on the right side, just past the Autonomous Database showcase.

Oracle is pleased to announce the availability of the developer preview for Oracle Linux 7 Update 6 as part of our ongoing goal of making Oracle Linux the distribution for development. The Oracle Linux...

Events

Join Pure Storage at the Infrastructure Technologies Showcase, #120, at Oracle OpenWorld 2018

Learn how Pure Storage empowers Oracle customers to maximize the value of data We’ve shared a lot of information about key sessions and the showcase to help Oracle OpenWorld attendees map out how to best spend time at the conference. There’s more… Our Partners. This year, we welcome AMD, Hewlett Packard Enterprise, Lenovo, Pure Storage, and Symantec who are joining us in the Infrastructure Technologies showcase, #120, in Moscone South. Here are some key things to know about Pure Storage. #1: Pure Storage hardware is qualified and supported on Oracle Linux, Oracle Solaris and Oracle VM. It is also a good selection for Oracle Private Cloud Appliance customers that need external storage for business continuity and rapid restore solutions. The Pure Data-Centric Architecture for Oracle, the all-flash storage platform, is virtually effortless to use, efficient from end-to-end, and evergreen to upgrade - delivering real-time data to power customers’ mission-critical Oracle databases, data warehouses, development activities, and modern analytics environments. Thousands of Oracle customers use Pure Storage to help them deliver faster performance, improved simplicity, and lower economics for their Oracle environments. A case in point: An insurance company in Latin America recently selected Pure Storage to improve the performance and simplify operation of their mission critical Oracle databases that run on Oracle Linux. Pure’s embedded Oracle Copy Automation Tool (CAT), based on space-efficient snapshots, helped this customer speed up development activities by over 150x. Copy, clone and refresh workflows that were taking up to 3 hours to complete are now conducted in as little as 1-2 minutes. Now that’s a benefit that’s hard to pass up. #2: You can hear more from Pure Storage product experts: Customer Case Study Session: Oracle Private Cloud Appliance and Pure Storage: An Integrated Disaster Recovery Solution Thursday, Oct 25, 12:00 p.m. - 12:45 p.m. Moscone South - Room 214 Theater Session: Accelerate Development with Database Automation Tuesday, Oct 23, 1:00 p.m. - 1:20 p.m.  The Exchange @ Moscone South - Theater 1 #3: At the Infrastructure Technologies showcase, #120, Pure will be highlighting hardware support on Oracle Linux, Oracle VM, and Oracle Private Cloud Appliance, and demoing tools including: Oracle Database Copy Automation Tool Accelerate Oracle DB Development with Automation using Ansible Oracle Enterprise Manager Plug-in for Pure Storage Space Efficient Oracle snapshots Pure ActiveCluster - Simple and cost-effective Sync Replication    Pure Storage is an OPN Gold member.  

Learn how Pure Storage empowers Oracle customers to maximize the value of data We’ve shared a lot of information about key sessions and the showcase to help Oracle OpenWorld attendees map out how to...

Events

Q: What do penguins, pop sockets and VR have in common? A: The Oracle Infrastructure Technologies Showcase at Oracle OpenWorld

It’s refreshing to be in San Francisco in the fall. The weather is typically “temperate” but can be unpredictable, so it’s always good to bring layers, just in case. Joining the throng of people heading to Oracle OpenWorld adds an even more energizing buzz to the city by the bay. I enjoy a walk in the Howard and 3rd St. neighborhood as I grab a cappuccino and head to Moscone Center. Like the convention center, which is undergoing an expansion and transformation, so too is The Exchange, this year’s demo grounds at Oracle OpenWorld, located in Moscone South. With a focus on attendees’ experience, there are several new things to make navigating the exhibit floor easier. A wayfinder application provides an easy, self-service portal for finding demos and product experts. On-demand demos join always-on demos to provide time savings, and meetings can be booked on the spot to fit your schedule. Also new this year is the Oracle Infrastructure Technologies showcase, #120. This showcase, located on the right side of the show floor, near the Oracle Cloud Infrastructure and Autonomous Database showcases, is a stop you'll want to make. Attendees will find a wealth of information and an opportunity to have some fun. Here’s an outline of what will be covered in the Oracle Infrastructure Technologies showcase. Products, technologies, and training: Servers: X86 Servers, SPARC Servers Storage: Zero Data Loss Recovery Appliance, Oracle ZFS Storage Appliance, StorageTek Tape Automation Operating Systems: Oracle Linux, Oracle Solaris Virtualization: Oracle VM for x86, Oracle VM Server for SPARC, Oracle VM VirtualBox, Tools and Platform: Oracle Containers, Oracle OpenStack, Oracle Enterprise Manager, Kubernetes Converged Infrastructure: Oracle MiniCluster, Oracle SuperCluster, Oracle Private Cloud Appliance Training Partners: AMD, Hewlett Packard Enterprise, Lenovo, Pure Storage, and Symantec Fun with VR: Join us for some fun in this virtual world (with all of the gear), where you’ll transform into the role of a solution architect. Shoot down the Oracle Infrastructure Technology product(s) that best fit your IT requirements and you could win an Oracle penguin pop socket.  And there’s more… More fun at CloudFest. 18 and if you’re planning to extend your stay in San Francisco, be sure to check out all of the Halloween parties – this city knows how to do them right! Finally, back to Oracle OpenWorld -- don’t forget to register for sessions now, they’re filling up fast. Enjoy fall in San Francisco and your time at Oracle OpenWorld 2018.

It’s refreshing to be in San Francisco in the fall. The weather is typically “temperate” but can be unpredictable, so it’s always good to bring layers, just in case. Joining the throng of...

Linux

Configuring Oracle Linux 7 Instances on Oracle Cloud Infrastructure Using OCI Utilities

Oracle Linux 7 instances created using Oracle-Provided Images on Oracle Cloud Infrastructure (OCI) include a pre-installed set of utilities that are designed to facilitate configuration tasks for Oracle Linux instances. These utilities consist of a set command line tools included in the oci-utils RPM package that is pre-installed with the latest Oracle Linux 7 images provided under the ‘Oracle-Provided OS Image’ selection when creating an instance from the Oracle Cloud Infrastructure console. The following OCI utilities are available in the oci-utils package: oci-iscsi-config - Displays and attaches/detaches iSCSI devices on Oracle Linux instances. oci-network-config - Displays instance VNICs, configures secondary VNICs, and auto-synchronizes VNIC IP configurations. oci-network-inspector - Displays network information for an OCI Virtual Cloud Network (VCN), compartment, or tenancy, including the security list, and IP addresses of VNICs and instances. oci-metadata - Queries instance metadata such as the OCI region, availability domain, shape, state, OCID, compartment, and network. oci-public-ip - Displays the instance public IP address. ocid - This is the oci-utils service daemon component. For more information on OCI utilities and how to use the scripts, visit the following links: Documentation Oracle Cloud Documentation: OCI Utilities Blogs oci-utils-0.6-34.el7 oci-utils for Oracle Cloud Infrastructure      

Oracle Linux 7 instances created using Oracle-Provided Images on Oracle Cloud Infrastructure (OCI) include a pre-installed set of utilities that are designed to facilitate configuration tasks for...

Events

Enterprise Development Platform founded on Oracle Linux and VirtualBox

"Tried, tested, and tuned for enterprise workloads, Oracle Linux is used by developers worldwide. Oracle Linux’s Yum server provides easy access to Linux developer and preview software channels. Thousands of EPEL packages have been built and signed by Oracle for security and compliance. Software collections include recent versions of Python, PHP, Node.js, nginx, and more. Oracle Cloud developer tools such as Terraform, SDKs, and CLI are available for improved experience. Oracle VM VirtualBox is the most popular cross-platform virtualization software. In this session learn about using Oracle Linux and Oracle VM VirtualBox as an enterprise development platform" Oracle Linux is a real Enterprise and Open Linux Distribution: It's free to use It's free to distribute It's free to update It's free to use just because the Oracle Linux ISOs can be downloaded and used for free, no subscription is required! It's free to distribute, because the software can be shared and installed on more and many different environment! It's free to update, because you can get access to all the updates by the Oracle Linux Yum Server and, again, no subscription is required! On Oracle Linux Yum Server you can also find channels dedicated to developers, like: Software Collection Library 3.0 for Oracle Linux 7 EPEL channel, with packages built and signed by Oracle for security and compliance Development channel, with packages dedicated to development utilities (like VirtualBox) and/or cloud utilities Oracle VM VirtualBox is the most popular cross-platform virtualization software; it allows to run any x86 Operating System on top on your laptop/desktop environment; it does not matter which OS you've installed on the Host, the same VirtualBox release is available for Linux, Windows and MacOS. By having Oracle VM VirtualBox installed on your host development platform you can really create a transparent layer that will get your Virtual Machines (dev environments) running on top, independent from the host operating system!   You can learn more about how your business can get advantage of those technologies during the Oracle Open World "Oracle Linux and Oracle VM VirtualBox: The Enterprise Development Platform" session at Oracle Open World on Monday, Oct 22, 9:00AM in room 152, Moscone South. To learn more about Oracle Linux and to speak with product experts, visit the Oracle Infrastructure Technologies showcase, booth #120, located in Moscone South, on the right side, just past the Autonomous Database showcase. See you there!

"Tried, tested, and tuned for enterprise workloads, Oracle Linux is used by developers worldwide. Oracle Linux’s Yum server provides easy access to Linux developer and preview software channels....

Events

Six Must-Attend Sessions at Oracle OpenWorld 2018

Building your Oracle OpenWorld 2018 schedule? You won't want to miss these six sessions. Our executives will share details on architecture and technical directions, the latest innovations, business strategies, and customer successes. You’ll come away with a better understanding of the unique capabilities Oracle Linux, Virtualization and other Oracle Infrastructure Technologies are delivering now and going forward – whether you want to deploy on premises, in the cloud or integrate between. Register now to ensure you have a seat!  Day/Time/Location    Session Title   Speakers Monday, Oct. 22     11:30 a.m. - 12:15 p.m. Moscone West - Room 2000 Oracle Linux: State of  the Penguin  [PRO4720] Wim Coekaerts, Senior Vice President, Operating Systems and Virtualization Engineering, Oracle 3:45 p.m. - 4:45 p.m. The Exchange @ Moscone South - The Arena Oracle's Systems Strategy for Cloud and On-Premises [PKN5901] Ali Alasti, Senior Vice President, Hardware Engineering, Oracle Wim Coekaerts, Senior Vice President, Operating Systems and Virtualization Engineering, Oracle Edward Screven, Chief Corporate Architect, Oracle Tuesday, Oct. 23     11:15 a.m. - 12:00 p.m. Moscone West - Room 2000 An Overview of Oracle Infrastructure Technologies in Oracle Cloud [PRO5904] Robert Shimp, Product Management Group Vice President - Oracle Linux, Virtualization and Linux and VM Development, Oracle Ajay Srivastava, Senior Vice President, Operating Systems and Virtualization, Oracle 11:30 a.m. - 12:15 p.m. Moscone West - Room 2009 Kubernetes, Docker, and Oracle Linux from On-Premises to Oracle Cloud with Ease [DEV6015] Wim Coekaerts, Senior Vice President, Operating Systems and Virtualization Engineering, Oracle Wednesday, Oct. 24     11:15 a.m. - 12:00 p.m.  Moscone West - Room 2000 The OS Factor: Advice for the Technology Buyer from IDC [BUS4729] Ashish Nadkarni, Research Director, IDC Karen Sigman, Vice President, Product and Partner Marketing, Oracle Thursday, Oct. 25     9:00 a.m. - 9:45 a.m.  Moscone West - Room 2003 Oracle Linux Is Really the Ideal Linux for Oracle Cloud Developers [DEV6017] Wim Coekaerts, SVP, Operating Systems and Virtualization Engineering, Oracle   To learn more about these sessions and to register, click on the session title above, in the search box enter the session code, click "+" to complete your registration. Visit and bookmark this Focus on Oracle Linux and Virtualization page to access the full list of our general sessions and hands-on labs. Check the Oracle Linux and Virtualization blogs regularly for news and updates. And, while at Oracle OpenWorld, be sure to stop by the Infrastructure Technologies showcase, booth #120, located in Moscone South (on the right side, just past the Autonomous Database showcase). Featuring Oracle Linux and Virtualization technologies, at the showcase you can experience demos, a virtual reality game, and speak with product experts and partners.  

Building your Oracle OpenWorld 2018 schedule? You won't want to miss these six sessions. Our executives will share details on architecture and technical directions, the latest innovations, business...

Events

Agile, reliable and secure DevOps with Oracle Linux and VirtualBox

Building an agile collaboration and communication between Development (Dev) and Operations (Ops) is one of the main goals of modern IT: deploying features into production quickly and, at the same time, detecting and correcting problems when they occur, without disrupting other services, can be obtained by a culture that puts a focus on creating a fast and stable workflow through development and IT operations. Results with a good DevOps approach: Faster time-to-market: Increase the frequency and accuracy of releases (the automation processes will give much more time to the people) Cost: reduce OPEX by automating processes; this will also prevent human errors and reduce downtime Focus on business: Allow employees to focus on high-value activities (that will also improve employees personal gratification) Oracle is, at the same time, one of the biggest players in both the Cloud and Software market and, so, DevOps is one of the most important components to grant us best results possible; infrastructure technologies we use at Oracle to build a stable and reliable workflow rely on both container and virtualization solutions. VirtualBox and Oracle Linux can help you to address most of the DevOps requirement in term of build, test and deploy; while Oracle Linux, with all its Enterprise Features, has been rated as the "Top Rated Operating System for Business", Oracle VM VirtualBox is the most famous, cross-platform, desktop virtualization solution available today. Those technologies, working with Vagrant, allow to automate and get a reliable, no human-error and reproducible environment in minutes; this is also why, some time ago, we created the official GitHub repository dedicated Vagrant Boxes for Oracle Products and projects, available at https://github.com/oracle/vagrant-boxes . You can learn more about how your business can get advantage of those technologies and their DevOps approach during the Oracle Code "Practical DevOps with Linux and Virtualization" session at Oracle Open World on Thursday, Oct 25, 10:00AM in room 2018, Moscone West. To learn more about Oracle Linux and to speak with product experts, visit the Oracle Infrastructure Technologies showcase, booth #120, located in Moscone South, on the right side, just past the Autonomous Database showcase. See you there!

Building an agile collaboration and communication between Development (Dev) and Operations (Ops) is one of the main goals of modern IT: deploying features into production quickly and, at the same...

Getting Started with the Unbreakable Enterprise Kernel Release 5 for Oracle Linux on Oracle Cloud Infrastructure

Oracle Linux images available on Oracle Cloud Infrastructure are frequently updated to help ensure access to the latest software. The latest Oracle Linux images provided in Oracle Cloud Infrastructure now include Oracle Linux 7 Update 5 with the Unbreakable Enterprise Kernel Release 5 (UEK R5). UEK R5 is an extensively tested and optimized Linux kernel designed for 64-bit (Intel x86_64) and ARM (aarch64) architectures and based on mainline version 4.14 LTS.  UEK R5 provides secure boot and performance optimization improvements, security and bug fixes, and driver updates. For details about UEK R5 improvements and more, visit these links: Announcing the General Availability of the Unbreakable Enterprise Kernel Release 5 Unbreakable Enterprise Kernel Release 5 for Oracle Linux 7 Oracle Linux Enterprise Kernel Release 5 – New Features and Change You can take advantage of the new UEK R5 enhancements by deploying the latest Oracle Linux images on Oracle Cloud Infrastructure. Simply create an instance with the latest Oracle Linux 7.5 image provided on the Oracle Cloud Infrastructure console, as shown in the following example: To upgrade your existing Oracle Linux instances to UEK R5 on Oracle Cloud Infrastructure, enable access to the ol7_UEKR5 channel on your Oracle Cloud Infrastructure region’s mirrored Oracle Linux yum server repository or the ol7_x86_64_UEKR5 channel on the Unbreakable Linux Network (ULN), and run the yum update command. After the upgrade, you will need to reboot and select the UEK5 kernel (version 4.14.35) if it is not the default boot kernel. The UEK R5 update is included with Oracle Linux Premier Support at no additional cost with your Oracle Cloud Infrastructure subscription. This includes access to the latest packages and updates, 24x7 expert support, the My Oracle Support portal with an extensive Linux knowledge base, Oracle Ksplice zero-downtime updates, and more. For more information, visit the following links: Oracle Linux Oracle Linux for Oracle Cloud Infrastructure Unbreakable Enterprise Kernel for Oracle Linux Release Notes for Unbreakable Enterprise Kernel Release 5 Getting Started: Oracle Linux for Oracle Cloud Infrastructure Guide

Oracle Linux images available on Oracle Cloud Infrastructure are frequently updated to help ensure access to the latest software. The latest Oracle Linux images provided in Oracle Cloud Infrastructure...

Events

Oracle Linux and Virtualization Hands-On Labs at Oracle OpenWorld

We have a great selection of hands-on labs for Oracle Linux and Virtualization at Oracle OpenWorld. To join the product experts for these sessions at the Marriott Marquis (Yerba Buena Level) - Salon 12/13, add the following six sessions to your Oracle OpenWorld calendar.   Session: Container Orchestration Using Oracle Linux (Kubernetes/Docker) - HOL6334 When: Monday October 22, 3.45 - 4.45pm Speaker: Avi Miller, Product Management Director, Oracle   Session: Build a High Availability Solution with Oracle Linux: Corosync/Pacemaker - HOL3137 When: Monday October 22, 5.15 - 6.15 pm Speaker: Jeff Savit, Director, Oracle   Session: Provide Zero Downtime Update for your Cloud Infrastructure - HOL6340 When: Wednesday October 24, 8:00  - 9:00 a.m Speaker: Christophe Pauliat, Oracle Solution Center Sales Consultant, Oracle; Simon Coter, Director of Product Management, Linux and Virtualization, Oracle   Session: Infrastructure as Code on Oracle Cloud Infrastructure with Terraform - HOL5139 When: Thursday October 25, 10:30 - 11:30 a.m Speaker: Simon Hayler, Sr Principal Technical Product Manager; Christophe Pauliat, Oracle Solution Center Sales Consultant, Oracle; Paul Bramy, CEO reloca; Matthieu Bordonne, Oracle Solution Center Sales Consultant   Session: Observing and Optimizing your Application on Oracle Linux with DTrace - HOL6339 When: Thursday October 25, 12.00 - 1.00pm Speaker: Jeff Savit, Director, Oracle   Session: Oracle Database 18c: Reliable DevOps with Vagrant, Oracle VM VirtualBox, and Oracle Linux - HOL6394 When: Thursday October 25, 1:30 - 2:30 p.m Speaker: Simon Coter, Director of Product Management, Linux and Virtualization, Oracle; Gerald Venzl, Senior Principal Product Manager, Oracle   At Oracle OpenWorld, to learn more about Oracle Linux and Virtualization, visit the Oracle Infrastructure Technologies showcase, booth #120, located in Moscone South, on the right side, just past the Autonomous Database showcase.

We have a great selection of hands-on labs for Oracle Linux and Virtualization at Oracle OpenWorld. To join the product experts for these sessions at the Marriott Marquis (Yerba Buena Level) -...

Events

Live Webinar: Secure and Agile Orchestration for Docker Containers

Live Webinar Oracle Webinar: Secure and Agile Orchestration for Docker Containers Europe, Middle East, Africa - October 9, 2018 10:00 AM BST/ 11:00 AM CEST/ 11:00 AM SAST/ 1:00 PM GST North America, Canada - October 9, 2018 12:00 PM PDT/ 3:00 PM EDT Asia Pacific and Japan–9 October, 2018 10:30 am IST/ 1:00 PM SGT/ 4:00 PM AEDT     Oracle Webinar: Secure and Agile Orchestration for Docker Containers     The goal of orchestration is to streamline and optimise frequent, repeatable processes to ensure accurate, speedier deployment of software–because companies know that the shorter the time-to-market, the more likely that success will follow.  Attend this webinar:  To understand how to build a secure and agile production environment by leveraging Docker containers and Kubernetes orchestration.  Learn about Oracle Container Services for use with Kubernetes which provides a comprehensive container and orchestration environment for the delivery of microservices and next generation application development.  Watch a demonstration of how to use Vagrant and VirtualBox to automatically deploy a Kubernetes cluster.  There will be a live Q&A at the end of the webinar.   Featured Speaker     Avi Miller Product Management Director  Oracle Linux and Virtualization  Stay Connected                        

Live Webinar Oracle Webinar: Secure and Agile Orchestration for Docker Containers Europe, Middle East, Africa - October 9, 2018 10:00 AM BST/ 11:00 AM CEST/ 11:00 AM SAST/ 1:00 PM GST North America,...

Announcements

Action required: Replacement of SSL certificates for the Unbreakable Linux Network

Oracle is replacing Symantec-branded certificates with Digicert-branded certificates across all of its infrastructure to prevent trust warnings once the Symantec root certificate authority is removed from several web browsers, including Firefox and Chrome. Immediate action required before October 9, 2018 Due to the nature of how Oracle Linux systems connect to Unbreakable Linux Network (ULN), this change requires that client certificates on all Oracle Linux systems directly subscribed to and receiving updates from ULN be updated. This does not affect Oracle Linux systems that are managed by Oracle Enterprise Manager or are subscribed to a local Spacewalk instance. The change in server certificates on ULN will occur on October 9, 2018. After that time, Oracle Linux systems will only be able to connect to ULN with an updated client certificate. Please make sure to update the packages listed at the end of this announcement on all servers that are registered directly to ULN before October 9, 2018. What happens if I can't update before October 9, 2018? If you are unable to update to the packages listed below before October 9, 2018, you will be unable to connect to ULN and will receive one of the following errors: The certificate /usr/share/rhn/ULN-CA-CERT is expired. Please ensure you have the correct certificate and your system time is correct. OR There was an SSL error: [('SSL routines', 'SSL3_GET_SERVER_CERTIFICATE', 'certificate verify failed')] A common cause of this error is the system time being incorrect. Verify that the time on this system is correct. Resolution: Manually replace the SSL certificate To manually replace the client SSL certificate on an Oracle Linux machine, run the following steps as root on each server: # cp /usr/share/rhn/ULN-CA-CERT /usr/share/rhn/ULN-CA-CERT.old # wget https://linux-update.oracle.com/rpms/ULN-CA-CERT.sha2 # cp ULN-CA-CERT.sha2 /usr/share/rhn/ULN-CA-CERT After this file has been updated you can continue using ULN as normal. After making this manual replacement, connectivity to ULN should be restored. The packages below should then be updated as part of your standard patching cycle. If you have any questions about this update please feel free to contact the ULN team via uln-info_us@oracle.com. Packages to be updated Oracle Linux 7 rhn-client-tools-2.0.2-21.0.9.el7.noarch.rpm rhn-setup-2.0.2-21.0.9.el7.noarch.rpm rhn-check-2.0.2-21.0.9.el7.noarch.rpm rhn-setup-gnome-2.0.2-21.0.9.el7.noarch.rpm (only required if a previous version is already installed) Oracle Linux 6 rhn-setup-1.0.0.1-45.0.3.el6.noarch.rpm rhn-client-tools-1.0.0.1-45.0.3.el6.noarch.rpm rhn-check-1.0.0.1-45.0.3.el6.noarch.rpm rhn-setup-gnome-1.0.0.1-45.0.3.el6.noarch.rpm (only required if a previous version is already installed) Oracle Linux 5 x86_64 up2date-5.10.1-41.30.el5.x86_64.rpm up2date-gnome-5.10.1-41.30.el5.x86_64.rpm (only required if a previous version is already installed) i386 up2date-5.10.1-41.30.el5.i386.rpm up2date-gnome-5.10.1-41.30.el5.i386.rpm (only required if a previous version is already installed) ia64 up2date-5.10.1-41.30.el5.ia64.rpm up2date-gnome-5.10.1-41.30.el5.ia64.rpm (only required if a previous version is already installed)

Oracle is replacing Symantec-branded certificates with Digicert-branded certificates across all of its infrastructure to prevent trust warnings once the Symantec root certificate authority is removed...

Events

Discover why Oracle Linux is top-rated operating system for business

In May 2017, IT Central Station's readers were asked to rank operating systems and based on those reviews, Oracle Linux was named the 2017 top operating system for business purposes. It can often be difficult to compare the value of one operating system over another. Pricing is obviously an important consideration, but there are so many other factors that should be considered when making such a fundamental platform decision. From the beginning, Oracle Linux was designed to provide a simpler way for Oracle customers to get full-stack support from the operating system to the application from an enterprise-class vendor that understands not just operating system, but all the really important things our customers actually need to run, like databases, middleware, applications and more recently, virtual machines and containers. To help ensure the most value for our customers, we've added additional components and products to Oracle Linux without increasing complexity. One of our customers that saw the biggest increases in efficiency and performance after migrating to Oracle Linux is Intel and this year at OpenWorld, they will be presenting the fascinating story of how they migrated their production, mission-critical manufacturing databases from Microsoft Windows to Oracle Linux with no outages or downtime. You'll also learn about the significant performance increase they gained on exactly the same hardware. To discover the value Oracle Linux can deliver for your organization, visit oracle.com/linux to connect with our global Oracle Linux sales team to schedule a customized workshop tailored specifically for you. Coming to Oracle OpenWorld? I will be giving a brief tour of all the other add-on packages that are supported on Oracle Linux at no extra cost and then introducing Intel at the "Why Choose Oracle Linux: The Value of Enterprise Linux" session on Monday, October 22nd at 5:45pm in Room 2000, Moscone West. You can also learn more about Oracle Linux directly from the product experts by visiting the Oracle Infrastructure Technologies showcase (booth #120) in Moscone South next to the Arena and just past the Autonomous Database showcase.

In May 2017, IT Central Station's readers were asked to rank operating systems and based on those reviews, Oracle Linux was named the 2017 top operating system for business purposes. It can often be...

Linux

A selection of OpenWorld sessions on Oracle Linux and Oracle VM

  Oracle OpenWorld 2018 is only a few weeks away! There are many sessions on Oracle Linux and Oracle VM, and here are a few technical sessions you may find interesting: Tips for Securing Your Cloud Infrastructure, Jan Hendrik Mangold, Jeff Savit [TIP4727], Monday 9:00 a.m. - 9:45 a.m., with products, tools and techniques for security. Perform In-Place Upgrade for Large-Scale Cloud Infrastructure, Jeff Savit, Jeffery Yoder, Rodolfo Martinez [CAS5088], Monday 3:45 p.m. - 4:30 p.m., with real world experiences on maintaining and upgrading an extremely large production Oracle VM environment. Maximize Performance with Oracle Linux and Oracle VM, Greg Marsden, Jeff Savit, Kevin Tribbey [TIP4725], Tuesday 5:45 p.m. - 6:30 p.m., with features of Oracle Linux, including DTrace, that enhance performance. Build a High Availability Solution with Oracle Linux: Corosync/Pacemaker, Jeff Savit [HOL3137] Hands-on lab for clustered Oracle Linux under VirtualBox, using Corosync and Pacemeker, Monday 5:15 p.m. - 6:15 p.m. Observing and Optimizing Your Application on Oracle Linux with DTrace, Jeff Savit [HOL6339] Hands-on lab introducing DTrace on Oracle Linux, Thursday noon to 1:00pm. These are sessions I will be at, so I hope you attend and find them useful. To learn more, visit the Oracle Infrastructure Technologies showcase featuring Oracle Linux and Virtualization technologies, booth #120, located in Moscone South (on the right side, just past the Autonomous Database showcase), where you can experience demos, a virtual reality game, and speak with product experts and partners.          

  Oracle OpenWorld 2018 is only a few weeks away! There are many sessions on Oracle Linux and Oracle VM, and here are a few technical sessions you may find interesting: Tips for Securing Your Cloud...

Events

Oracle Linux at Oracle OpenWorld 2018

Oracle OpenWorld 2018, in San Francisco, CA, is less than a month away! To help you plan your schedule below is the lineup of Oracle Linux sessions. The highlighted sessions are ones in which you’ll hear from our executives. This year’s content includes product roadmaps, tips and tricks, product training, customer case studies, and business use cases to enrich your learning experience. Remember to register ahead of time to make sure you have a seat. At the conference, you’ll also have the opportunity to connect with other Oracle customers, product experts, and partners, to help you make the most of your time. Read on and fill up your schedule now. The Sessions: Monday, Oct 22: Tips for Securing Your Cloud Infrastructure, Jan Hendrik Mangold, Jeff Savit [TIP4727], 9:00 a.m. - 9:45 a.m. Oracle Linux and Oracle VM: Get Trained for Cloud, Hybrid, and On-Premises, Avi Miller, Anotinette O’Sullivan [TRN5828], 10:30 a.m. - 11:15 a.m. -- more from Antoinette Oracle Linux: State of the Penguin, Wim Coekaerts [PRO4720], 11:30 a.m. - 12:15 p.m. Automating Workload Migration to Oracle Cloud Infrastructure, Simon Coter, Gilson Melo, Alessandr Pilotti [PRO5796], 12:30 p.m. - 1:15 p.m. Oracle’s Systems Strategy for Cloud and On-Premises, Ali Alasti, Wim Coekaerts, Edward Screven [PKN5901], 3:45 p.m. - 4:45 p.m. Perform In-Place Upgrade for Large-Scale Cloud Infrastructure, Jeff Savit, Jeffery Yoder, Rodolfo Martinez [CAS5088], 3:45 p.m. - 4:30 p.m. Why Choose Oracle Linux: The Value of Enterprise Linux, Deepen Chakraborty, Avi Miller [CAS4726], 5:45 p.m. - 6:30 p.m.  Tuesday, Oct 23: An Overview of Oracle Infrastructure Technologies in Oracle Cloud, Robert Shimp, Ajay Srivastava [PRO5904], 11:15 a.m. - 12:00 p.m.  Kubernetes, Docker, and Oracle Linux from On-Premises to Oracle Cloud with Ease, Wim Coekaerts [DEV6015], 11:30 a.m. - 12:15 p.m.  Best Practices: Oracle Linux and Oracle VM in Oracle Cloud Infrastructure, Julie Wong, Simon Coter [PRO4721], 4:45 p.m. - 5:30 p.m. Maximize Performance with Oracle Linux and Oracle VM, Greg Marsden, Jeff Savit, Kevin Tribbey [TIP4725], 5:45 p.m. - 6:30 p.m.  Wednesday, Oct 24: The OS Factor: Advice for the Technology Buyer from IDC, Karen Sigman, Ashish Nadkarni [BUS4729], 11:15 a.m. - 12:00 p.m. Secure and Agile Orchestration for Linux Containers, Avi Miller [TRN4723], 12:30 p.m. - 1:15 p.m. -- more from Avi The Emergence of New Threats: A Look at Spectre and Meltdown, Greg Marsden, Bruce Lowental [TIP3992], 4:45 p.m. - 5:30 p.m.  Thursday, Oct 25: Oracle Linux is really the ideal Linux for Oracle Cloud Developers, Wim Coekaerts [DEV6017], 9:00 a.m. - 9:45 a.m.  Build an ARM64-Based Solution with Oracle Linux, Honglin Su, Michele Resta [PRM4722], 9:00 a.m. - 9:45 a.m.  Practical DevOps with Linux and Virtualization, Simon Coter [DEV5029], 10:00 a.m. - 10:45 a.m.  Embrace Open Source Projects on GitHub for Cloud Automation, Avi Miller, Simon Coter [TIP5795], 12:00 p.m. - 12:45 p.m.  Why Oracle Linux is the Best Platform for Oracle Database and Oracle Cloud, Dhaval Giani [PRO5797], 1:00 p.m. - 1:45 p.m.  Accelerate Your Business with Machine Learning and Oracle Linux, Joost Pronk Van Hoogeveen, Simon Coter [PRO4731], 2:00 p.m. - 2:45 p.m. Add these sessions to your schedule and don't forget to bookmark our Focus on Oracle Linux and Virtualization page. And, there’s more…   The Showcase, Moscone South – Booth #120 Make sure to find time to visit the Oracle Infrastructure Technologies showcase featuring Oracle Linux and Virtualization technologies, booth #120, located in Moscone South (on the right side, just past the Autonomous Database showcase), where you can learn more about Oracle Linux, experience demos, a virtual reality game, and speak with product experts and partners. #OOW18 is sure to be an informative event. Stay tuned to this blog for more information on sessions, Hands-on Labs (HOLs), and more, in the coming days. We look forward to sharing this open world with you!

Oracle OpenWorld 2018, in San Francisco, CA, is less than a month away! To help you plan your schedule below is the lineup of Oracle Linux sessions. The highlighted sessions are ones in which you’ll...

Events

Building an open container native platform with Oracle Linux

In today's modern world of cloud-first development and container native deployment, building the infrastructure to support all of your business requirements can be complex. Going with an "all-in-one" product can seem attractive, even at the cost of locking you into that vendor. At Oracle, we're committed to letting our customers build their cloud, their way. Our goal is to provide maximum choice with all components based on open technologies. Whether your goal is to better manage and predict your IT costs while keeping pace with business demands or your developers expect the latest technology and rapid provisioning, Oracle has a solution that will fit. You can choose to migrate your workloads to Oracle Cloud and take advantage of industry leading IaaS and PaaS options, bring Oracle Cloud services into your data center with Cloud at Customer or build your own private cloud using Oracle Linux as the foundation. Oracle has years of experience with providing container-based solutions and most of our flagship products are available as container images. We also provide container runtime and orchestration tools at no extra cost with an Oracle Linux Premier support subscription. You can learn more about how your business can take advantage of these tools during the Secure and Agile Orchestration for Linux Containers session at Oracle OpenWorld on Wednesday, October 24th at 12:30pm in Room 2000, Moscone West.

In today's modern world of cloud-first development and container native deployment, building the infrastructure to support all of your business requirements can be complex. Going with an "all-in-one"...

Linux

Oracle Instant Client RPMs Now Available on Oracle Linux Yum Servers in OCI

Today we added Oracle Instant Client to the Oracle Cloud Infrastructure (OCI) yum mirrors. This makes developing Oracle Database-based apps on OCI a breeze. Previously, installing Oracle Instant Client required either registering a system with ULN or downloading from OTN, each with manual steps to accept license terms. Now you can simply use yum install directly from Oracle Linux running in OCI. See this tutorial on the Oracle Developer blog for an example that connects a Node.js app running on an OCI instance to an Autonomous Transaction Processing (ATP) Database. Getting Oracle Instant Client RPMs From Your Local OCI Yum Mirror Grab the latest version of the repo defintion from the yum server local to your region as follows: cd /etc/yum.repos.d sudo mv public-yum-ol7.repo public-yum-ol7.repo.bak export REGION=`curl http://169.254.169.254/opc/v1/instance/ -s | jq -r '.region'| cut -d '-' -f 2` sudo -E wget http://yum-$REGION.oracle.com/yum-$REGION-ol7.repo Enable the ol7_oci_included repo: sudo yum-config-manager --enable ol7_oci_included   Behold! $ yum list oracle-instantclient* Loaded plugins: langpacks, ulninfo Installed Packages oracle-instantclient12.2-basic.x86_64 12.2.0.1.0-1 @ol7_oci_included Available Packages oracle-instantclient12.2-basiclite.x86_64 12.2.0.1.0-1 ol7_oci_included oracle-instantclient12.2-devel.x86_64 12.2.0.1.0-1 ol7_oci_included oracle-instantclient12.2-jdbc.x86_64 12.2.0.1.0-1 ol7_oci_included oracle-instantclient12.2-odbc.x86_64 12.2.0.1.0-1 ol7_oci_included oracle-instantclient12.2-precomp.x86_64 12.2.0.1.0-1 ol7_oci_included oracle-instantclient12.2-sqlplus.x86_64 12.2.0.1.0-1 ol7_oci_included oracle-instantclient12.2-tools.x86_64 12.2.0.1.0-1 ol7_oci_included oracle-instantclient18.3-basic.x86_64 18.3.0.0.0-1 ol7_oci_included oracle-instantclient18.3-basiclite.x86_64 18.3.0.0.0-1 ol7_oci_included oracle-instantclient18.3-devel.x86_64 18.3.0.0.0-1 ol7_oci_included oracle-instantclient18.3-jdbc.x86_64 18.3.0.0.0-1 ol7_oci_included oracle-instantclient18.3-odbc.x86_64 18.3.0.0.0-1 ol7_oci_included oracle-instantclient18.3-precomp.x86_64 18.3.0.0.0-1 ol7_oci_included oracle-instantclient18.3-sqlplus.x86_64 18.3.0.0.0-1 ol7_oci_included oracle-instantclient18.3-tools.x86_64 18.3.0.0.0-1 ol7_oci_included $   Try it Yourself If you want to give this a try, read the end-to-end example here.

Today we added Oracle Instant Client to the Oracle Cloud Infrastructure (OCI) yum mirrors. This makes developing Oracle Database-based apps on OCI a breeze. Previously, installing Oracle Instant...

Announcements

Announcing the developer preview of Oracle Container Services 1.1.10 for use with Kubernetes

Oracle is pleased to announce the the developer preview release of Oracle Container Services 1.1.10 for use with Kubernetes®. This release maintains Oracle's commitment to conformance with the upstream project and is Certified Kubernetes by the Cloud Native Computing Foundation (CNCF). Release Information Oracle Container Services 1.1.10 for use with Kubernetes is based on Kubernetes version 1.10, as released upstream. It is available for Oracle Linux 7 and is designed to integrate with the Oracle Container Runtime for Docker. Oracle Container Services for use with Kubernetes runs in a series of Docker containers and these images are available from the new "Container Services (Developer)" section of the Oracle Container Registry. Oracle has provided and tested a setup and configuration script that takes advantage of the kubeadm cluster configuration utility. This setup script eases configuration and setup on Oracle Linux and provides additional support for backup and recovery. Installation Oracle Container Services 1.1.10 for use with Kubernetes is free to download from Oracle Linux 7 Developer Channel on the Oracle Linux yum server. You can use the standard yum update command to perform an upgrade, however Oracle does not support Kubernetes on systems where the ol7_preview, ol7_developer, or ol7_developer_EPEL yum repositories or ULN channels are enabled, or where software from these repositories, or channels, is currently installed on the systems where Kubernetes runs. Kubernetes® is a registered trademark of The Linux Foundation in the United States and other countries, and is used pursuant to a license from The Linux Foundation. Resources – Oracle Linux Documentation Oracle Linux Software Download Oracle Linux Oracle Container Registry Blogs Oracle Linux Blog Oracle Ksplice Blog Oracle Linux Kernel Development Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux - https://oracle.com/education/linux For community-based support, please visit the Oracle Linux space on the Oracle Technology Network Community.

Oracle is pleased to announce the the developer preview release of Oracle Container Services 1.1.10 for use with Kubernetes®. This release maintains Oracle's commitment to conformance with...

Announcements

Announcing Oracle Container Runtime for Docker Release 18.03

Oracle is pleased to announce the release of Oracle Container Runtime for Docker version 18.03. Oracle Container Runtime allows you to create and distribute applications across Oracle Linux systems and other operating systems that support Docker. Oracle Container Runtime for Docker consists of the Docker Engine, which packages and runs the applications, and integrates with the Docker Hub, Docker Store and Oracle Container Registry to share the applications in a Software-as-a-Service (SaaS) cloud. Notable Updates Oracle has implemented multi-registry support that makes it possible to run the daemon with the --add-registry flag, to include a list of additional registries to query when performing a pull operation. This functionality, currently available as a technology preview, enables Oracle Container Runtime for Docker to use the Oracle Container Registry as the default registry to search for container images, before falling back to alternate registry sources such as a local mirror, the Docker Hub or Docker Store. Other functionality available in this feature includes the --block-registry flag which can be used to prevent access to a particular Docker registry. Registry lists ensure that all images are prefixed with their source registry automatically, so that a listing of Docker images always indicates the source registry from which an image was pulled.   Docker 18.03 introduces enhancements that allow for better integration with Kubernetes orchestration as an alternative to Docker Swarm, including changes to follow namespace conventions used across a variety of other containerization projects.   The Dockerfile can also now exist outside of the build-context, allowing you to store Dockerfiles together and to reference their paths in the docker build command on stdin.   Several improvements to logging and access to docker logs have been added, including the --until flag to limit the log lines to those that occurred before the specified timestamp.   Experimental Docker trust management commands have been added to better handle trust management on Docker images. See the docker trust command for more information. Upgrading To learn how to upgrade from a previously supported version of Oracle Container Runtime for Docker, please review the Upgrading Oracle Container Runtime for Docker chapter of the documentation. Note that upgrading from a developer preview release is not supported by Oracle. Support Support for the Oracle Container Runtime for Docker is available to customers with an Oracle Linux Premier support subscription. Refer to Oracle Linux 7 License Information User Manual for information about Oracle Linux support levels. Oracle Linux Resources: Documentation Oracle Linux Software Download Oracle Linux Oracle Container Registry Blogs Oracle Linux Blog Oracle Ksplice Blog Oracle Mainline Linux Kernel Blog Community Pages Oracle Linux Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux Product Training and Education Oracle Linux - https://oracle.com/education/linux For community-based support, please visit the Oracle Linux space on the Oracle Developer Community.

Oracle is pleased to announce the release of Oracle Container Runtime for Docker version 18.03. Oracle Container Runtime allows you to create and distribute applications across Oracle Linux systems...

Announcements

Announcing Oracle OpenStack Release 5.0

We are pleased to announce the release of Oracle OpenStack 5.0, based on the upstream Queens release. Oracle OpenStack 5.0 includes support for the KVM hypervisor included with the Unbreakable Enterprise Kernel Release 5 for Oracle Linux 7. What's New Support for OpenStack Queens  For more than two years, beginning with the Kilo release, Oracle OpenStack has deployed the OpenStack control plane in Docker containers, enabling simple, scalable, and reliable deployment, updates, and upgrades of OpenStack services. The Oracle OpenStack containers have been updated to the upstream Queens release. New Capabilities: In-place upgrade: easily upgrade Oracle OpenStack Release 4 (Pike) to Release 5 (Queens) without requiring additional hardware. This can either be done service by service or all at once with a single command, with no instance downtime. Newly Supported Services Ironic (Bare Metal-as-a-Service): enables users to deploy the workload onto a physical machine instead of a virtualized instance on a hypervisor. Users of the OpenStack Compute API can launch a bare metal server instance in the same way that they can currently launch a VM instance. Telemetry and monitoring tools: offers services including Ceilometer - a data collection service, Aodh - an alarming service, and Gnocchi - a time-series database and resource indexing service These tools enable applications such as metering, monitoring, alarming and billing. Designate: provides a multitenant DNS-as-a-Service for OpenStack. It can be configured to auto-generate records based on Nova and Neutron actions. Enhancements: Deployment Configuration Flexibility Secure-by-default configuration of TLS:  automatically installs trusted certificates, or generates and installs self-signed certificates to protect API endpoints. Reset-to-defaults: enables quick, automated iterations when testing various deployment configurations. Cinder Block Storage Services** Block storage multi-attach: attach a volume to multiple VMs to enable highly available clustered filesystems, such as ASM for Oracle Real Application Clusters (Oracle RAC). Ceph Luminous support: for Cinder backend and Cinder backup. NFS support: for volume backup, providing a flexible and economical solution for development and test environments. Nova Compute Services** Libvirt compute driver: enables a new block storage multi-attach feature in Cinder, critical for highly available, mission critical workloads such as Oracle RAC. Neutron Networking Service: Infoblox IPAM plugin integration: provides an interface from Neutron to the Infoblox DDI Appliance. The Infoblox DDI Appliance is a leading DNS / DHCP / IPAM solution for the enterprise and service providers. Keystone Identity Service Application Credentials: enables finer-grained access control. Glance Image Service Shared storage: is automatically configured for Glance, if available when using the file backend. ** Oracle has supported multi-attach Cinder/Nova capabilities for the automated deployment for Oracle RAC and Oracle Database 12c single instance since Release 4 (Pike). The OpenStack community incorporated these capabilities with the Queens release. Tech Preview Features: Terraform for Oracle Database 12c single instance: Terraform is an alternative option to the Murano service for automated deployment for Oracle Database 12c single instance. Some of the advantages of Terraform: Excellent portability and cloud agnostic: A single and universal tool for describing infrastructure for OpenStack, Oracle Cloud Infrastructure or any other public/private cloud. Enhanced troubleshooting capability: enables the progress of the Oracle Database 12c deployment script to be followed and its output viewed. Magnum (Container-as-a-Service): is an OpenStack API service making container orchestration engines (COE) such as Docker Swarm, Kubernetes and Apache Mesos available as first-class resources in OpenStack. Magnum uses Heat to orchestrate an OS image which contains Docker and COE and runs that image in either virtual machines or bare metal in a cluster configuration. OpenStack Community Contributions Oracle has been actively contributing to Nova, Cinder, Kolla, Murano, Oslo, and many other projects. All Oracle enhancements are contributed upstream and are freely available for anyone to use. Below are a few examples of Oracle code contributions available upstream for the Queens release. Kolla provides production-ready containers and deployment tools for operating OpenStack clouds. Oracle developed and contributed a command line interface called kollacli to Kolla. Kollacli provides a simple, intuitive and consistent user interface for driving kolla-ansible deployments. Multi-attach support for Nova/Cinder block device. This is required to support shared storage for Oracle RAC and other solutions that require shared storage. MySQL Cluster NDB: To address OpenStack scaling issues, Oracle OpenStack employs MySQL Cluster with NDB storage engine for the database backend. Oracle has contributed upstream enhancements to OpenStack services to help ensure they are using the oslo.db framework when doing database creations, upgrades, and migrations. Murano service: Oracle contributed numerous new features and bug fixes Product Life Cycle Support Support for Oracle OpenStack is included, at no additional cost, as part of Oracle Premier Support for Oracle Linux or Oracle Premier Support for Systems. Software Download  Download Oracle OpenStack Docker images from either the Oracle Container Registry, Docker Hub or Oracle Software Delivery Cloud. Please refer to chapters 2 through 4 of the Installation and Deployment Guide, available in the Oracle Documentation Library, for important steps to take prior to downloading the Docker images. Oracle Linux software packages required to deploy Oracle OpenStack are available from the Oracle Linux yum server and from the Unbreakable Linux Network (ULN). Resources Documentation: Release notes, Installation and Configuration Guide Application Deployment Guide Data Sheets, Podcast, Videos Oracle OpenStack Community Pages Product Training and Education Training from Oracle University:  Oracle OpenStack: Administration Essentials Ed 1 NEW Administration Essentials teaches students about essential OpenStack services for creating and managing cloud resources as a cloud administrator and identifies tasks cloud operators perform. Oracle OpenStack: Getting Started Ed 1 Getting Started teaches students that are new to OpenStack about this cloud computing architecture, core and optional services, Docker images and containers, a multi-node deployment, and troubleshooting deployments.

We are pleased to announce the release of Oracle OpenStack 5.0, based on the upstream Queens release. Oracle OpenStack 5.0 includes support for the KVM hypervisor included with the...

Linux Kernel Development

A Musical Tour of Hints and Tools for Debugging Host Networks

Shannon Nelson from the Oracle Linux Kernel Development team offers these tips and tricks to help make host network diagnostics easier. He also includes a recommended playlist for accompanying your debugging!   Ain't Misbehavin' (Dinah Washington) As with many debugging situations, digging into and resolving a network-based problem can seem like a lot of pure guess and magic.  In the networking realm, not only do we have the host system's processes and configurations to contend with, but also the exciting and often frustrating asynchronicity of network traffic. Some of the problems that can trigger a debug session are reports of lost packets, corrupt data, poor performance, even random system crashes.  Not always do these end up as actual network problems, but as soon as the customer mentions anything about their wiring rack or routers, the network engineer is brought in and put on the spot. This post is intended not as a full how-to in debugging any particular network issue, but more a set of some of the tips and tools that we use when investigating network misbehavior. Start Me Up (The Rolling Stones) In order to even get started, and probably the most important debugging tool available, is a concise and clear description of what is happening that shouldn't happen.  This is harder to get than one might think.  You know what I mean, right?  The customer might give us anything from "it's broken" to the 3 page dissertation of everything but the actual problem. We start gathering a clearer description by asking simple questions that should be easy to answer.  Things like: Who found it, who is the engineering contact? Exactly what equipment was it running on? When/how often does this happen? What machines/configurations/NICs/etc are involved? Do all such machines have this problem, or only one or two? Are there routers and/or switches involved? Are there Virtual Machines, Virtual Functions, or Containers involved? Are there macvlans, bridges, bonds or teams involved? Are there any network offloads involved? With this information, we should be able to write our own description of the problem and see if the customer agrees with our summary.  Once we can refine that, we should have a better idea of what needs to be looked into. Some of the most valuable tools for getting this information are simple user commands that the user can do on the misbehaving systems.  These should help detail what actual NICs and drivers are on the system and how they might be connected. uname -a - This is an excellent way to start, if nothing else but to get a basic idea of what the system is and how old is the kernel being used.  This can catch the case where the customer isn't running a supported kernel. These next few are good for finding what all is on the system and how they are connected: ip addr, ip link - these are good for getting a view of the network ports that are configured, and perhaps point out devices that are either offline or not set to the right address.  These can also give a hint as to what bonds or teams might be in place.  These replace the deprecated "ifconfig" command. ip route - shows what network devices are going to handle outgoing packets.  This is mostly useful on systems with many network ports. This replaces the deprecated "route" command and the similar "netstat -rn". brctl show - lists software bridges set up and what devices are connected to them. netstat -i - gives a summary list of the interfaces and their basic statistics. These are also available with "ip -s link", just formatted differently. lseth - this is a non-standard command that gives a nice summary combining a lot of the output from the above commands.  (See http://vcojot.blogspot.com/2015/11/introducing-lsethlsnet.html) Watchin' the Detectives (Elvis Costello) Once we have an idea which particular device is involved, the following commands can help gather more information about that device.  This can get us an initial clue as to whether or not the device is configured in a generally healthy way. ethtool <ethX> - lists driver and connection attributes such as current speed connection and if link is detected. ethtool -i <ethX> - lists device driver information, including kernel driver and firmware versions, useful for being sure the customer is working with the right software; and PCIe device bus address, good for tracking the low level system hardware interface. ethtool -l <ethX> - shows the number of Tx and Rx queues that are setup, which usually should match the number of CPU cores to be used. ethtool -g <ethX> - shows the number of packet buffers for each Tx and Rx queue; too many and we're wasting memory, too few and we risk dropping packets under heavy throughput pressure. lspci -s <bus:dev:func> -vv - lists detailed information about the NIC hardware and its attributes. You can get the interface's <bus:dev:func> from "ethtool -i". Diary (Bread) The system logfiles usually have some good clues in them as to what may have happened around the time of the issue being investigated.  "dmesg" gives the direct kernel log messages, but beware that it is a limited sized buffer that can get overrun and loose history over time. In older Linux distributions the systems logs are found in /var/log, most usefully in either /var/log/messages or /var/log/syslog, while newer "systemd" based systems use "journalctl" for accessing log messages. Either way, there are often interesting traces to be found that can help describe the behavior. One thing to watch out for is that when the customer sends a log extract, it usually isn't enough.  Too often they will capture something like the kernel panic message, but not the few lines before that show what led up to the panic.  Much more useful is a copy of the full logfile, or at least something with several hours of log before the event. Once we have the full file, it can be searched for error messages, any log messages with the ethX name or the PCI device address, to look for more hints.  Sometimes just scanning through the file shows patterns of behavior that can be related. Fakin' It (Simon & Garfunkel) With the information gathered so far, we should have a chance at creating a simple reproducer.  Much of the time we can't go poking at the customer's running systems, but need to demonstrate the problem and the fix on our own lab systems.  Of course we don't have the same environment, but with a concise enough problem description we stand a good chance of finding a simple case that shows the same behavior. Some traffic generator tools that help in reproducing the issues include: ping - send one or a few packets, or send a packet flood to a NIC.  It has flags for size, timing, and other send parameters. iperf - good for heavy traffic exercise, and can run several in parallel to get a better RSS spread on the receiver. pktgen - this kernel module can be used to generate much more traffic than user level programs, in part because the packets don't have to traverse the sender's network stack.  There are also several options for packet shapes and throughput rates. scapy - this is a Python tool that allows scripting of specially crafted packets, useful in making sure certain data patterns are exactly what you need for a particular test. All Along the Watchtower (The Jimi Hendrix Experience) With our own model of the problem, we can start looking deeper into the system to see what is happening: looking at throughput statistics and watching actual packet contents.  Easy statistic gathering can come from these tools: ethtool -S <ethX> - most NIC device drivers offer Tx and Rx packets counts, as well as error data, through the '-S' option of ethtool.  This device specific information is a good window into what the NIC thinks it is doing, and can show when the NIC sees low level issues, including malformed packets and bad checksums. netstat -s <ethX> - this gives protocol statistics from the upper networking stack, such as TCP connections, segments retransmitted, and other related counters. ip -s link show <ethX> - another method for getting a summary of traffic counters, including some dropped packets. grep <ethX> /proc/interrupts - looking at the interrupt counters can give a better idea of how well the processing is getting spread across the available CPU cores.  For some loads, we can expect a wide dispersal, and other loads might end up with one core more heavily loaded that others. /proc/net/* - there are lots of data files exposed by the kernel networking stack available here that can show many different aspects of the network stack operations. Many of the command line utilities get their info directly from these files. Sometimes it is handy to write your own scripts to pull the very specific data that you need from these files. watch - The above tools give a snapshot of the current status, but sometimes we need to get a better idea of how things are working over time.  The "watch" utility can help here by repeatedly running the snapshot command and displaying the output, even highlighting where things have changed since the last snapshot.  Example uses include: # See the interrupt activity as it happens watch "grep ethX /proc/interrupts" # Watch all of the NIC's non-zero stats watch "ethtool -S ethX | grep -v ': 0'" Also useful for catching data in flight is tcpdump and its cousins wireshark and tcpreplay.  These are invaluable in catching packets from the wire, dissecting what exactly got sent and received, and replaying the conversation for testing.  These have whole tutorials in and of themselves so I won't detail them here, but here's an example of tcpdump output from a single network packet: 23:12:47.471622 IP (tos 0x0, ttl 64, id 48247, offset 0, flags [DF], proto TCP (6), length 52) 14.0.0.70.ssh > 14.0.0.52.37594: Flags [F.], cksum 0x063a (correct), seq 2358, ack 2055, win 294, options [nop,nop,TS val 2146211557 ecr 3646050837], length 0 0x0000: 4500 0034 bc77 4000 4006 61d3 0e00 0046 0x0010: 0e00 0034 0016 92da 21a8 b78a af9a f4ea 0x0020: 8011 0126 063a 0000 0101 080a 7fec 96e5 0x0030: d952 5215 Photographs and Memories (Jim Croce) Once we've made it this far and we have some idea that it might be a particular network device driver issue, we can do a little research into the history of the driver.  A good web search is an invaluable friend. For example, a web search for "bnxt_en dropping packets" brings up some references to a bugfix for the Nitro A0 hardware - perhaps this is related to a packet drop problem we are seeing? If we have a clone of the Linux kernel git repository, we can do a search through the patch history for key words.  If there's something odd happening with macvlan filters, this will point out some patches that might be related to the issue.  For example, here's a macvlan issue with driver resets that was fixed upstream in v4.18: $ git log --oneline drivers/net/ethernet/intel/ixgbe | grep -i macvlan | grep -i reset 8315ef6 ixgbe: Avoid performing unnecessary resets for macvlan offload e251ecf ixgbe: clean macvlan MAC filter table on VF reset   $ git describe --contains 8315ef6 v4.18-rc1~114^2~380^2 Reelin' In the Years (Steely Dan) A couple of examples can show a little of how these tools have been used in real life.  Of course, it's never as easy as it sounds when you're in the middle of it. lost/broken packets with TSO from sunvnet through bridge When doing some performance testing on the sunvnet network driver, a virtual NIC in the SPARC Linux kernel, we found that enabling TSO actually significantly hurt throughput, rather than helping, when going out to a remote system.  After using netstat and ethtool -S to find that there were a lot of lost packets and retries through the base machine's physical, we used tcpdump on the NIC and at various points in the internal software bridge to find where packets were getting broken and dropped.  We also found comments in the netdev mailing list about an issue with TSO'd packets getting messed up when going into the software bridge.  We turned off TSO for packets headed into the host bridge and the performance issue was fixed. log file points out misbehaving process In a case where NIC hardware was randomly freezing up on several servers, we found that a compute service daemon had recently been updated with a broken version that would immediately die and restart several times a second on scores of servers at the same time, and was resetting the NICs each time.  Once the daemon was fixed, the NIC resetting stopped and the network problem went away. Bring It On Home This is just a quick overview of some of the tools for debugging a network issue.  Everyone has their favorite tools and different uses, we've only touched on a few here. They are all handy, but all need our imagination and perseverance to be useful in getting to the root of whatever problem we are chasing.  Also useful are quick shell scripts written to collect specific sets of data, and shell scripts to process various bits of data when looking for something specific.  For more ideas, see the links below. And sometimes, when we've dug so far and haven't yet found the gold, it's best to just get up from the keyboard, take a walk, grab a snack, listen to some good music, and let the mind wander. Good hunting. Related pages Linux network troubleshooting and debugging - https://unix.stackexchange.com/questions/50098/linux-network-troubleshooting-and-debugging Tracing NFS: Beyond tcpdump - https://blogs.oracle.com/linux/tracing-nfs%3a-beyond-tcpdump-v2 Tracing Linux Networking with DTrace on Oracle Linux - https://blogs.oracle.com/linux/tracing-linux-networking-with-dtrace-on-oracle-linux-v2 iproute2 uses - https://baturin.org/docs/iproute2/ A tcpdump Tutorial and Primer with Examples - https://danielmiessler.com/study/tcpdump/ Searching git code and logs - https://git-scm.com/book/en/v2/Git-Tools-Searching  https://git-scm.com/docs/git-log#git-log--Sltstringgt Wireshark User’s Guide - https://www.wireshark.org/docs/wsug_html/ systemd: Using the journal - https://fedoramagazine.org/systemd-using-journal/

Shannon Nelson from the Oracle Linux Kernel Development team offers these tips and tricks to help make host network diagnostics easier. He also includes a recommended playlist for accompanying your...

Linux Kernel Development

Getting system resource information with a Standard API

Oracle Linux kernel developer Rahul Yadav kicked off a new project in LXC this year, called libresource. In this blog post, he talks about how to use libresource to effectively read system statistics in a stable manner. This project is hosted on github at https://github.com/lxc/libresource System resource information, like memory, network and device statistics, are crucial for system administrators to understand the inner workings of their systems, and are increasingly being used by applications to fine tune performance on different environments. Getting system resource information on Linux is not a straightforward affair. Many tools like top, free and sar can gather system statistics. The best way is to collect the information from procfs or sysfs, but getting such information from procfs or sysfs presents many challenges.  Each time an application wants to get a system resource information, it has to open a file, read the content and then parse the content to get actual information. Over time, the format in which information is provided might change and with that each application has to change its own code to read the data in the correct manner. Libresource tries to fix few of these problems by providing a standard library with set of APIs through which we can get system resource information e.g. memory, CPU, stat, networking, security related information. Find libresource on github at https://github.com/lxc/libresource Libresource provides following benefits: Ease of use: Currently applications needs to read this info mostly from /proc and /sys file-systems. In most of the cases complex string parsing is involved which is needed to be done in application code. With the library APIs application can get the information directly and all the string parsing, if any, will be done by library. Stability: If the format in which the information is provided in /proc or /sys file-system is changed then the application code is changed to align with those changes. Also if a better way to get information comes in future, like through a syscall or a sysconf then again application code needs to be changed to get the benefit of it. Library will take care of such changes and the application will never have to change the code. Virtualization: In cases where DB is running in a virtualized environment using cgroup or namespaces, reading from /proc and /sys file-systems might not give correct information as these are not cgroup aware. Library API will take care of this e.g. if a process is running in a cgroup then library should provide information which is local to that cgroup.  Interfaces to libresource Reading a single resource ID /* This is to read a resource information. A valid resource id should be * provided in res_id, out should be properly allocated on the basis of * size of resource information, hint should be given where needed. * Currently pid and flags are not used, they are for future extensions. */ int resread(int resid, void out, void hint, int pid, int flags); /* Available Resource IDs */ RES_MEM_ACTIVE Total amount of buffer or page cache memory, in kilobytes, that is in active use. RES_MEM_INACTIVE Total amount of buffer or page cache memory, in kilobytes, that are free and available RES_MEM_AVAILABLE An estimate of how much memory is available for starting new applications, without swapping. RES_MEM_FREE The amount of physical RAM, in kilobytes, left unused by the system. RES_MEM_TOTAL Total amount of physical RAM, in kilobytes. RES_MEM_PAGESIZE Size of a page in bytes RES_MEM_SWAPFREE Total amount of swap free, in kilobytes. RES_MEM_SWAPTOTAL The total amount of swap available, in kilobytes. RES_KERN_COMPILE_TIME Kernel compile time RES_KERN_RELEASE Kernel version RES_NET_ALLIFSTAT Network stat for all interfaces on system. RES_NET_IFSTAT Network stat for an interface RES_MEM_INFOALL All Memory related information Reading multiple resources in one call If an application wants to read multiple resource information in one call, it can call res_*_blk APIs to do so which are described below. #define RES_UNIT_OUT_SIZE 256 /* This union is used to return resource information of various types */ union r_data { int i; size_t sz; long l; char str[RES_UNIT_OUT_SIZE]; void *ptr; }; /* In case of res_read_blk, each resource information will be represented by * following structure. */ typedef struct res_unit { int status; unsigned int res_id; void *hint; union r_data data; } res_unit_t; /* In case of bulk read (res_read_blk), this structure will hold all required * information needed to do so. */ typedef struct res_blk { int res_count; res_unit_t *res_unit[0]; } res_blk_t; /* It allocates memory for resources and initiates them properly. * res_ids holds an array of valid resource ids and res_count holds * number of resource ids. It also initializes struct fields properly. */ extern res_blk_t *res_build_blk(int *res_ids, int res_count); /* Reading bulk resource information. Memory must be properly allocated and * all fields should be properly filled to return error free resource * information. res_build_blk call is suggested to allocate build res_blk_t * structure. */ extern int res_read_blk(res_blk_t *resblk, int pid, int flags); /* Free allocated memory from res_build_blk */ extern void res_destroy_blk(res_blk_t *resblk); Some Examples Reading total memory size_t stemp = 0; res_read(RES_MEM_TOTAL,&stemp,NULL, 0, 0); printf("MEMTOTAL is: %zu\n", stemp); Reading network interface related statistics for interface named "lo" res_net_ifstat_t ifstat; res_read(RES_NET_IFSTAT,&ifstat, (void *)"lo",0, 0); printf("status for %s: %llu %llu\n", ifstat.ifname, ifstat.rx_bytes, ifstat.rx_packets ); Reading multiple resources in one call res_blk_t *b = NULL; int a[NUM] = {RES_MEM_PAGESIZE, RES_MEM_TOTAL, RES_MEM_AVAILABLE, RES_MEM_INFOALL, RES_KERN_RELEASE, RES_NET_IFSTAT, RES_NET_ALLIFSTAT, RES_KERN_COMPILE_TIME }; b = res_build_blk(a, NUM); b->res_unit[5]->hint = (void *)"lo"; res_read_blk(b, 0, 0); printf("pagesize %ld bytes,\n memtotal %ld kb,\n memavailable %ld kb,\n" " memfree %ld kb,\n release %s,\n compile time %s\n", b->res_unit[0]->data.sz, b->res_unit[1]->data.sz, b->res_unit[2]->data.sz, ((res_mem_infoall_t *)(b->res_unit[3]->data.ptr))->memfree, b->res_unit[4]->data.str, b->res_unit[7]->data.str ); res_net_ifstat_t *ip = (res_net_ifstat_t *)b->res_unit[5]->data.ptr; printf("stat for interface %s: %llu %llu\n", ip->ifname, ip->rx_bytes, ip->rx_packets ); int k = (int)(long long)b->res_unit[6]->hint; res_net_ifstat_t *ipp = (res_net_ifstat_t *)b->res_unit[6]->data.ptr; for (int j=0; j< k; j++) { printf("stat for interface %s: %llu %llu\n", ipp[j].ifname, ipp[j].rx_bytes, ipp[j].rx_packets ); } free(ipp); res_destroy_blk(b); res_blk_t *b = NULL; int a[NUM] = {RES_MEM_PAGESIZE, RES_MEM_TOTAL, RES_MEM_AVAILABLE, RES_MEM_INFOALL, RES_KERN_RELEASE, RES_NET_IFSTAT, RES_NET_ALLIFSTAT, RES_KERN_COMPILE_TIME }; b = res_build_blk(a, NUM); b->res_unit[5]->hint = (void *)"lo"; res_read_blk(b, 0, 0); printf("pagesize %ld bytes,\n memtotal %ld kb,\n memavailable %ld kb,\n" " memfree %ld kb,\n release %s,\n compile time %s\n", b->res_unit[0]->data.sz, b->res_unit[1]->data.sz, b->res_unit[2]->data.sz, ((res_mem_infoall_t *)(b->res_unit[3]->data.ptr))->memfree, b->res_unit[4]->data.str, b->res_unit[7]->data.str ); res_net_ifstat_t *ip = (res_net_ifstat_t *)b->res_unit[5]->data.ptr; printf("stat for interface %s: %llu %llu\n", ip->ifname, ip->rx_bytes, ip->rx_packets ); int k = (int)(long long)b->res_unit[6]->hint; res_net_ifstat_t *ipp = (res_net_ifstat_t *)b->res_unit[6]->data.ptr; for (int j=0; j< k; j++) { printf("stat for interface %s: %llu %llu\n", ipp[j].ifname, ipp[j].rx_bytes, ipp[j].rx_packets ); } free(ipp); res_destroy_blk(b);

Oracle Linux kernel developer Rahul Yadav kicked off a new project in LXC this year, called libresource. In this blog post, he talks about how to use libresource to effectively read system statistics...

Oracle Database Runs Best on Oracle Linux

Why does Oracle Database run best on Oracle Linux?  A new white paper is now available where you’ll learn what makes the Oracle Linux cloud-ready operating system a cost-effective and high-performance choice when modernizing infrastructure or consolidating Oracle Database instances. When you deploy Oracle Database on Oracle Linux, you can have the confidence that you are deploying on an operating system backed by development teams that work closely together to optimize performance, security, mission-critical reliability, availability, and serviceability. Because Oracle’s applications, middleware, and database products are developed on Oracle Linux, you’ll be deploying on the most extensively tested solution, whether it be on-premises or in the cloud. For Oracle Database workloads, advantages are afforded by the operating system’s deep integration with the solution stack, optimizations resulting from Oracle’s upstream Linux kernel work and industry collaborations, and enhancements delivered in the Unbreakable Enterprise Kernel (UEK) for Oracle Linux. With Oracle Linux Support, your software environment is backed by the expertise of Oracle’s global 24x7 support organization, regardless of whether you deploy on certified partner hardware, Oracle servers, an Oracle engineered solution, or Oracle Cloud. You also receive management and high availability solutions at no additional charge, which helps reduce the TCO of your database infrastructure. Additionally, when you deploy Oracle Database on Oracle Cloud, all the benefits of Oracle Linux Support and more are provided at no additional cost. To find out more about these and other Oracle Linux advantages for Oracle Database, download a copy of the white paper: Why Oracle Database Runs Best on Oracle Linux today.

Why does Oracle Database run best on Oracle Linux?  A new white paper is now available where you’ll learn what makes the Oracle Linuxcloud-ready operating system a cost-effective and...

Linux

Getting Started with Oracle Arm Toolset 1

Why Use Oracle Arm Toolset 1? Oracle Linux 7 for Arm was announced earlier this summer. Oracle includes the "Oracle Arm Toolset 1" [see release notes], which provides many popular development tools, including: gcc v7.3.0 Supports the 2011 revision of the ISO C standard. g++ v7.3.0 Supports the 2014 ISO C++ standard. gfortran v7.3.0 Supports Fortran 2008 go 1.10.1 The Go Programming Language gdb v8.0.1 The GNU debugger binutils v2.30   Binary utilities The above versions are much more recent than the base system versions. The base system versions are intentionally kept stable for many years, in order to help ensure compatibility for device drivers and other components that may be intimately tied to a specific compiler version. For your own applications, you might want to use more modern language features. For example, Oracle Arm Toolset 1 includes support for C++14.   Illustration credit: Laura Bassett, via wikipedia For a complete list of the software packages in Oracle Arm Toolset 1, see the packages listed at the Oracle Linux 7 Software Collections yum repo. Steps (1) repo Download the Oracle Linux repo file: # cd /etc/yum.repos.d # wget http://yum.oracle.com/aarch64/public-yum-ol7.repo (2) Enable the collection In the repo file, set enabled=1 for ol7_software_collections: Edit the .repo file. Notice that there are many repositories. At minimum, you should edit the section about the Software Collection Library to set  enabled=1 While you are there, review the other repositories, and decide which others you would like to enable. You can view the Software Collection Library in a browser by going to:  http://yum.oracle.com/repo/OracleLinux/OL7/SoftwareCollections/aarch64/index.html (3) Install # yum install 'oracle-armtoolset-1*' (4) Enable a shell with the software collection $ scl enable oracle-armtoolset-1 bash Note that this will start a new shell.   (Of course, you could change the word ‘bash’ above to some other shell if you prefer.) (5) Verify Verify that the gcc command invokes the correct copy, and that paths are set as expected: which gcc echo $PATH echo $MANPATH echo $INFOPATH echo $LD_LIBRARY_PATH  Expected output: The which command should return: /opt/oracle/oracle-armtoolset-1/root/usr/bin/gcc All four echo commands should begin with: /opt/oracle/oracle-armtoolset-1/   (6) Wrong gcc?  Wrong paths? If Step (5) gives unexpected output, then check whether your shell initialization files are re-setting the path variables. If so here are four possible solutions: (6a) norc Depending on your shell, there is probably an option to start up without initialization. For example, if you are a bash user, you could say: scl enable oracle-armtoolset-1 "bash --noprofile --norc" (6b) silence Alternatively, you can edit your shell initialization files to avoid setting paths, leaving it up to  scl instead. (6c) (RECOMMENDED) Set paths only in your login shell initialization files. The easiest solution is probably to check out the documentation for your shell and notice that it probably executes certain file(s) at login time and certain other file(s) when a new sub shell is created. For example, bash at login time will look for    ~/.bash_profile, ~/.bash_login, or ~/.profile and for sub shells it looks for    ~/.bashrc If you do your path setting in ~/.bash_profile and avoid touching paths in .bashrc, then the scl enable command will successfully add Oracle Arm Toolset 1 to your paths. (6d) (Kludge) enable last  If for some reason you wish to set paths in your sub shell initialization file, then please ensure that the toolset's enable scriptlet is done last. Here is an example from the bottom of my current .bashrc # If this is a shell created by 'scl enable', then make sure that the # 'enable' scriplet is done last, after all other path setting has # been completed. grandparent_cmd=$(ps -o cmd= $(ps -o ppid= $PPID)) if [[ "$grandparent_cmd" =~ "scl enable" ]] ; then #echo "looks like scl" grandparent_which=${grandparent_cmd/scl enable} grandparent_which=${grandparent_which/bash} grandparent_which=${grandparent_which// } grandparent_enable=$(ls /opt/*/$grandparent_which/enable 2>/dev/null) if [[ -f $grandparent_enable ]] ; then sourceit="source $grandparent_enable" echo doing "'$sourceit'" $sourceit else echo "did not find the enable scriplet for '$grandparent_which'" fi fi Sources If you would like the sources: wget http://yum.oracle.com/repo/OracleLinux/OL7/SoftwareCollections/aarch64/getPackageSource/oracle-armtoolset-1-gcc-7.3.0-2.el7.src.rpm

Why Use Oracle Arm Toolset 1? Oracle Linux 7 for Arm was announced earlier this summer. Oracle includes the "Oracle Arm Toolset 1" [see release notes], which provides many popular development...

Linux

DTrace on Linux: an Update

DTrace offers easy-but-powerful dynamic tracing of system behavior, and it is so lightweight and safe that it can routinely be used on production systems. DTrace was originally developed for the Oracle Solaris operating system. Oracle has also ported DTrace to Linux. Recent enhancements for Linux include: initial ARM64 support implementation of additional providers (lockstat, initial pid provider support) improved feature alignment with other DTrace implementations (llquantize, a third argument to tracemem, etc.) compile-time array bounds checking translator support for kernels 4.12 - 4.14 pid provider support for userspace tracing bug fixes (better address-to-symbolic-name translation, drastically faster dtrace_sync() data updating, etc.) Providers now include: X86 ARM64 Provider X X dtrace (BEGIN, END, ERROR) X   fbt (function boundary tracing) X   io X   ip X   lockstat X   perf X X pid (tracing in a specific pid) X   proc (e.g. for process creation and termination) X X profile X   sched X   sdt (statically defined tracing; e.g., for instrumenting specific source code sites) X X usdt (statically defined tracing for user applications) X X syscall X   tcp X   udp   DTrace for Linux is shipped as part of the Unbreakable Enterprise Kernel (UEK) Release 4 for Oracle Linux (for x86_64) and UEK Release 5 (for aarch64 and x86_64 with Oracle Linux 7). Going forward, new versions of DTrace will be released exclusively on UEK R5 and beyond, as development for UEK R4 is no longer active. The recent DTrace for Linux presentation at FOSDEM 2018 provides more helpful details. Download information can be found at http://www.oracle.com/technetwork/server-storage/linux/downloads/linux-dtrace-2800968.html. On Oracle Linux, you can install the dtrace-utils and dtrace-utils-devel packages, which can be found on yum.oracle.com or the Unbreakable Linux Network. In addition, source code is available with other Oracle open source projects at https://oss.oracle.com/projects/DTrace/ and on github at https://github.com/oracle/dtrace-utils. The Linux kernel DTrace code is also merged periodically with more recent upstream kernels and the resulting code can be found in a git repository on oss.oracle.com: https://oss.oracle.com/git/gitweb.cgi?p=dtrace-linux-kernel.git. Help is available on the dtrace-devel mailing list.

DTrace offers easy-but-powerful dynamic tracing of system behavior, and it is so lightweight and safe that it can routinely be used on production systems. DTrace was originally developed for the Oracle...

Announcing Release 3 of Ceph Storage for Oracle Linux

We are excited to announce Release 3 of Ceph Storage for Oracle Linux. This release presents a uniform view of object and block storage from a cluster of multiple physical and logical commodity-hardware storage devices. Ceph can provide fault tolerance and enhance I/O performance by replicating and striping data across the storage devices in a Ceph Storage Cluster. Ceph's monitoring and self-repair features minimize administration overhead. Release 3 of Ceph Storage for Oracle Linux is based on the Ceph Community Luminous release (v12.2.5). Differences between Oracle versions of the software and upstream releases are limited to Oracle-specific fixes and patches for specific bugs. Supported features include the Object Store, Block Device, Ceph Storage Cluster, Ceph File System (Ceph FS), Simple Ceph Object Gateway, and Multisite Ceph Object Gateway components.   Notable new features: Ceph Manager daemon, ceph-mgr, to monitor clusters Ceph Manager web-based dashboard OSDs using the BlueStore backend to manage HDDs and SSDs Simplified OSD replacement process   Release 3 of Ceph Storage for Oracle Linux adds support for: Ceph iSCSI gateway Ceph FS Export Ceph FS filesystems and block storage over NFS Ceph block devices with QEMU   Supported Upgrade Path Please refer to the product documentation upgrade section for steps and procedures.   Product Support Release 3 of Ceph Storage for Oracle Linux replaces the previous 2.0 release. Release 3.0 of Ceph Storage for Oracle Linux is available for Oracle Linux 7 (x86_64) running the Unbreakable Enterprise Kernel Release 5. A minimum of Oracle Linux 7 Update 5 is required. The ceph-deploy package for Release 3.0 is available via ULN or Oracle Linux yum server.   Resources – Oracle Linux Documentation Oracle Linux Oracle OpenStack  Software Download Oracle Linux Oracle OpenStack  Blogs Oracle Linux Blog Oracle OpenStack Blog   Community Pages Oracle Linux Oracle OpenStack  Social Media Oracle Linux on YouTube Oracle Linux on Facebook Oracle Linux on Twitter Data Sheets, White Papers, Videos, Training, Support & more Oracle Linux, Oracle OpenStack

We are excited to announce Release 3 of Ceph Storage for Oracle Linux. This release presents a uniform view of object and block storage from a cluster of multiple physical and...

Announcements

Latest Oracle Linux 7.5 and 6.10 Vagrant Boxes Now Available

We've just updated our Oracle Linux Vagrant boxes for Oracle VM VirtualBox to Oracle Linux 7.5 with Unbreakable Enterprise Kernel release 5 and Oracle Linux 6.10. These Vagrant boxes include: A recent kernel Oracle Linux 7: UEK5 (4.14.35-1818.0.9.el7uek.x86_64) Oracle Linux 6: UEK4 (4.1.12-124.16.4.el6uek.x86_64) VirtualBox guest additions RPMs installed Minimal package set installed 32 GiB root volume 4 GiB swap XFS root filesystem Extra 16GiB VirtualBox disk image attached, dynamically allocated The complete latest details are always here: yum.oracle.com/boxes VirtualBox Guest Addition RPMs Last year, we introduced RPM versions of VirtualBox Guest Additions to simplify installation and upgrade of these essential drivers and guest OS optimizations. Our boxes come pre-installed with the guest addition RPMs. Get Up and Running Quickly with Pre-configured Software Stacks: Vagrantfiles on GitHub If you'd like to experiment with Oracle Database, Docker, or Kubernetes and are looking to get started quickly without getting bogged down with installation details, these Vagrantfiles we've posted on Vagrantfiles on GitHub are for you. For example, there are Vagrantfiles and instructions to quickly: set up a Kubernetes cluster install Oracle Database 12c on Oracle Linux set up a Docker environment set up a local Docker Container Registry References Vagrantfile examples on GitHub Oracle Linux Vagrant boxes

We've just updated our Oracle Linux Vagrant boxes for Oracle VM VirtualBox to Oracle Linux 7.5 with Unbreakable Enterprise Kernel release 5 and Oracle Linux 6.10. These Vagrant boxes include: A recent...

Resilient RDMA IP Addresses

Oracle Linux kernel developer Sudhakar Dindukurti contributed this post on the work he's doing to bring the Resilient RDMA IP feature from RDS into upstream. This code currently is maintained in Oracle's open source UEK kernel and we are working on integrating this into the upstream Linux source code. 1.0 Introduction to Resilient RDMA IP The Resilient RDMAIP module assists ULPs (RDMA Upper Level Protocols) to do failover, failback and load-balancing  for InfiniBand and RoCE adapters.   RDMAIP is a feature for RDMA connections in Oracle Linux.  When this feature, also known as active-active bonding, is enabled the Resilient RDMAIP module creates an active bonding group among the ports of an adapter. Then, if any network adapter is lost the IPs on that port will be moved to the other port automatically providing HA for the application while allowing the full available bandwidth to be used in the non-failure scenario. Reliable Datagram Sockets (RDS) are high-performance, low-latency reliable connection-less sockets for delivering datagrams. RDS provides reliable, ordered datagram delivery by using a single reliable transport between two nodes. For more information on RDS protocol, please see the RDS documentation.  RDS RDMA uses Resilient RDMAIP module to provide HA support.  RDS RDMA module listens to RDMA CM Address change events that are delivered by the Resilient RDMAIP module. RDS drops all the RC connections associated with the failing port when it receives address change event and re-establishes new RC connections before sending the data the next time. Transparent high availability is an important issue for  RDMA-capable NIC adapters compared to standard NICs (Network Interface Cards). In case of standard NICs, the IP layer can decide which path or which netdev interface to use for sending a packet. This is not possible for RDMA capable adapters for security and performance reasons which tie the hardware to a specific port and path.   To send a data packet using RDMA to the remote node,  there are several steps: 1) Client application registers the memory with the RDMA adapter and the RDMA adapter returns an R_Key for the registered memory region to the client.  Note that the registration information is saved on the RDMA adapter. 2) Client sends this  "R_key" to the remote server  3) Server includes this R_key while requesting RDMA_READ/RDMA_WRITE to client 4) RDMA adapter on the client side uses the "R_key" to find the memory region and proceed with the transaction. Since the "R_key' is bound to a particular RDMA adapter,  same R_KEY cannot be used to send the data over another RDMA adapter.  Also, since RDMA applications can directly talk to the hardware, bypassing the kernel, traditional bonding (which lies in kernel) cannot provide HA. Resilient RDMAIP does not provide transparent failover for kernel ULPs or for OS bypass applications, however, it enables ULPs to failover, failback, and load balance over RDMA capable adapters. RDS (Reliable Datagram Sockets) protocol is the first client that is using Resilient RDMAIP module support to provide HA. The below section talks about the role of Resilient RDMAIP for different features. 1.1 Load balancing All the interfaces in the active active bonding group have individual IPs. RDMA consumers can use one or more  interfaces to send data simultaneously and are responsible to spread the load across all the active interfaces. 1.2 Failover If any interface in the active active bonding group goes down, then Resilient RDMAIP module moves  the IP address(s) of the interface to the other interface in the same group and it also sends a RDMA CM (Communication Manager) address change event to the RDMA kernel ULPs. RDMA kernel ULPs that are HA capable, would stop using the interface that went down and start using the other active interfaces.  For example, if there are any Reliable Connections (RC) established on the downed interface,  the ULP can close all those connections and re-establishes them on the failover interface. 1.3 Failback If the interface that went down earlier comes back up, then Resilient RDMAIP module moves back the IP address to the original interface and it again sends RDMA CM address change event to the kernel consumers.  RDMA kernel consumers would take action when they receive address change event. For example, RDMA consumers would move the connections that were moved as part of failover. 2.0 Resilient RDMAIP module provides the below module parameters rdmaip_active_bonding_enabled Set to 1 to enable active active bonding feature Set to 0 to disable active active bonding feature By default,  active active bonding feature is disabled. If active bonding is enabled, then the Resilient RDMAIP module creates an active bonding group among ports of the same RDMA adapter. For example,  consider a system with two RDMA adapters each with two ports, one Infiniband (ib0 and ib1) and one RoCE (eth5 and eth5). On this setup,  two active bonding groups will be created 1) Bond 1 with ib0 and ib1 2) Bond 2 with eth4 and eth5   rdmaip_ipv4_exclude_ips_list For IPs listed in this parameters, active bonding feature will be disabled. by default,  link local addresses are excluded by Resilient RDMAIP. 3.0 How it works ?   In Figure 1, there are two nodes with one 2-port Infiniband HCA each and each port of the HCA is connected to a different switch as shown. Two IPoIB interfaces (ib0 and ib1) are created, one for each port as shown in the diagram. When active active bonding is enabled,  Resilient RDMAIP module automatically creates a bond between two ports of the Infiniband HCA. 1) All the IB interfaces are up and configured   #ip a --- ib0: mtu 2044 qdisc pfifo_fast state UP qlen 256 link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:00:10:e0:00:01:29:65:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 10.10.10.92/24 brd 10.10.10.255 scope global ib0 valid_lft forever preferred_lft forever ib1: mtu 2044 qdisc pfifo_fast state UP qlen 256 link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:00:10:e0:00:01:29:65:02 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 10.10.10.102/24 brd 10.10.10.255 scope global secondary ib0:P06 valid_lft forever preferred_lft forever 2)  When Port 2 on Node 1 goes down, ib1 IP '10.10.10.102' will be moved to Port 1 (ib0) - Failover #ip a -------------- ib0: mtu 2044 qdisc pfifo_fast state UP qlen 256 link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:00:10:e0:00:01:29:65:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 10.10.10.92/24 brd 10.10.10.255 scope global ib0 valid_lft forever preferred_lft forever inet 10.10.10.102/24 brd 10.10.10.255 scope global secondary ib0:P06 valid_lft forever preferred_lft forever inet6 fe80::210:e000:129:6501/64 scope link valid_lft forever preferred_lft forever ib1: mtu 2044 qdisc pfifo_fast state DOWN qlen 256 link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:00:10:e0:00:01:29:65:02 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff ----------------   3) When Port 2 on node 1 comes back, IP '10.10.10.102' will be moved back to Port 2 (ib1) - Failback #ip a --- ib0: mtu 2044 qdisc pfifo_fast state UP qlen 256 link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:00:10:e0:00:01:29:65:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 10.10.10.92/24 brd 10.10.10.255 scope global ib0 valid_lft forever preferred_lft forever ib1: mtu 2044 qdisc pfifo_fast state UP qlen 256 link/infiniband 80:00:02:09:fe:80:00:00:00:00:00:00:00:10:e0:00:01:29:65:02 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 10.10.10.102/24 brd 10.10.10.255 scope global secondary ib0:P06 valid_lft forever preferred_lft forever   Example: RDS Implementation Here are the sequence steps that occur during failover and failback. Consider an RDS application establishing an RDS socket between IP1 on node 1 Port 1 to IP3 on node 2.  For this case, at RDS kernel level, there will be one RC connection between IP1 and IP3.   Case 1: Port 1 on Node 1 goes down Resilient RDMAIP module moves the IP address IP1  from Port 1 to Port 2 Port 2 will have two IPs (IP1 and IP2) Resilient RDMAIP module sends an RDMA CM address change event to RDS RDS RDMA driver,  drops the IB connection between IP1 (Port 1) to IP3 as part of handling the address change event. RDS RDMA driver creates a new RC connection between IP1 (Port 2) to IP3  when it receives a new send request from IP1 to IP3 After failover,  when RDS resolves IP1, it will get path records for Port 2 as IP1 is now bound to Port 2. Case 2: Port 1 on Node 1 comes back UP Resilient RDMAIP module moves the IP address IP1  from Port 2 to Port 1 Resilient RDMAIP module sends an RDMA CM address change event to RDS RDS RDMA driver drops the IB connection between IP1 (Port 2) to IP3 as part of handling the address change event. RDS RDMA driver creates a new RC connection between IP1 (Port 1) to IP3 when it receives a new send request from IP1 to IP3 After failback,  when RDS resolves IP1, it will get path records for Port 1 as IP1 is now bound to Port 1. 4.0 Future work The Resilient RDMAIP module's current implementation not tightly coupled with the network stack implementation. For example, RDMA kernel consumers do not have an option to create active bonding groups and also there are no APIs that can tell the RDMA consumers about the active bond groups and which interfaces that are configured in the active bond group.  As a result, current design and implementation are not suitable tor upstream. We are currently working on developing an version of this module which would be something we can submit to upstream Linux, but until then the code for RDMAIP can be found on oss.oracle.com and our github pages.  

Oracle Linux kernel developer Sudhakar Dindukurti contributed this post on the work he's doing to bring the Resilient RDMA IP feature from RDS into upstream. This code currently is maintained in...

Linux Kernel Development

Translating Process ID between Namespaces

Oracle Linux kernel developer Nagarathnam Muthusamy contributed this blog post on the challenges of translating pids (process IDs) between different namespaces. This is a feature currently lacking from namespace support in the Linux kernel and is an important feature to enable multitenant use of the Oracle database via CDBs.  Process ID(PID) namespace facility in Linux kernel has been an effective way of providing isolation between groups of processes which in turn has been employed by various implementations of containers. Though strong isolation between processes is desired, there are always some processes which would like to monitor the activities of other processes and their resource utilizations in the system. Each PID namespace has its own sequence of PIDs which require any processes monitoring them from top of the hierarchy to translate the process ID to and from its own PID namespace. Linux kernel has various set of APIs which provide PID in its result. Any such API can be used for PID translations and following are few of the approaches. SCM_CREDENTIALS:     The sender can translate its PID from its own namespace to a PID in the target namespace by sending and receiving the SCM_CREDENTIALS message. The drawback of this method is the requirement of a socket communication channel to PID translation which adds to the management overhead. This method does not enable the sender to translate the PID of other process unless it is root or it has CAP_SYS_ADMIN.     Ref: http://man7.org/linux/man-pages/man7/unix.7.html /proc/<pid>/status file     /proc/<pid>/status file provides a way to find the PIDs associated with a process in different namespaces. PID translation from child namespace to parent namespace from parent namespace would require searching all the status file in the parent namespace to find the desired PID at desired level.     Ref: http://man7.org/linux/man-pages/man5/proc.5.html     Ref: https://patchwork.kernel.org/patch/5861791/ shmctl(..,IPC_STAT,..), msgctl(..,IPC_STAT,..)     struct shmid_ds provided by IPC_STAT on a shared memory contains following two elements. pid_t shm_cpid; /* PID of creator */ pid_t shm_lpid; /* PID of last shmat(2)/shmdt(2) */       struct msqid_ds provided by IPC_STAT on a message queue contains following two elements. pid_t msg_lspid; /* PID of last msgsnd(2) */ pid_t msg_lrpid; /* PID of last msgrcv(2) */ PIDs in these elements are translated to the PID namespace of the caller. Though these can be used by monitors to keep track of the usage of shared resources by processes regardless of their namespace, these APIs cannot be used for generic PID translation without creating extra shared memory or message queues. Ref: http://man7.org/linux/man-pages/man2/shmctl.2.html Ref: http://man7.org/linux/man-pages/man2/msgctl.2.html semctl(..,GETPID,..)     GETPID command of semctl provides the PID of the process that performed the last operation on a semaphore. Similar to shmctl and msgctl, this is an excellent way to monitor the users of a semaphore but cannot be used for generic PID translation without creating extra semaphores.  shmctl and semctl were fixed in upstream linux kernel 4.17. This facility might not be available in older releases but will be part of the Oracle UEK. Ref: http://man7.org/linux/man-pages/man2/semctl.2.html fcntl(..,F_GETLK,..)     F_GETLK command of fcntl provides information on process which is holding the file lock. This information is translated to the caller's namespace. Any process which require translation across different PID namespaces can create a dummy file in a common location which it can lock. Any query on the owner of the file lock through fcntl will return the translated PID of the observed process under caller's namespace. Though file is lighter weight than any IPC mechanisms, creation and cleanup of files for every process in a system just for PID transaltion is an added overhead. Is there any cleaner way? Usually when your monitor process or any other process in the system requires PID translation, you might be able to work with any of the above mentioned methods and get around this problem. If none of the above options satisfy your use case, well, you are not alone! I have been working with Konstantin to resurrect his old patch which provides PID translation capabilities through a new system call called translate_pid. The discussions can be followed in https://lkml.org/lkml/2018/4/4/677 The link also has pointers to previous versions of the API. The API started off with following function signature, pid_t getvpid(pid_t pid, pid_t source, pid_t target) The major issue highlighted here was the use of PID to identify namespace. Any API which uses PID is susceptible to race condition involving PID recycling. Linux kernel has many existing PID based interfaces only because there were no better method to identify the resources when those interfaces were designed. This suggestion lead to the following API pid_t translate_pid(pid_t pid, int source, int target); where source and target are the file descriptors pointing to /proc/<pid>/ns/pid files of the source and target namespace. The major issue with this API is the additional step involved in opening and closing of a file for every PID translation. This API also prevents use cases which requires PID translation but does not have privileges to open /proc/<pid>/ns/pid file. The API under discussion at the time of writing this blog tries to get the best of both worlds as follows. pid_t translate_pid(pid_t pid, int source_type, int source, int target_type, int target); Here *type argument is used to change the way source and target are interpreted as follows. TRANSLATE_PID_CURRENT_PIDNS - current pid namespace, argument is unused TRANSLATE_PID_TASK_PIDNS - task pid-ns, argument is task pid TRANSLATE_PID_FD_PIDNS - pidns fd, argument is file descriptor As the API is finalized, we will have cleaner method to translate the PID without working around the problem with other existing methods.

Oracle Linux kernel developer Nagarathnam Muthusamy contributed this blog post on the challenges of translating pids (process IDs) between different namespaces. This is a feature currently lacking...

New Oracle Linux Home Target and Ksplice patching with Oracle Enterprise Manager 13c version 13.3

From Oracle Enterprise Manager 13c version 13.3, we have introduced a new Oracle Linux Home target which enables a simplified approach to the management of Oracle Linux in a single place including the ability to patch using Ksplice for both kernel and user space updates. We view Oracle Linux Home from the Cloud Menu via Enterprise > Cloud > Oracle Linux Home: This new home page exclusively for Oracle Linux enables customers to perform management and monitoring of Oracle Linux hosts from a single page; main features include: Oracle Linux host administration and management Bare Metal Provisioning (BMP) Oracle Linux OS Patching Oracle Ksplice patching (provides the ability to update the Oracle Linux operating system kernel and key user space libraries while the OS is running, without a reboot or any interruption). Add a new Oracle Linux host which directs the user to the Setup > Add Target > Add Targets Manually wizard to push an Oracle Enterprise Manager agent to the Oracle Linux host This new target is also visible from the All Targets view: We can navigate to Oracle Linux Home from either the Enterprise or All Targets page. Oracle Linux Home has the following regions: General Overview of Incidents and Problems Host flux CPU Memory Linux patching compliance / summary Ksplice patching compliance / summary General The general region shows a summary of the Oracle Linux hosts showing total numbers of each Oracle Linux version as well as their status. From here we can click on the OS Version, which will show us in a tabular view all the Oracle Linux hosts matching that version. We have a similar view when we click on any of the total or Green arrow links. This view displays useful information such as CPU and Memory utilization as well as the total IO/second. These metrics have links which when clicked will take you to the metric monitoring area for that host. Other useful information such as Logical memory, CPU load, Network interface rate and swap utilization are available. Overview of Incidents and Problems From here, we can see any incidents or problems affecting our Oracle Linux hosts with respect to Availability, Performance, Security and others. Host flux When Oracle Linux hosts are retired or added, we show when these events occurred over a period of the last 30 days. CPU Here we display CPU utilization over a range of Oracle Linux hosts. In our example, we have 12 Oracle Linux hosts where 100% of them have a CPU utilization between 0 – 25%. If we click on the CPU 0-25 bar we see a table view of each host with individual CPU utilization. Memory For memory, we take a similar approach to CPU. Our example shows 12 Oracle Linux hosts split with regard to their memory utilization. If we click on the Memory, 25-50 bar we see a table view of each host with individual Memory utilization. Oracle Linux Patching Status / Compliance Here we show two regions: Oracle Linux Status and Compliance. The status region shows us how many Oracle Linux hosts are compliant with respect to Oracle Linux packages present on the Oracle Linux host compared to packages within ULN based or custom patching groups. We can change the Compliance region view between Hosts or Patching groups. Both views show any hosts or patching groups that have out of date or rogue packages. A rogue package is one that exists on the Oracle Linux host but not in ULN based or custom patching groups. Ksplice for Oracle Linux Ksplice updates the Oracle Linux operating system kernel and key user space libraries, whilst the operating system is running, without a reboot or interruption. To enable Oracle Enterprise Manager Ksplice management all Oracle Linux Hosts must have an Oracle Enterprise Manager agent installed and configured with Ksplice software. For further details, refer to the Ksplice portal and user guide. Ksplice Configuration metrics are collected on every monitored Oracle Linux Host configured with Ksplice software (Uptrack v1.2.45 or Enhanced Ksplice v1.0.29 or higher). To access these Metrics: From the Host menu on a host's home page, select Configuration > Latest: This view is for an offline Ksplice host, which is up to date for the kernel but out of date for user space: This view is for an online Ksplice host, which is up to date for the kernel but out of date for user space: The following metrics are collected: Ksplice Version This reports the version of the Ksplice software installed on the Target Host. Ksplice Status This reports if the host is configured to receive updates from the Ksplice Server or if it is Ksplice offline. Base Kernel Version This queries the stock (base) Kernel running in the system; this version does not represent the patched version, only the one that booted the system. Effective Kernel Version This reports the Effective Kernel, which means the Kernel version after the live Ksplice patching including security fixes and others. This also reports the last applied patch date. Kernel Status This reports if the kernel of the host is up to or out of date. A system is up to date if it has all available Ksplice patches installed. Kernel Patches Installed This reports the count of Ksplice packages installed on the system. User Space Status This reports if the host's User Space Ksplice aware packages are up to or out of date. If this in an offline Ksplice host then the status is based upon the local repositories configured on the system. User Space Packages Installed This reports the count of Ksplice user space packages installed on the system. Kernel Installed Patches This reports the installed Ksplice patches in the system. Kernel Available Patches This lists the available Ksplice patches for the kernel, in essence it list the patches that have not yet been installed. This information is gathered based on the Ksplice configuration. In the case of an online Ksplice host configured with Ksplice server, it gets that information from the ULN (Unbreakable Linux Network). In the case of an offline Ksplice host, it reflects the data based on the uptrack-updates-`uname -r` package installed on the system. User Space Installed Packages This reports the Ksplice User Space packages installed on the system.   The Ksplice Patching region on the Oracle Linux Home Page uses the metrics collected detailed earlier to collate the Ksplice status over all the Ksplice enabled Oracle Linux Hosts monitored; it contains 2 sub regions: Ksplice Status Region This region shows the total number of Ksplice enabled Hosts; clicking on that number will open a list of Hosts. The Ksplice Status Region contains two pie charts: Kernel Status User Space Status  Each pie chart shows the status of all hosts. i.e. how many hosts are compliant, non-compliant or compliance unknown. Clicking on a particular compliance status will open another page with associated hosts. Ksplice Summary Region. This region shows the table of hosts that lists the following Ksplice Status (Online/Offline) Kernel Status (Compliant/Non-Compliant/Compliance unknown) User Space Status (Compliant/Non-Compliant/Compliance unknown) Effective Kernel Version By clicking on the number next to Ksplice Enabled Hosts (in screenshot above “10”), we are taken to the Ksplice Linux Hosts page, which contains a table displaying the following: Ksplice Enabled Hosts with Ksplice Software Ksplice software Version Ksplice Status (Online – Green / Offline - Grey) Kernel Status (Compliant/Non-Compliant/Compliance unknown-in case of unconfigured/offline systems) Number of Kernel Installed Patches User Space Status (Compliant/Non-Compliant/Compliance unknown-in case of unconfigured/offline systems) Number of User space Installed Patches Base Kernel Version Effective Kernel Version.  Notice from the above screenshot the last two hosts have a version of 1.2.47. This denotes that the Ksplice Enhanced client is not installed (uptrack client) and therefore no User space patches are listed. By clicking on a host name in the Ksplice detail table, a new page will be opened. This page will list the installed Ksplice patches on that host. If this host is a Ksplice Online host, it will also list what updates are available; these updates can be added or removed from this page.  If the host is a Ksplice Offline host, this page will show all the Ksplice kernel or user space patches available in the local repository. If the Ksplice Enhanced Client Software is installed on the host, then it will display list of intall/available user space patches. Otherwise, it will show message "Install/Upgrade/Configure Ksplice Enhanced Client Software". With a Ksplice Offline host, the Ksplice status will be a grey rather than a green dot which denotes an Online host. In addition, with an Offline host two dotted clocks are present for the Kernel and User space status as we can only determine the latest updates from the Offline repository, which may not be the latest from the ULN. Notice the Refresh button; this refreshes the latest data to the dashboard. When clicked, there will be a dialogue box, which will take confirmation from user. For any install or remove update you have to select and enter root privilege or credentials. We offer the use of the uptrack or the enhanced client features. Best practice is to install all updates; therefore, we follow this model even for the uptrack client to keep our deployment model consistent. The removal of updates for Kernel is possible by ID / individually, however for User Space it is only possible to remove all updates. Summary The Oracle Linux Home target brings Oracle Linux Management into a single page providing a simplified Oracle Linux management portal. The existing Oracle Linux Patching and Bare Metal Provisioning (BMP) frameworks can be accessed here from the Oracle Linux Home main menu: For information on Oracle Linux refer here; for information on Oracle Enterprise Manager 13c 13.3 refer here.

From Oracle Enterprise Manager 13c version 13.3, we have introduced a new Oracle Linux Home target which enables a simplified approach to the management of Oracle Linux in a single place including the...

Linux Kernel Development

Oracle Data Analytics Accelerator (DAX) for SPARC

This blog post was written by kernel developers Jon Helman and Rob Gardner, whose code for the Oracle Data Analytics driver was accepted into the Linux source earlier this year. This is our ultimate installment in the kernel blog series on Linux enablement for SPARC chip features. Oracle DAX Support in Linux The Oracle Data Analytics Accelerator (DAX) is a coprocessor built into the SPARC M7, S7, and M8 chips, which can perform various operations on data streams. These operations are particularly suited to accelerate database queries but have a wide variety of uses in the field of data analytics. For the duration of a coprocessor operation, the main processors are free to execute other instruction streams. Since the coprocessor can operate on large data sets, this can potentially free up processor resources significantly. Each system may have multiple DAX coprocessors, and each DAX has multiple execution units. Each unit is capable of doing independent work in parallel with the others and applications may be able to take advantage of this parallelism for some data sets. DAX Operations The explanations and drawings below show in detail the basic operations that the DAX can perform. Scan The scan operation finds all instances of a value, values, or range of values in a list. In the following example, the DAX performs the operation of finding each instance of the search value, A, in the input vector. The resulting bit vector has a 1 set in each position where an A is found. Select The select operation pulls elements from a vector to produce a subset which corresponds to the bits set in a bitmap. In the following example, the DAX filters the input data so that the resulting output vector consists of only those elements for which a 1 is set in the bit vector. Extract The extract operation converts a vector of values from one format to another format. In the following example, the DAX converts from an RLE-encoded input vector to an expanded output vector. (RLE, or run-length encoding, is a compression technique in which repeated elements are represented by a tuple consisting of the element and the number of repetitions.) This is just one of the many possible format conversions. Translate The translate operation takes as input a vector and a bitmap. Each element in the vector is used as an index into the bitmap, and that bit is placed into the output bitmap. This operation is more easily described with this short code segment and illustrated in the diagram which follows. for (i=0; i<N; i++) OUTPUT[i] = BITMAP[INPUT[i]]; Coprocessor Features Control flow The hardware defines a Coprocessor Control Block (CCB) which specifies the operation to be done, the addresses of the buffers to process as well as metadata describing those buffers (format of the data, number of elements in the stream, compression format, etc.).  One or more CCBs are presented to the coprocessor via software.  Multiple requests may be enqueued in the hardware and these are serviced as resources allow. Many threads may make requests concurrently, and resources are shared much like the CPU is shared. After submission, software is free to do other work until it requires the computational results from the coprocessor. Upon completion of the request, no interrupt is sent as commonly done with other hardware. Rather, completion is signalled via memory which can be polled by software. The processor provides an efficient mechanism for polling this completion status in the form of two new instructions, monitored load and monitored wait.  The monitored load instruction performs a memory load while also marking the address as one of interest. The monitored wait instruction pauses the virtual processor until one of several events occur, one of which is modification of the memory location of interest. This allows other hardware threads to use core resources while the monitoring thread is suspended. Data access The DAX hardware directly reads from and writes to physical memory avoiding handling large amounts of data in the main processor.  In order to optimize cache utilization, an option is provided that directs the DAX to place output directly in the processor's L3 cache.  The DAX also optimizes data accesses with its capability of operating on compressed data: it can decompress data while performing the operation and hence does not need temporary memory to hold decompressed intermediate output. This helps to reduce the number of physical memory reads and increase the size of possible data sets.  In addition to compressed data, the DAX can work with a variety of data formats and bit widths including fixed-width bit- and byte-packed, and variable width. The multitude of possible data formats and supported bit widths is documented in the Linux kernel file located at Documentation/sparc/oradax/dax-hv-api.txt. Software Stack Initiating a Request An application will typically use the available function library (libdax) to utilize the capabilities of the coprocessor, though it is also feasible to use the raw driver interface. A request to submit an operation to the DAX starts with a user calling one of the libdax functions (e.g. dax_scan_value). These functions perform rigorous validation of the arguments, and convert them into the hardware defined CCB format before being fed to the driver. The driver locks the pages containing the input and output buffers and then submits the CCBs to the hypervisor via the hypercall mechanism. The hypervisor translates each address in the CCB from virtual to physical and then initiates the hardware operation. Control immediately returns to the hypervisor, subsequently to the driver, and then back to libdax. Request Completion Since the kernel and hypervisor are not involved in processing a CCB after it has been submitted to the DAX, requests to the DAX driver do not block waiting for completion as is traditional for many other drivers. This means that the userland application has the option of performing other work while waiting for completion. libdax provides two variants of each DAX operation: blocking (e.g. dax_scan_value or dax_extract) and non-blocking (e.g. dax_scan_value_post and dax_extract_post). Completion of a request is signaled via a status byte in shared memory called the completion area. libdax waits on this byte using the monitored load and monitored wait instructions. The function dax_poll is provided for the application to check for completion in the non-blocking scenario. In libdax, the logic of checking the completion area is: while (1) { uint8_t status = loadmon8(&completion_area->status); if (status == INPROGRESS) mwait(TIMEOUT); else break; } Driver Operation The oradax driver provides a transport mechanism for conveying one or more CCBs from a user application to the coprocessor, and also performs several housekeeping functions essential to security and integrity. The API consists of the Linux system calls open, close, read, write, and mmap. The /open/ call initializes a context for use by a single thread. The context contains buffers to hold CCBs, completion areas, and records the virtual pages used by requests. Multiple threads may utilize the coprocessor, but each thread must do its own /open/. A correspondin /close/ releases all resources associated with all requests submitted by the thread. The /mmap/ call is used to gain access to said completion area buffer. Driver commands are given via /write/, and responses (when necessary) are retrieved via /read/. Driver commands involve a CCB or group of CCBs and are submit, kill, request info, and dequeue. The submit command is a /write/ of a buffer containing one or more CCBs to be conveyed to the coprocessor. Since the coprocessor accesses physical memory directly, the virtual to physical mappings of the I/O buffers must be locked in order to prevent the physical pages from being repurposed by the kernel. The driver does this locking of all pages associated with the request and transmits the CCBs to the hypervisor. If any of the CCBs were not submitted successfully, the corresponding pages are unlocked and the /write/ return value will indicate this discrepancy. If all CCBs could not be submitted successfully, then a /read/ must be done to retrieve further information that describes what went wrong. If all CCBs were submitted successfully, the application may poll for completion or proceed immediately to other tasks and defer polling until the results are required for further progress. The current state of a CCB may be queried at any time using the request info command, and a CCB may be terminated with the kill command. The dequeue command explicitly unlocks the pages associated with all completed requests; it is not usually necessary to call this since pages are unlocked implictly during the submission process. For much more detail, see Documentation/sparc/oradax/oracle-dax.txt. Conclusion Oracle DAX is supported by the oradax device driver and is available beginning with the Linux 4.16 kernel.  A user may make calls directly to the oradax driver to submit requests to the DAX, and the kernel documentation files contain example code to demonstrate this. Do be aware that we fully expect applications wishing to use the DAX to leverage the libdax library which provides higher level services for analytics and frees the application writer from the need to understand the low level DAX command structure. The library is fully open-sourced and available at the Oracle open source project webpage and includes a full set of manpages to describe the DAX operations. Feedback is always welcome and we would be interested in hearing about your experiences with the DAX. Reference Links Oradax Driver Oradax Linux Kernel documentation OSS libdax git repo Oracle Developer Community Software in Silicon Space Introduction to Stream Processing Using the DAX API SPARC innovation article DAX use in Oracle Database 12c DAX use in Apache Spark DAX use in Java Streams API

This blog post was written by kernel developers Jon Helman and Rob Gardner, whose code for the Oracle Data Analytics driver was accepted into the Linux source earlier this year. This is our ultimate...

Announcing the release of Oracle Linux 6 Update 10

We're happy to announce the general availability of Oracle Linux 6 Update 10 for the i386 and x86_64 architectures. You can find the individual RPM packages on the Unbreakable Linux Network (ULN) and the Oracle Linux yum server. ISO installation images are available for download from the Oracle Software Delivery Cloud and Docker images are available via Oracle Container Registry and Docker Hub. Oracle Linux 6 Update 10 ships with the following kernel packages: Unbreakable Enterprise Kernel (UEK) Release 4 (kernel-uek-4.1.12-124.16.4.el6uek) for x86-64 Unbreakable Enterprise Kernel (UEK) Release 2 (kernel-uek-2.6.39-400.294.3.el6uek) for i386 Red Hat Compatible Kernel (kernel-2.6.32-754.el6) for i386 and x86-64 By default, both UEK and RHCK for the specific architecture (i386 or x86-64) are installed and the system boots the Unbreakable Enterprise Kernel release. Application Compatibility Oracle Linux maintains user space compatibility with Red Hat Enterprise Linux (RHEL), which is independent of the kernel version that underlies the operating system. Existing applications in user space will continue to run unmodified on Oracle Linux 6 Update 10 with UEK Release 4 and no re-certifications are needed for applications already certified with Red Hat Enterprise Linux 6 or Oracle Linux 6. Notable updates in this release: Retpoline Support Added to GCC. Support for retpolines has been added to the GNU Compiler Collection (GCC) in this update. The kernel uses this technique to reduce the overhead of mitigating Spectre Variant 2 attacks, which is described in CVE-2017-5715. For more details on these and other new features and changes, please consult the Oracle Linux 6 Update 10 Release Notes in the Oracle Linux Documentation Library. Oracle Linux can be downloaded, used and distributed free of charge and all updates and errata are freely available. Customers decide which of their systems require a support subscription. This makes Oracle Linux an ideal choice for development, testing, and production systems. The customer decides which support coverage is the best for each individual system, while keeping all of the systems up-to-date and secure. Customers with Oracle Linux Premier Support also receive support for additional Linux programs, including Oracle Linux software collections, Oracle OpenStack and zero-downtime kernel updates using Oracle Ksplice. For more information about Oracle Linux, please visit www.oracle.com/linux.

We're happy to announce the general availability of Oracle Linux 6 Update 10 for the i386 and x86_64 architectures. You can find the individual RPM packages on the Unbreakable Linux Network (ULN) and...

Oracle

Integrated Cloud Applications & Platform Services