Monday Sep 02, 2013

Oracle Linux 6 UEK3 beta

Last week we published UEK3 beta on

It is very easy to get started with this and play around with the new features. Just takes a few steps :

  • Install Oracle Linux 6 (preferrably the latest update) on a system or in a VM
  • Add the beta repository file in /etc/yum.repos.d
  • Enable the beta channel
  • Reboot into the new kernel
  • Add updated packages like lxc tools and dtrace
  • Oracle Linux is freely downloadable from Oracle Linux is free to use on as many systems as you want, is freely re-distributable without changing the CD/ISO content (so including our cute penguin), provides free security errata and bugfix errata updates. You only need to pay for a support subscription for those systems that you want/need support for, not for other systems. This allows our customers/users to run the exact same software on test and dev systems as well as production systems without having to maintain potentially two kinds of repositories. All systems can run the exact same software all the time.

    The free yum repository for security and bugfix errata is at This site also contains a few other repositories :

  • Playground channel (a yum repository where we publish the latest kernels as released on We take the mainline tree and build it into RPMs that can easily be installed on Oracle Linux (Oracle Linux 6 and x86_64 specifically).
  • Beta channel (a yum repository where we publish new early versions of UEK along with corresponding packages that need to be updated along with it.
  • Now, back to UEK3 beta. Just a few steps are needed to get started.

    I will assume you have already installed Oracle Linux 6 (update 4) on a system and it is configured to use public-yum as the repository.

    First download and enable the beta repository.

    # cd /etc/yum.repos.d/
    # wget
    # sed -i s/enabled=0/enabled=1/g public-yum-ol6-beta.repo 

    You don't have to do sed you can just edit (vi/emacs) the repo file and manually set it to 1 (enable). Now you can just run yum update

    # yum update

    This will install UEK3 (3.8.13-13) and it will update any relevant packages that are required to be on a later version as well. At this point you should reboot into UEK3.

    New features introduced in UEK3 are listed in our release notes. There are tons of detailed improvements in the kernel since UEK2 (3.0 based). Kernelnewbies is an awesome site that keeps a nice list of changes for each version. We will add more detail to our release notes over time but for those that want to browse through all the changes, check it out.

  • To try out dtrace, you need to install the dtrace packages. We introduced USDT in UEK3's version of dtrace, there is some information in the release notes about the changes.

    # yum install dtrace-utils

    To try out lxc, you need to install the lxc packages. lxc is capable of using Oracle VM Oracle Linux templates as a base image to create a container.

    # yum install lxc


    Thursday Jan 03, 2013

    dm nfs

    A little known feature that we make good use of in Oracle VM is called dm nfs. Basically the ability to create a device mapper device directly on an nfs-based file/filesystem. We use this in Oracle VM 3 if your shared storage for the cluster is nfs based.

    Oracle VM clustering relies on the OCFS2 clusterstack/filesystem that is native in the kernel (uek2/2.6.39-x). When we create an HA-enabled pool, we create, what we call, a pool filesystem. That filesystem contains an ocfs2 volume so that we can store cluster-wide data. In particular we store shared database files that are needed by the Oracle VM agents on the nodes for HA. It contains info on pool membership, which VMs are in HA mode, what the pool IP is etc...

    When the user provides an nfs filesystem for the pool, we do the following :

  • mount the nfs volume in /nfsmnt/
  • create a 10GB sized file ovspoolfs.img
  • create a dm nfs volume(/dev/mapper/ovspoolfs> on this ovspoolfs.img file
  • create an ocfs2 volume on this dm nfs device
  • mount the ocfs2 volume on /poolfsmnt/
  • If someone wants to try out something that relies on block-based shared storage devices, such as ocfs2, but does not have iSCSI or SAN storage, using nfs is an alternative and dm nfs just makes it really easy.

    To do this yourself, the following commands will do it for you :

  • to find out if any such devices exist just type dmsetup table --target nfs
  • to create your own device, do something like this:
  • mount mynfsserver:/mountpoint /mnt
    dd if=/dev/zero of=/mnt/myvolume.img bs=1M count=2000 
    dmsetup create myvolume --table "0 4096000 nfs /mnt/myvolume.img 0"
    So mount the nfs volume, create a file which will be the container of the blockdevice, in this case a 2GB file and then create the dm device. The values for the dmsetup command are the following:

    myvolume = the name of the /dev/mapper device. Here we end up with /dev/mapper/myvolume

    table = start (normally always 0), number of blocks/length, this is in 512byte blocks, so you double the number,nfs since this is on nfs, filename of the nfs based file, offset (normally always 0)

    So now you have /dev/mapper/myvolume, it acts like a normal block device. If you do this on multiple servers, you can actually create an ocfs2 filesystem on this block device and it will be consistent across the servers.

    Credits go to Chuck Lever for writing dm nfs in the first place, thanks Chuck :) The code for dm nfs is here.

    Wednesday Jul 04, 2012

    What's up with OCFS2?

    On Linux there are many filesystem choices and even from Oracle we provide a number of filesystems, all with their own advantages and use cases. Customers often confuse ACFS with OCFS or OCFS2 which then causes assumptions to be made such as one replacing the other etc... I thought it would be good to write up a summary of how OCFS2 got to where it is, what we're up to still, how it is different from other options and how this really is a cool native Linux cluster filesystem that we worked on for many years and is still widely used.

    Work on a cluster filesystem at Oracle started many years ago, in the early 2000's when the Oracle Database Cluster development team wrote a cluster filesystem for Windows that was primarily focused on providing an alternative to raw disk devices and help customers with the deployment of Oracle Real Application Cluster (RAC). Oracle RAC is a cluster technology that lets us make a cluster of Oracle Database servers look like one big database. The RDBMS runs on many nodes and they all work on the same data. It's a Shared Disk database design. There are many advantages doing this but I will not go into detail as that is not the purpose of my write up. Suffice it to say that Oracle RAC expects all the database data to be visible in a consistent, coherent way, across all the nodes in the cluster. To do that, there were/are a few options : 1) use raw disk devices that are shared, through SCSI, FC, or iSCSI 2) use a network filesystem (NFS) 3) use a cluster filesystem(CFS) which basically gives you a filesystem that's coherent across all nodes using shared disks. It is sort of (but not quite) combining option 1 and 2 except that you don't do network access to the files, the files are effectively locally visible as if it was a local filesystem.

    So OCFS (Oracle Cluster FileSystem) on Windows was born. Since Linux was becoming a very important and popular platform, we decided that we would also make this available on Linux and thus the porting of OCFS/Windows started. The first version of OCFS was really primarily focused on replacing the use of Raw devices with a simple filesystem that lets you create files and provide direct IO to these files to get basically native raw disk performance. The filesystem was not designed to be fully POSIX compliant and it did not have any where near good/decent performance for regular file create/delete/access operations. Cache coherency was easy since it was basically always direct IO down to the disk device and this ensured that any time one issues a write() command it would go directly down to the disk, and not return until the write() was completed. Same for read() any sort of read from a datafile would be a read() operation that went all the way to disk and return. We did not cache any data when it came down to Oracle data files.

    So while OCFS worked well for that, since it did not have much of a normal filesystem feel, it was not something that could be submitted to the kernel mail list for inclusion into Linux as another native linux filesystem (setting aside the Windows porting code ...) it did its job well, it was very easy to configure, node membership was simple, locking was disk based (so very slow but it existed), you could create regular files and do regular filesystem operations to a certain extent but anything that was not database data file related was just not very useful in general. Logfiles ok, standard filesystem use, not so much. Up to this point, all the work was done, at Oracle, by Oracle developers.

    Once OCFS (1) was out for a while and there was a lot of use in the database RAC world, many customers wanted to do more and were asking for features that you'd expect in a normal native filesystem, a real "general purposes cluster filesystem". So the team sat down and basically started from scratch to implement what's now known as OCFS2 (Oracle Cluster FileSystem release 2). Some basic criteria were :

  • Design it with a real Distributed Lock Manager and use the network for lock negotiation instead of the disk
  • Make it a Linux native filesystem instead of a native shim layer and a portable core
  • Support standard Posix compliancy and be fully cache coherent with all operations
  • Support all the filesystem features Linux offers (ACL, extended Attributes, quotas, sparse files,...)
  • Be modern, support large files, 32/64bit, journaling, data ordered journaling, endian neutral, we can mount on both endian /cross architecture,..
  • Needless to say, this was a huge development effort that took many years to complete. A few big milestones happened along the way...

  • OCFS2 was development in the open, we did not have a private tree that we worked on without external code review from the Linux Filesystem maintainers, great folks like Christopher Hellwig reviewed the code regularly to make sure we were not doing anything out of line, we submitted the code for review on lkml a number of times to see if we were getting close for it to be included into the mainline kernel. Using this development model is standard practice for anyone that wants to write code that goes into the kernel and having any chance of doing so without a complete rewrite or.. shall I say flamefest when submitted. It saved us a tremendous amount of time by not having to re-fit code for it to be in a Linus acceptable state. Some other filesystems that were trying to get into the kernel that didn't follow an open development model had a lot harder time and a lot harsher criticism.
  • March 2006, when Linus released 2.6.16, OCFS2 officially became part of the mainline kernel, it was accepted a little earlier in the release candidates but in 2.6.16. OCFS2 became officially part of the mainline Linux kernel tree as one of the many filesystems. It was the first cluster filesystem to make it into the kernel tree. Our hope was that it would then end up getting picked up by the distribution vendors to make it easy for everyone to have access to a CFS. Today the source code for OCFS2 is approximately 85000 lines of code.
  • We made OCFS2 production with full support for customers that ran Oracle database on Linux, no extra or separate support contract needed. OCFS2 1.0.0 started being built for RHEL4 for x86, x86-64, ppc, s390x and ia64. For RHEL5 starting with OCFS2 1.2.
  • SuSE was very interested in high availability and clustering and decided to build and include OCFS2 with SLES9 for their customers and was, next to Oracle, the main contributor to the filesystem for both new features and bug fixes.
  • Source code was always available even prior to inclusion into mainline and as of 2.6.16, source code was just part of a Linux kernel download from, which it still is, today. So the latest OCFS2 code is always the upstream mainline Linux kernel.
  • OCFS2 is the cluster filesystem used in Oracle VM 2 and Oracle VM 3 as the virtual disk repository filesystem.
  • Since the filesystem is in the Linux kernel it's released under the GPL v2
  • The release model has always been that new feature development happened in the mainline kernel and we then built consistent, well tested, snapshots that had versions, 1.2, 1.4, 1.6, 1.8. But these releases were effectively just snapshots in time that were tested for stability and release quality.

    OCFS2 is very easy to use, there's a simple text file that contains the node information (hostname, node number, cluster name) and a file that contains the cluster heartbeat timeouts. It is very small, and very efficient. As Sunil Mushran wrote in the manual :

  • OCFS2 is an efficient, easily configured, quickly installed, fully integrated and compatible, feature-rich, architecture and endian neutral, cache coherent, ordered data journaling, POSIX-compliant, shared disk cluster file system.
  • Here is a list of some of the important features that are included :

  • Variable Block and Cluster sizes Supports block sizes ranging from 512 bytes to 4 KB and cluster sizes ranging from 4 KB to 1 MB (increments in power of 2).
  • Extent-based Allocations Tracks the allocated space in ranges of clusters making it especially efficient for storing very large files.
  • Optimized Allocations Supports sparse files, inline-data, unwritten extents, hole punching and allocation reservation for higher performance and efficient storage.
  • File Cloning/snapshots REFLINK is a feature which introduces copy-on-write clones of files in a cluster coherent way.
  • Indexed Directories Allows efficient access to millions of objects in a directory.
  • Metadata Checksums Detects silent corruption in inodes and directories.
  • Extended Attributes Supports attaching an unlimited number of name:value pairs to the file system objects like regular files, directories, symbolic links, etc.
  • Advanced Security Supports POSIX ACLs and SELinux in addition to the traditional file access permission model.
  • Quotas Supports user and group quotas.
  • Journaling Supports both ordered and writeback data journaling modes to provide file system consistency in the event of power failure or system crash.
  • Endian and Architecture neutral Supports a cluster of nodes with mixed architectures. Allows concurrent mounts on nodes running 32-bit and 64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64) architectures.
  • In-built Cluster-stack with DLM Includes an easy to configure, in-kernel cluster-stack with a distributed lock manager.
  • Buffered, Direct, Asynchronous, Splice and Memory Mapped I/Os Supports all modes of I/Os for maximum flexibility and performance.
  • Comprehensive Tools Support Provides a familiar EXT3-style tool-set that uses similar parameters for ease-of-use.
  • The filesystem was distributed for Linux distributions in separate RPM form and this had to be built for every single kernel errata release or every updated kernel provided by the vendor. We provided builds from Oracle for Oracle Linux and all kernels released by Oracle and for Red Hat Enterprise Linux. SuSE provided the modules directly for every kernel they shipped. With the introduction of the Unbreakable Enterprise Kernel for Oracle Linux and our interest in reducing the overhead of building filesystem modules for every minor release, we decide to make OCFS2 available as part of UEK. There was no more need for separate kernel modules, everything was built-in and a kernel upgrade automatically updated the filesystem, as it should. UEK allowed us to not having to backport new upstream filesystem code into an older kernel version, backporting features into older versions introduces risk and requires extra testing because the code is basically partially rewritten. The UEK model works really well for continuing to provide OCFS2 without that extra overhead.

    Because the RHEL kernel did not contain OCFS2 as a kernel module (it is in the source tree but it is not built by the vendor in kernel module form) we stopped adding the extra packages to Oracle Linux and its RHEL compatible kernel and for RHEL. Oracle Linux customers/users obviously get OCFS2 included as part of the Unbreakable Enterprise Kernel, SuSE customers get it by SuSE distributed with SLES and Red Hat can decide to distribute OCFS2 to their customers if they chose to as it's just a matter of compiling the module and making it available.

    OCFS2 today, in the mainline kernel is pretty much feature complete in terms of integration with every filesystem feature Linux offers and it is still actively maintained with Joel Becker being the primary maintainer. Since we use OCFS2 as part of Oracle VM, we continue to look at interesting new functionality to add, REFLINK was a good example, and as such we continue to enhance the filesystem where it makes sense. Bugfixes and any sort of code that goes into the mainline Linux kernel that affects filesystems, automatically also modifies OCFS2 so it's in kernel, actively maintained but not a lot of new development happening at this time. We continue to fully support OCFS2 as part of Oracle Linux and the Unbreakable Enterprise Kernel and other vendors make their own decisions on support as it's really a Linux cluster filesystem now more than something that we provide to customers. It really just is part of Linux like EXT3 or BTRFS etc, the OS distribution vendors decide.

    Do not confuse OCFS2 with ACFS (ASM cluster Filesystem) also known as Oracle Cloud Filesystem. ACFS is a filesystem that's provided by Oracle on various OS platforms and really integrates into Oracle ASM (Automatic Storage Management). It's a very powerful Cluster Filesystem but it's not distributed as part of the Operating System, it's distributed with the Oracle Database product and installs with and lives inside Oracle ASM. ACFS obviously is fully supported on Linux (Oracle Linux, Red Hat Enterprise Linux) but OCFS2 independently as a native Linux filesystem is also, and continues to also be supported. ACFS is very much tied into the Oracle RDBMS, OCFS2 is just a standard native Linux filesystem with no ties into Oracle products. Customers running the Oracle database and ASM really should consider using ACFS as it also provides storage/clustered volume management. Customers wanting to use a simple, easy to use generic Linux cluster filesystem should consider using OCFS2.

    To learn more about OCFS2 in detail, you can find good documentation on in the Documentation area, or get the latest mainline kernel from and read the source.

    One final, unrelated note - since I am not always able to publicly answer or respond to comments, I do not want to selectively publish comments from readers. Sometimes I forget to publish comments, sometime I publish them and sometimes I would publish them but if for some reason I cannot publicly comment on them, it becomes a very one-sided stream. So for now I am going to not publish comments from anyone, to be fair to all sides. You are always welcome to email me and I will do my best to respond to technical questions, questions about strategy or direction are sometimes not possible to answer for obvious reasons.

    Thursday Mar 29, 2012

    4.8M wasn't enough so we went for 5.055M tpmc with Unbreakable Enterprise Kernel r2 :-)

    We released a new set of benchmarks today. One is an updated tpc-c from a few months ago where we had just over 4.8M tpmc at $0.98 and we just updated it to go to 5.05M and $0.89. The other one is related to Java Middleware performance. You can find the press release here.

    Now, I don't want to talk about the actual relevance of the benchmark numbers, as I am not in the benchmark team. I want to talk about why these numbers and these efforts, unrelated to what they mean to your workload, matter to customers. The actual benchmark effort is a very big, long, expensive undertaking where many groups work together as a big virtual team. Having the virtual team be within a single company of course helps tremendously... We already start with a very big server setup with tons of storage, many disks, lots of ram, lots of cpu's, cores, threads, large database setups. Getting the whole setup going to start tuning, by itself, is no easy task, but then the real fun starts with tuning the system for optimal performance -and- stability. A benchmark is not just revving an engine at high rpm, it's actually hitting the circuit. The tests require long runs, require surviving availability tests, such as surviving crashes -and- recovery under load.

    In the TPC-C example, the x4800 system had 4TB ram, 160 threads (8 sockets, hyperthreaded, 10 cores/socket), tons of storage attached, tons of luns visible to the OS. flash storage, non flash storage... many things at high scale that all have to be perfectly synchronized.

    During this process, we find bugs, we fix bugs, we find performance issues, we fix performance issues, we find interesting potential features to investigate for the future, we start new development projects for future releases and all this goes back into the products. As more and more customers, for Oracle Linux, are running larger and larger, faster and faster, more mission critical, higher available databases..., these things are just absolutely critical. Unrelated to what anyone's specific opinion is about tpc-c or tpc-h or specjenterprise etc, there is a ton of effort that the customer benefits from. All this work makes Oracle Linux and/or Oracle Solaris better platforms. Whether it's faster, more stable, more scalable, more resilient. It helps.

    Another point that I always like to re-iterate around UEK and UEK2 : we have our kernel source git repository online. Complete changelog of the mainline kernel, and our changes, easy to pull, easy to dissect, easy to know what went in when, why and where. No need to go log into a website and manually click through pages to hopefully discover changes or patches. No need to untar 2 tar balls and run a diff.

    Wednesday Feb 22, 2012

    DTrace update to 0.2

    We just put an updated version of DTrace on ULN.
    This is version 0.2, another preview (hence the 0.x...)

    To test it out :

    Register a server with ULN and add the following channel to the server list : ol6_x86_64_Dtrace_BETA.

    When you have that channel registered with your server, you can install the following required RPMs :


    Once the RPMs are installed, reboot the server into the correct kernel : 2.6.39-101.0.1.el6uek.

    The DTrace modules are installed in /lib/modules/2.6.39-101.0.1.el6uek.x86_64/kernel/drivers/dtrace.

    # cd /lib/modules/2.6.39-101.0.1.el6uek.x86_64/kernel/drivers/dtrace
    # ls
    dtrace.ko  dt_test.ko  profile.ko  sdt.ko  systrace.ko
    Load the DTrace modules into the running kernel:
    # modprobe dtrace 
    # modprobe profile
    # modprobe sdt
    # modprobe systrace
    # modprobe dt_test

    The DTrace compiler is in /usr/sbin/dtrace.There are a few README files in : /usr/share/doc/dtrace-0.2.4.
    These explain what's there and what's not yet there...

    New features:

    - The SDT provider is implemented, providing in-kernel static probes. Some of the proc provider is implemented using this facility.


    - Syscall tracing of stub-based syscalls (such as fork, clone, exit, and sigreturn) now works.
    - Invalid memory accesses inside D scripts no longer cause oopses or panics.
    - Memory exhaustion inside D scripts no longer emits spurious oopses.
    - Several crash fixes.
    - Fixes to arithmetic inside aggregations, fixing quantize().
    - Improvements to the installed headers.

    We are also getting pretty good coverage on both userspace and kernel in terms of the DTrace testsuites.
    Thanks to the team working on it!

    Below are a few examples with output and source code for the .d scripts.

    activity.d    - this shows ongoing activity in terms of what program was executing, 
              what its parent is, and how long it ran. This makes use of the proc SDT provider.
    pstrace.d     - this is similar but instead of providing timing, it lists ancestory
              of a process, based on whatever history is collected during the DTrace runtime 
              of this script.  This makes use of the proc SDT provider.
    rdbufsize.d   - this shows quantised results for buffer sizes used in read syscalls, 
              i.e. it gives a statistical breakdown of sizes passed in the read() syscall, 
              which can be useful to see what buffer sizes are commonly used.
    #pragma D option quiet
        this->pid = *((int *)arg0 + 171);
        time[this->pid] = timestamp;
        p_pid[this->pid] = pid;
        p_name[this->pid] = execname;
        p_exec[this->pid] = "";
        p_exec[pid] = stringof(arg0);
    /p_pid[pid]&&  p_exec[pid] != ""/
        printf("%d: %s (%d) executed %s (%d) for %d msecs\n",
               timestamp, p_name[pid], p_pid[pid], p_exec[pid], pid,
               (timestamp - time[pid]) / 1000);
    /p_pid[pid]&&  p_exec[pid] == ""/
        printf("%d: %s (%d) forked itself (as %d) for %d msecs\n",
               timestamp, p_name[pid], p_pid[pid], pid,
               (timestamp - time[pid]) / 1000);
    #pragma D option quiet
        this->pid = *((int *)arg0 + 171);
        p_pid[this->pid] = pid;
        p_name[this->pid] = execname;
        p_exec[this->pid] = "";
        path[this->pid] = strjoin(execname, " ->  ");
        this->pid = *((int *)arg0 + 171);
        path[this->pid] = strjoin(path[pid], " ->  ");
        this->path = basename(stringof(arg0));
        path[pid] = strjoin(p_name[pid], strjoin(" ->  ", this->path));
        p_exec[pid] = this->path;
    /p_pid[pid]&&  p_exec[pid] != ""/
        printf("%d: %s[%d] ->  %s[%d]\n",
               timestamp, p_name[pid], p_pid[pid], p_exec[pid], pid);
        p_name[pid] = 0;
        p_pid[pid] = 0;
        p_exec[pid] = 0;
        path[pid] = 0;
    /p_pid[pid]&&  p_exec[pid] == ""/
        printf("%d: %s[%d] ->  [%d]\n",
               timestamp, p_name[pid], p_pid[pid], pid);
        p_name[pid] = 0;
        p_pid[pid] = 0;
        p_exec[pid] = 0;
        path[pid] = 0;
    /path[pid] != ""/
        this->pid = *((int *)arg0 + 171);
        p_name[this->pid] = path[pid];
        @["read"] = quantize(arg2);
    Since we do not yet have CTF support, the scripts do raw memory access to get to the pid field in the task_struct.
    this->pid = *((int *)arg0 + 171);
    Where arg0 is a pointer to struct task_struct (arg0 passed to the proc:::create probe when a new task/thread/process is created).

    Just cut'n paste these scripts into text files and run them, I have some sample output below :
    activity.d (here I just run some commands in a separate shell which then shows in the output)
    dtrace -s activity.d 
      2134889238792594: automount (1736) forked itself (as 11484) for 292 msecs
    2134912932312379: bash (11488) forked itself (as 11489) for 1632 msecs
    2134912934171504: bash (11488) forked itself (as 11491) for 1319 msecs
    2134912937531743: bash (11488) forked itself (as 11493) for 2150 msecs
    2134912939231853: bash (11488) forked itself (as 11496) for 1366 msecs
    2134912945152337: bash (11488) forked itself (as 11499) for 1135 msecs
    2134912948946944: bash (11488) forked itself (as 11503) for 1285 msecs
    2134912923230099: sshd (11485) forked itself (as 11486) for 8790195 msecs
    2134912932092719: bash (11489) executed /usr/bin/id (11490) for 1005 msecs
    2134912945773882: bash (11488) forked itself (as 11501) for 328 msecs
    2134912937325453: bash (11493) executed /usr/bin/tput (11495) for 721 msecs
    2134912941951947: bash (11488) executed /bin/grep (11498) for 1418 msecs
    2134912933963262: bash (11491) executed /bin/hostname (11492) for 804 msecs
    2134912936358611: bash (11493) executed /usr/bin/tty (11494) for 626 msecs
    2134912939035204: bash (11496) executed /usr/bin/dircolors (11497) for 789 msecs
    2134912944986994: bash (11499) executed /bin/uname (11500) for 621 msecs
    2134912946568141: bash (11488) executed /bin/grep (11502) for 1003 msecs
    2134912948757031: bash (11503) executed /usr/bin/id (11504) for 796 msecs
    2134913874947141: ksmtuned (1867) forked itself (as 11505) for 2189 msecs
    2134913883976223: ksmtuned (11507) executed /bin/awk (11509) for 8056 msecs
    2134913883854384: ksmtuned (11507) executed /bin/ps (11508) for 8122 msecs
    2134913884227577: ksmtuned (1867) forked itself (as 11507) for 9025 msecs
    2134913874664300: ksmtuned (11505) executed /bin/awk (11506) for 1307 msecs
    2134919238874188: automount (1736) forked itself (as 11511) for 263 msecs
    2134920459512267: bash (11488) executed /bin/ls (11512) for 1682 msecs
    2134930786318884: bash (11488) executed /bin/ps (11513) for 7241 msecs
    2134933581336279: bash (11488) executed /bin/find (11514) for 161853 msecs
    pstrace.d (as daemons or shells/users execute binaries, they show up automatically)
    # dtrace -s pstrace.d 
    2134960378397662: bash[11488] ->  ps[11517]
    2134962360623937: bash[11488] ->  ls[11518]
    2134964238953132: automount[1736] ->  [11519]
    2134965712514625: bash[11488] ->  df[11520]
    2134971432047109: bash[11488] ->  top[11521]
    2134973888279789: ksmtuned[1867] ->  [11522]
    2134973897131858: ksmtuned ->  [11524] ->  awk[11526]
    2134973896999204: ksmtuned ->  [11524] ->  ps[11525]
    2134973897400622: ksmtuned[1867] ->  [11524]
    2134973888019910: ksmtuned ->  [11522] ->  awk[11523]
    2134981995742661: sshd ->  sshd ->  bash[11531] ->  [11532]
    2134981997448161: sshd ->  sshd ->  bash[11531] ->  [11534]
    2134982000599413: sshd ->  sshd ->  bash[11531] ->  [11536]
    2134982002035206: sshd ->  sshd ->  bash[11531] ->  [11539]
    2134982007815639: sshd ->  sshd ->  bash[11531] ->  [11542]
    2134982011627125: sshd ->  sshd ->  bash[11531] ->  [11546]
    2134981989026168: sshd ->  sshd[11529] ->  [11530]
    2134982008472173: sshd ->  sshd ->  bash[11531] ->  [11544]
    2134981995518210: sshd ->  sshd ->  bash ->  [11532] ->  id[11533]
    2134982000393612: sshd ->  sshd ->  bash ->  [11536] ->  tput[11538]
    2134982004531164: sshd ->  sshd ->  bash[11531] ->  grep[11541]
    2134981997256114: sshd ->  sshd ->  bash ->  [11534] ->  hostname[11535]
    2134981999476476: sshd ->  sshd ->  bash ->  [11536] ->  tty[11537]
    2134982001865119: sshd ->  sshd ->  bash ->  [11539] ->  dircolors[11540]
    2134982007610268: sshd ->  sshd ->  bash ->  [11542] ->  uname[11543]
    2134982009271769: sshd ->  sshd ->  bash[11531] ->  grep[11545]
    2134982011408808: sshd ->  sshd ->  bash ->  [11546] ->  id[11547]
    rdbufsize.d (in another shell I just did some random read operations and this
    shows a summary)
    # dtrace -s rdbufsize.d 
    dtrace: script 'rdbufsize.d' matched 1 probe
               value  ------------- Distribution ------------- count    
                  -1 |                                         0        
                   0 |                                         8        
                   1 |                                         59       
                   2 |                                         209      
                   4 |                                         72       
                   8 |                                         488      
                  16 |                                         67       
                  32 |                                         1074     
                  64 |                                         113      
                 128 |                                         88       
                 256 |                                         384      
                 512 |@@@                                      6582     
                1024 |@@@@@@@@@@@@@@@@@@                       44787    
                2048 |@                                        2419     
                4096 |@@@@@@@                                  16239    
                8192 |@@@@                                     10395    
               16384 |@@@@@@                                   14784    
               32768 |                                         427      
               65536 |                                         669      
              131072 |                                         143      
              262144 |                                         43       
              524288 |                                         46       
             1048576 |                                         92       
             2097152 |                                         196      
             4194304 |                                         0   

    Sunday Oct 16, 2011

    Containers on Linux

    At Oracle OpenWorld we talked about Linux Containers. Here is an example of getting a Linux container going with Oracle Linux 6.1, UEK2 beta and btrfs. This is just an example, not released, production, bug-free... for those that don't read README files ;-)

    This container example is using the existing Linux cgroups features in the mainline kernel (and also in UEK, UEK2) and lxc tools to create the environments.

    Example assumptions :
    - Host OS is Oracle Linux 6.1 with UEK2 beta.
    - using btrfs filesystem for containers (to make use of snapshot capabilities)
    - mounting the fs in /container
    - use Oracle VM templates as a base environment
    - Oracle Linux 5 containers

    I have a second disk on my test machine (/dev/sdb) which I will use for this exercise.

    # mkfs.btrfs  -L container  /dev/sdb
    # mount
    /dev/mapper/vg_wcoekaersrv4-lv_root on / type ext4 (rw)
    proc on /proc type proc (rw)
    sysfs on /sys type sysfs (rw)
    devpts on /dev/pts type devpts (rw,gid=5,mode=620)
    tmpfs on /dev/shm type tmpfs (rw)
    /dev/sda1 on /boot type ext4 (rw)
    /dev/mapper/vg_wcoekaersrv4-lv_home on /home type ext4 (rw)
    none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
    sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
    /dev/mapper/loop0p2 on /mnt type ext3 (rw)
    /dev/mapper/loop1p2 on /mnt2 type ext3 (rw)
    /dev/sdb on /container type btrfs (rw)

    lxc tools installed...

    # rpm -qa|grep lxc

    lxc tools come with template config files :

    # ls /usr/lib64/lxc/templates/
    lxc-altlinux lxc-busybox lxc-debian lxc-fedora lxc-lenny lxc-ol4 lxc-ol5 lxc-opensuse lxc-sshd lxc-ubuntu
    I created one for Oracle Linux 5 : lxc-ol5.

    Download Oracle VM template for OL5 from I used OVM_EL5U5_X86_PVM_10GB.
    We want to be able to create 1 environment that can be used in both container and VM mode to avoid duplicate effort.

    Untar the VM template.

    # tar zxvf OVM_EL5U5_X86_PVM_10GB.tar.gz
    These are the steps needed (to be automated in the future)...
    Copy the content of the VM virtual disk's root filesystem into a btrfs subvolume in order to easily clone the base template.

    My template configure script defines :

    - create subvolume ol5-template on /containers

    # btrfs subvolume create /container/ol5-template
    Create subvolume '/container/ol5-template'
    - loopback mount the Oracle VM template System image / partition
    # kpartx -a System.img 
    # kpartx -l System.img 
    loop0p1 : 0 192717 /dev/loop0 63
    loop0p2 : 0 21607425 /dev/loop0 192780
    loop0p3 : 0 4209030 /dev/loop0 21800205
    I need to mount the 2nd partition of the virtual disk image, kpartx will set up loopback devices for each of the virtual disk partitions. So let's mount loop0p2 which will contain the Oracle Linux 5 / filesystem of the template.
    # mount /dev/mapper/loop0p2 /mnt
    # ls /mnt
    bin  boot  dev  etc  home  lib  lost+found  media  misc  mnt  opt  proc  
    root  sbin  selinux  srv  sys  tftpboot  tmp  u01  usr  var
    Great, now we have the entire template / filesystem available. Let's copy this into our subvolume. This subvolume will then become the basis for all OL5 containers.
    # cd /mnt
    # tar cvf - * | ( cd /container/ol5-template ; tar xvf ; )
    In the near future we will put some automation around the above steps.
    # pwd
    # ls
    bin  boot  dev  etc  home  lib  lost+found  media  misc  mnt  opt  proc  
    root  sbin  selinux  srv  sys  tftpboot  tmp  u01  usr  var
    From this point on, the lxc-create script, using the template config as an argument, should be able to automatically create a snapshot and set up the filesystem correctly.
    # lxc-create -n ol5test1 -t ol5
    Cloning base template /container/ol5-template to /container/ol5test1 ...
    Create a snapshot of '/container/ol5-template' in '/container/ol5test1'
    Container created : /container/ol5test1 ...
    Container template source : /container/ol5-template
    Container config : /etc/lxc/ol5test1
    Network : eth0 (veth) on virbr0
    'ol5' template installed
    'ol5test1' created
    # ls /etc/lxc/ol5test1/
    config  fstab
    # ls /container/ol5test1/
    bin  boot  dev  etc  home  lib  lost+found  media  misc  mnt  opt  proc  
    root  sbin  selinux  srv  sys  tftpboot  tmp  u01  usr  var
    Now that it's created and configured, we should be able to just simply start it :
    # lxc-start -n ol5test1
    INIT: version 2.86 booting
                    Welcome to Enterprise Linux Server
                    Press 'I' to enter interactive startup.
    Setting clock  (utc): Sun Oct 16 06:08:27 EDT 2011         [  OK  ]
    Loading default keymap (us):                               [  OK  ]
    Setting hostname ol5test1:                                 [  OK  ]
    raidautorun: unable to autocreate /dev/md0
    Checking filesystems
                                                               [  OK  ]
    mount: can't find / in /etc/fstab or /etc/mtab
    Mounting local filesystems:                                [  OK  ]
    Enabling local filesystem quotas:                          [  OK  ]
    Enabling /etc/fstab swaps:                                 [  OK  ]
    INIT: Entering runlevel: 3
    Entering non-interactive startup
    Starting sysstat:  Calling the system activity data collector (sadc): 
                                                               [  OK  ]
    Starting background readahead:                             [  OK  ]
    Flushing firewall rules:                                   [  OK  ]
    Setting chains to policy ACCEPT: nat mangle filter         [  OK  ]
    Applying iptables firewall rules:                          [  OK  ]
    Loading additional iptables modules: no                    [FAILED]
    Bringing up loopback interface:                            [  OK  ]
    Bringing up interface eth0:  
    Determining IP information for eth0... done.
                                                               [  OK  ]
    Starting system logger:                                    [  OK  ]
    Starting kernel logger:                                    [  OK  ]
    Enabling ondemand cpu frequency scaling:                   [  OK  ]
    Starting irqbalance:                                       [  OK  ]
    Starting portmap:                                          [  OK  ]
    FATAL: Could not load /lib/modules/2.6.39-100.0.12.el6uek.x86_64/modules.dep: No such file or directory
    Starting NFS statd:                                        [  OK  ]
    Starting RPC idmapd: Error: RPC MTAB does not exist.
    Starting system message bus:                               [  OK  ]
    Starting o2cb:                                             [  OK  ]
    Can't open RFCOMM control socket: Address family not supported by protocol
    Mounting other filesystems:                                [  OK  ]
    Starting PC/SC smart card daemon (pcscd):                  [  OK  ]
    Starting HAL daemon:                                       [FAILED]
    Starting hpiod:                                            [  OK  ]
    Starting hpssd:                                            [  OK  ]
    Starting sshd:                                             [  OK  ]
    Starting cups:                                             [  OK  ]
    Starting xinetd:                                           [  OK  ]
    Starting crond:                                            [  OK  ]
    Starting xfs:                                              [  OK  ]
    Starting anacron:                                          [  OK  ]
    Starting atd:                                              [  OK  ]
    Starting yum-updatesd:                                     [  OK  ]
    Starting Avahi daemon...                                   [FAILED]
    Starting oraclevm-template...
    Regenerating SSH host keys.
    Stopping sshd:                                             [  OK  ]
    Generating SSH1 RSA host key:                              [  OK  ]
    Generating SSH2 RSA host key:                              [  OK  ]
    Generating SSH2 DSA host key:                              [  OK  ]
    Starting sshd:                                             [  OK  ]
    Regenerating up2date uuid.
    Setting Oracle validated configuration parameters.
    Configuring network interface.
      Network device: eth0
      Hardware address: 52:19:C0:EF:78:C4
    Do you want to enable dynamic IP configuration (DHCP) (Y|n)? 
    This will run the well-known Oracle VM template configure scripts and set up the container the same way as it would an Oracle VM guest.

    The session that runs lxc-start is the local console. It is best to run this session inside screen so you can disconnect and reconnect.

    At this point,I can use lxc-console to log into the local console of the container, or, since the container has its internal network up and running and sshd is running, I can also just ssh into the guest.
    # lxc-console -n ol5test1 -t 1
    Enterprise Linux Enterprise Linux Server release 5.5 (Carthage)
    Kernel 2.6.39-100.0.12.el6uek.x86_64 on an x86_64
    host login: 
    I can simple get out of the console entering ctrl-a q.

    From inside the container :
    # mount
    proc on /proc type proc (rw,noexec,nosuid,nodev)
    sysfs on /sys type sysfs (rw)
    devpts on /dev/pts type devpts (rw)
    none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
    # /sbin/ifconfig
    eth0      Link encap:Ethernet  HWaddr 52:19:C0:EF:78:C4  
              inet addr:  Bcast:  Mask:
              inet6 addr: fe80::5019:c0ff:feef:78c4/64 Scope:Link
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
              RX packets:141 errors:0 dropped:0 overruns:0 frame:0
              TX packets:19 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:1000 
              RX bytes:8861 (8.6 KiB)  TX bytes:2476 (2.4 KiB)
    lo        Link encap:Local Loopback  
              inet addr:  Mask:
              inet6 addr: ::1/128 Scope:Host
              UP LOOPBACK RUNNING  MTU:16436  Metric:1
              RX packets:8 errors:0 dropped:0 overruns:0 frame:0
              TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:0 
              RX bytes:560 (560.0 b)  TX bytes:560 (560.0 b)
    # ps aux
    root         1  0.0  0.0   2124   656 ?        Ss   06:08   0:00 init [3]  
    root       397  0.0  0.0   1780   596 ?        Ss   06:08   0:00 syslogd -m 0
    root       400  0.0  0.0   1732   376 ?        Ss   06:08   0:00 klogd -x
    root       434  0.0  0.0   2524   368 ?        Ss   06:08   0:00 irqbalance
    rpc        445  0.0  0.0   1868   516 ?        Ss   06:08   0:00 portmap
    root       469  0.0  0.0   1920   740 ?        Ss   06:08   0:00 rpc.statd
    dbus       509  0.0  0.0   2800   576 ?        Ss   06:08   0:00 dbus-daemon --system
    root       578  0.0  0.0  10868  1248 ?        Ssl  06:08   0:00 pcscd
    root       610  0.0  0.0   5196   712 ?        Ss   06:08   0:00 ./hpiod
    root       615  0.0  0.0  13520  4748 ?        S    06:08   0:00 python ./
    root       637  0.0  0.0  10168  2272 ?        Ss   06:08   0:00 cupsd
    root       651  0.0  0.0   2780   812 ?        Ss   06:08   0:00 xinetd -stayalive -pidfile /var/run/
    root       660  0.0  0.0   5296  1096 ?        Ss   06:08   0:00 crond
    root       745  0.0  0.0   1728   580 ?        SNs  06:08   0:00 anacron -s
    root       753  0.0  0.0   2320   340 ?        Ss   06:08   0:00 /usr/sbin/atd
    root       817  0.0  0.0  25580 10136 ?        SN   06:08   0:00 /usr/bin/python -tt /usr/sbin/yum-updatesd
    root       819  0.0  0.0   2616  1072 ?        SN   06:08   0:00 /usr/libexec/gam_server
    root       830  0.0  0.0   7116  1036 ?        Ss   06:08   0:00 /usr/sbin/sshd
    root      2998  0.0  0.0   2368   424 ?        Ss   06:08   0:00 /sbin/dhclient -1 -q -lf /var/lib/dhclient/dhclient-eth0.leases -pf /var/run/dhc
    root      3102  0.0  0.0   5008  1376 ?        Ss   06:09   0:00 login -- root     
    root      3103  0.0  0.0   1716   444 tty2     Ss+  06:09   0:00 /sbin/mingetty tty2
    root      3104  0.0  0.0   1716   448 tty3     Ss+  06:09   0:00 /sbin/mingetty tty3
    root      3105  0.0  0.0   1716   448 tty4     Ss+  06:09   0:00 /sbin/mingetty tty4
    root      3138  0.0  0.0   4584  1436 tty1     Ss   06:11   0:00 -bash
    root      3167  0.0  0.0   4308   936 tty1     R+   06:12   0:00 ps aux
    From the host :
    # lxc-info -n ol5test1
    state:   RUNNING
    pid:     16539
    # lxc-kill -n ol5test1
    # lxc-monitor -n ol5test1
    'ol5test1' changed state to [STOPPING]
    'ol5test1' changed state to [STOPPED]
    So creating more containers is trivial. Just keep running lxc-create.
    # lxc-create -n ol5test2 -t ol5
    # btrfs subvolume list /container
    ID 297 top level 5 path ol5-template
    ID 299 top level 5 path ol5test1
    ID 300 top level 5 path ol5test2
    lxc-tools will be uploaded to the uek2 beta channel to start playing with this.

    Oracle Linux 4 example

    Here is the same principle for Oracle Linux 4. Using the template create script lxc-ol4. I started out using the OVM_EL4U7_X86_PVM_4GB template and followed the same steps.

    # kpartx -a System.img 
    # kpartx -l System.img 
    loop0p1 : 0 64197 /dev/loop0 63
    loop0p2 : 0 8530515 /dev/loop0 64260
    loop0p3 : 0 4176900 /dev/loop0 8594775
    # mount /dev/mapper/loop0p2 /mnt
    # cd /mnt
    # btrfs subvolume create /container/ol4-template
    Create subvolume '/container/ol4-template'
    # tar cvf - * | ( cd /container/ol4-template ; tar xvf - ; )
    # lxc-create -n ol4test1 -t ol4
    Cloning base template /container/ol4-template to /container/ol4test1 ...
    Create a snapshot of '/container/ol4-template' in '/container/ol4test1'
    Container created : /container/ol4test1 ...
    Container template source : /container/ol4-template
    Container config : /etc/lxc/ol4test1
    Network : eth0 (veth) on virbr0
    'ol4' template installed
    'ol4test1' created
    # lxc-start -n ol4test1
    INIT: version 2.85 booting
    /etc/rc.d/rc.sysinit: line 80: /dev/tty5: Operation not permitted
    /etc/rc.d/rc.sysinit: line 80: /dev/tty6: Operation not permitted
    Setting default font (latarcyrheb-sun16):                  [  OK  ]
                    Welcome to Enterprise Linux
                    Press 'I' to enter interactive startup.
    Setting clock  (utc): Sun Oct 16 09:34:56 EDT 2011         [  OK  ]
    Initializing hardware...  storage network audio done       [  OK  ]
    raidautorun: unable to autocreate /dev/md0
    Configuring kernel parameters:  error: permission denied on key 'net.core.rmem_default'
    error: permission denied on key 'net.core.rmem_max'
    error: permission denied on key 'net.core.wmem_default'
    error: permission denied on key 'net.core.wmem_max'
    net.ipv4.ip_forward = 0
    net.ipv4.conf.default.rp_filter = 1
    net.ipv4.conf.default.accept_source_route = 0
    kernel.core_uses_pid = 1
    fs.file-max = 327679
    kernel.msgmni = 2878
    kernel.msgmax = 8192
    kernel.msgmnb = 65536
    kernel.sem = 250 32000 100 142
    kernel.shmmni = 4096
    kernel.shmall = 1073741824
    kernel.sysrq = 1
    fs.aio-max-nr = 3145728
    net.ipv4.ip_local_port_range = 1024 65000
    kernel.shmmax = 4398046511104
    Loading default keymap (us):                               [  OK  ]
    Setting hostname ol4test1:                                 [  OK  ]
    Remounting root filesystem in read-write mode:             [  OK  ]
    mount: can't find / in /etc/fstab or /etc/mtab
    Mounting local filesystems:                                [  OK  ]
    Enabling local filesystem quotas:                          [  OK  ]
    Enabling swap space:                                       [  OK  ]
    INIT: Entering runlevel: 3
    Entering non-interactive startup
    Starting sysstat:                                          [  OK  ]
    Setting network parameters:  error: permission denied on key 'net.core.rmem_default'
    error: permission denied on key 'net.core.rmem_max'
    error: permission denied on key 'net.core.wmem_default'
    error: permission denied on key 'net.core.wmem_max'
    net.ipv4.ip_forward = 0
    net.ipv4.conf.default.rp_filter = 1
    net.ipv4.conf.default.accept_source_route = 0
    kernel.core_uses_pid = 1
    fs.file-max = 327679
    kernel.msgmni = 2878
    kernel.msgmax = 8192
    kernel.msgmnb = 65536
    kernel.sem = 250 32000 100 142
    kernel.shmmni = 4096
    kernel.shmall = 1073741824
    kernel.sysrq = 1
    fs.aio-max-nr = 3145728
    net.ipv4.ip_local_port_range = 1024 65000
    kernel.shmmax = 4398046511104
    Bringing up loopback interface:                            [  OK  ]
    Bringing up interface eth0:                                [  OK  ]
    Starting system logger:                                    [  OK  ]
    Starting kernel logger:                                    [  OK  ]
    Starting portmap:                                          [  OK  ]
    Starting NFS statd:                                        [FAILED]
    Starting RPC idmapd: Error: RPC MTAB does not exist.
    Mounting other filesystems:                                [  OK  ]
    Starting lm_sensors:                                       [  OK  ]
    Starting cups:                                             [  OK  ]
    Generating SSH1 RSA host key:                              [  OK  ]
    Generating SSH2 RSA host key:                              [  OK  ]
    Generating SSH2 DSA host key:                              [  OK  ]
    Starting sshd:                                             [  OK  ]
    Starting xinetd:                                           [  OK  ]
    Starting crond:                                            [  OK  ]
    Starting xfs:                                              [  OK  ]
    Starting anacron:                                          [  OK  ]
    Starting atd:                                              [  OK  ]
    Starting system message bus:                               [  OK  ]
    Starting cups-config-daemon:                               [  OK  ]
    Starting HAL daemon:                                       [  OK  ]
    Starting oraclevm-template...
    Regenerating SSH host keys.
    Stopping sshd:                                             [  OK  ]
    Generating SSH1 RSA host key:                              [  OK  ]
    Generating SSH2 RSA host key:                              [  OK  ]
    Generating SSH2 DSA host key:                              [  OK  ]
    Starting sshd:                                             [  OK  ]
    Regenerating up2date uuid.
    Setting Oracle validated configuration parameters.
    Configuring network interface.
      Network device: eth0
      Hardware address: D2:EC:49:0D:7D:80
    Do you want to enable dynamic IP configuration (DHCP) (Y|n)? 
    # lxc-console -n ol4test1
    Enterprise Linux Enterprise Linux AS release 4 (October Update 7)
    Kernel 2.6.39-100.0.12.el6uek.x86_64 on an x86_64
    localhost login: 

    Wednesday Sep 28, 2011

    btrfs scrub - go fix corruptions with mirror copies please!

    Another day, another btrfs entry. I'm trying to learn all the in's and out's of the filesystem here.

    As many of you know, btrfs supports CRC for data and metadata. I created a simple btrfs filesystem :

    # mkfs.btrfs -L btrfstest -d raid1 -m raid1 /dev/sdb /dev/sdc
    then created a file on the volume :
    # dd if=/dev/urandom of=foo bs=1M count=100
    # md5sum /btrfs/foo 
    76f4c03dc7a3477939467ee230696b70  /btrfs/foo
    so now lets play the bad guy and write over the disk itself, underneath the filesystem so it has no idea. This could be a shared device with another server that accidentally had data written on it, or a bad userspace program that spews out to the wrong device or even a bug in kernel...

    Step 1: find the physical layout of the file :
    # filefrag -v /btrfs/foo
    Filesystem type is: 9123683e
    File size of /btrfs/foo is 104857600 (25600 blocks, blocksize 4096)
     ext logical physical expected length flags
       0       0   269312           25600 eof
    /btrfs/foo: 1 extent found
    # echo $[4096*269312]
    The filesystem is 4k blocksize and we know it's at block 269312. Now we call btrfs-map-logical to find out what the physical offsets are on both the mirrors (/dev/sdb /dev/sdc) so I can happily overwrite it with junk.
    # btrfs-map-logical -l 1103101952 -o scratch /dev/sdb
    mirror 1 logical 1103101952 physical 1083179008 device /dev/sdc
    mirror 2 logical 1103101952 physical 1103101952 device /dev/sdb
    there we go. now. let's scribble :
    # dd if=/dev/urandom of=/dev/sdc bs=1 count=50000 seek=1083179008
    so we just wrote 50k bytes of random stuff to /dev/sdc at the offset of its copy of file foo
    accessing the file gives the right md5sum still but now we have this command called scrub that can be run at any time and it will go through the filesystem you specific and check for any nasty errors and recover them. This happens through creating a kernel thread that does this in the background and then you can just use scrub status to see where it's at later.
    # btrfs scrub start /btrfs
    # btrfs scrub status /btrfs
    scrub status for 15e213ad-4e2a-44f6-85d8-86d13e94099f
    scrub started at Wed Sep 28 12:36:26 2011 and finished after 2 seconds
         total bytes scrubbed: 200.48MB with 13 errors
         error details: csum=13
         corrected errors: 13, uncorrectable errors: 0, unverified
    As you can see above, the scrubber found 13 errors. A quick peek in dmesg shows the following :
    btrfs: fixed up at 1103101952
    btrfs: fixed up at 1103106048
    btrfs: fixed up at 1103110144
    btrfs: fixed up at 1103114240
    btrfs: fixed up at 1103118336
    btrfs: fixed up at 1103122432
    btrfs: fixed up at 1103126528
    btrfs: fixed up at 1103130624
    btrfs: fixed up at 1103134720
    btrfs: fixed up at 1103138816
    # md5sum /btrfs/foo 
    76f4c03dc7a3477939467ee230696b70  /btrfs/foo
    Everything got repaired. This happens on both data and metadata. If there was a true IO error reading from one of the 2 sides we'd have handled that in the filesystem as well. If you don't have mirroring then with CRC it would have told you it was bad data and given you an IO error (instead of reading junk).

    Monday Sep 26, 2011

    btrfs root and yum update snapshots

    ok so now it's Monday and I found a few minutes to continue my weekend project at work :)

    Today, I want to take my OL6.1 with UEK setup and convert the root ext4 partition to btrfs. Then use yum update to create a snapshot before rpm installs/updates so that if something goes wrong, one can revert back to the original state.
    here's my story :

    The default OL6 install uses ext4 for the root fileystem(/). So the first step in my test is to convert the ext4 filesystem into a btrfs filesystem. The cool thing is that btrfs actually lets you do that, there's a tool called btrfs-convert which takes a volume as an argument and then converts ext[2,3,4] to btrfs and leaves the original ext[2,3,4] as a snapshot so you can even go back to it if you want to.

    In order to do this I did the following :

    - prepared my initrd to have btrfs built in. rebuilt it running mkinitrd using --with-module=btrfs. this way, the kernel module for the btrfs filesystem is included in the initrd.
    - find a boot ISO that has btrfs-convert on it (not yet on the OL6 ISOs)
    - reboot the machine in rescue mode off of the ISO image
    - run btrfs-convert on the root volume in my case it was /dev/mapper/vg_wcoekaersrv3-lv_root
    - edit etc/fstab

    /dev/mapper/vg_wcoekaersrv3-lv_root /                       ext4    defaults        1 1
    /dev/mapper/vg_wcoekaersrv3-lv_root /                       btrfs    defaults        1 1
    - reboot OL6 again
    - at reboot OL presents a message saying that selinux has to re-label the files. This will take a few minutes and a reboot will automatically follow again

    From this point on, you have OL6 running with btrfs as root filesystem.

    # mount
    /dev/mapper/vg_wcoekaersrv3-lv_root on / type btrfs (rw)
    proc on /proc type proc (rw)
    sysfs on /sys type sysfs (rw)
    devpts on /dev/pts type devpts (rw,gid=5,mode=620)
    tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0")
    /dev/sda1 on /boot type ext4 (rw)
    none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
    sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)

    The original ext snapshot is still available as a subvolume :
    # btrfs subvolume list /
    ID 256 top level 5 path ext2_saved
    I don't need it any more so I am just going to throw it out :
    # btrfs subvolume delete /ext2_saved
    Delete subvolume '//ext2_saved'
    # btrfs subvolume list /
    Just to run optimally, it's a good idea to de-fragment the volume as we inherit the old ext4 layout.
    # btrfs filesystem defragment /
    There. done.
    Next up - make sure the yum-plugin-fs-snapshot is installed
    # rpm -qa|grep yum-plugin
    If not, then just run yum install yum-plugin-fs-snapshot it's on the OL6 media/ULN

    So, now the big experiment. I want to do a yum update. Thanks to the installed plugin, yum will detect that the filesystem is btrfs and it will automatically, prior to installing new rpms, create a snapshot, then install.
    In this case a long list, I just added bold fonts to the interesting tidbits...
    # yum update
    Loaded plugins: fs-snapshot
    Setting up Update Process
    Resolving Dependencies
    --> Running transaction check
    ---> Package binutils.x86_64 0: will be updated
    ---> Package binutils.x86_64 0: will be an update
    ---> Package ca-certificates.noarch 0:2010.63-3.el6 will be updated
    ---> Package ca-certificates.noarch 0:2010.63-3.el6_1.5 will be an update
    ---> Package certmonger.x86_64 0:0.42-1.el6 will be updated
    ---> Package certmonger.x86_64 0:0.42-1.el6_1.2 will be an update
    ---> Package cifs-utils.x86_64 0:4.8.1-2.el6 will be updated
    ---> Package cifs-utils.x86_64 0:4.8.1-2.el6_1.2 will be an update
    ---> Package cups.x86_64 1:1.4.2-39.el6 will be updated
    ---> Package cups.x86_64 1:1.4.2-39.el6_1.1 will be an update
    ---> Package cups-libs.x86_64 1:1.4.2-39.el6 will be updated
    ---> Package cups-libs.x86_64 1:1.4.2-39.el6_1.1 will be an update
    ---> Package ipa-client.x86_64 0:2.0.0-23.el6_1.1 will be updated
    ---> Package ipa-client.x86_64 0:2.0.0-23.el6_1.2 will be an update
    ---> Package ipa-python.x86_64 0:2.0.0-23.el6_1.1 will be updated
    ---> Package ipa-python.x86_64 0:2.0.0-23.el6_1.2 will be an update
    ---> Package kernel-uek-devel.x86_64 0:2.6.39-100.0.5.el6uek will be installed
    ---> Package kernel-uek-headers.x86_64 0:2.6.32-100.34.1.el6uek will be updated
    ---> Package kernel-uek-headers.x86_64 0:2.6.32-200.16.1.el6uek will be updated
    ---> Package kernel-uek-headers.x86_64 0:2.6.39-100.0.5.el6uek will be an update
    ---> Package kpartx.x86_64 0:0.4.9-41.0.1.el6 will be updated
    ---> Package kpartx.x86_64 0:0.4.9-41.0.1.el6_1.1 will be an update
    ---> Package nss.x86_64 0:3.12.9-9.0.1.el6 will be updated
    ---> Package nss.x86_64 0:3.12.9-12.0.1.el6_1 will be an update
    ---> Package nss-sysinit.x86_64 0:3.12.9-9.0.1.el6 will be updated
    ---> Package nss-sysinit.x86_64 0:3.12.9-12.0.1.el6_1 will be an update
    ---> Package nss-tools.x86_64 0:3.12.9-9.0.1.el6 will be updated
    ---> Package nss-tools.x86_64 0:3.12.9-12.0.1.el6_1 will be an update
    ---> Package perf.x86_64 0:2.6.32-131.6.1.el6 will be updated
    ---> Package perf.x86_64 0:2.6.32-131.12.1.el6 will be an update
    ---> Package phonon-backend-gstreamer.x86_64 1:4.6.2-17.el6 will be updated
    ---> Package phonon-backend-gstreamer.x86_64 1:4.6.2-17.el6_1.1 will be an update
    ---> Package portreserve.x86_64 0:0.0.4-4.el6 will be updated
    ---> Package portreserve.x86_64 0:0.0.4-4.el6_1.1 will be an update
    ---> Package qt.x86_64 1:4.6.2-17.el6 will be updated
    ---> Package qt.x86_64 1:4.6.2-17.el6_1.1 will be an update
    ---> Package qt-sqlite.x86_64 1:4.6.2-17.el6 will be updated
    ---> Package qt-sqlite.x86_64 1:4.6.2-17.el6_1.1 will be an update
    ---> Package qt-x11.x86_64 1:4.6.2-17.el6 will be updated
    ---> Package qt-x11.x86_64 1:4.6.2-17.el6_1.1 will be an update
    ---> Package rsyslog.x86_64 0:4.6.2-3.el6_1.1 will be updated
    ---> Package rsyslog.x86_64 0:4.6.2-3.el6_1.2 will be an update
    ---> Package samba-client.x86_64 0:3.5.6-86.el6 will be updated
    ---> Package samba-client.x86_64 0:3.5.6-86.el6_1.4 will be an update
    ---> Package samba-common.x86_64 0:3.5.6-86.el6 will be updated
    ---> Package samba-common.x86_64 0:3.5.6-86.el6_1.4 will be an update
    ---> Package samba-winbind-clients.x86_64 0:3.5.6-86.el6 will be updated
    ---> Package samba-winbind-clients.x86_64 0:3.5.6-86.el6_1.4 will be an update
    ---> Package selinux-policy.noarch 0:3.7.19-93.0.1.el6_1.2 will be updated
    ---> Package selinux-policy.noarch 0:3.7.19-93.0.1.el6_1.7 will be an update
    ---> Package selinux-policy-targeted.noarch 0:3.7.19-93.0.1.el6_1.2 will be updated
    ---> Package selinux-policy-targeted.noarch 0:3.7.19-93.0.1.el6_1.7 will be an update
    ---> Package tzdata.noarch 0:2011h-2.el6 will be updated
    ---> Package tzdata.noarch 0:2011h-3.el6 will be an update
    ---> Package tzdata-java.noarch 0:2011h-2.el6 will be updated
    ---> Package tzdata-java.noarch 0:2011h-3.el6 will be an update
    ---> Package xmlrpc-c.x86_64 0:1.16.24-1200.1840.el6 will be updated
    ---> Package xmlrpc-c.x86_64 0:1.16.24-1200.1840.el6_1.4 will be an update
    ---> Package xmlrpc-c-client.x86_64 0:1.16.24-1200.1840.el6 will be updated
    ---> Package xmlrpc-c-client.x86_64 0:1.16.24-1200.1840.el6_1.4 will be an update
    --> Finished Dependency Resolution
    --> Running transaction check
    ---> Package kernel-uek-devel.x86_64 0:2.6.32-100.28.9.el6 will be erased
    --> Finished Dependency Resolution
    Dependencies Resolved
     Package                  Arch   Version                   Repository                       Size
     kernel-uek-devel         x86_64 2.6.39-100.0.5.el6uek     kernel-uek-2.6.39-100.0.5-alpha 7.3 M
     binutils                 x86_64  ol6_latest                      2.8 M
     ca-certificates          noarch 2010.63-3.el6_1.5         ol6_latest                      531 k
     certmonger               x86_64 0.42-1.el6_1.2            ol6_latest                      193 k
     cifs-utils               x86_64 4.8.1-2.el6_1.2           ol6_latest                       41 k
     cups                     x86_64 1:1.4.2-39.el6_1.1        ol6_latest                      2.3 M
     cups-libs                x86_64 1:1.4.2-39.el6_1.1        ol6_latest                      314 k
     ipa-client               x86_64 2.0.0-23.el6_1.2          ol6_latest                       88 k
     ipa-python               x86_64 2.0.0-23.el6_1.2          ol6_latest                      491 k
     kernel-uek-headers       x86_64 2.6.39-100.0.5.el6uek     kernel-uek-2.6.39-100.0.5-alpha 716 k
     kpartx                   x86_64 0.4.9-41.0.1.el6_1.1      ol6_latest                       41 k
     nss                      x86_64 3.12.9-12.0.1.el6_1       ol6_latest                      772 k
     nss-sysinit              x86_64 3.12.9-12.0.1.el6_1       ol6_latest                       28 k
     nss-tools                x86_64 3.12.9-12.0.1.el6_1       ol6_latest                      749 k
     perf                     x86_64 2.6.32-131.12.1.el6       ol6_latest                      998 k
     phonon-backend-gstreamer x86_64 1:4.6.2-17.el6_1.1        ol6_latest                      125 k
     portreserve              x86_64 0.0.4-4.el6_1.1           ol6_latest                       22 k
     qt                       x86_64 1:4.6.2-17.el6_1.1        ol6_latest                      4.0 M
     qt-sqlite                x86_64 1:4.6.2-17.el6_1.1        ol6_latest                       50 k
     qt-x11                   x86_64 1:4.6.2-17.el6_1.1        ol6_latest                       12 M
     rsyslog                  x86_64 4.6.2-3.el6_1.2           ol6_latest                      450 k
     samba-client             x86_64 3.5.6-86.el6_1.4          ol6_latest                       11 M
     samba-common             x86_64 3.5.6-86.el6_1.4          ol6_latest                       13 M
     samba-winbind-clients    x86_64 3.5.6-86.el6_1.4          ol6_latest                      1.1 M
     selinux-policy           noarch 3.7.19-93.0.1.el6_1.7     ol6_latest                      741 k
     selinux-policy-targeted  noarch 3.7.19-93.0.1.el6_1.7     ol6_latest                      2.4 M
     tzdata                   noarch 2011h-3.el6               ol6_latest                      438 k
     tzdata-java              noarch 2011h-3.el6               ol6_latest                      150 k
     xmlrpc-c                 x86_64 1.16.24-1200.1840.el6_1.4 ol6_latest                      103 k
     xmlrpc-c-client          x86_64 1.16.24-1200.1840.el6_1.4 ol6_latest                       25 k
     kernel-uek-devel         x86_64 2.6.32-100.28.9.el6       installed                        22 M
    Transaction Summary
    Install       1 Package(s)
    Upgrade      29 Package(s)
    Remove        1 Package(s)
    Total size: 63 M
    Is this ok [y/N]: y
    Downloading Packages:
    Running rpm_check_debug
    Running Transaction Test
    Transaction Test Succeeded
    Running Transaction
    fs-snapshot: snapshotting /: /yum_20110926132957
      Updating   : nss-sysinit-3.12.9-12.0.1.el6_1.x86_64                                       1/61 
      Updating   : nss-3.12.9-12.0.1.el6_1.x86_64                                               2/61 
      Updating   : xmlrpc-c-1.16.24-1200.1840.el6_1.4.x86_64                                    3/61 
      Updating   : xmlrpc-c-client-1.16.24-1200.1840.el6_1.4.x86_64                             4/61 
      Updating   : samba-winbind-clients-3.5.6-86.el6_1.4.x86_64                                5/61 
      Updating   : samba-common-3.5.6-86.el6_1.4.x86_64                                         6/61 
      Updating   : certmonger-0.42-1.el6_1.2.x86_64                                             7/61 
      Updating   : nss-tools-3.12.9-12.0.1.el6_1.x86_64                                         8/61 
      Updating   : ca-certificates-2010.63-3.el6_1.5.noarch                                     9/61 
      Updating   : 1:qt-4.6.2-17.el6_1.1.x86_64                                                10/61 
      Updating   : 1:qt-sqlite-4.6.2-17.el6_1.1.x86_64                                         11/61 
      Updating   : 1:qt-x11-4.6.2-17.el6_1.1.x86_64                                            12/61 
      Updating   : 1:phonon-backend-gstreamer-4.6.2-17.el6_1.1.x86_64                          13/61 
      Updating   : portreserve-0.0.4-4.el6_1.1.x86_64                                          14/61 
      Updating   : ipa-python-2.0.0-23.el6_1.2.x86_64                                          15/61 
      Updating   : 1:cups-libs-1.4.2-39.el6_1.1.x86_64                                         16/61 
      Updating   : selinux-policy-3.7.19-93.0.1.el6_1.7.noarch                                 17/61 
      Updating   : selinux-policy-targeted-3.7.19-93.0.1.el6_1.7.noarch                        18/61 
      Updating   : 1:cups-1.4.2-39.el6_1.1.x86_64                                              19/61 
      Updating   : ipa-client-2.0.0-23.el6_1.2.x86_64                                          20/61 
      Updating   : samba-client-3.5.6-86.el6_1.4.x86_64                                        21/61 
      Updating   : tzdata-2011h-3.el6.noarch                                                   22/61 
      Updating   : cifs-utils-4.8.1-2.el6_1.2.x86_64                                           23/61 
      Updating   : rsyslog-4.6.2-3.el6_1.2.x86_64                                              24/61 
      Installing : kernel-uek-devel-2.6.39-100.0.5.el6uek.x86_64                               25/61 
      Updating   : kernel-uek-headers-2.6.39-100.0.5.el6uek.x86_64                             26/61 
      Updating   : binutils-                                    27/61 
      Updating   : tzdata-java-2011h-3.el6.noarch                                              28/61 
      Updating   : perf-2.6.32-131.12.1.el6.x86_64                                             29/61 
      Updating   : kpartx-0.4.9-41.0.1.el6_1.1.x86_64                                          30/61 
      Cleanup    : selinux-policy-targeted-3.7.19-93.0.1.el6_1.2.noarch                        31/61 
      Cleanup    : selinux-policy-3.7.19-93.0.1.el6_1.2.noarch                                 32/61 
      Cleanup    : tzdata-2011h-2.el6.noarch                                                   33/61 
      Cleanup    : kernel-uek-headers.x86_64                                                   34/61 
      Cleanup    : kernel-uek-headers.x86_64                                                   35/61 
      Cleanup    : tzdata-java-2011h-2.el6.noarch                                              36/61 
      Cleanup    : perf-2.6.32-131.6.1.el6.x86_64                                              37/61 
      Cleanup    : kernel-uek-devel-2.6.32-100.28.9.el6.x86_64                                 38/61 
      Cleanup    : ipa-client-2.0.0-23.el6_1.1.x86_64                                          39/61 
      Cleanup    : certmonger-0.42-1.el6.x86_64                                                40/61 
      Cleanup    : 1:qt-x11-4.6.2-17.el6.x86_64                                                41/61 
      Cleanup    : 1:phonon-backend-gstreamer-4.6.2-17.el6.x86_64                              42/61 
      Cleanup    : samba-client-3.5.6-86.el6.x86_64                                            43/61 
      Cleanup    : 1:cups-1.4.2-39.el6.x86_64                                                  44/61 
      Cleanup    : samba-common-3.5.6-86.el6.x86_64                                            45/61 
      Cleanup    : 1:qt-sqlite-4.6.2-17.el6.x86_64                                             46/61 
      Cleanup    : 1:qt-4.6.2-17.el6.x86_64                                                    47/61 
      Cleanup    : xmlrpc-c-client-1.16.24-1200.1840.el6.x86_64                                48/61 
      Cleanup    : nss-tools-3.12.9-9.0.1.el6.x86_64                                           49/61 
      Cleanup    : ca-certificates-2010.63-3.el6.noarch                                        50/61 
      Cleanup    : nss-sysinit-3.12.9-9.0.1.el6.x86_64                                         51/61 
      Cleanup    : nss-3.12.9-9.0.1.el6.x86_64                                                 52/61 
      Cleanup    : xmlrpc-c-1.16.24-1200.1840.el6.x86_64                                       53/61 
      Cleanup    : samba-winbind-clients-3.5.6-86.el6.x86_64                                   54/61 
      Cleanup    : 1:cups-libs-1.4.2-39.el6.x86_64                                             55/61 
      Cleanup    : portreserve-0.0.4-4.el6.x86_64                                              56/61 
      Cleanup    : ipa-python-2.0.0-23.el6_1.1.x86_64                                          57/61 
      Cleanup    : cifs-utils-4.8.1-2.el6.x86_64                                               58/61 
      Cleanup    : rsyslog-4.6.2-3.el6_1.1.x86_64                                              59/61 
      Cleanup    : binutils-                                        60/61 
      Cleanup    : kpartx-0.4.9-41.0.1.el6.x86_64                                              61/61 
      kernel-uek-devel.x86_64 0:2.6.32-100.28.9.el6                                                  
      kernel-uek-devel.x86_64 0:2.6.39-100.0.5.el6uek                                                
      binutils.x86_64 0:                                                     
      ca-certificates.noarch 0:2010.63-3.el6_1.5                                                     
      certmonger.x86_64 0:0.42-1.el6_1.2                                                             
      cifs-utils.x86_64 0:4.8.1-2.el6_1.2                                                            
      cups.x86_64 1:1.4.2-39.el6_1.1                                                                 
      cups-libs.x86_64 1:1.4.2-39.el6_1.1                                                            
      ipa-client.x86_64 0:2.0.0-23.el6_1.2                                                           
      ipa-python.x86_64 0:2.0.0-23.el6_1.2                                                           
      kernel-uek-headers.x86_64 0:2.6.39-100.0.5.el6uek                                              
      kpartx.x86_64 0:0.4.9-41.0.1.el6_1.1                                                           
      nss.x86_64 0:3.12.9-12.0.1.el6_1                                                               
      nss-sysinit.x86_64 0:3.12.9-12.0.1.el6_1                                                       
      nss-tools.x86_64 0:3.12.9-12.0.1.el6_1                                                         
      perf.x86_64 0:2.6.32-131.12.1.el6                                                              
      phonon-backend-gstreamer.x86_64 1:4.6.2-17.el6_1.1                                             
      portreserve.x86_64 0:0.0.4-4.el6_1.1                                                           
      qt.x86_64 1:4.6.2-17.el6_1.1                                                                   
      qt-sqlite.x86_64 1:4.6.2-17.el6_1.1                                                            
      qt-x11.x86_64 1:4.6.2-17.el6_1.1                                                               
      rsyslog.x86_64 0:4.6.2-3.el6_1.2                                                               
      samba-client.x86_64 0:3.5.6-86.el6_1.4                                                         
      samba-common.x86_64 0:3.5.6-86.el6_1.4                                                         
      samba-winbind-clients.x86_64 0:3.5.6-86.el6_1.4                                                
      selinux-policy.noarch 0:3.7.19-93.0.1.el6_1.7                                                  
      selinux-policy-targeted.noarch 0:3.7.19-93.0.1.el6_1.7                                         
      tzdata.noarch 0:2011h-3.el6                                                                    
      tzdata-java.noarch 0:2011h-3.el6                                                               
      xmlrpc-c.x86_64 0:1.16.24-1200.1840.el6_1.4                                                    
      xmlrpc-c-client.x86_64 0:1.16.24-1200.1840.el6_1.4                                             

    Well, wasn't that easy! You can see the snapshot here :

    # btrfs subvolume list /
    ID 256 top level 5 path yum_20110926132957
    So if something went wrong in the rpm update or you want to revert to the prior copy of the OS/filesystem, you can boot back into the snapshot, using subvolid=256 as filesystem mount options for / in fstab.

    If you want to just default to the snapshot then you can run btrfs subvol set-default 256 and you are just running from the old snapshot state going forward.

    Friday Sep 23, 2011

    Getting started with Oracle Linux and ksplice updates

    Since early September Oracle Linux customers with a Premier or Premier limited support subscription, or Oracle customers with Oracle Premier Support for Systems and Operating Systems, have access to the newly added Oracle Ksplice technology.

    Oracle Ksplice updates are kernel updates that can be applied on a running system. Note that we are not just talking about being able to install a new update of a package while the system is running and it then have take effect after a reboot or restart. The ksplice patches are immediately applied to the running Linux kernel and are effective immediately. On subsequent reboot, these patches are applied at bootup, or if a customer has installed a newer version of the kernel rpm itself, then of course the new kernel would just be loaded.

    This technology allows a system administrator to apply the latest kernel security errata (CVEs) while their applications are running, a database, a webserver, etc. It does not halt the system, it does not restart applications, the updates are just applied in the background with a totally negligible impact (a milisecond pause for instance). In an environment where you have multi-tiered applications running across multiple servers, this becomes even more cost effective, very often if a backend server needs to be brought down for patching, the app/sys admins have to schedule downtime for the entire stack, apply the patches, then bring back all the services, even if only one of the servers in the hierarchy needs changes. The goal here is to 1) make patching trivial on the kernel side 2) reduce any sort of impact on sysadmin time, the patches can be done automated on a schedule 3) increase security of the systems because security vulnerabilities can be patched much sooner 4) because of all the others drastically reduce TCO.

    We have published a white paper on how to get started with ksplice updates here. More documentation is coming. This is very exciting stuff. Also, as before, Ubuntu and Fedora users can continue to make free use of the service. go to the ksplice website and click on Try it Now.


    Wim Coekaerts is the Senior Vice President of Linux and Virtualization Engineering for Oracle. He is responsible for Oracle's complete desktop to data center virtualization product line and the Oracle Linux support program.

    You can follow him on Twitter at @wimcoekaerts


    « July 2016