Thursday Mar 29, 2012

4.8M wasn't enough so we went for 5.055M tpmc with Unbreakable Enterprise Kernel r2 :-)

We released a new set of benchmarks today. One is an updated tpc-c from a few months ago where we had just over 4.8M tpmc at $0.98 and we just updated it to go to 5.05M and $0.89. The other one is related to Java Middleware performance. You can find the press release here.

Now, I don't want to talk about the actual relevance of the benchmark numbers, as I am not in the benchmark team. I want to talk about why these numbers and these efforts, unrelated to what they mean to your workload, matter to customers. The actual benchmark effort is a very big, long, expensive undertaking where many groups work together as a big virtual team. Having the virtual team be within a single company of course helps tremendously... We already start with a very big server setup with tons of storage, many disks, lots of ram, lots of cpu's, cores, threads, large database setups. Getting the whole setup going to start tuning, by itself, is no easy task, but then the real fun starts with tuning the system for optimal performance -and- stability. A benchmark is not just revving an engine at high rpm, it's actually hitting the circuit. The tests require long runs, require surviving availability tests, such as surviving crashes -and- recovery under load.

In the TPC-C example, the x4800 system had 4TB ram, 160 threads (8 sockets, hyperthreaded, 10 cores/socket), tons of storage attached, tons of luns visible to the OS. flash storage, non flash storage... many things at high scale that all have to be perfectly synchronized.

During this process, we find bugs, we fix bugs, we find performance issues, we fix performance issues, we find interesting potential features to investigate for the future, we start new development projects for future releases and all this goes back into the products. As more and more customers, for Oracle Linux, are running larger and larger, faster and faster, more mission critical, higher available databases..., these things are just absolutely critical. Unrelated to what anyone's specific opinion is about tpc-c or tpc-h or specjenterprise etc, there is a ton of effort that the customer benefits from. All this work makes Oracle Linux and/or Oracle Solaris better platforms. Whether it's faster, more stable, more scalable, more resilient. It helps.

Another point that I always like to re-iterate around UEK and UEK2 : we have our kernel source git repository online. Complete changelog of the mainline kernel, and our changes, easy to pull, easy to dissect, easy to know what went in when, why and where. No need to go log into a website and manually click through pages to hopefully discover changes or patches. No need to untar 2 tar balls and run a diff.

Sunday Oct 16, 2011

Containers on Linux

At Oracle OpenWorld we talked about Linux Containers. Here is an example of getting a Linux container going with Oracle Linux 6.1, UEK2 beta and btrfs. This is just an example, not released, production, bug-free... for those that don't read README files ;-)

This container example is using the existing Linux cgroups features in the mainline kernel (and also in UEK, UEK2) and lxc tools to create the environments.

Example assumptions :
- Host OS is Oracle Linux 6.1 with UEK2 beta.
- using btrfs filesystem for containers (to make use of snapshot capabilities)
- mounting the fs in /container
- use Oracle VM templates as a base environment
- Oracle Linux 5 containers

I have a second disk on my test machine (/dev/sdb) which I will use for this exercise.

# mkfs.btrfs  -L container  /dev/sdb

# mount
/dev/mapper/vg_wcoekaersrv4-lv_root on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda1 on /boot type ext4 (rw)
/dev/mapper/vg_wcoekaersrv4-lv_home on /home type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/mapper/loop0p2 on /mnt type ext3 (rw)
/dev/mapper/loop1p2 on /mnt2 type ext3 (rw)
/dev/sdb on /container type btrfs (rw)

lxc tools installed...

# rpm -qa|grep lxc
lxc-libs-0.7.5-2.x86_64
lxc-0.7.5-2.x86_64
lxc-devel-0.7.5-2.x86_64

lxc tools come with template config files :

# ls /usr/lib64/lxc/templates/
lxc-altlinux lxc-busybox lxc-debian lxc-fedora lxc-lenny lxc-ol4 lxc-ol5 lxc-opensuse lxc-sshd lxc-ubuntu
I created one for Oracle Linux 5 : lxc-ol5.

Download Oracle VM template for OL5 from http://edelivery.oracle.com/linux. I used OVM_EL5U5_X86_PVM_10GB.
We want to be able to create 1 environment that can be used in both container and VM mode to avoid duplicate effort.

Untar the VM template.

# tar zxvf OVM_EL5U5_X86_PVM_10GB.tar.gz
These are the steps needed (to be automated in the future)...
Copy the content of the VM virtual disk's root filesystem into a btrfs subvolume in order to easily clone the base template.

My template configure script defines :
template_path=/container/ol5-template

- create subvolume ol5-template on /containers

# btrfs subvolume create /container/ol5-template
Create subvolume '/container/ol5-template'
- loopback mount the Oracle VM template System image / partition
# kpartx -a System.img 
# kpartx -l System.img 
loop0p1 : 0 192717 /dev/loop0 63
loop0p2 : 0 21607425 /dev/loop0 192780
loop0p3 : 0 4209030 /dev/loop0 21800205
I need to mount the 2nd partition of the virtual disk image, kpartx will set up loopback devices for each of the virtual disk partitions. So let's mount loop0p2 which will contain the Oracle Linux 5 / filesystem of the template.
# mount /dev/mapper/loop0p2 /mnt

# ls /mnt
bin  boot  dev  etc  home  lib  lost+found  media  misc  mnt  opt  proc  
root  sbin  selinux  srv  sys  tftpboot  tmp  u01  usr  var
Great, now we have the entire template / filesystem available. Let's copy this into our subvolume. This subvolume will then become the basis for all OL5 containers.
# cd /mnt
# tar cvf - * | ( cd /container/ol5-template ; tar xvf ; )
In the near future we will put some automation around the above steps.
# pwd
/container/ol5-template

# ls
bin  boot  dev  etc  home  lib  lost+found  media  misc  mnt  opt  proc  
root  sbin  selinux  srv  sys  tftpboot  tmp  u01  usr  var
From this point on, the lxc-create script, using the template config as an argument, should be able to automatically create a snapshot and set up the filesystem correctly.
# lxc-create -n ol5test1 -t ol5

Cloning base template /container/ol5-template to /container/ol5test1 ...
Create a snapshot of '/container/ol5-template' in '/container/ol5test1'
Container created : /container/ol5test1 ...
Container template source : /container/ol5-template
Container config : /etc/lxc/ol5test1
Network : eth0 (veth) on virbr0
'ol5' template installed
'ol5test1' created

# ls /etc/lxc/ol5test1/
config  fstab

# ls /container/ol5test1/
bin  boot  dev  etc  home  lib  lost+found  media  misc  mnt  opt  proc  
root  sbin  selinux  srv  sys  tftpboot  tmp  u01  usr  var
Now that it's created and configured, we should be able to just simply start it :
# lxc-start -n ol5test1
INIT: version 2.86 booting
                Welcome to Enterprise Linux Server
                Press 'I' to enter interactive startup.
Setting clock  (utc): Sun Oct 16 06:08:27 EDT 2011         [  OK  ]
Loading default keymap (us):                               [  OK  ]
Setting hostname ol5test1:                                 [  OK  ]
raidautorun: unable to autocreate /dev/md0
Checking filesystems
                                                           [  OK  ]
mount: can't find / in /etc/fstab or /etc/mtab
Mounting local filesystems:                                [  OK  ]
Enabling local filesystem quotas:                          [  OK  ]
Enabling /etc/fstab swaps:                                 [  OK  ]
INIT: Entering runlevel: 3
Entering non-interactive startup
Starting sysstat:  Calling the system activity data collector (sadc): 
                                                           [  OK  ]
Starting background readahead:                             [  OK  ]
Flushing firewall rules:                                   [  OK  ]
Setting chains to policy ACCEPT: nat mangle filter         [  OK  ]
Applying iptables firewall rules:                          [  OK  ]
Loading additional iptables modules: no                    [FAILED]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface eth0:  
Determining IP information for eth0... done.
                                                           [  OK  ]
Starting system logger:                                    [  OK  ]
Starting kernel logger:                                    [  OK  ]
Enabling ondemand cpu frequency scaling:                   [  OK  ]
Starting irqbalance:                                       [  OK  ]
Starting portmap:                                          [  OK  ]
FATAL: Could not load /lib/modules/2.6.39-100.0.12.el6uek.x86_64/modules.dep: No such file or directory
Starting NFS statd:                                        [  OK  ]
Starting RPC idmapd: Error: RPC MTAB does not exist.
Starting system message bus:                               [  OK  ]
Starting o2cb:                                             [  OK  ]
Can't open RFCOMM control socket: Address family not supported by protocol

Mounting other filesystems:                                [  OK  ]
Starting PC/SC smart card daemon (pcscd):                  [  OK  ]
Starting HAL daemon:                                       [FAILED]
Starting hpiod:                                            [  OK  ]
Starting hpssd:                                            [  OK  ]
Starting sshd:                                             [  OK  ]
Starting cups:                                             [  OK  ]
Starting xinetd:                                           [  OK  ]
Starting crond:                                            [  OK  ]
Starting xfs:                                              [  OK  ]
Starting anacron:                                          [  OK  ]
Starting atd:                                              [  OK  ]
Starting yum-updatesd:                                     [  OK  ]
Starting Avahi daemon...                                   [FAILED]
Starting oraclevm-template...
Regenerating SSH host keys.
Stopping sshd:                                             [  OK  ]
Generating SSH1 RSA host key:                              [  OK  ]
Generating SSH2 RSA host key:                              [  OK  ]
Generating SSH2 DSA host key:                              [  OK  ]
Starting sshd:                                             [  OK  ]
Regenerating up2date uuid.
Setting Oracle validated configuration parameters.

Configuring network interface.
  Network device: eth0
  Hardware address: 52:19:C0:EF:78:C4

Do you want to enable dynamic IP configuration (DHCP) (Y|n)? 

... 
This will run the well-known Oracle VM template configure scripts and set up the container the same way as it would an Oracle VM guest.

The session that runs lxc-start is the local console. It is best to run this session inside screen so you can disconnect and reconnect.

At this point,I can use lxc-console to log into the local console of the container, or, since the container has its internal network up and running and sshd is running, I can also just ssh into the guest.
# lxc-console -n ol5test1 -t 1

Enterprise Linux Enterprise Linux Server release 5.5 (Carthage)
Kernel 2.6.39-100.0.12.el6uek.x86_64 on an x86_64

host login: 
I can simple get out of the console entering ctrl-a q.

From inside the container :
# mount
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)

# /sbin/ifconfig
eth0      Link encap:Ethernet  HWaddr 52:19:C0:EF:78:C4  
          inet addr:192.168.122.225  Bcast:192.168.122.255  Mask:255.255.255.0
          inet6 addr: fe80::5019:c0ff:feef:78c4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:141 errors:0 dropped:0 overruns:0 frame:0
          TX packets:19 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:8861 (8.6 KiB)  TX bytes:2476 (2.4 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:560 (560.0 b)  TX bytes:560 (560.0 b)

# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   2124   656 ?        Ss   06:08   0:00 init [3]  
root       397  0.0  0.0   1780   596 ?        Ss   06:08   0:00 syslogd -m 0
root       400  0.0  0.0   1732   376 ?        Ss   06:08   0:00 klogd -x
root       434  0.0  0.0   2524   368 ?        Ss   06:08   0:00 irqbalance
rpc        445  0.0  0.0   1868   516 ?        Ss   06:08   0:00 portmap
root       469  0.0  0.0   1920   740 ?        Ss   06:08   0:00 rpc.statd
dbus       509  0.0  0.0   2800   576 ?        Ss   06:08   0:00 dbus-daemon --system
root       578  0.0  0.0  10868  1248 ?        Ssl  06:08   0:00 pcscd
root       610  0.0  0.0   5196   712 ?        Ss   06:08   0:00 ./hpiod
root       615  0.0  0.0  13520  4748 ?        S    06:08   0:00 python ./hpssd.py
root       637  0.0  0.0  10168  2272 ?        Ss   06:08   0:00 cupsd
root       651  0.0  0.0   2780   812 ?        Ss   06:08   0:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
root       660  0.0  0.0   5296  1096 ?        Ss   06:08   0:00 crond
root       745  0.0  0.0   1728   580 ?        SNs  06:08   0:00 anacron -s
root       753  0.0  0.0   2320   340 ?        Ss   06:08   0:00 /usr/sbin/atd
root       817  0.0  0.0  25580 10136 ?        SN   06:08   0:00 /usr/bin/python -tt /usr/sbin/yum-updatesd
root       819  0.0  0.0   2616  1072 ?        SN   06:08   0:00 /usr/libexec/gam_server
root       830  0.0  0.0   7116  1036 ?        Ss   06:08   0:00 /usr/sbin/sshd
root      2998  0.0  0.0   2368   424 ?        Ss   06:08   0:00 /sbin/dhclient -1 -q -lf /var/lib/dhclient/dhclient-eth0.leases -pf /var/run/dhc
root      3102  0.0  0.0   5008  1376 ?        Ss   06:09   0:00 login -- root     
root      3103  0.0  0.0   1716   444 tty2     Ss+  06:09   0:00 /sbin/mingetty tty2
root      3104  0.0  0.0   1716   448 tty3     Ss+  06:09   0:00 /sbin/mingetty tty3
root      3105  0.0  0.0   1716   448 tty4     Ss+  06:09   0:00 /sbin/mingetty tty4
root      3138  0.0  0.0   4584  1436 tty1     Ss   06:11   0:00 -bash
root      3167  0.0  0.0   4308   936 tty1     R+   06:12   0:00 ps aux
From the host :
# lxc-info -n ol5test1
state:   RUNNING
pid:     16539

# lxc-kill -n ol5test1

# lxc-monitor -n ol5test1
'ol5test1' changed state to [STOPPING]
'ol5test1' changed state to [STOPPED]
So creating more containers is trivial. Just keep running lxc-create.
# lxc-create -n ol5test2 -t ol5

# btrfs subvolume list /container
ID 297 top level 5 path ol5-template
ID 299 top level 5 path ol5test1
ID 300 top level 5 path ol5test2
lxc-tools will be uploaded to the uek2 beta channel to start playing with this.

Oracle Linux 4 example

Here is the same principle for Oracle Linux 4. Using the template create script lxc-ol4. I started out using the OVM_EL4U7_X86_PVM_4GB template and followed the same steps.

# kpartx -a System.img 

# kpartx -l System.img 
loop0p1 : 0 64197 /dev/loop0 63
loop0p2 : 0 8530515 /dev/loop0 64260
loop0p3 : 0 4176900 /dev/loop0 8594775

# mount /dev/mapper/loop0p2 /mnt

# cd /mnt

# btrfs subvolume create /container/ol4-template
Create subvolume '/container/ol4-template'

# tar cvf - * | ( cd /container/ol4-template ; tar xvf - ; )

# lxc-create -n ol4test1 -t ol4

Cloning base template /container/ol4-template to /container/ol4test1 ...
Create a snapshot of '/container/ol4-template' in '/container/ol4test1'
Container created : /container/ol4test1 ...
Container template source : /container/ol4-template
Container config : /etc/lxc/ol4test1
Network : eth0 (veth) on virbr0
'ol4' template installed
'ol4test1' created

# lxc-start -n ol4test1
INIT: version 2.85 booting
/etc/rc.d/rc.sysinit: line 80: /dev/tty5: Operation not permitted
/etc/rc.d/rc.sysinit: line 80: /dev/tty6: Operation not permitted
Setting default font (latarcyrheb-sun16):                  [  OK  ]

                Welcome to Enterprise Linux
                Press 'I' to enter interactive startup.
Setting clock  (utc): Sun Oct 16 09:34:56 EDT 2011         [  OK  ]
Initializing hardware...  storage network audio done       [  OK  ]
raidautorun: unable to autocreate /dev/md0
Configuring kernel parameters:  error: permission denied on key 'net.core.rmem_default'
error: permission denied on key 'net.core.rmem_max'
error: permission denied on key 'net.core.wmem_default'
error: permission denied on key 'net.core.wmem_max'
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.core_uses_pid = 1
fs.file-max = 327679
kernel.msgmni = 2878
kernel.msgmax = 8192
kernel.msgmnb = 65536
kernel.sem = 250 32000 100 142
kernel.shmmni = 4096
kernel.shmall = 1073741824
kernel.sysrq = 1
fs.aio-max-nr = 3145728
net.ipv4.ip_local_port_range = 1024 65000
kernel.shmmax = 4398046511104
                                                           [FAILED]
Loading default keymap (us):                               [  OK  ]
Setting hostname ol4test1:                                 [  OK  ]
Remounting root filesystem in read-write mode:             [  OK  ]
mount: can't find / in /etc/fstab or /etc/mtab
Mounting local filesystems:                                [  OK  ]
Enabling local filesystem quotas:                          [  OK  ]
Enabling swap space:                                       [  OK  ]
INIT: Entering runlevel: 3
Entering non-interactive startup
Starting sysstat:                                          [  OK  ]
Setting network parameters:  error: permission denied on key 'net.core.rmem_default'
error: permission denied on key 'net.core.rmem_max'
error: permission denied on key 'net.core.wmem_default'
error: permission denied on key 'net.core.wmem_max'
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.core_uses_pid = 1
fs.file-max = 327679
kernel.msgmni = 2878
kernel.msgmax = 8192
kernel.msgmnb = 65536
kernel.sem = 250 32000 100 142
kernel.shmmni = 4096
kernel.shmall = 1073741824
kernel.sysrq = 1
fs.aio-max-nr = 3145728
net.ipv4.ip_local_port_range = 1024 65000
kernel.shmmax = 4398046511104
                                                           [FAILED]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface eth0:                                [  OK  ]
Starting system logger:                                    [  OK  ]
Starting kernel logger:                                    [  OK  ]
Starting portmap:                                          [  OK  ]
Starting NFS statd:                                        [FAILED]
Starting RPC idmapd: Error: RPC MTAB does not exist.
Mounting other filesystems:                                [  OK  ]
Starting lm_sensors:                                       [  OK  ]
Starting cups:                                             [  OK  ]
Generating SSH1 RSA host key:                              [  OK  ]
Generating SSH2 RSA host key:                              [  OK  ]
Generating SSH2 DSA host key:                              [  OK  ]
Starting sshd:                                             [  OK  ]
Starting xinetd:                                           [  OK  ]
Starting crond:                                            [  OK  ]
Starting xfs:                                              [  OK  ]
Starting anacron:                                          [  OK  ]
Starting atd:                                              [  OK  ]
Starting system message bus:                               [  OK  ]
Starting cups-config-daemon:                               [  OK  ]
Starting HAL daemon:                                       [  OK  ]
Starting oraclevm-template...
Regenerating SSH host keys.
Stopping sshd:                                             [  OK  ]
Generating SSH1 RSA host key:                              [  OK  ]
Generating SSH2 RSA host key:                              [  OK  ]
Generating SSH2 DSA host key:                              [  OK  ]
Starting sshd:                                             [  OK  ]
Regenerating up2date uuid.
Setting Oracle validated configuration parameters.

Configuring network interface.
  Network device: eth0
  Hardware address: D2:EC:49:0D:7D:80

Do you want to enable dynamic IP configuration (DHCP) (Y|n)? 
...
...

# lxc-console -n ol4test1

Enterprise Linux Enterprise Linux AS release 4 (October Update 7)
Kernel 2.6.39-100.0.12.el6uek.x86_64 on an x86_64

localhost login: 

About

Wim Coekaerts is the Senior Vice President of Linux and Virtualization Engineering for Oracle. He is responsible for Oracle's complete desktop to data center virtualization product line and the Oracle Linux support program.

You can follow him on Twitter at @wimcoekaerts

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today