Saturday May 26, 2012

example of transcendent memory and oracle databases

I did some tests with tmem using an Oracle Database 11gR2 and swingbench setup. You can see a graph below. Let me try to explain what this means.

Using Oracle VM 3 with some changes booting dom0 (additional parameters at the boot prompt) and with UEK2 as a guest kernel in my VM, I can make use of autoballooning. What you see in the graph below is very simple : it's a timeline (horizontal)of how much actual memory the VM is using/needing. I created 3 16GB VMs that I wanted to run on a 36GB Oracle VM server (so more VM memory than we have physically available in the server). When I start a 16GB VM (vertical) the Linux guest immediately balloons down to about 700Mb in size. It automatically releases pages to the hypervisor that are not needed, it's free/idle memory otherwise. Then I start a database with a 4GB SGA, as you can see, the second I start the DB, the VM grows to just over 4GB in size. Then I start swingbench runs, 25, 50, 100, 500, 1000 users. Every time such a run starts, you can see memory use/grab go up, when swingbench stops it goes down. In the end after the last run with 1000 users I also shut down the database instance and memory drops all the way to 700Mb.

I ran 3 guests with swingbench and the database in each and through dynamic ballooning and the guests cooperatively working with the hypervisor, I was able to start all 3 16GB VMs and there was no performance impact. When there was free memory in the hypervisor, cleancache kicked in and guests made use of those pages, including deduping and compression of the pages.

If you want to play with this yourself, you can run this command in dom0 to get decent statistics out of the setup : xm tmem-list --long --all | /usr/sbin/xen-tmem-list-parse. It will show you the compression ratio, the cache ratios etc. I used those statistics to generate the chart below. This yet another example of how, when one can control both the hypervisor and the guest operating system and have things work together, you get better and more interesting results than just black box VM management.


Thursday May 03, 2012

understanding memory allocation in oracle vm / xen

As a follow up to my previous blog about cpu topology, I wanted to add a little bit about memory topology and memory allocation in the hypervisor. Most systems these days that are multi-socket are considered NUMA. Even though over the years, the NUMA-factor has gone down drastically,there still is a small amount of memory locality involved.

My test setup is a dual socket server with 36GB memory. You can see this in Oracle VM Manager as part of the server info or directly on the server with xm info :

# xm info 
total_memory           : 36852
free_memory            : 25742

I have a few VMs running on this server which is why you see memory be lower than total. The 16GB VM is running with tmem enabled and because of that is not using up all memory but only the base memory needed to be active for the workload it's running.

# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
0004fb00000600001668dac79108cb84             2  4096     4     -b----    129.9
0004fb0000060000804bac06a5087809             1  4096     4     -b----    129.4
0004fb0000060000db9c71d539c940ed             3 16000     4     -b----     28.3
Domain-0                                     0  1244    24     r-----    188.0

Let's start with a clean slate and look at some statistics. The following commands will dump detailed memory information on your server :

# xm debug-key u ; xm dmesg. Basically debug info for NUMA memory info. xm dmesg will show you the debug output.

(XEN) 'u' pressed -> dumping numa info (now-0xFE:A1CFFF69)
(XEN) idx0 -> NODE0 start->0 size->4980736
(XEN) phys_to_nid(0000000000001000) -> 0 should be 0
(XEN) idx1 -> NODE1 start->4980736 size->4718592
(XEN) phys_to_nid(00000004c0001000) -> 1 should be 1
(XEN) CPU10 -> NODE0
(XEN) CPU11 -> NODE0
(XEN) CPU12 -> NODE1
(XEN) CPU13 -> NODE1
(XEN) CPU14 -> NODE1
(XEN) CPU15 -> NODE1
(XEN) CPU16 -> NODE1
(XEN) CPU17 -> NODE1
(XEN) CPU18 -> NODE1
(XEN) CPU19 -> NODE1
(XEN) CPU20 -> NODE1
(XEN) CPU21 -> NODE1
(XEN) CPU22 -> NODE1
(XEN) CPU23 -> NODE1
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 318627):
(XEN)     Node 0: 282976
(XEN)     Node 1: 35651
The above output shows that the first 12 cpu's are bound to memory node 0 and the next 12 to memory node 1. The info shows how many pages of RAM are available on each node NODE0 start->0 size->4980736 and NODE1 start->4980736 size->4718592. the Dom0 domain is about 1.2Gb of RAM and it has some memory allocated on each NODE (it also has all of it's 24 vcpu's allocated across all threads in the box). Now let's start a VM.

# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
0004fb0000060000804bac06a5087809             4  4096     4     r-----      8.8
Domain-0                                     0  1244    24     r-----    240.9

# xm debug-key u ; xm dmesg
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 318627):
(XEN)     Node 0: 282976
(XEN)     Node 1: 35651
(XEN) Domain 4 (total: 1048576):
(XEN)     Node 0: 1048576
(XEN)     Node 1: 0
You can see that the newly started VM (domain 4) has 4Gb allocated on node 0.
# xm vcpu-list 4
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
0004fb0000060000804bac06a5087809     4     0     0   -b-       4.8 0-3
0004fb0000060000804bac06a5087809     4     1     3   -b-      26.1 0-3
0004fb0000060000804bac06a5087809     4     2     2   -b-       3.5 0-3
0004fb0000060000804bac06a5087809     4     3     1   -b-       2.4 0-3
The VM also has its virtual CPUs bound to node 0. Let's start another VM.

# xm vcpu-list 6
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
0004fb00000600001668dac79108cb84     6     0    19   r--       2.2 19-23
0004fb00000600001668dac79108cb84     6     1    23   r--      24.6 19-23
0004fb00000600001668dac79108cb84     6     2    20   -b-       1.4 19-23
0004fb00000600001668dac79108cb84     6     3    22   -b-       1.1 19-23

# xm debug-key u ; xm dmesg
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 318627):
(XEN)     Node 0: 282976
(XEN)     Node 1: 35651
(XEN) Domain 4 (total: 1048576):
(XEN)     Node 0: 1048576
(XEN)     Node 1: 0
(XEN) Domain 6 (total: 1048576):
(XEN)     Node 0: 0
(XEN)     Node 1: 1048576
As you can see, this domain 6 has vCPUs bound to node 1, and Xen automatically also allocates memory from node 1. To ensure memory locality. It tries hard to keep memory and CPU as local as possible. Of course when you run with many VMs with many vCPUs then memory allocation will be spread out across multiple nodes.

After starting a 16Gb VM on this server (domain 7), now that 8Gb is allocated, you will see that this 16Gb VM's memory allocation is across the 2 memory nodes :

(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 318627):
(XEN)     Node 0: 282976
(XEN)     Node 1: 35651
(XEN) Domain 4 (total: 1048576):
(XEN)     Node 0: 1048576
(XEN)     Node 1: 0
(XEN) Domain 6 (total: 1048576):
(XEN)     Node 0: 0
(XEN)     Node 1: 1048576
(XEN) Domain 7 (total: 4097012):
(XEN)     Node 0: 2524148
(XEN)     Node 1: 1572864

Thursday May 26, 2011

Linux mainline contains all the Xen code bits for Dom0 and DomU support

After a relatively long road traveled with a few bumps along the way, as of yesterday, Linus's mainline tree (2.6.39+) contains literally every component needed for Linux to run both as a management domain kernel(Dom0) and a guest(DomU).

Xen has always used Linux as the management OS (Dom0) on top of the hypervisor itself, to do the device management and control of the virtual machines running on top of Xen. And for many years, next to the hypervisor, there was a substantial linux kernel patch that had to be applied on top of a linux kernel to transform into this "Dom0". This code had to constantly be kept in sync with the progress Linux itself was making and as such caused a substantial amount of extra work that had to be done.

Another bit of code, that's been in the kernel for many years, were the paravirt drivers for xen in a guest VM (DomU). Linux has had this as part of the codebase for quite a few years, the xen network, block and xenbus drivers that are loaded when you run a hardware virtualized guest (hvm) on Xen with paravirt (pv) drivers. This is always referred to as pv-hvm.

A pure hardware virtualized kernel without any xen drivers, just emulated qemu devices is just simply called an "hvm guest". This does not perform well as any type of network or block IO goes through many layers of emulation. As hardware virtualization has improved over the years in the chips, pv-hvm has become performant and is frequently used. The pv-drivers basically are highly optimized virtual devices that communicate through the hypervisor to do network or disk io, handled behind the scenes by the Dom0 kernel and what is called backend devices (netbk, blockbk).

A pure paravirtualized guest is/was an OS kernel that was totally modified to really be in sync with the hypervisor and let the hypervisor take care or own a number of tasks to be as optimized as possible. Performance and integration is the best with a paravirtualized kernel and this also allowed xen to run on x86 hardware, optimally, without hardware virtualization instruction support - this is referred to as pv-guest. The Dom0 kernel runs in pv mode (more on this later) and the DomU guests could run in hvm, pv-hvm or pv mode.

Over the years, a number of efforts were made to get these pv / dom0 patches submitted into the mainline kernel but at times the code was not considered acceptable by a number of the linux kernel maintainers and little progress was made. Over the last 2 years a renewed effort started to really convert the code into patches considerd acceptable and a set of people : Jeremy Fitzhardinge, Konrad Rzeszutek Wilk, Ian Campbell , Stefano Stabellini (and others not mentioned but obviously also important) focused on getting this stuff done once and for all... and so.. bit by bit. code was rewritten submitted for review, rewritten again until it was considered ok. In terms of timeline, a good chunk of code has gone in over time to handle Linux as a well behaved guest (DomU) first, then followed by all the work to make the Dom0 happy as well.

One change that happened in the Linux kernel to be able to better handle such an infrastructure in a virtual world for more than one hypervisor, was called pvops.

pvops, is a mode where the kernel can switch into pv, hvm or pvhvm at boot time. Instead of having multiple kernel binaries, there is just one and it will lay out its operations at boot time when it detects on what platform it runs. Linux as a DomU guest on Xen has had pvops support since 2.6.23/24 with good use starting around 2.6.27. So the frontend network and block drivers and running pvops on xen has been around also for quite some time. As this finalized the work focused more on preparing the Dom0 parts of integration and a migration from the old classical pure pv kernel mode to what's now called pvops.

Late last year in 2.6.37, we had a mainline kernel that was able to actually run as the "Dom0" for the Xen hypervisor. That was a big step, followed shortly by adding the remaining bits that were needed to really handle every area : memory management, grant table stuff, network pv driver backend and block pv driver backend code (and other misc components). The last remaining driver just got merged 2 days ago into 2.6.39+ mainline - the block backend driver blkback.

All this means that every single bit of support needed in Linux to work perfectly well with Xen is -in- the mainline kernel tree. I've heard over the last few years, competitors use "There is no Xen support in Linux" as a tagline to create fud with the xen userbase and promote alternatives. Well, it's all there people. As Linux evolves, now, within that code base, the Linux/Xen bits will evolve at the same rate without separate patch trees and big chunks of code to carry along. This is great, Xen is a great hypervisor with capabilities and features that one cannot achieve in a non-true hypervisor architecture. We are exploiting this, and more to come in the future. My hat off to everyone in the Xen community, including of course our guys on the Oracle VM team, Konrad and gang to help, and a big shout out to the citrix xensource folks. Good times.


Wim Coekaerts is the Senior Vice President of Linux and Virtualization Engineering for Oracle. He is responsible for Oracle's complete desktop to data center virtualization product line and the Oracle Linux support program.

You can follow him on Twitter at @wimcoekaerts


« April 2014