Wednesday Jul 29, 2009

A million VMs?

The other day, I was given this link from HPCWire: Sandia Computer Scientists Boot One Million Linux Kernels as Virtual Machines. Let's see what they did:
To complete the project, Sandia utilized its Albuquerque-based 4,480-node Dell high-performance computer cluster, known as Thunderbird. To arrive at the one million Linux kernel figure, Sandia's researchers ran one kernel in each of 250 VMs and coupled those with the 4,480 physical machines on Thunderbird.
4480 machines, with 250 VMs each. While we don't have 4480 machines available, I thought I'd go for 250 Solaris domUs (VM instance) on one Solaris dom0 (controlling domain), as a proof of concept. If my fellow nerds here in New Mexico can do it, so can I!

One of the machines in our testlab has 40G memory, and 4x4 CPU cores in it, so it soundedl like a good dom0 candidate. I wanted to be safe with dom0 memory, so I pinned that at 4G. To play it safe, I wanted a small domU. I picked (Open)Solaris build 105, simply because I knew that would run in 128M of memory, in 32bit mode (32bit mode was only used because it saves a bit of memory; 64bit required 16M more per domU). This would leave enough space to add a few more domUs, should things work out. I also decided not to configure networking in the domUs, simply because there weren't enough IP addresses available in the lab network. This is just a proof of concept after all.

How to set things up? I wanted to set this up quickly, so I preferred to have the backing storage on the local disk, not on NFS. A basic install of Solaris plus swap space can be done in 1.5G. I picked 2G as sufficient for an install.

The first issue was that no more than 400G of diskspace was left on this system, so 250 \* 2 = 500G of raw backing disk wasn't going to work. No problem, ZFS to the rescue. Obviously all domUs would be virtually identical, so cloned ZFS volumes are a perfect match.

After doing an install of a paravirtualized Solaris domU on a ZFS volume, I took a snapshot and then cloned it 250 times:

# zfs list -t snapshot
NAME                                    USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/snv_115@bfu-116-configured   181M      -  7.24G  -
tank/disk0@init                            0      -   973M  -

#!/bin/sh
i=1
while [ $i -le 250 ]
do
        echo "cloning instance $i"
        zfs clone tank/disk0@init tank/disk$i
        i=`expr $i + 1`
done
exit 0

# zfs list 
NAME                  USED  AVAIL  REFER  MOUNTPOINT
[...]
tank                 22.7G   379G    25K  /export
tank/disk0           2.95G   381G   973M  -
tank/disk1               0   379G   973M  -
tank/disk10              0   379G   973M  -
tank/disk100             0   379G   973M  -
tank/disk101             0   379G   973M  -
tank/disk102             0   379G   973M  -
tank/disk103             0   379G   973M  -
tank/disk104             0   379G   973M  -
tank/disk105             0   379G   973M  -
tank/disk106             0   379G   973M  -
tank/disk107             0   379G   973M  -
tank/disk108             0   379G   973M  -
tank/disk109             0   379G   973M  -
tank/disk11              0   379G   973M  -
tank/disk110             0   379G   973M  -
tank/disk111             0   379G   973M  -
tank/disk112             0   379G   973M  -
tank/disk113             0   379G   973M  -
tank/disk114             0   379G   973M  -
tank/disk115             0   379G   973M  -
tank/disk116             0   379G   973M  -
tank/disk117             0   379G   973M  -
tank/disk118             0   379G   973M  -
tank/disk119             0   379G   973M  -
[etc]
That was easy enough. Now to create all the domains, using this script:
#!/bin/sh
i=1
while [ $i -le 250 ]
do
        echo "creating VM $i"
        hex=`printf "%02x" $i`
        echo "(
[template sxp file contents with variables]
)" > temp$i.sxp
        xm new -F temp$i.sxp
        rm -f temp$i.sxp
        i=`expr $i + 1`
done
exit 0
I used a raw SXP config file, because I noticed that doing it via libvirt would strip the explicit kernel and ramdisk options that are needed to boot the domU 32bit, not 64bit. That's an item to be dealt with later. That all took less than an hour to set up, and all domains were ready to go:
ginsberg# virsh list --all
 Id Name                 State
----------------------------------
  0 Domain-0             running
  - spv0                 shut off
  - spv1                 shut off
  - spv10                shut off
  - spv100               shut off
  - spv101               shut off
  - spv102               shut off
  - spv103               shut off
  - spv104               shut off
  - spv105               shut off
  - spv106               shut off
  - spv107               shut off
  - spv108               shut off
  - spv109               shut off
  - spv11                shut off
  - spv110               shut off
  - spv111               shut off
  - spv112               shut off
  - spv113               shut off
  - spv114               shut off
  - spv115               shut off
  - spv116               shut off
  - spv117               shut off
  - spv118               shut off
  - spv119               shut off
[etc]
So I started a loop to get them all going. After 25 started domains, the hypervisor complained:
(xVM) Cannot handle page request order 0!
Hm... was enough memory available?
(xVM) Physical memory information:
(xVM)     Xen heap: 12kB free
[...]
Ah, Xen had run out of heap space. The Xen heap is gone in Xen 3.4, so it wouldn't have this issue. However, I was doing this on our current 3.3-based bits (we have 3.4 lined up for later this year). So, this needed to be worked around. 256M of heap space should be enough (instead of the 16M default). So, I rebooted the system with xenheap_megabytes=256 in the Xen command line, and started firing up domains again. After a while, I noticed an error message:
Unable to open tty /dev/pts/86: No such file or directory
So I stopped the loop to see what was going on. It quicky turned out that xenconsoled was running out of filedescriptors. So I upped its limit using plimit(1), and restarted the loop. Domains were happily starting up, until about the 127th domain:
panic[cpu2]/thread=ffffff000a861c60: No available IRQ to bind to: increase
NR_IRQS!

ffffff000a8618f0 unix:alloc_irq+158 ()
ffffff000a861910 unix:ec_bind_evtchn_to_irq+2e ()
ffffff000a861950 unix:xvdi_bind_evtchn+a3 ()
ffffff000a8619e0 xdb:xdb_bindto_frontend+206 ()
ffffff000a861a30 xdb:xdb_start_connect+ae ()
ffffff000a861a80 xdb:xdb_oe_state_change+99 ()
ffffff000a861af0 genunix:ndi_event_run_callbacks+96 ()
ffffff000a861b20 xpvd:xpvd_post_event+24 ()
ffffff000a861b50 genunix:ndi_post_event+2d ()
ffffff000a861ba0 unix:i_xvdi_oestate_handler+94 ()
ffffff000a861c40 genunix:taskq_thread+1b7 ()
ffffff000a861c50 unix:thread_start+8 ()
Oops. Ok, I bumped NR_IRQ (to be more precise, NR_DYNIRQ), and recompiled a dom0 kernel. Obviously, dom0 shouldn't fail that way when it runs out of virtual IRQ space, but that's an isue that can be addressed later.

The system was updated, restarted, and I restarted the loop again. This time, success! All domUs were active, and I could access all of their consoles.

xentop - 10:15:12   Xen 3.3.2-rc1-xvm-debu
252 domains: 1 running, 251 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 41942204k total, 37902624k used, 4039580k free    CPUs: 16 @ 2933MHz
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS
 NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR SSID
  Domain-0 -----r       6501   39.5    4194304   10.0   no limit       n/a    16
    0        0        0    0        0        0        0    0
      spv0 --b---         21    0.3     131072    0.3     131072       0.3     1
    1        0        0    1        0        0        0    0
      spv1 --b---         21    0.3     131072    0.3     131072       0.3     1
    1        0        0    1        0        0        0    0

# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  4096    16     r-----   6508.4
spv0                                         1   128     1     -b----     21.5
spv1                                         2   128     1     -b----     21.2
spv10                                       11   128     1     -b----     23.4
spv100                                     101   128     1     -b----     18.0
spv101                                     102   128     1     -b----     17.7
spv102                                     103   128     1     -b----     17.8
spv103                                     104   128     1     -b----     17.6
spv104                                     105   128     1     -b----     19.3
spv105                                     106   128     1     -b----     18.1
spv106                                     107   128     1     -b----     18.6
spv107                                     108   128     1     -b----     18.1
spv108                                     109   128     1     -b----     19.6
spv109                                     110   128     1     -b----     19.2
spv11                                       12   128     1     -b----     21.2
spv110                                     111   128     1     -b----     18.9
spv111                                     112   128     1     -b----     18.0
spv112                                     113   128     1     -b----     17.7
spv113                                     114   128     1     -b----     19.6
spv114                                     115   128     1     -b----     17.8
spv115                                     116   128     1     -b----     17.9
spv116                                     117   128     1     -b----     19.9
spv117                                     118   128     1     -b----     18.4
spv118                                     119   128     1     -b----     18.5
spv119                                     120   128     1     -b----     17.8
spv12                                       13   128     1     -b----     22.7
spv120                                     121   128     1     -b----     19.3
spv121                                     122   128     1     -b----     18.0
spv122                                     123   128     1     -b----     17.6
[..you get the idea]

=================

# virsh console spv250
v3.3.2-rc1-xvm-debu chgset 'Wed Jul 29 08:09:08 2009 -0700 18433:844795afdcb4'
SunOS Release 5.11 Version snv_105 32-bit
Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: spv
Reading ZFS config: done.

spv console login:
spv console login:
spv console login:
spv console login:
spv console login: root
Password:
Last login: Tue Jul 28 15:22:26 on console
Jul 29 10:12:19 spv login: ROOT LOGIN /dev/console
Sun Microsystems Inc.   SunOS 5.11      snv_105 November 2008
# w
 10:12am  up 3 min(s),  1 user,  load average: 0.06, 0.10, 0.04
User     tty           login@  idle   JCPU   PCPU  what
root     console      10:12am                      -sh

Just for fun, I started up some more domUs, as there was memory left for about 30 more. But, there seems to be a limit (8bit limit perhaps) somewhere in the system:
starting instance 256
error: Failed to start domain spv256
error: POST operation failed: xend_post: error from xen daemon:
(xend.err 'Device 0 (vif) could not be connected. Backend device not found.')
syseventconfd[100801]: process 225108 exited with status 1
Hmm, well, that's another item to be looked at.

All in all, this didn't take long to set up, and the domUs were running just fine. The job was made a whole lot easier by ZFS, too. The system has been running all these domUs for about half a day, and I've poked around in them a bit, without any problems.

Now I just need 4,480 of these machines..

UPDATE The last limit that I mention was actually an error in my script; it generated a bogus MAC address for the new guest. However, there is a limit of 256 guests currently, because of the value of EVTCHNDRV_DEFAULT_NCLONES, which is 256.

About

fvdl

Search

Top Tags
Categories
Archives
« July 2009
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
30
31
 
       
Today
Bookmarks