Wednesday Jul 29, 2009

A million VMs?

The other day, I was given this link from HPCWire: Sandia Computer Scientists Boot One Million Linux Kernels as Virtual Machines. Let's see what they did:
To complete the project, Sandia utilized its Albuquerque-based 4,480-node Dell high-performance computer cluster, known as Thunderbird. To arrive at the one million Linux kernel figure, Sandia's researchers ran one kernel in each of 250 VMs and coupled those with the 4,480 physical machines on Thunderbird.
4480 machines, with 250 VMs each. While we don't have 4480 machines available, I thought I'd go for 250 Solaris domUs (VM instance) on one Solaris dom0 (controlling domain), as a proof of concept. If my fellow nerds here in New Mexico can do it, so can I!

One of the machines in our testlab has 40G memory, and 4x4 CPU cores in it, so it soundedl like a good dom0 candidate. I wanted to be safe with dom0 memory, so I pinned that at 4G. To play it safe, I wanted a small domU. I picked (Open)Solaris build 105, simply because I knew that would run in 128M of memory, in 32bit mode (32bit mode was only used because it saves a bit of memory; 64bit required 16M more per domU). This would leave enough space to add a few more domUs, should things work out. I also decided not to configure networking in the domUs, simply because there weren't enough IP addresses available in the lab network. This is just a proof of concept after all.

How to set things up? I wanted to set this up quickly, so I preferred to have the backing storage on the local disk, not on NFS. A basic install of Solaris plus swap space can be done in 1.5G. I picked 2G as sufficient for an install.

The first issue was that no more than 400G of diskspace was left on this system, so 250 \* 2 = 500G of raw backing disk wasn't going to work. No problem, ZFS to the rescue. Obviously all domUs would be virtually identical, so cloned ZFS volumes are a perfect match.

After doing an install of a paravirtualized Solaris domU on a ZFS volume, I took a snapshot and then cloned it 250 times:

# zfs list -t snapshot
NAME                                    USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/snv_115@bfu-116-configured   181M      -  7.24G  -
tank/disk0@init                            0      -   973M  -

while [ $i -le 250 ]
        echo "cloning instance $i"
        zfs clone tank/disk0@init tank/disk$i
        i=`expr $i + 1`
exit 0

# zfs list 
tank                 22.7G   379G    25K  /export
tank/disk0           2.95G   381G   973M  -
tank/disk1               0   379G   973M  -
tank/disk10              0   379G   973M  -
tank/disk100             0   379G   973M  -
tank/disk101             0   379G   973M  -
tank/disk102             0   379G   973M  -
tank/disk103             0   379G   973M  -
tank/disk104             0   379G   973M  -
tank/disk105             0   379G   973M  -
tank/disk106             0   379G   973M  -
tank/disk107             0   379G   973M  -
tank/disk108             0   379G   973M  -
tank/disk109             0   379G   973M  -
tank/disk11              0   379G   973M  -
tank/disk110             0   379G   973M  -
tank/disk111             0   379G   973M  -
tank/disk112             0   379G   973M  -
tank/disk113             0   379G   973M  -
tank/disk114             0   379G   973M  -
tank/disk115             0   379G   973M  -
tank/disk116             0   379G   973M  -
tank/disk117             0   379G   973M  -
tank/disk118             0   379G   973M  -
tank/disk119             0   379G   973M  -
That was easy enough. Now to create all the domains, using this script:
while [ $i -le 250 ]
        echo "creating VM $i"
        hex=`printf "%02x" $i`
        echo "(
[template sxp file contents with variables]
)" > temp$i.sxp
        xm new -F temp$i.sxp
        rm -f temp$i.sxp
        i=`expr $i + 1`
exit 0
I used a raw SXP config file, because I noticed that doing it via libvirt would strip the explicit kernel and ramdisk options that are needed to boot the domU 32bit, not 64bit. That's an item to be dealt with later. That all took less than an hour to set up, and all domains were ready to go:
ginsberg# virsh list --all
 Id Name                 State
  0 Domain-0             running
  - spv0                 shut off
  - spv1                 shut off
  - spv10                shut off
  - spv100               shut off
  - spv101               shut off
  - spv102               shut off
  - spv103               shut off
  - spv104               shut off
  - spv105               shut off
  - spv106               shut off
  - spv107               shut off
  - spv108               shut off
  - spv109               shut off
  - spv11                shut off
  - spv110               shut off
  - spv111               shut off
  - spv112               shut off
  - spv113               shut off
  - spv114               shut off
  - spv115               shut off
  - spv116               shut off
  - spv117               shut off
  - spv118               shut off
  - spv119               shut off
So I started a loop to get them all going. After 25 started domains, the hypervisor complained:
(xVM) Cannot handle page request order 0!
Hm... was enough memory available?
(xVM) Physical memory information:
(xVM)     Xen heap: 12kB free
Ah, Xen had run out of heap space. The Xen heap is gone in Xen 3.4, so it wouldn't have this issue. However, I was doing this on our current 3.3-based bits (we have 3.4 lined up for later this year). So, this needed to be worked around. 256M of heap space should be enough (instead of the 16M default). So, I rebooted the system with xenheap_megabytes=256 in the Xen command line, and started firing up domains again. After a while, I noticed an error message:
Unable to open tty /dev/pts/86: No such file or directory
So I stopped the loop to see what was going on. It quicky turned out that xenconsoled was running out of filedescriptors. So I upped its limit using plimit(1), and restarted the loop. Domains were happily starting up, until about the 127th domain:
panic[cpu2]/thread=ffffff000a861c60: No available IRQ to bind to: increase

ffffff000a8618f0 unix:alloc_irq+158 ()
ffffff000a861910 unix:ec_bind_evtchn_to_irq+2e ()
ffffff000a861950 unix:xvdi_bind_evtchn+a3 ()
ffffff000a8619e0 xdb:xdb_bindto_frontend+206 ()
ffffff000a861a30 xdb:xdb_start_connect+ae ()
ffffff000a861a80 xdb:xdb_oe_state_change+99 ()
ffffff000a861af0 genunix:ndi_event_run_callbacks+96 ()
ffffff000a861b20 xpvd:xpvd_post_event+24 ()
ffffff000a861b50 genunix:ndi_post_event+2d ()
ffffff000a861ba0 unix:i_xvdi_oestate_handler+94 ()
ffffff000a861c40 genunix:taskq_thread+1b7 ()
ffffff000a861c50 unix:thread_start+8 ()
Oops. Ok, I bumped NR_IRQ (to be more precise, NR_DYNIRQ), and recompiled a dom0 kernel. Obviously, dom0 shouldn't fail that way when it runs out of virtual IRQ space, but that's an isue that can be addressed later.

The system was updated, restarted, and I restarted the loop again. This time, success! All domUs were active, and I could access all of their consoles.

xentop - 10:15:12   Xen 3.3.2-rc1-xvm-debu
252 domains: 1 running, 251 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 41942204k total, 37902624k used, 4039580k free    CPUs: 16 @ 2933MHz
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS
  Domain-0 -----r       6501   39.5    4194304   10.0   no limit       n/a    16
    0        0        0    0        0        0        0    0
      spv0 --b---         21    0.3     131072    0.3     131072       0.3     1
    1        0        0    1        0        0        0    0
      spv1 --b---         21    0.3     131072    0.3     131072       0.3     1
    1        0        0    1        0        0        0    0

# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  4096    16     r-----   6508.4
spv0                                         1   128     1     -b----     21.5
spv1                                         2   128     1     -b----     21.2
spv10                                       11   128     1     -b----     23.4
spv100                                     101   128     1     -b----     18.0
spv101                                     102   128     1     -b----     17.7
spv102                                     103   128     1     -b----     17.8
spv103                                     104   128     1     -b----     17.6
spv104                                     105   128     1     -b----     19.3
spv105                                     106   128     1     -b----     18.1
spv106                                     107   128     1     -b----     18.6
spv107                                     108   128     1     -b----     18.1
spv108                                     109   128     1     -b----     19.6
spv109                                     110   128     1     -b----     19.2
spv11                                       12   128     1     -b----     21.2
spv110                                     111   128     1     -b----     18.9
spv111                                     112   128     1     -b----     18.0
spv112                                     113   128     1     -b----     17.7
spv113                                     114   128     1     -b----     19.6
spv114                                     115   128     1     -b----     17.8
spv115                                     116   128     1     -b----     17.9
spv116                                     117   128     1     -b----     19.9
spv117                                     118   128     1     -b----     18.4
spv118                                     119   128     1     -b----     18.5
spv119                                     120   128     1     -b----     17.8
spv12                                       13   128     1     -b----     22.7
spv120                                     121   128     1     -b----     19.3
spv121                                     122   128     1     -b----     18.0
spv122                                     123   128     1     -b----     17.6
[ get the idea]


# virsh console spv250
v3.3.2-rc1-xvm-debu chgset 'Wed Jul 29 08:09:08 2009 -0700 18433:844795afdcb4'
SunOS Release 5.11 Version snv_105 32-bit
Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: spv
Reading ZFS config: done.

spv console login:
spv console login:
spv console login:
spv console login:
spv console login: root
Last login: Tue Jul 28 15:22:26 on console
Jul 29 10:12:19 spv login: ROOT LOGIN /dev/console
Sun Microsystems Inc.   SunOS 5.11      snv_105 November 2008
# w
 10:12am  up 3 min(s),  1 user,  load average: 0.06, 0.10, 0.04
User     tty           login@  idle   JCPU   PCPU  what
root     console      10:12am                      -sh

Just for fun, I started up some more domUs, as there was memory left for about 30 more. But, there seems to be a limit (8bit limit perhaps) somewhere in the system:
starting instance 256
error: Failed to start domain spv256
error: POST operation failed: xend_post: error from xen daemon:
(xend.err 'Device 0 (vif) could not be connected. Backend device not found.')
syseventconfd[100801]: process 225108 exited with status 1
Hmm, well, that's another item to be looked at.

All in all, this didn't take long to set up, and the domUs were running just fine. The job was made a whole lot easier by ZFS, too. The system has been running all these domUs for about half a day, and I've poked around in them a bit, without any problems.

Now I just need 4,480 of these machines..

UPDATE The last limit that I mention was actually an error in my script; it generated a bogus MAC address for the new guest. However, there is a limit of 256 guests currently, because of the value of EVTCHNDRV_DEFAULT_NCLONES, which is 256.

Thursday Aug 17, 2006

Silly IBM statements on OpenSolaris

Ok, we've seen it all before, and it's common practice: sniping at the other guy's products, especially when you see it as a competitor. The latest IBM potshots can be read on CNET, who interviewed IBM's Dan Frye at Linuxworld. Our own Jim Grisanzio was given the opportunity to reply, and did a good job of countering the spin. However, there's one statement that is just waiting to be quoted:

"They have done nothing to build a community," with only 16 non-Sun people contributing code to the project in its first 11 months, Frye said. Linux, in comparison, had 10 times that number in the same period after it was launched by Linus Torvalds in 1991--and that was with no Internet and no advertisements, Frye said.

Jim already countered the first statement, and anyone who has actually been following what is going on at will know that it is false. People are participating on many levels, and work is progressing to make it possible for all OpenSolaris developers (inside our outside Sun) to commit to repositories directly. The second part of the quote is where it gets silly.

First of all, the comparison with Linux is apples and oranges, to say the least. Linux started out from scratch, as a one-man project, in the open right from the start, and lacking many features in the beginning. OpenSolaris started by opening up a large codebase for a sophisticated OS, developed by a large group of engineers in a large company. This meant dealing with incompatible licenses in existing code, getting the infrastructure in place for what is expected of an open source OS these days, and setting things in motion to push development out in the open. There is just no way to compare the two.

Funnier is the "no Internet" part. Maybe Mr. Frye was misquoted, but that's just a silly statement. Yes, Linus had to sell his project door to door using a horse and carriage full of boot floppies, you know. And in the beginning he didn't even have a horse. Those were the days, the Internet-free days of 1991/1992. I remember them fondly.

It's even funnier when you consider that Mr. Frye offers advice in the interview, too. I don't know, somehow I don't think that Sun, the company who had a large part in powering the Internet boom, needs advice from someone who seems to think that there was no "Internet" in 1991/1992 .

But, it's all good. If the competition feels the need to snipe at you like this, you're doing something right.

Wednesday May 17, 2006

Open Source Attitude

For my first blog entry, I figured I'd make some sweeping generalizations and be judgemental to boot. Ah, why not, I can get away with it, since noone's reading this ;-) I've been around in the Open Source world since the time it wasn't even called Open Source (he said, trying to play up his streetcred, you can stop making gag gestures now..), and an important factor in the success of a project is the attitude of the community members. An "open attitude" is essential. Ok, that sounds very general and vague. What do I mean by that? I guess it means a few things:

  • Be open to external input. This may seem obvious, but often it does not seem to be. You've just put your code out there for everyone to see, and a lot of people will have an opinion on it, will find bugs in it, or can simply write a better version of what you did. If you've been in the business for 20 years and proud of your work, some snotty 16-year old kid fixing bugs in your code a day after you've released it may make you stop for a moment. Contain that kneejerk reflex, and consider that this kid has just \*improved\* your work, which can only be a good thing. Don't take it as personal criticism
  • Open Source means that things move faster, since the amount of people participating is much bigger. Don't get left behind. Be responsive, work with the input from the community you're getting. If you get complacent, people will just take your source code and will start their own project, leaving you in the dust. Which is their right, and that's a good thing, but by the time you realize you got left behind, a lot of time will have been wasted on code duplication, etc. Of course, it's entirely valid to reject contributions because they are not good enough or do not match the goals of the project. However, being unresponsive or just dropping contributions on the floor will have your project end up in the Open Source graveyard eventually.
  • Don't be a jerk. Again, that's pretty obvious, but in an Open Source community it tends to be a bigger issue. Within a company, there is a management structure that keeps people in check, since professional behavior is expected by those who have control over whether you have a job or not. However, an Open Source community is different. Most contributors will probably not work on a project professionally, and may have a more cavalier attitude. This also means that people will leave more easily, or start forks of a project if they don't feel welcome anymore. In other words, an Open Source project can be much more volatile. Take this into account, and be polite. Too many splits in Open Source project have happened because of personality issues, and that's a real shame.
There you have it, Uncle Frank's wise words on Open Source. Ignore them at your own peril leisure.



Top Tags
« April 2014