Freitag Jun 29, 2012

What's up with LDoms: Part 2 - Creating a first, simple guest

Welcome back!

In the first part, we discussed the basic concepts of LDoms and how to configure a simple control domain.  We saw how resources were put aside for guest systems and what infrastructure we need for them.  With that, we are now ready to create a first, very simple guest domain.  In this first example, we'll keep things very simple.  Later on, we'll have a detailed look at things like sizing, IO redundancy, other types of IO as well as security.

For now,let's start with this very simple guest.  It'll have one core's worth of CPU, one crypto unit, 8GB of RAM, a single boot disk and one network port.  (If this were a T4 system, we'd not have to assign the crypto units.  Since this is T3, it makes lots of sense to do so.)  CPU and RAM are easy.  The network port we'll create by attaching a virtual network port to the vswitch we created in the primary domain.  This is very much like plugging a cable into a computer system on one end and a network switch on the other.  For the boot disk, we'll need two things: A physical piece of storage to hold the data - this is called the backend device in LDoms speak.  And then a mapping between that storage and the guest domain, giving it access to that virtual disk.  For this example, we'll use a ZFS volume for the backend.  We'll discuss what other options there are for this and how to chose the right one in a later article.  Here we go:

root@sun # ldm create mars

root@sun # ldm set-vcpu 8 mars 
root@sun # ldm set-mau 1 mars 
root@sun # ldm set-memory 8g mars

root@sun # zfs create rpool/guests
root@sun # zfs create -V 32g rpool/guests/mars.bootdisk
root@sun # ldm add-vdsdev /dev/zvol/dsk/rpool/guests/mars.bootdisk \
root@sun # ldm add-vdisk root mars.root@primary-vds mars

root@sun # ldm add-vnet net0 switch-primary mars

That's all, mars is now ready to power on.  There are just three commands between us and the OK prompt of mars:  We have to "bind" the domain, start it and connect to its console.  Binding is the process where the hypervisor actually puts all the pieces that we've configured together.  If we made a mistake, binding is where we'll be told (starting in version 2.1, a lot of sanity checking has been put into the config commands themselves, but binding will catch everything else).  Once bound, we can start (and of course later stop) the domain, which will trigger the boot process of OBP.  By default, the domain will then try to boot right away.  If we don't want that, we can set "auto-boot?" to false.  Finally, we'll use telnet to connect to the console of our newly created guest.  The output of "ldm list" shows us what port has been assigned to mars.  By default, the console service only listens on the loopback interface, so using telnet is not a large security concern here.

root@sun # ldm set-variable auto-boot\?=false mars
root@sun # ldm bind mars
root@sun # ldm start mars 

root@sun # ldm list
primary          active     -n-cv-  UART    8     7680M    0.5%  1d 4h 30m
mars             active     -t----  5000    8     8G        12%  1s

root@sun # telnet localhost 5000

Connected to localhost.
Escape character is '^]'.

~Connecting to console "mars" in group "mars" ....
Press ~? for control options ..

{0} ok banner

SPARC T3-4, No Keyboard
Copyright (c) 1998, 2011, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.33.1, 8192 MB memory available, Serial # 87203131.
Ethernet address 0:21:28:24:1b:50, Host ID: 85241b50.

{0} ok 

We're done, mars is ready to install Solaris, preferably using AI, of course ;-)  But before we do that, let's have a little look at the OBP environment to see how our virtual devices show up here:

{0} ok printenv auto-boot?
auto-boot? =            false

{0} ok printenv boot-device
boot-device =           disk net

{0} ok devalias
root                     /virtual-devices@100/channel-devices@200/disk@0
net0                     /virtual-devices@100/channel-devices@200/network@0
net                      /virtual-devices@100/channel-devices@200/network@0
disk                     /virtual-devices@100/channel-devices@200/disk@0
virtual-console          /virtual-devices/console@1
name                     aliases

We can see that setting the OBP variable "auto-boot?" to false with the ldm command worked.  Of course, we'd normally set this to "true" to allow Solaris to boot right away once the LDom guest is started.  The setting for "boot-device" is the default "disk net", which means OBP would try to boot off the devices pointed to by the aliases "disk" and "net" in that order, which usually means "disk" once Solaris is installed on the disk image.  The actual devices these aliases point to are shown with the command "devalias".  Here, we have one line for both "disk" and "net".  The device paths speak for themselves.  Note that each of these devices has a second alias: "net0" for the network device and "root" for the disk device.  These are the very same names we've given these devices in the control domain with the commands "ldm add-vnet" and "ldm add-vdisk".  Remember this, as it is very useful once you have several dozen disk devices...

To wrap this up, in this part we've created a simple guest domain, complete with CPU, memory, boot disk and network connectivity.  This should be enough to get you going.  I will cover all the more advanced features and a little more theoretical background in several follow-on articles.  For some background reading, I'd recommend the following links:

What's up with LDoms: Part 1 - Introduction & Basic Concepts

LDoms - the correct name is Oracle VM Server for SPARC - have been around for quite a while now.  But to my surprise, I get more and more requests to explain how they work or to give advise on how to make good use of them.  This made me think that writing up a few articles discussing the different features would be a good idea.  Now - I don't intend to rewrite the LDoms Admin Guide or to copy and reformat the (hopefully) well known "Beginners Guide to LDoms" by Tony Shoumack from 2007.  Those documents are very recommendable - especially the Beginners Guide, although based on LDoms 1.0, is still a good place to begin with.  However, LDoms have come a long way since then, and I hope to contribute to their adoption by discussing how they work and what features there are today.

 In this and the following posts, I will use the term "LDoms" as a common abbreviation for Oracle VM Server for SPARC, just because it's a lot shorter and easier to type (and presumably, read).

So, just to get everyone on the same baseline, lets briefly discuss the basic concepts of virtualization with LDoms.  LDoms make use of a hypervisor as a layer of abstraction between real, physical hardware and virtual hardware.  This virtual hardware is then used to create a number of guest systems which each behave very similar to a system running on bare metal:  Each has its own OBP, each will install its own copy of the Solaris OS and each will see a certain amount of CPU, memory, disk and network resources available to it.  Unlike some other type 1 hypervisors running on x86 hardware, the SPARC hypervisor is embedded in the system firmware and makes use both of supporting functions in the sun4v SPARC instruction set as well as the overall CPU architecture to fulfill its function.

The CMT architecture of the supporting CPUs (T1 through T4) provide a large number of cores and threads to the OS.  For example, the current T4 CPU has eight cores, each running 8 threads, for a total of 64 threads per socket.  To the OS, this looks like 64 CPUs. 

The SPARC hypervisor, when creating guest systems, simply assigns a certain number of these threads exclusively to one guest, thus avoiding the overhead of having to schedule OS threads to CPUs, as do typical x86 hypervisors.  The hypervisor only assigns CPUs and then steps aside.  It is not involved in the actual work being dispatched from the OS to the CPU, all it does is maintain isolation between different guests.

Likewise, memory is assigned exclusively to individual guests.  Here,  the hypervisor provides generic mappings between the physical hardware addresses and the guest's views on memory.  Again, the hypervisor is not involved in the actual memory access, it only maintains isolation between guests.

During the inital setup of a system with LDoms, you start with one special domain, called the Control Domain.  Initially, this domain owns all the hardware available in the system, including all CPUs, all RAM and all IO resources.  If you'd be running the system un-virtualized, this would be what you'd be working with.  To allow for guests, you first resize this initial domain (also called a primary domain in LDoms speak), assigning it a small amount of CPU and memory.  This frees up most of the available CPU and memory resources for guest domains. 

IO is a little more complex, but very straightforward.  When LDoms 1.0 first came out, the only way to provide IO to guest systems was to create virtual disk and network services and attach guests to these services.  In the meantime, several different ways to connect guest domains to IO have been developed, the most recent one being SR-IOV support for network devices released in version 2.2 of Oracle VM Server for SPARC. I will cover these more advanced features in detail later.  For now, lets have a short look at the initial way IO was virtualized in LDoms:

For virtualized IO, you create two services, one "Virtual Disk Service" or vds, and one "Virtual Switch" or vswitch.  You can, of course, also create more of these, but that's more advanced than I want to cover in this introduction.  These IO services now connect real, physical IO resources like a disk LUN or a networt port to the virtual devices that are assigned to guest domains.  For disk IO, the normal case would be to connect a physical LUN (or some other storage option that I'll discuss later) to one specific guest.  That guest would be assigned a virtual disk, which would appear to be just like a real LUN to the guest, while the IO is actually routed through the virtual disk service down to the physical device.  For network, the vswitch acts very much like a real, physical ethernet switch - you connect one physical port to it for outside connectivity and define one or more connections per guest, just like you would plug cables between a real switch and a real system. For completeness, there is another service that provides console access to guest domains which mimics the behavior of serial terminal servers.

The connections between the virtual devices on the guest's side and the virtual IO services in the primary domain are created by the hypervisor.  It uses so called "Logical Domain Channels" or LDCs to create point-to-point connections between all of these devices and services.  These LDCs work very similar to high speed serial connections and are configured automatically whenever the Control Domain adds or removes virtual IO.

To see all this in action, now lets look at a first example.  I will start with a newly installed machine and configure the control domain so that it's ready to create guest systems.

In a first step, after we've installed the software, let's start the virtual console service and downsize the primary domain. 

root@sun # ldm list
primary  active   -n-c--  UART  512   261632M  0.3%  2d 13h 58m

root@sun # ldm add-vconscon port-range=5000-5100 \
               primary-console primary
root@sun # svcadm enable vntsd
root@sun # svcs vntsd
STATE          STIME    FMRI
online          9:53:21 svc:/ldoms/vntsd:default

root@sun # ldm set-vcpu 16 primary
root@sun # ldm set-mau 1 primary
root@sun # ldm start-reconf primary
root@sun # ldm set-memory 7680m primary
root@sun # ldm add-config initial
root@sun # shutdown -y -g0 -i6 

So what have I done:

  • I've defined a range of ports (5000-5100) for the virtual network terminal service and then started that service.  The vnts will later provide console connections to guest systems, very much like serial NTS's do in the physical world.
  • Next, I assigned 16 vCPUs (on this platform, a T3-4, that's two cores) to the primary domain, freeing the rest up for future guest systems.  I also assigned one MAU to this domain.  A MAU is a crypto unit in the T3 CPU.  These need to be explicitly assigned to domains, just like CPU or memory.  (This is no longer the case with T4 systems, where crypto is always available everywhere.)
  • Before I reassigned the memory, I started what's called a "delayed reconfiguration" session.  That avoids actually doing the change right away, which would take a considerable amount of time in this case.  Instead, I'll need to reboot once I'm all done.  I've assigned 7680MB of RAM to the primary.  That's 8GB less the 512MB which the hypervisor uses for it's own private purposes.  You can, depending on your needs, work with less.  I'll spend a dedicated article on sizing, discussing the pros and cons in detail.
  • Finally, just before the reboot, I saved my work on the ILOM, to make this configuration available after a powercycle of the box.  (It'll always be available after a simple reboot, but the ILOM needs to know the configuration of the hypervisor after a power-cycle, before the primary domain is booted.)

Now, lets create a first disk service and a first virtual switch which is connected to the physical network device igb2. We will later use these to connect virtual disks and virtual network ports of our guest systems to real world storage and network.

root@sun # ldm add-vds primary-vds primary
root@sun # ldm add-vswitch net-dev=igb2 switch-primary primary

You are free to choose whatever names you like for the virtual disk service and the virtual switch.  I strongly recommend that you choose names that make sense to you and describe the function of each service in the context of your implementation.  For the vswitch, for example, you could choose names like "admin-vswitch" or "production-network" etc.

This already concludes the configuration of the control domain.  We've freed up considerable amounts of CPU and RAM for guest systems and created the necessary infrastructure - console, vts and vswitch - so that guests systems can actually interact with the outside world.  The system is now ready to create guests, which I'll describe in the next section.

For further reading, here are some recommendable links:

Oracle VM Server for SPARC Demo Videos

I just stumbled across several well done demos for newer LDoms features.  Find them all in the youtube channel "Oracle VM Server for SPARC".  I'd like to recommend the ones about power management and cross CPU migration specifically :-)

Neuigkeiten, Tipps und Wissenswertes rund um SPARC, CMT, Performance und ihre Analyse sowie Erfahrungen mit Solaris auf dem Server und dem Laptop.

This is a bilingual blog (most of the time). Please select your prefered language:
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


« June 2012 »