Freitag Jul 13, 2012

What's up with LDoms: Part 3 - A closer look at Disk Backend Choices

In this section, we'll have a closer look at virtual disk backends and the various choises available here.  As a little reminder, a disk backend, in LDoms speak, is the physical storage used when creating a virtual disk for a guest system.  In other virtualization solutions, these are sometimes called virtual disk images, a term that doesn't really fit for all possible options available in LDoms.

In the previous example, we used a ZFS volume as a backend for the boot disk of mars.  But there are many other ways to store the data of virtual disks.  The relevant section in the Admin Guide lists all the available options:

  • Physical LUNs, in any variant that the Control Domain supports.  This of course includes SAN, iSCSI and SAS, including the internal disks of the host system.
  • Logical Volumes like ZFS Volumes, but also SVM or VxVM
  • Regular Files. These can be stored in any filesystem, as long as they're accessible by the LDoms subsystem. This includes storage on NFS.

Each of these backend devices have their own set of characteristica that should be considered when deciding which backend type to use.  Let's look at them in a little more detail.

LUNs are the most generic option. By assigning a virtual disk to a LUN backend, the guest essentially gains full access to the underlying storage device, whatever that might be.  It will see the volume label of the LUN, it can see and alter the partition table of the LUN, it can also read or set SCSI reservations on that device.  Depending on the way the LUN is connected to the host system, this very same LUN could also be attached to a second host and a guest residing on it, with the two guests sharing the data on that one LUN, or supporting live migration.  If there is a filesystem on the LUN, the guest will be able to mount that filesystem, just like any other system with access to that LUN, be it virtualized or direct.  Bear in mind that most filesystems are non-shared filesystems.  This doesn't change here, either.  For the IO domain (that's the domain where the physical LUN is connected) LUNs mean the least possible amount of work.  All it has to do is pass data blocks up and down to and from the LUN, there is a very minimum of driver layers invovled.

Flat files, on the other hand, are the most simple option, very similar in user experience to what one would do in a desktop hypervisor like VirtualBox.  The easiest way to create one is with the "mkfile" command.  For the guest, there is no real difference to LUNs.  The virtual disk will, just like in the LUN case, appear to be a full disk, partition table, label and all.  Of course, initially, it'll be all empty, so the first thing the guest usually needs to do is write a label to the disk.  The main difference to LUNs is in the way these image files are managed.  Since they are files in a filesystem, they can be copied, moved and deleted, all of which should be done with care, especially if the guest is still running.  They can be managed by the filesystem, which means attributes like compression, encryption or deduplication in ZFS could apply to them - fully transparent to the guest.  If the filesystem is a shared filesystem like NFS or SAM-FS, the file (and thus the disk image) could be shared by another LDom on another system, for example as a shared database disk or for live migration.  Their performance will be impacted by the filesystem, too.  The IO domain might cache some of the file, hoping to speed operations.  If there are many such image files on a single filesystem, they might impact each other's performance.  These files, by the way, need not be empty initially.  A typical use case would be a Solaris iso image file.  Adding it to a guest as a virtual disk will allow that guest to boot (and install) off that iso image as if it were a physical CD drive.

Finally, there are logical Volumes, typically created with volume managers such as Solaris Volume Manager (SVM) or Veritas Volume Manager (VxVM) or ZFS, of course.  For the guest, again, these look just like ordinary disks, very much like files.  The difference to files is in the management layer;  The logical volumes are created straigt from the underlying storage, without a filesystem layer in between.  In the database world, we would call these "raw devices", and their device names in Solaris are very similar to those of physical LUNs.  We need different commands to find out how large these volumes are, or how much space is left on the storage devices underneath.  Other than that, however, they are very similar to files in many ways.  Sharing them between two host systems is likely to be more complex, as one would need the corresponding cluster volume managers, which typically only really work in combination with Solaris Cluster.  One type of volume that deserves special mentioning is the ZFS Volume.  It offers all the features of a normal ZFS dataset: Clones, snapshots, compression, encryption, deduplication, etc.  Especially with snapshots and clones, they lend themselves as the ideal backend for all use cases that make heavy use of these features. 

For the sake of completeness, I'd like to mention that you can export all of these backends to a guest with or without the "slice" option, something that I consider less usefull in most cases, which is why I'd like to refer you to the relevant section in the admin guide if you want to know more about this.

Lastly, you do have the option to export these backends read-only to prevent any changes from the guests.  Keep in mind that even mounting a UFS filesystem read only would require a write operation to the virtual disk.  The most typical usecase for this is probably an iso-image, which can indeed be mounted read-only.  You can also export one backend to more than one guest.  In the physical world, this would correspond to using the same SAN LUN on several hosts, and the same restrictions with regards to shared filesystems etc. apply.

So now that we know about all these different options, when should we use which kind of backend ?  The answer, as usual, is: It depends!

LUNs require a SAN (or iSCSI) infrastructure which we tend to associate with higher cost.  On the other hand, they can be shared between many hosts, are easily mapped from host to host and bring a rich feature set of storage management and redundancy with them.  I recommend LUNs (especially SAN) for both boot devices and data disks of guest systems in production environments.  My main reasons for this are:

  • They are very light-weight on the IO domain
  • They avoid any double buffering of data in the guest and in the IO domain because there is no filesystem layer involved in the IO domain.
  • Redundancy for the device and the data path is easy
  • They allow sharing between hosts, which in turn allows cluster implementations and live migration
  • All ZFS features can be implemented in the guest, if desired.

For test and development, my first choice is usually the ZFS volume.  Unlike VxVM, it comes free of charge, and it's features like snapshots and clones meet the typical requirements of such environments to quickly create, copy and destroy test environments.  I explicitly recommend against using ZFS snapshots/clones (files or volumes) over a longer period of time.  Since ZFS records the delta between the original image and the clones, the space overhead will eventually grow to a multiple of the initial size and eventually even prevent further IO to the virtual disk if the zpool is full.  Also keep in mind that ZFS is not a shared filesystem.  This prevents guest that use ZFS files or volumes as virtual disks from doing live migration.  Which leads directly to the recommendation for files:

I recommend files on NFS (or other shared filesystems) in all those cases where SAN LUNs are not available but shared access to disk images is required because of live migration (or because of cluster software like Solaris Cluster or RAC is running in the guests).  The functionality is mostly the same as for LUNs, with the exception of SCSI reservations, which don't work with a file backend.  However, CPU requirements in the IO domain and performance of NFS files as compared to SAN LUNs is likely to be different, which is why I strongly recommend to use SAN LUNs for all prodution use cases.

Further reading:

Freitag Jun 29, 2012

What's up with LDoms: Part 2 - Creating a first, simple guest

Welcome back!

In the first part, we discussed the basic concepts of LDoms and how to configure a simple control domain.  We saw how resources were put aside for guest systems and what infrastructure we need for them.  With that, we are now ready to create a first, very simple guest domain.  In this first example, we'll keep things very simple.  Later on, we'll have a detailed look at things like sizing, IO redundancy, other types of IO as well as security.

For now,let's start with this very simple guest.  It'll have one core's worth of CPU, one crypto unit, 8GB of RAM, a single boot disk and one network port.  (If this were a T4 system, we'd not have to assign the crypto units.  Since this is T3, it makes lots of sense to do so.)  CPU and RAM are easy.  The network port we'll create by attaching a virtual network port to the vswitch we created in the primary domain.  This is very much like plugging a cable into a computer system on one end and a network switch on the other.  For the boot disk, we'll need two things: A physical piece of storage to hold the data - this is called the backend device in LDoms speak.  And then a mapping between that storage and the guest domain, giving it access to that virtual disk.  For this example, we'll use a ZFS volume for the backend.  We'll discuss what other options there are for this and how to chose the right one in a later article.  Here we go:

root@sun # ldm create mars

root@sun # ldm set-vcpu 8 mars 
root@sun # ldm set-mau 1 mars 
root@sun # ldm set-memory 8g mars

root@sun # zfs create rpool/guests
root@sun # zfs create -V 32g rpool/guests/mars.bootdisk
root@sun # ldm add-vdsdev /dev/zvol/dsk/rpool/guests/mars.bootdisk \
           mars.root@primary-vds
root@sun # ldm add-vdisk root mars.root@primary-vds mars

root@sun # ldm add-vnet net0 switch-primary mars

That's all, mars is now ready to power on.  There are just three commands between us and the OK prompt of mars:  We have to "bind" the domain, start it and connect to its console.  Binding is the process where the hypervisor actually puts all the pieces that we've configured together.  If we made a mistake, binding is where we'll be told (starting in version 2.1, a lot of sanity checking has been put into the config commands themselves, but binding will catch everything else).  Once bound, we can start (and of course later stop) the domain, which will trigger the boot process of OBP.  By default, the domain will then try to boot right away.  If we don't want that, we can set "auto-boot?" to false.  Finally, we'll use telnet to connect to the console of our newly created guest.  The output of "ldm list" shows us what port has been assigned to mars.  By default, the console service only listens on the loopback interface, so using telnet is not a large security concern here.

root@sun # ldm set-variable auto-boot\?=false mars
root@sun # ldm bind mars
root@sun # ldm start mars 

root@sun # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  UART    8     7680M    0.5%  1d 4h 30m
mars             active     -t----  5000    8     8G        12%  1s

root@sun # telnet localhost 5000

Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

~Connecting to console "mars" in group "mars" ....
Press ~? for control options ..

{0} ok banner

SPARC T3-4, No Keyboard
Copyright (c) 1998, 2011, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.33.1, 8192 MB memory available, Serial # 87203131.
Ethernet address 0:21:28:24:1b:50, Host ID: 85241b50.

{0} ok 

We're done, mars is ready to install Solaris, preferably using AI, of course ;-)  But before we do that, let's have a little look at the OBP environment to see how our virtual devices show up here:

{0} ok printenv auto-boot?
auto-boot? =            false

{0} ok printenv boot-device
boot-device =           disk net

{0} ok devalias
root                     /virtual-devices@100/channel-devices@200/disk@0
net0                     /virtual-devices@100/channel-devices@200/network@0
net                      /virtual-devices@100/channel-devices@200/network@0
disk                     /virtual-devices@100/channel-devices@200/disk@0
virtual-console          /virtual-devices/console@1
name                     aliases

We can see that setting the OBP variable "auto-boot?" to false with the ldm command worked.  Of course, we'd normally set this to "true" to allow Solaris to boot right away once the LDom guest is started.  The setting for "boot-device" is the default "disk net", which means OBP would try to boot off the devices pointed to by the aliases "disk" and "net" in that order, which usually means "disk" once Solaris is installed on the disk image.  The actual devices these aliases point to are shown with the command "devalias".  Here, we have one line for both "disk" and "net".  The device paths speak for themselves.  Note that each of these devices has a second alias: "net0" for the network device and "root" for the disk device.  These are the very same names we've given these devices in the control domain with the commands "ldm add-vnet" and "ldm add-vdisk".  Remember this, as it is very useful once you have several dozen disk devices...

To wrap this up, in this part we've created a simple guest domain, complete with CPU, memory, boot disk and network connectivity.  This should be enough to get you going.  I will cover all the more advanced features and a little more theoretical background in several follow-on articles.  For some background reading, I'd recommend the following links:

What's up with LDoms: Part 1 - Introduction & Basic Concepts

LDoms - the correct name is Oracle VM Server for SPARC - have been around for quite a while now.  But to my surprise, I get more and more requests to explain how they work or to give advise on how to make good use of them.  This made me think that writing up a few articles discussing the different features would be a good idea.  Now - I don't intend to rewrite the LDoms Admin Guide or to copy and reformat the (hopefully) well known "Beginners Guide to LDoms" by Tony Shoumack from 2007.  Those documents are very recommendable - especially the Beginners Guide, although based on LDoms 1.0, is still a good place to begin with.  However, LDoms have come a long way since then, and I hope to contribute to their adoption by discussing how they work and what features there are today.

 In this and the following posts, I will use the term "LDoms" as a common abbreviation for Oracle VM Server for SPARC, just because it's a lot shorter and easier to type (and presumably, read).

So, just to get everyone on the same baseline, lets briefly discuss the basic concepts of virtualization with LDoms.  LDoms make use of a hypervisor as a layer of abstraction between real, physical hardware and virtual hardware.  This virtual hardware is then used to create a number of guest systems which each behave very similar to a system running on bare metal:  Each has its own OBP, each will install its own copy of the Solaris OS and each will see a certain amount of CPU, memory, disk and network resources available to it.  Unlike some other type 1 hypervisors running on x86 hardware, the SPARC hypervisor is embedded in the system firmware and makes use both of supporting functions in the sun4v SPARC instruction set as well as the overall CPU architecture to fulfill its function.

The CMT architecture of the supporting CPUs (T1 through T4) provide a large number of cores and threads to the OS.  For example, the current T4 CPU has eight cores, each running 8 threads, for a total of 64 threads per socket.  To the OS, this looks like 64 CPUs. 

The SPARC hypervisor, when creating guest systems, simply assigns a certain number of these threads exclusively to one guest, thus avoiding the overhead of having to schedule OS threads to CPUs, as do typical x86 hypervisors.  The hypervisor only assigns CPUs and then steps aside.  It is not involved in the actual work being dispatched from the OS to the CPU, all it does is maintain isolation between different guests.

Likewise, memory is assigned exclusively to individual guests.  Here,  the hypervisor provides generic mappings between the physical hardware addresses and the guest's views on memory.  Again, the hypervisor is not involved in the actual memory access, it only maintains isolation between guests.

During the inital setup of a system with LDoms, you start with one special domain, called the Control Domain.  Initially, this domain owns all the hardware available in the system, including all CPUs, all RAM and all IO resources.  If you'd be running the system un-virtualized, this would be what you'd be working with.  To allow for guests, you first resize this initial domain (also called a primary domain in LDoms speak), assigning it a small amount of CPU and memory.  This frees up most of the available CPU and memory resources for guest domains. 

IO is a little more complex, but very straightforward.  When LDoms 1.0 first came out, the only way to provide IO to guest systems was to create virtual disk and network services and attach guests to these services.  In the meantime, several different ways to connect guest domains to IO have been developed, the most recent one being SR-IOV support for network devices released in version 2.2 of Oracle VM Server for SPARC. I will cover these more advanced features in detail later.  For now, lets have a short look at the initial way IO was virtualized in LDoms:

For virtualized IO, you create two services, one "Virtual Disk Service" or vds, and one "Virtual Switch" or vswitch.  You can, of course, also create more of these, but that's more advanced than I want to cover in this introduction.  These IO services now connect real, physical IO resources like a disk LUN or a networt port to the virtual devices that are assigned to guest domains.  For disk IO, the normal case would be to connect a physical LUN (or some other storage option that I'll discuss later) to one specific guest.  That guest would be assigned a virtual disk, which would appear to be just like a real LUN to the guest, while the IO is actually routed through the virtual disk service down to the physical device.  For network, the vswitch acts very much like a real, physical ethernet switch - you connect one physical port to it for outside connectivity and define one or more connections per guest, just like you would plug cables between a real switch and a real system. For completeness, there is another service that provides console access to guest domains which mimics the behavior of serial terminal servers.

The connections between the virtual devices on the guest's side and the virtual IO services in the primary domain are created by the hypervisor.  It uses so called "Logical Domain Channels" or LDCs to create point-to-point connections between all of these devices and services.  These LDCs work very similar to high speed serial connections and are configured automatically whenever the Control Domain adds or removes virtual IO.

To see all this in action, now lets look at a first example.  I will start with a newly installed machine and configure the control domain so that it's ready to create guest systems.

In a first step, after we've installed the software, let's start the virtual console service and downsize the primary domain. 

root@sun # ldm list
NAME     STATE    FLAGS   CONS  VCPU  MEMORY   UTIL  UPTIME
primary  active   -n-c--  UART  512   261632M  0.3%  2d 13h 58m

root@sun # ldm add-vconscon port-range=5000-5100 \
               primary-console primary
root@sun # svcadm enable vntsd
root@sun # svcs vntsd
STATE          STIME    FMRI
online          9:53:21 svc:/ldoms/vntsd:default

root@sun # ldm set-vcpu 16 primary
root@sun # ldm set-mau 1 primary
root@sun # ldm start-reconf primary
root@sun # ldm set-memory 7680m primary
root@sun # ldm add-config initial
root@sun # shutdown -y -g0 -i6 

So what have I done:

  • I've defined a range of ports (5000-5100) for the virtual network terminal service and then started that service.  The vnts will later provide console connections to guest systems, very much like serial NTS's do in the physical world.
  • Next, I assigned 16 vCPUs (on this platform, a T3-4, that's two cores) to the primary domain, freeing the rest up for future guest systems.  I also assigned one MAU to this domain.  A MAU is a crypto unit in the T3 CPU.  These need to be explicitly assigned to domains, just like CPU or memory.  (This is no longer the case with T4 systems, where crypto is always available everywhere.)
  • Before I reassigned the memory, I started what's called a "delayed reconfiguration" session.  That avoids actually doing the change right away, which would take a considerable amount of time in this case.  Instead, I'll need to reboot once I'm all done.  I've assigned 7680MB of RAM to the primary.  That's 8GB less the 512MB which the hypervisor uses for it's own private purposes.  You can, depending on your needs, work with less.  I'll spend a dedicated article on sizing, discussing the pros and cons in detail.
  • Finally, just before the reboot, I saved my work on the ILOM, to make this configuration available after a powercycle of the box.  (It'll always be available after a simple reboot, but the ILOM needs to know the configuration of the hypervisor after a power-cycle, before the primary domain is booted.)

Now, lets create a first disk service and a first virtual switch which is connected to the physical network device igb2. We will later use these to connect virtual disks and virtual network ports of our guest systems to real world storage and network.

root@sun # ldm add-vds primary-vds
root@sun # ldm add-vswitch net-dev=igb2 switch-primary primary

You are free to choose whatever names you like for the virtual disk service and the virtual switch.  I strongly recommend that you choose names that make sense to you and describe the function of each service in the context of your implementation.  For the vswitch, for example, you could choose names like "admin-vswitch" or "production-network" etc.

This already concludes the configuration of the control domain.  We've freed up considerable amounts of CPU and RAM for guest systems and created the necessary infrastructure - console, vts and vswitch - so that guests systems can actually interact with the outside world.  The system is now ready to create guests, which I'll describe in the next section.

For further reading, here are some recommendable links:

Mittwoch Jan 26, 2011

Logical Domains - sure secure

LDoms Oracle VM Server for SPARC are being used wide and far.  And I've been asked several times, how secure they actually were.  One customer especially wanted to be very very sure. So we asked for independent expertise on the subject matter.  The results were quite pleasing, but not exactly night time literature. So I decided to add some generic deployment recommendations to the core results and came up with a whitepaper. Publishing was delayed a bit due to the change of ownership which resulted in a significant change in process.  The good thing about that is that now it's also up to date with the latest release of the software. I am now happy and proud to present::


Secure Deployment of Oracle VM for SPARC


A big Thanks You to Steffen Gundel of Cirosec, who laid the foundation for this paper with his study.


I do hope that it will be usefull to some of you!


Enjoy!

About

Neuigkeiten, Tipps und Wissenswertes rund um SPARC, CMT, Performance und ihre Analyse sowie Erfahrungen mit Solaris auf dem Server und dem Laptop.

This is a bilingual blog (most of the time). Please select your prefered language:
.
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today