Montag Feb 24, 2014

What's up with LDoms: Part 8 - Physical IO

Virtual IO SetupFinally finding some time to continue this blog series...  And starting the new year with a new chapter for which I hope to write several sections: Physical IO options for LDoms and what you can do with them.  In all previous sections, we talked about virtual IO and how to deal with it.  The diagram at the right shows the general architecture of such virtual IO configurations. However, there's much more to IO than that. 

From an architectural point of view, the primary task of the SPARC hypervisor is partitioning of  the system.  The hypervisor isn't usually very active - all it does is assign ownership of some parts of the hardware (CPU, memory, IO resources) to a domain, build a virtual machine from these components and finally start OpenBoot in that virtual machine.  After that, the hypervisor essentially steps aside.  Only if the IO components are virtual components, do we need hypervisor support.  But those IO components could also be physical.  Actually, that is the more "natural" option, if you like.  So lets revisit the creation of a domain:

We always start with assigning of CPU and memory in some very simple steps:

root@sun:~# ldm create mars
root@sun:~# ldm set-memory 8g mars
root@sun:~# ldm set-core 8 mars

If we now bound and started the domain, we would have OpenBoot running and we could connect using the virtual console.  Of course, since this domain doesn't have any IO devices, we couldn't yet do anything particularily useful with it.  Since we want to add physical IO devices, where are they?

To begin with, all physical components are owned by the primary domain.  This is the same for IO devices, just like it is for CPU and memory.  So just like we need to remove some CPU and memory from the primary domain in order to assign these to other domains, we will have to remove some IO from the primary if we want to assign it to another domain.  A general inventory of available IO resources can be obtained with the "ldm ls-io" command:

root@sun:~# ldm ls-io
NAME                                      TYPE   BUS      DOMAIN   STATUS  
----                                      ----   ---      ------   ------  
pci_0                                     BUS    pci_0    primary          
pci_1                                     BUS    pci_1    primary          
pci_2                                     BUS    pci_2    primary          
pci_3                                     BUS    pci_3    primary          
/SYS/MB/PCIE1                             PCIE   pci_0    primary  EMP     
/SYS/MB/SASHBA0                           PCIE   pci_0    primary  OCC
/SYS/MB/NET0                              PCIE   pci_0    primary  OCC     
/SYS/MB/PCIE5                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE6                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE7                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE2                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE3                             PCIE   pci_2    primary  OCC     
/SYS/MB/PCIE4                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE8                             PCIE   pci_3    primary  EMP     
/SYS/MB/SASHBA1                           PCIE   pci_3    primary  OCC     
/SYS/MB/NET2                              PCIE   pci_3    primary  OCC     
/SYS/MB/NET0/IOVNET.PF0                   PF     pci_0    primary          
/SYS/MB/NET0/IOVNET.PF1                   PF     pci_0    primary          
/SYS/MB/NET2/IOVNET.PF0                   PF     pci_3    primary          
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_3    primary

The output of this command will of course vary greatly, depending on the type of system you have.  The above example is from a T5-2.  As you can see, there are several types of IO resources.  Specifically, there are

  • BUS
    This is a whole PCI bus, which means everything controlled by a single PCI control unit, also called a PCI root complex.  It typically contains several PCI slots and possibly some end point devices like SAS or network controllers.
  • PCIE
    This is either a single PCIe slot.  In that case, it's name corresponds to the slot number you will find imprinted on the system chassis.  It is controlled by a root complex listed in the "BUS" column.  In the above example, you can see that some slots are empty, while others are occupied.  Or it is an endpoint device like a SAS HBA or network controller.  An example would be "/SYS/MB/SASHBA0" or "/SYS/MB/NET2".  Both of these typically control more than one actual device, so for example, SASHBA0 would control 4 internal disks and NET2 would control 2 internal network ports.
  • PF
    This is a SR-IOV Physical Function - usually an endpoint device like a network port which is capable of PCI virtualization.  We will cover SR-IOV in a later section of this blog.

All of these devices are available for assignment.  Right now, they are all owned by the primary domain.  We will now release some of them from the primary domain and assign them to a different domain.  Unfortunately, this is not a dynamic operation, so we will have to reboot the control domain (more precisely, the affected domains) once to complete this.

root@sun:~# ldm start-reconf primary
root@sun:~# ldm rm-io pci_3 primary
root@sun:~# reboot
[ wait for the system to come back up ]
root@sun:~# ldm add-io pci_3 mars
root@sun:~# ldm bind mars

With the removal of pci_3, we also removed PCIE8, SYSBHA1 and NET1 from the primary domain and added all three to mars.  Mars will now have direct, exclusive access to all the disks controlled by SASHBA1, all the network ports on NET1 and whatever we chose to install in PCIe slot 8.  Since in this particular example, mars has access to internal disk and network, it can boot and communicate using these internal devices.  It does not depend on the primary domain for any of this.  Once started, we could actually shut down the primary domain.  (Note that the primary is usually the home of vntsd, the console service.  While we don't need this for running or rebooting mars, we do need it in case mars falls back to OBP or single-user.) 

Root Domain SetupMars now owns its own PCIe root complex.  Because of this, we call mars a root domain.  The diagram on the right shows the general architecture.  Compare this to the diagram above!  Root domains are truely independent partitions of a SPARC system, very similar in functionality to Dynamic System Domains in the E10k, E25k or M9000 times (or Physical Domains, as they're now called).  They own their own CPU, memory and physical IO.   They can be booted, run and rebooted independently of any other domain.  Any failure in another domain does not affect them.  Of course, we have plenty of shared components: A root domain might share a mainboard, a part of a CPU (mars, for example, only has 2 cores...), some memory modules, etc. with other domains.  Any failure in a shared component will of course affect all the domains sharing that component, which is different in Physical Domains because there are significantly fewer shared components.  But beyond this, root domains have a level of isolation very similar to that of Physical Domains.

Comparing root domains (which are the most general form of physical IO in LDoms) with virtual IO, here are some pros and cons:


  • Root domains are fully independet of all other domains (with the exception of console access, but this is a minor limitation).
  • Root domains have zero overhead in IO - they have no virtualization overhead whatsoever.
  • Root domains, because they don't use virtual IO, are not limited to disk and network, but can also attach to tape, tape libraries or any other, generic IO device supported in their PCIe slots.


  • Root domains are limited in number.  You can only create as many root domains as you have PCIe root complexes available.  In current T5 and M5/6 systems, that's two per CPU socket.
  • Root domains can not live migrate.  Because they own real IO hardware (with all these nasty little buffers, registers and FIFOs), they can not be live migrated to another chassis.

Because of these different characteristics, root domains are typically used for applications that tend to be more static, have higher IO requirements and/or larger CPU and memory footprints.  Domains with virtual IO, on the other hand, are typically used for the mass of smaller applications with lower IO requirements.  Note that "higher" and "lower" are relative terms - LDoms virtual IO is quite powerful.

This is the end of the first part of the physical IO section, I'll cover some additional options next time.  Here are some links for further reading:

Montag Jan 14, 2013

LDoms IO Best Practices & T4 Red Crypto Stack

In November, I presented at DOAG Konferenz & Ausstellung 2012.  Now, almost two months later, I finally get around to posting the slides here...

  • In "LDoms IO Best Practices" I discuss different IO options for both disk and networking and give some recommens on how you to choose the right ones for your environment.  A couple hints about performance are also included.

I hope the slides are useful!

Freitag Jul 13, 2012

What's up with LDoms: Part 3 - A closer look at Disk Backend Choices

In this section, we'll have a closer look at virtual disk backends and the various choises available here.  As a little reminder, a disk backend, in LDoms speak, is the physical storage used when creating a virtual disk for a guest system.  In other virtualization solutions, these are sometimes called virtual disk images, a term that doesn't really fit for all possible options available in LDoms.

In the previous example, we used a ZFS volume as a backend for the boot disk of mars.  But there are many other ways to store the data of virtual disks.  The relevant section in the Admin Guide lists all the available options:

  • Physical LUNs, in any variant that the Control Domain supports.  This of course includes SAN, iSCSI and SAS, including the internal disks of the host system.
  • Logical Volumes like ZFS Volumes, but also SVM or VxVM
  • Regular Files. These can be stored in any filesystem, as long as they're accessible by the LDoms subsystem. This includes storage on NFS.

Each of these backend devices have their own set of characteristica that should be considered when deciding which backend type to use.  Let's look at them in a little more detail.

LUNs are the most generic option. By assigning a virtual disk to a LUN backend, the guest essentially gains full access to the underlying storage device, whatever that might be.  It will see the volume label of the LUN, it can see and alter the partition table of the LUN, it can also read or set SCSI reservations on that device.  Depending on the way the LUN is connected to the host system, this very same LUN could also be attached to a second host and a guest residing on it, with the two guests sharing the data on that one LUN, or supporting live migration.  If there is a filesystem on the LUN, the guest will be able to mount that filesystem, just like any other system with access to that LUN, be it virtualized or direct.  Bear in mind that most filesystems are non-shared filesystems.  This doesn't change here, either.  For the IO domain (that's the domain where the physical LUN is connected) LUNs mean the least possible amount of work.  All it has to do is pass data blocks up and down to and from the LUN, there is a very minimum of driver layers invovled.

Flat files, on the other hand, are the most simple option, very similar in user experience to what one would do in a desktop hypervisor like VirtualBox.  The easiest way to create one is with the "mkfile" command.  For the guest, there is no real difference to LUNs.  The virtual disk will, just like in the LUN case, appear to be a full disk, partition table, label and all.  Of course, initially, it'll be all empty, so the first thing the guest usually needs to do is write a label to the disk.  The main difference to LUNs is in the way these image files are managed.  Since they are files in a filesystem, they can be copied, moved and deleted, all of which should be done with care, especially if the guest is still running.  They can be managed by the filesystem, which means attributes like compression, encryption or deduplication in ZFS could apply to them - fully transparent to the guest.  If the filesystem is a shared filesystem like NFS or SAM-FS, the file (and thus the disk image) could be shared by another LDom on another system, for example as a shared database disk or for live migration.  Their performance will be impacted by the filesystem, too.  The IO domain might cache some of the file, hoping to speed operations.  If there are many such image files on a single filesystem, they might impact each other's performance.  These files, by the way, need not be empty initially.  A typical use case would be a Solaris iso image file.  Adding it to a guest as a virtual disk will allow that guest to boot (and install) off that iso image as if it were a physical CD drive.

Finally, there are logical Volumes, typically created with volume managers such as Solaris Volume Manager (SVM) or Veritas Volume Manager (VxVM) or ZFS, of course.  For the guest, again, these look just like ordinary disks, very much like files.  The difference to files is in the management layer;  The logical volumes are created straigt from the underlying storage, without a filesystem layer in between.  In the database world, we would call these "raw devices", and their device names in Solaris are very similar to those of physical LUNs.  We need different commands to find out how large these volumes are, or how much space is left on the storage devices underneath.  Other than that, however, they are very similar to files in many ways.  Sharing them between two host systems is likely to be more complex, as one would need the corresponding cluster volume managers, which typically only really work in combination with Solaris Cluster.  One type of volume that deserves special mentioning is the ZFS Volume.  It offers all the features of a normal ZFS dataset: Clones, snapshots, compression, encryption, deduplication, etc.  Especially with snapshots and clones, they lend themselves as the ideal backend for all use cases that make heavy use of these features. 

For the sake of completeness, I'd like to mention that you can export all of these backends to a guest with or without the "slice" option, something that I consider less usefull in most cases, which is why I'd like to refer you to the relevant section in the admin guide if you want to know more about this.

Lastly, you do have the option to export these backends read-only to prevent any changes from the guests.  Keep in mind that even mounting a UFS filesystem read only would require a write operation to the virtual disk.  The most typical usecase for this is probably an iso-image, which can indeed be mounted read-only.  You can also export one backend to more than one guest.  In the physical world, this would correspond to using the same SAN LUN on several hosts, and the same restrictions with regards to shared filesystems etc. apply.

So now that we know about all these different options, when should we use which kind of backend ?  The answer, as usual, is: It depends!

LUNs require a SAN (or iSCSI) infrastructure which we tend to associate with higher cost.  On the other hand, they can be shared between many hosts, are easily mapped from host to host and bring a rich feature set of storage management and redundancy with them.  I recommend LUNs (especially SAN) for both boot devices and data disks of guest systems in production environments.  My main reasons for this are:

  • They are very light-weight on the IO domain
  • They avoid any double buffering of data in the guest and in the IO domain because there is no filesystem layer involved in the IO domain.
  • Redundancy for the device and the data path is easy
  • They allow sharing between hosts, which in turn allows cluster implementations and live migration
  • All ZFS features can be implemented in the guest, if desired.

For test and development, my first choice is usually the ZFS volume.  Unlike VxVM, it comes free of charge, and it's features like snapshots and clones meet the typical requirements of such environments to quickly create, copy and destroy test environments.  I explicitly recommend against using ZFS snapshots/clones (files or volumes) over a longer period of time.  Since ZFS records the delta between the original image and the clones, the space overhead will eventually grow to a multiple of the initial size and eventually even prevent further IO to the virtual disk if the zpool is full.  Also keep in mind that ZFS is not a shared filesystem.  This prevents guest that use ZFS files or volumes as virtual disks from doing live migration.  Which leads directly to the recommendation for files:

I recommend files on NFS (or other shared filesystems) in all those cases where SAN LUNs are not available but shared access to disk images is required because of live migration (or because of cluster software like Solaris Cluster or RAC is running in the guests).  The functionality is mostly the same as for LUNs, with the exception of SCSI reservations, which don't work with a file backend.  However, CPU requirements in the IO domain and performance of NFS files as compared to SAN LUNs is likely to be different, which is why I strongly recommend to use SAN LUNs for all prodution use cases.

Further reading:

Mittwoch Dez 21, 2011

Which IO Option for which Server?

For those of you who always wanted to know what IO option cards were available for which server, there is now a new portal on  This wiki contains a full list of IO options, ordered by server, and maintained for all current systems. Also included is the number of cards supported on each system.  The same information, for all current as well as for all older models, is available in the Systems Handbook, the ultimate answerbook for all hardware questions ;-)

(For those that have been around for a while: This service is the replacement for the previous "Cross Platform IO Wiki", which is no longer available.)


Neuigkeiten, Tipps und Wissenswertes rund um SPARC, CMT, Performance und ihre Analyse sowie Erfahrungen mit Solaris auf dem Server und dem Laptop.

This is a bilingual blog (most of the time). Please select your prefered language:
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


« April 2014