Mittwoch Dez 23, 2015

Getting Started with OVM Templates for SPARC - Part 3: Using Templates on SuperCluster

As with the previous parts of this mini-series, this is the blog version of my MOS Note 2065199.1, although this time I'll deviate a bit more from the "original".

In the previous section, we saw how to create and deploy templates on standalone systems.  While straightforward, it was all commandline work.  On SuperCluster, this is very different.  First of all, you can't create a template on SuperCluster.  That's because it's not supported to manually create domains on SuperCluster, and all the domains that are created by the tools don't use a file backend for their rpool.  Deployment, on the other hand, is made even easier than on standalone systems by the very easy to use "IO Domain Creation Tool". 

The "IO Domain Creation Tool" manages all aspects of the lifecycle of an IO domain - a new type of domain introduced in the 2.0 release of SuperCluster.  I will not go into the details of this tool and the SuperCluster IO domain here.  Have a look in the official documentation for details.  However, IO domains also support OVM templates as a possible installation method, and this is what I'll cover here.

On standalone systems, you're responsible for managing your own templates.  Typically, you'll have some shared storage with a collection of templates that you use.  In SuperCluster, this collection is formalized into a "Library" from which you can pick a template at install time.  So, to get started, the first step is to upload a template into that library.  To do that, the template file will need to be available on a filesystem visible on the first node primary domain.  In the example I'll use here, this is /net/192.168.30.1/export/ssc/templates/pae/template-demo.ova.  In the IO Domain Creation Tool, go to the "OVM Templates" tab, enter the full pathname of the OVA file and select "Add Template".  Click on the images for a readable view. 
The template will now be uploaded to the library.  Since this includes unpacking the OVA file and uncompressing the disk images, it will take several minutes.  There is no progress display, but you can check the contents of the library from time to time.


The template will be displayed in the list of existing templates once the import process has completed.  Note that at this time, you can't delete or otherwise modify a template once it's in this library.  We're working on that...


Allocating Hardware and Selecting the Template

On SuperCluster, you can't just create a new domain using ldm commands like you would on a standalone system.  All hardware resources available for IO domains are managed by the IO domains subsystem.  You'll need to "register" the hardware you want for the new domain with that subsystem, using the IO Domain Creation Tool.  To do that, go to the "IO Domains" tab of the IO Domain Creation Tool.  It will display a list of existing IO Domains.  Select the "Add IO Domain" button.  On the following screen, select your template as the domain type and the desired hardware size as the domain recipe.  The size you choose here will override any resource requirements layed down in the template.  Then click "Next".


The next screen will ask you to provide values for various properties.  Some of these are hostnames for different network interfaces.  These are SuperCluster specific and will override any settings configured in the template.  Depending on the template, there might be other properties to be filled with values.  In our example, you will be asked to provide values for the three properties defined earlier.  Once you are satisfied with your entries, click "Allocate".  This will take you back to the list of IO domains.


At the top of the screen, some network details will be displayed.  The tool chooses IP addresses from a pool which was configured during SuperCluster Installation.  Make note of these so you can easily connect to your domain after deployment.

Deploying the Domain

The final step is very simple.  To deploy the domain with the configuration given in the previous steps, select the domain in the list of available domains and click on "Deploy".  After a final confirmation, the domain will be created and the template will be deployed.

Now wait for the Domain to be created.


Once ready, you can view some details.


And of course, you can connect to the application and see if it works.


Differences to Templates on Standalone Systems

Domains deployed using templates on standalone systems are configured based on the requirements defined in the template.  This means the number of CPUs, the amount of memory, the number of network interfaces, the type and location of the disk images are either taken from the template definition or are defined by the administrator during template deployment and configuration.  When deploying on SuperCluster, all of this is controlled by the IO domain creation utility:

  • The domain size is set by the utility based on the domain recipe chosen by the administrator.
  • Network adapters are created and configured to conform with the SuperCluster network environment, overriding any settings in the template.
  • Disk images are always provided as iSCSI LUNs from the central ZFS SA.  The image sizes are take from the template.
  • /etc/hosts is populated with entries from the SuperCluster environment.
  • The Solaris publisher is set to the standard publisher in the SuperCluster environment.  This includes adding the exa-family publisher.
  • ssctuner is installed and activated.
  • The domain's root password is set to a default password.

The net effect of these differences is that the IO domain, although based on a template, is a normal SuperCluster IO domain that provides full connectivity to the SuperCluster environment and is fully supported, just like any other application domain.  This also means that any restrictions that apply to normal application domains also apply to IO domains deployed from a template.

For additional and background reading, please refer to the links I provided at the end of the first article in this mini-series.

Donnerstag Dez 03, 2015

Getting Started with OVM Templates for SPARC - Part 2: Creating a Template

The primary purpose of a template is to deliver an application "ready to run" in an environment that contains everything the application needs, but not more than necessary.  The template will be deployed multiple times and you will want to configure each deployed instance to give it a personality and perhaps connect it with other components running elsewhere.  For this to work, you will need to understand the configuration aspects of the application well.  You might need a script that does some or all of the configuration of the application once the template has been deployed and the domain boots for the first time.  Any configuration not done by such a script will need to be done manually by the user of the template after the domain has booted.  It is usually desirable to create templates in such a way that no further manual configuration is required.  I'll not cover how to create such first-boot scripts here.

(Note: This article is the blog version of the second part of MOS DocID 2063739.1 reproduced here for convenient access.) 

Here are the steps usually required to build a template:

  1. Create a new domain and install Solaris and the ovmt utilities.
  2. Define what properties will be needed to configure the application.
  3. Install and configure the application as much as possible.
  4. Write and test a first-boot script or SMF service that will read the values for these properties using the ovmt utilities and configures the application.
  5. Unconfigure Solaris in your domain and remove any temporary files, then and shut it down.
  6. Create the template. 
  7. Test your template.  Go back to any of the previous steps to fix issues you find.

Before we look at each of these steps using an example, here is a little background about how properties work in this context.

Properties for Your Template

Most applications require some sort of configuration before they are ready.  Typical configuration items might be TCP ports on which to listen for requests, admin passwords for a web user interface or the IP address of a central administration server for that application.  These are usually set during manual installation of the application.  Since the whole idea of an application template is that once you deploy it, the application is ready to run, there needs to be a way to pass these configuration items to the application during template deployment.  This is where properties come in.  Any configuration item that must be passed to the application at deployment time is defined as a property of the template.  This happens in a properties definition file that is bundled with the actual OS image to build the template.  Note that this file only contains the definition of these properties, not their values!  Properties are populated with values at deployment time.

You have already seen an example of how to pass such values to a template during deployment in the first part of this mini-series, when Solaris properties were passed to the template.

Properties are defined in a small XML file, as they themselves have various properties.  They have a name, which can be fully qualified or simple.  They have a data type, like number, text, selection lists etc.  These are covered in detail in the Template Authoring Guide.  For the simple example in this article, we will use text properties only.

During deployment, you pass values for these properties to the target domain using the ovmtconfig utility.  It will use one of two methods to pass these values to the domain.  The more complex method is backmounting the target domain's disk images to the control domain and then running a script which will put configuration items right into the filesystem of the target domain.  This is how a Solaris AI profile is constructed and passed to the domain for first-boot configuration of Solaris itself.  The other method currently uses VM variables to pass properties and their values to the domain.  These can then be read at first boot using the ovmtprop utility.  This is why the ovmt utilities should be installed in the template.  In the example below, you will see how this is done.

Creating a Source Domain

The first step in developing a template is to create a source domain.  This is where you will install your application, configure it as much as possible and test the first-boot script which will do the rest of the configuration based on the template property values passed during deployment.  Here are a few hints and recommendations for this step:

  • Use a standalone SPARC sun4v system to build your template.  Do not attempt to build a template on SuperCluster - this is not supported.
  • Create a simple domain:
    • Use only one network adapter if possible.  This makes it easier to deploy the template in many different environments.
    • Use only one disk image for the OS.  Keep it small.  Unnecessarily large disk images will make it more difficult to transport and deploy the template.
    • If required, you can use one or more additional disk images for the application.  Separating it from the OS is often a good idea.
    • Define CPU and memory resources only as needed.  They can always be increased after deployment if necessary.
    • Install only those OS packages required by your application.  You might want to start with "solaris-minimal-server" and then add any packages required.
  • You must use flat files for all disk images.  A later version of the utilities will also support other disk backends.
  • All your disk image files must have a filename ending in ".img".  While this is not a restriction for general use, it is currently a requirement if you intend to deploy your template on SuperCluster.
    • Use a sensible name for the disk images, as they are used in the volume descriptions within the template.
  • To speed up testing (create and deploy cycles) it is helpful to install pigz, a parallel implementation of gzip.  If present, it will be used to speed up compression and decompression of the disk images.
  • As you will most likely be using ovmtprop to read properties for your application, don't forget to install the ovmtutilities into the domain.
  • Apply all required configurations to make the application ready to run.  If this includes adding users, file systems, system configurations etc. to the domain, then do that.  All of these will be preserved during template creation.
  • Note that system properties like IP addresses, root passwords, timezone information etc. will not be preserved.

Defining the Properties

What properties you need in your template depends on your application.  So of course you should understand how to install and configure the application before you start building the template.  Here is an example for a very very simple application:  An Apache webserver that shows a plain text page displaying the properties passed to the domain.  In this example, there are three properties defined in the properties file: property1, property2 and property3.  Here is the full XML file defining these properties:

<ovf:ProductSection ovf:class="com.oracle.supercluster.client">
        <ovf:Product>Oracle SuperCluster</ovf:Product>
        <ovf:Category>OVM Templates</ovf:Category>
        <ovf:Version>1</ovf:Version>
        <ovf:Property ovf:key="property1" ovf:type="string" 
                      ovf:userConfigurable="true" ovf:value="Value1">
                <ovf:Description>Specifies property 1</ovf:Description>
        </ovf:Property>
        <ovf:Property ovf:key="property2" ovf:type="string"
                      ovf:userConfigurable="true" ovf:value="Value2">
                <ovf:Description>Specifies property 2</ovf:Description>
        </ovf:Property>
        <ovf:Property ovf:key="property3" ovf:type="string"
                      ovf:userConfigurable="true" ovf:value="Value3">
                <ovf:Description>Specifies property 3</ovf:Description>
        </ovf:Property>
</ovf:ProductSection> 

This file will be bundled with the template when you create the template.  Note that although there are values defined in this file, these values are not passed to the target domain during deployment!  A detailed description of all the options for template properties will be available in the Template Authoring Guide, which is currently being written.

Installing and Configuring the Application

Again, this step very much depends on the application you intend for your template.  In this simple example, there is very little to do.  Note that this step also shows a very simple first-boot script which does the configuration of the application:

root@source:~# pkg install ovmtutils

root@source:~# pkg install apache22

root@source:~# svcadm enable apache22

root@source:~# more /etc/rc3.d/S99template-demo 

#!/bin/sh

# simplest of all first-boot scripts
# just for demo purposes

OUTFILE=/var/apache2/2.2/htdocs/index.html
OVMTPROP=/opt/ovmtutils/bin/ovmtprop
BASEKEY=com.oracle.supercluster.client.property

if [ ! -f $OUTFILE ]
then
   cat >>$OUTFILE << EOF
<html><body>
<h1>OVM Template Demo</h1>
EOF
   for i in 1 2 3
   do
      $OVMTPROP -q get-prop -k $BASEKEY$i >> $OUTFILE
      echo "</br>" >> $OUTFILE
   done
cat >>$OUTFILE << EOF
</body></html>
EOF
fi

After you have completed the installation and are happy with how the first-boot scripts work, the final step is to unconfigure Solaris.  You do this using the command "sysconfig unconfigure --destructive --include-site-profile". This removes system identity like hostname, IP addresses, time zones and also root passwords from the system.  It is necessary so that after deployment, a new system identity can be passed to the target domain.  You might also want to remove other files like ssh keys, temp files in /var/tmp etc. which were created during configuration and testing but should not be left in the template.

Creating the Template

After all the preparations are complete, creating the actual template is a very simple step.  You only need to call ovmtcreate, which will collect all the pieces and assemble the OVF container for you.

root@mars:~# ovmtcreate -d source -o /templates/apache-template.ova \
             -s "OVM S11.2 SPARC Apache 2.2 Demo" \
             -P /development/apache-template.properties.xml

This will collect all the disk images attached to the source domain as well as the property definition file and package them in OVF format.   This file can now be transferred to any target system and deployed there.  With the "-s" flag, a descriptive name is given to the template.  This name can be used to identify the template.  For example, it will be displayed in the library of templates available in SuperCluster.

To complete the example workflow, deployment and configuration with the three custom properties for the sample Apache configuration is shown next.

Deploying Your Template on a Standalone Server with Custom Properties

To prepare for deployment, you will need a small file containing values for the three custom properties defined above: 

root@mars:~# cat custom.values
com.oracle.supercluster.client.property1=Value1
com.oracle.supercluster.client.property2=Value2
com.oracle.supercluster.client.property3=Value3

Of course, Solaris itself will also need some configuration.  You have already seen this in Part 1.  This example will simply reuse that configuration.

To deploy and configure the target domain, you now only need two simple commands:

root@mars:~# ovmtdeploy -d target -o /domains \
    -s /templates/apache-template.ova
root@mars:~# ovmtconfig -d target  \
    -c /opt/ovmtutils/share/scripts/ovmt_s11_scprofile.sh \
    -P solaris_11.props.values,custom.values

After starting the domain, you should be able to point your browser to port 80 and see the simple html page created by the first-boot script.

Additional Notes

  • You can check the properties passed to a domain from the control domain.  They are passed as normal VM variables, so the command "ldm ls-variable <domain>" will display them.  Note that this also means that any sensitive information passed to a domain in this way will be visible to anyone on the control domain with privileges to execute the ldm command.  The same information is also available to any user logged into the target domain, as any user has read access to the ovmtprop utility.  So while this is a nice way to check if the right variables have been passed to the target domain during testing, you should not use template properties to pass clear text passwords or other sensitive information to the domain.  Use encrypted password strings instead, if possible.  If the application requires clear text sensitive information, you should at least remove these variables after the target domain is in production.  Use the command "ldm rm-variable" to do so.  Triggering a change of passwords after first login is also a good idea.
  • In general, OVM Templates support Solaris Zones.  This means you can create a template domain that contains one or more Zones as part of the template.  However, you will need to care for the configuration (IP addresses, hostnames, etc) of each zone as part of the first-boot configuration, as this is not covered by the Solaris configuration utilities.  Also note that templates containing Zones are currently not supported on SuperCluster.

For additional and background reading, please refer to the links I provided at the end of the first article in this mini-series.

Getting Started with OVM Templates for SPARC - Part 1: Deploying a Template

OVM Templates have been around for a while now.  Documentation and various templates are readily available on OTN.  However, most of the documentation is centered around OVM Manager and most of the templates are built for Linux on x86 hardware.  But SPARC and Solaris are catching up.  A template for Weblogic 12.1.3 is already available.

With the release of Solaris 11.3, commandline tools to create, deploy and configure OVM Templates for SPARC are now available.  The tools are also available as a separate download on MOS as patch ID 21210110.  In this small series of blog entries, I will discuss how to deploy OVM Templates on SPARC and how to create your own.

(Note: This article is the blog version of the first part of MOS DocID 2063739.1 reproduced here for convenient access.)

Let's start with the easiest part: Deploying an existing template on a SPARC system.

Of course, the first step here is to get a template.  Today, there isn't very much choice - you can get a template for Solaris 10 and one for Solaris 11.  But more are being developed.  Go to Edelivery.oracle.com  to get them.  Here's a little screenshot to guide you in the right direction.  My notes are in red...

There is also a template for Weblogic 12.1.3 available here.

Once you've downloaded the template, you'll find a file called "sol-11_2-ovm-sparc.ova" or similar.  This is the template in Open Virtualization Format.   Since OVA and OVF are based on tar, you can actually extract and explore that file if you are curious.

The second step in deploying this template is to download and install the OVM Template Toolkit in the control domain of your server.  Either update to Solaris 11.3 or download the toolkit as a separate patch.  Then install it - you will find the tools in /opt/ovmtutils.  The patch will also contain a README file and manpages for the utilities.  I recommend looking at them for additional details and commandline switches not covered here.

root@mars:/tmp/ovmt# unzip /tmp/p21210110_11010_SOLARIS64.zip 
Archive:  /tmp/p21210110_11010_SOLARIS64.zip
  inflating: README.txt              
   creating: man1m/
  inflating: man1m/ovmtconfig.1m     
  inflating: man1m/ovmtprop.1m       
  inflating: man1m/ovmtlibrary.1m    
  inflating: man1m/ovmtcreate.1m     
  inflating: man1m/ovmtdeploy.1m     
  inflating: ovmt-utils1.1.0.1.p5p   

root@mars:/tmp/ovmt# pkg install -vg ./ovmt-utils1.1.0.1.p5p ovmtutils

[....]

root@mars:~# ls /opt/ovmtutils/bin
agent        dist         ovmtconfig   ovmtdeploy   ovmtprop
bin          lib          ovmtcreate   ovmtlibrary

Now, before we deploy the template, let's have a short look at what the utilities find in this specific template:

root@mars:~# ovmtdeploy -l ./sol-11_2-ovm-sparc.ova 
 
Oracle Virtual Machine for SPARC Deployment Utility
ovmtdeploy Version 1.1.0.1.4
Copyright (c) 2014, 2015, Oracle and/or its affiliates. All rights reserved.

STAGE 1 - EXAMINING SYSTEM AND ENVIRONMENT
------------------------------------------
Checking user privilege
Performing platform & prerequisite checks
Checking for required services
Named resourced available

STAGE 2 - ANALYZING ARCHIVE & RESOURCE REQUIREMENTS
---------------------------------------------------
Checking .ova format and contents
Validating archive configuration
Listing archive configuration

Assembly
------------------------
Assembly name: sol-11_2-ovm-sparc.ovf
Gloabl settings: 
References: zdisk-ovm-template-s11_2 -> zdisk-ovm-template-s11_2.gz
Disks: zdisk-ovm-template-s11_2 -> zdisk-ovm-template-s11_2
Networks: primary-vsw0

Virtual machine 1
------------------------
Name: sol-11_2-ovm-sparc
Description: Oracle VM for SPARC Template with 8 vCPUs, 4G memory, 1 disk image(s)
vcpu Quantity: 8
Memory Quantity: 4G
Disk image 1: ovf:/disk/zdisk-ovm-template-s11_2 -> zdisk-ovm-template-s11_2
Network adapter 1: Ethernet_adapter_0 -> primary-vsw0
Oracle VM for SPARC Template
        root-password
        network.hostname
        network.bootproto.0
        network.ipaddr.0
        network.netmask.0
        network.gateway.0
        network.dns-servers.0
        network.dns-search-domains.0

We can see that the target domain will have:

  • 8 vCPUs and 4GB of RAM
  • One Ethernet adapter connected to "primary-vsw0"
  • One disk image

It also supports several properties that can be configured after deployment and before the domain is first started.  These are all Solaris properties that one would usually provide during initial system configuration - either manually on the system console or using an AI profile.  We will see later in this article how to populate these properties.

Before we actually go and deploy this, we should check the prerequisites on the platform:

  • Your control domain should be running Solaris 11.3 (or at least 11.2 if you're using the patch mentioned above).
  • The virtual console service must be configured and running.  If not, set it up now.
    (See here for an example.)
  • A virtual disk service must be available.  If non is there, create one now.

But first, let's deploy the template without bothering about any details:

root@mars:~# ovmtdeploy -d solarisguest -o /localstore/domains \
             /incoming/sol-11_2-ovm-sparc.ova

In this very simple example, the domain we're creating and installing will be called "solarisguest".  It's disk images will be stored in /localstore/domains and it will be installed from the template found in /incoming/sol-11_2-ovm-sparc.ova.

There are a few things to note here:

  • ovmtdeploy will create the domain with resources as they are defined in the template.
  • It will, if necessary, create vswitches to connect network ports.  Make sure to check the result.
  • By default, ovmtdeploy will use flat files for disk images.
  • All of these settings can be overridden with commandline switches.  They allow very sophisticated domain configurations and are not covered here.
  • The domain is started right after the deployment.  Since we didn't populate the available properties, Solaris will boot in an unconfigured state and request configuartion on the system console.

So let's do this again, this time providing values for these properties.  This will allow us to boot the domain in a configured state without ever logging in.  What we need for this is a small text file which contains values for these properties:

root@mars:~# more solaris_11.props.values 

# Default hostname
com.oracle.solaris.system.computer-name=solarisguest

# Root user account settings
com.oracle.solaris.root-password='password hash goes here'

# Administrator account settings 
com.oracle.solaris.user.name.0=admin
com.oracle.solaris.user.real-name.0="Administrator"
com.oracle.solaris.user.password.0='password hash goes here'

# Network settings for first network instance
# Domain network interface
com.oracle.solaris.system.ifname=net0

# IP Address
# if not set, use DHCP
com.oracle.solaris.network.ipaddr.0=192.168.1.2
com.oracle.solaris.network.netmask.0=24
com.oracle.solaris.network.gateway.0=192.168.1.1

# DNS settings
# (comma separated list of DNS servers)
com.oracle.solaris.network.dns-servers.0=192.168.1.1
# (comma separated list of domains)
com.oracle.solaris.network.dns-search-domains.0=example.com

# System default locale settings
com.oracle.solaris.system.time-zone=US/Pacific
It should be obvious how to populate this file with your own values.  With this file, deployment and configuration is a simple, two step operation:

root@mars:~# ovmtdeploy -d solarisguest -o /localstore/domains \
             -s /incoming/sol-11_2-ovm-sparc.ova 

This deploys the guest, but doesn't start it.  That's what the "-s" commandline switch is for.

root@mars:~# ovmtconfig -d solarisguest \
    -c /opt/ovmtutils/share/scripts/ovmt_s11_scprofile.sh \
    -P solaris_11.props.values

The script "ovmt_s11_scprofile.sh" is part of the ovmtutils distribution.  ovmtconfig will mount the deployed disk image and call this script.  It will create a configuration profile using the property values given in "solaris_1.props.values".  This profile will be picked up by solaris at first boot to configure the domain.

With this, you have a solaris guest domain up and running.  If you don't like it, un-deploy it using "ovmtdeploy -U solarisguest".  It cleans up nicely.  You could now use this domain as a starting point for developing your own template.  But this will be covered in the next part.

For the curious, here are some additional links and references:

Dienstag Jan 06, 2015

What's up with LDoms - Article Index

In the last few years - yes, it's actually years! - I wrote a series of articles about LDoms and their various features.  It's about time to publish a small index to all those articles:

I will update this index if and when I find time for a new article.

What's up with LDoms: Part 11 - IO Recommendations

In the last few articles, I discussed various different options for IO with LDoms.  Here's a very short summary:

IO Option Links to previous articles
SR-IOV
Direct IO
Root Domains
Virtual IO

In this article, I will discuss the pros and cons of each of these options and give some recommendations for their use.

Root Domain SetupIn the case of physical IO, there are several options:  Root Domains, DirectIO and SR-IOV.  Let's start with SR-IOV.  The most recent addition to the LDom IO options, it is by far the most flexible and the most sophisticated PCI virtualization option available.  Please see the diagram on the right (from the Admin Guide) for an overview.  First introduced for Ethernet adapters, Oracle today supports SR-IOV for Ethernet, Infiniband and Fibre Channel.  Note that the exact features depend on the hardware capabilities and built-in support of the individual adapter.  SR-IOV is not a feature of a server but rather a feature of an individual IO card in a server platform that supports it.  Here are the advantages of this solution:

  • It is very fine grain, with between 7 and 63 Virtual Functions per adapter.  The exact number depends on adapter capabilities.  This means that you can create and use as many as 63 virtual devices in a single PCIe slot!
  • It provides bare metal performance (especially latency), although hardware resources like send and receive buffers, MAC slots and other resources are devided between VFs which might lead to slight performance differences in some cases.
  • Particularily for Fibre Channel, there are no limitations to what end-point device (disk, tape, library, etc.) you attach to the fabric.  Since this is a virtual HBA, it is administered like one.
  • Different than Root Domains and Direct IO, most SR-IOV configuration operations can be performed dynamically, if the adapters support it.  This is currently the case for Ethernet and Fibre Channel.  This means you can add or remove SR-IOV VFs to and from domains in a dynamic reconfiguration operation, without rebooting the domain.

Of course, there are also some drawbacks:

  • First of all, you have a hard dependency on the domain owning the root complex.  Here's a little more detail about this:
    As you can see in the diagram, the IO domain owns the physical IO card.  The physical Root Complex (pci_0 in the diagram) remains under the control of the root domain (the control domain in this example).  This means that if the root domain should reboot for whatever reason, it will reset the root complex as part of that reboot.  This reset will cascade down the PCI structures controlled by that root complex and eventually reset the PCI card in the slot and all the VFs given away to the IO domains.  Essentially, seen from the IO domain, its (virtual) IO card will perform an unexpected reset.  The best way to respond to this is with a panic of the IO domain, which is the most likely consequence.  Note that the Admin Guide says that the behaviour of the IO domain is unpredictable, which means that a panic is the best, but not the only possible outcome.  Please also take note of the recommended precautions (by configuring domain dependencies) documented in the same section of the Admin Guide.  Furthermore, you should be aware that this also means that any kind of multi-pathing on top of VFs is counter-productive.  While it is possible to create a configuration where one guest uses VFs from two different root domains (and thus from two different physical adapters), this does not increase the availability of the configuration.  While this might protect against external failures like link failures to a single adapter, it doubles the likelyhood of a failure of the guest, because it now depends on two root domains instead of one.  I strongly recommend against any such configurations at this time.  (There is work going on to mitigate this dependency.)
  • Live Migration is not possible for domains that use VFs.  In the case of Ethernet, this can be worked around by creating an IPMP failover group consiting of one virtual network port and one Ethernet VF and manually removing the VF before initiating the migration as described by Raghuram here.  Note that this is not currently possible for Fibre Channel or IB.
  • Since you are actually sharing one adapter between many guests, these guests do share the IO bandwidth of this one adapter.  Depending on the adapter, there might be bandwidth management available, however, the side effects of sharing should be considered.
  • Not all PCIe adapters support SR-IOV.  Please consult MOS DocID 1325454.1 for details.

SR-IOV is a very flexible solution, especially if you need a larger number of virtual devices and yet don't want to buy into the slightly higher IO latencies of virtual IO.  Due to the limitations mentioned above, I can not currently recommend SR-IOV or Direct IO for use in domains with highest availability requirements.  In all other situations, and definately in test and development environments, it is an interesting alternative to virtual IO.  The performance gap between SR-IOV and virtual IO has been narrowed considerably with the latest improvements in virtual IO.  You will essentially have to weigh the availability, latency and managability characteristics of SR-IOV against virtual IO to make your decision.

Root Domain SetupNext in line is Direct IO.  As described in an earlier post, you give one full PCI slot to the receiving domain.  The hypervisor will create a virtual PCIe infrastructure in the receiving guest and reconfigure the PCIe subsystem accordingly.  This is shown in an abstract view in the diagram (from the Admin Guide) at the right. Here are the advantages:

  • Since Direct IO works on a per slot basis, it is a more fine grain solution, compared to root domains.  For example, you have 16 slots in a T5-4, but only 8 root complexes.
  • The IO domain has full control over the adapter.
  • Like SR-IOV, it will provide bare-metal performance.
  • There is no sharing, and thus no cross-influencing from other domains.
  • It will support all kinds of IO devices, tape drives and tape libraries being the most popular example.

The disadvantages of Direct IO are:

  • There is a hard dependency on the domain owning the root complex.  The reason is the same as with SR-IOV, so there's no need to repeat this here.  Please make sure you understand this and read the recommendations in the Admin Guide on how to deal with this dependency.
  • Not all IO cards are supported with DirectIO.  They must not contain their own PCIe switch.  A list of supported cards is maintained in MOS DocID 1325454.1.
  • Like Root Domains, dynamic reconfiguration is not currently supported with DirectIO slots.  This means that you will need to reboot both the root domain and the receiving guest domain to change this configuration.
  • And of course, Live Migration is not possible with Direct IO devices.

DirectIO was introduced in an early release of the LDoms software.  At the time, systems like the T2000 only supported two Root Complexes.  The most common usecase was to support tape devices in domains other than the control domain.  Today, with a much better ratio of slots/root complex, the need for this feature is diminishing and although it is fully supported, you should consider other alternatives first.

Root Domain Setup Finally there are Root Domains.  Again, a diagram you already know, just as a reminder.

The advantages of Root Domains are:

  • Highest Isolation of all domain types.  Since they own and control their own CPU, memory and one or more PCIe root complex, they are fully isolated from all other domains in the system.  This is very similar to Dynamic System Domains you might know from older SPARC systems, just that we now use a hypervisor instead of a crossbar.
  • This also means no sharing of any IO resources with other domains, and thus no cross-influence of any kind.
  • Bare metal performance.  Since there's no virtualization of any kind involved, there are no performance penalties anywhere.
  • Root Domains are fully independent of all other domains in all aspects.  The only exception is console access, which is usually provided by the control domain.  However, this is not a single point of failure, as the root domain will continue to operate and will be fully available over the network even if the control domain is unavailable.
  • They allow hot-swapping of IO cards under their control, if the chassis supports it.  Today, that is for T5-4 and above.

Of course, there are disadvantages, too:

  • Root Domains are not very flexible.  You can not add or remove PCIe root complexes without rebooting the domain.
  • You are limited in the number of Root Domains, mostly by the number of PCIe root complexes available in the system.
  • As with all physical IO, Live Migration is not possible.

Use Root Domains whenever you have an application that needs at least one socket worth of CPU and memory or more and has high IO requirements, but where you'd prefer to host it on a larger system to allow some flexibility in CPU and memory assignment.  Typically, Root Domains have a memory footprint and CPU activity which is too high to allow sensible live migration. They are typically used for high value applications that are secured with some kind of cluster framework. 

Virtual IO SetupHaving covered all the options for PCI virtualization, there is only virtual IO left to cover. For easier reference, here's the diagram from previous posts that shows this basic setup.  This variant is probably the most widely used one.  It has been available from the very first version, it's performance has been significantly improved recently.  The advantages of this type of IO are mostly obvious:

  • Virtual IO allows live migration of guests.  In fact, only if all the IO of a guest is fully virtualized, can it be live migrated.
  • This type of IO is by far the most flexible from a platform point of view.  The number of virtual networks and the overall network architecture is only limited by the number of available LDCs (which has recently been increased to 1984 per domain).  There is a big choice of disk backends.  Providing disk and networking to a great number of guests can be achieved with a minimum of hardware.
  • Virtual IO fully supports dynamic reconfiguration - the adding and removing of virtual devices.
  • Virtual IO can be configured with redundant IO service domains, allowing a rolling upgrade of the IO service domains without disrupting the guest domains and without requiring live migration of the guests for this purpose.  Especially when running a large number of guests on one platform, this is a huge advantage.

Of course, there are also some drawbacks:

  • As with all virtual IO, there is a small overhead involved.  In the LDoms implementation, there is no limitation of physical bandwidth.  But there is a small amount of additional latency added to each data packet as it is processed through the stack.  Note that this additional latency, while measurable, is very small and not typically an issue for applications.
  • LDoms virtual IO currently supports virtual Ethernet and virtual disk.  While virtual Ethernet provides the same functionality as a physical Ethernet switch, the virtual disk interface works on a LUN by LUN basis.  This is different to other solutions that provide a virtual HBA and comes with some overhead in administration, since you have to add each virtual disk individually instead of just a single (virtual) HBA.  It also means that other SCSI devices like tapes or tape libraries can not be connected with virtual IO.
  • As is natural for virtual IO, the physical devices (and thus their resources) are shared between all consumers.  While recent releases of LDoms do support bandwidth limitations for network traffic, no such limits can currently be set on virtual disk devices.
  • You need to configure sufficient CPU and memory resources in the IO service domains.  The usual recommendation is one to two cores and 8-16 GB of memory.  While this doesn't strictly count as overhead for the CPU resources of the guests, those is still resources that are not directly available to guests.

Some recommendations for virtual IO:

  • In general, use the latest version of LDoms, along with Solaris 11.
  • Other than general networking considerations, there are no specific tunables for networking, if you are using a recent version of LDoms.  Stick to the defaults.
  • The same is true for disk IO.  However, keep in mind what has been true for the last 20 years: More LUNs do more IOPS.  Just because you've virtualized your guest doesn't mean that a single, 10TB LUN would give you more IOPS than 10x1TB LUNs - quite the opposite!  In the special case of the Oracle database: Make sure the redo logs are on dedicated storage.  This has been a recommendation since the "bad old days", and it continues to be true, whether you virtualize or not.

Virtual IO is best used in consolidation scenarios, where you have many smaller systems to host on one chassis.  These smaller systems tend to be lightweight in most of their resource consumption, including IO.  Hence, they will definately work well on virtual IO.  These are also the workloads that lend themselves best to Live Migration because of their smaller memory footprint and lower overall activity.  This is not to say that domains with moderate IO requirements wouldn't be well suited for virtual IO, they are.  However, larger domains with higher overall resource consumption (CPU, Memory, IO), tend to benefit less from the advantages of Live Migration and the flexibility of virtual IO.

To finalize this article, here's a tabular overview of the different options and the most important points to consider:

IO Option
Pros Cons When to use
SR-IOV
  • Highest granularity of all PCIe-based IO solutions
  • Bare metal performance
  • Supports Ethernet, FC and IB
  • Dynamic reconfiguration
  • Depends on support by PCIe card
  • No Live Migration
  • Dependency on root domain
  • For larger number of guests that need bare metal latency and can do without live migration.
  • When administrating a great number of LUNs is a constant burden, consider FC SR-IOV
  • When availability is not the top priority.
Direct IO
  • Dedicated slot, no hardware sharing
  • Bare metal performance
  • Supports Ethernet, FC and IB
  • Granularity limited by number of PCIe slots in the system
  • Not all PCIe cards supported
  • No Live Migration
  • No dynamic reconfiguration
  • Dependency on root domain
  • If you need a dedicated or special purpose IO card
Root Domains
  • Fully independent domains, similar to dynamic domains
  • Full bare metal performance, dedicated to each domain
  • All types of IO cards supported
  • Granularity limited by the number of Root Complexes in the system
  • No Live Migration
  • No dynamic reconfiguration
  • High value applications with high CPU, memory and IO requirements
  • Live Migration is not a requirement and/or not practical because of domain size and activity.
Virtual IO
  • Allows Live Migration
  • Most flexible, including full dynamic reconfiguration
  • No special hardware requirements
  • Almost no limit to the number of virtual devices
  • Allows fully redundant virtual IO configuration for HA deployments
  • Limited to Ethernet and virtual disk
  • Small performance overhead, mostly visible in additional latency
  • vDisk administration complexity
  • Sharing of IO hardware may have performance implications
  • Consolidation Scenarios
  • Many small guests
  • Live Migration is a requirement

There are already quite a few links for further reading spread throughout this article.  Here is just one more:

Montag Dez 15, 2014

What's up with LDoms: Part 10 - SR-IOV

Back after a long "break" filled with lots of interesting work...  In this article, I'll cover the most flexible solution in LDoms PCI virtualization: SR-IOV.

SR-IOV - Single Root IO Virtualization, is a PCI Express standard developed and published by the PCI-SIG.  The idea here is that each PCIe card capable of SR-IOV, also called a "physical function", can create multiple virtual copies or "virtual functions" of itself and present these to the PCIe bus.  There, they appear very similar to the original, physical card and can be assigned to a guest domain very similar to a whole slot in case of DirectIO.  The domain then has direct hardware access to this virtual adapter.  Support for SR-IOV was first introduced to LDoms in version 2.2, quite a while ago.  Since SR-IOV very much depends on the capabilities of the PCIe adapters, support for various communication protocols was added one by one, as the adapters started to support this.  Today, LDoms support SR-IOV for Ethernet, Infiniband and FibreChannel.  Creating, assigning or de-assigning virtual functions (with the exception of Infiniband) is dynamic since LDoms version 3.1 which means you can do all of this without rebooting the domains affected.

All of this is well documented, not only in the LDoms Admin Guide, but also in various blog entries, most of them by Raghuram Kothakota, one of the chief developers for this feature.  However, I do want to give a short example on how this is configured, pointing to a few things to note as we go along.

Just like with DirectIO, the first thing you want to do is an inventory of what SR-IOV capable hardware you have in your system:

root@sun:~# ldm ls-io
NAME                                      TYPE   BUS      DOMAIN   STATUS   
----                                      ----   ---      ------   ------   
pci_0                                     BUS    pci_0    primary           
pci_1                                     BUS    pci_1    primary           
niu_0                                     NIU    niu_0    primary           
niu_1                                     NIU    niu_1    primary           
/SYS/MB/PCIE0                             PCIE   pci_0    primary  EMP      
/SYS/MB/PCIE2                             PCIE   pci_0    primary  OCC      
/SYS/MB/PCIE4                             PCIE   pci_0    primary  OCC      
/SYS/MB/PCIE6                             PCIE   pci_0    primary  EMP      
/SYS/MB/PCIE8                             PCIE   pci_0    primary  EMP      
/SYS/MB/SASHBA                            PCIE   pci_0    primary  OCC      
/SYS/MB/NET0                              PCIE   pci_0    primary  OCC      
/SYS/MB/PCIE1                             PCIE   pci_1    primary  EMP      
/SYS/MB/PCIE3                             PCIE   pci_1    primary  EMP      
/SYS/MB/PCIE5                             PCIE   pci_1    primary  OCC      
/SYS/MB/PCIE7                             PCIE   pci_1    primary  EMP      
/SYS/MB/PCIE9                             PCIE   pci_1    primary  EMP      
/SYS/MB/NET2                              PCIE   pci_1    primary  OCC      
/SYS/MB/NET0/IOVNET.PF0                   PF     pci_0    primary           
/SYS/MB/NET0/IOVNET.PF1                   PF     pci_0    primary           
/SYS/MB/NET2/IOVNET.PF0                   PF     pci_1    primary           
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_1    primary           
We've discussed this example earlier, this time let's concentrate on the four last lines. Those are physical functions (PF) of two network devices (/SYS/MB/NET0 and NET2). Since there are two PFs for each device, we know that each device actually has two ports. (These are the four internal ports of a T4-2 system.) To dynamically create a virtual function of one of these ports, we first have to turn on IO Virtualization on the corresponding PCI bus. Unfortunately, this is not (yet) a dynamic operation, so we have to reboot the domain owning that bus once. But only once. So let's do that now:
root@sun:~# ldm start-reconf primary
Initiating a delayed reconfiguration operation on the primary domain.
All configuration changes for other domains are disabled until the primary
domain reboots, at which time the new configuration for the primary domain
will also take effect.
root@sun:~# ldm set-io iov=on pci_0
------------------------------------------------------------------------------
Notice: The primary domain is in the process of a delayed reconfiguration.
Any changes made to the primary domain will only take effect after it reboots.
------------------------------------------------------------------------------
root@sun:~# reboot

Once the system comes back up, we can check that everything went well:

root@sun:~# ldm ls-io
NAME                                      TYPE   BUS      DOMAIN   STATUS   
----                                      ----   ---      ------   ------   
pci_0                                     BUS    pci_0    primary  IOV      
pci_1                                     BUS    pci_1    primary        
[...]
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_1    primary      

As you can see, pci_0 now shows "IOV" in the Status column. We can use the "-d" option to ldm ls-io to learn a bit more about the capabilities of the PF we intend to use:

root@sun:~# ldm ls-io -d /SYS/MB/NET2/IOVNET.PF1
Device-specific Parameters
--------------------------
max-config-vfs
    Flags = PR
    Default = 7
    Descr = Max number of configurable VFs
max-vf-mtu
    Flags = VR
    Default = 9216
    Descr = Max MTU supported for a VF
max-vlans
    Flags = VR
    Default = 32
    Descr = Max number of VLAN filters supported
pvid-exclusive
    Flags = VR
    Default = 1
    Descr = Exclusive configuration of pvid required
unicast-slots
    Flags = PV
    Default = 0 Min = 0 Max = 32
    Descr = Number of unicast mac-address slots    

All of these capabilities depend on the type of adapter and the driver that supports it.  In this example case, we can see that we can create up to 7 VFs, the VFs support a maximum MTU of 9216 bytes and have hardware support for 32 VLANs and 32 MAC addresses.  Other adapters are likely to give you different values here.

Now we can create a virtual function (VF) and assign it to a guest domain.  We have to do this with a currently unused port - creating VFs doesn't work while there's traffic on the device.

root@sun:~# ldm create-vf /SYS/MB/NET2/IOVNET.PF1 
Created new vf: /SYS/MB/NET2/IOVNET.PF1.VF0
root@sun:~# ldm add-io /SYS/MB/NET2/IOVNET.PF1.VF0 mars
root@sun:~# ldm ls-io /SYS/MB/NET2/IOVNET.PF1    
NAME                                      TYPE   BUS      DOMAIN   STATUS   
----                                      ----   ---      ------   ------   
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_1    primary           
/SYS/MB/NET2/IOVNET.PF1.VF0               VF     pci_1    mars             

The first command here tells the hypervisor, or actually, the NIC located at /SYS/MB/NET2/IOVNET.PF1, to create one virtual function.  The command returns and reports the name of that virtual function.  There is a different variant of this command to create multiple VFs in one go.  The second command then assigns this newly create VF to a domain called "mars".  This is an online operation - mars is already up and running Solaris at this point.  Finally, the third command just shows us that everything went well and mars now owns the VF. 

Used with the "-l" option, the ldm command tells us some details about the device structure of the PF and VF:

root@sun:~# ldm ls-io -l /SYS/MB/NET2/IOVNET.PF1
NAME                                      TYPE   BUS      DOMAIN   STATUS   
----                                      ----   ---      ------   ------   
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_1    primary           
[pci@500/pci@1/pci@0/pci@5/network@0,1]
    maxvfs = 7
/SYS/MB/NET2/IOVNET.PF1.VF0               VF     pci_1    mars             
[pci@500/pci@1/pci@0/pci@5/network@0,81]
    Class properties [NETWORK]
        mac-addr = 00:14:4f:f8:07:ad
        mtu = 1500

Of course, we also want to check if and how this shows up in mars:

root@mars:~# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             unknown    0      unknown   igbvf0
root@mars:~# grep network /etc/path_to_inst
"/virtual-devices@100/channel-devices@200/network@0" 0 "vnet"
"/pci@500/pci@1/pci@0/pci@5/network@0,81" 0 "igbvf"

As you can see, mars now has two network interfaces.  One, net0, is a more conventional, virtual network interface.  The other, net1, uses the VF driver for the underlying physical device, in our case igb.  Checking in /etc/path_to_inst (or, if you prefer, in /devices), we can now find an entry for this network interface that shows us the PCIe infrastructure now plumbed into mars to support this NIC. Of course, it's the same device path as in the root domain (sun).

So far, we've seen how to create a VF in the root domain, how to assign this to a guest and how it shows up there.  I've used Ethernet for this example, as it's readily available in all systems.  As I mentioned earlier, LDoms also support Infiniband and FibreChannel with SR-IOV, so you could also add a FC HBA's VF to a guest domain.  Note that this doesn't work with just any HBA.  The HBA itself has to support this functionality.  There is a list of supported cards maintained in MOS. 

There are a few more things to note with SR-IOV.  First, there's the VFs identity.  You might not have noticed it, but the VF created in the example above has it's own identity - it's own MAC address.  While this seems natural in the case of Ethernet, it is actually something that you should be aware of with FC and IB as well.  FC VFs use WWNs and NPIV to identify themselves in the attached fabric.  This means the fabric has to be NPIV capable and the guest domain using the VF can not layer further software NPIV-HBAs on top.  Likewise, IB VFs use HCAGUIDs to identify themselves.  While you can choose Ethernet MAC-addresses and FC WWNs if you prefer, IB VFs choose their HCAGUIDs automatically.  If you intend to run Solaris zones within a guest domain that uses a SR-IOV VF for Ethernet, remember to assign this VF additional MAC-addresses to be used by the anet devices of these zones.

Finally I want to point out once more that while SR-IOV devices can be moved in and out of domains dynamically, and can be added from two different root domains to the same guest, they still depend on their respective root domains.  This is very similar to the restriction with DirectIO.  So if the root domain owning the PF reboots (for whatever reason), it will reset the PF which will also reset all VFs and have unpredictable results in the guests using them.  Keep this in mind when deciding whether or not to use SR-IOV.  If you do, consider to configure explicit domain dependencies reflecting these physical dependencies.  You can find details about this in the Admin Guide. Development in this area is continuing, so you may expect to see enhancements in this space in upcoming versions. 

Since it is possible to work with multiple root domains and have each of those root domains create VFs of some of their devices, it is important to avoid cyclic dependencies between these root domains.  This is explicitly prevented by the ldm command, which does not allow a VF from one root domain to be assigned to another root domain.

We have now seen multiple ways of providing IO resources to logical domains: Virtual network and disk, PCIe root complexes, PCIe slots and finally SR-IOV.  Each of them have their own pros and cons and you will need to weigh them carefully to find the correct solution for a given task.  I will dedicate one of the next chapters of this series to a discussion of IO best practices and recommendations.  For now, here are some links for further reading about SR-IOV:

Mittwoch Aug 20, 2014

What's up with LDoms: Part 9 - Direct IO

In the last article of this series, we discussed the most general of all physical IO options available for LDoms, root domains.  Now, let's have a short look at the next level of granularity: Virtualizing individual PCIe slots.  In the LDoms terminology, this feature is called "Direct IO" or DIO.  It is very similar to root domains, but instead of reassigning ownership of a complete root complex, it only moves a single PCIe slot or endpoint device to a different domain.  Let's look again at hardware available to mars in the original configuration:

root@sun:~# ldm ls-io
NAME                                      TYPE   BUS      DOMAIN   STATUS  
----                                      ----   ---      ------   ------  
pci_0                                     BUS    pci_0    primary          
pci_1                                     BUS    pci_1    primary          
pci_2                                     BUS    pci_2    primary          
pci_3                                     BUS    pci_3    primary          
/SYS/MB/PCIE1                             PCIE   pci_0    primary  EMP     
/SYS/MB/SASHBA0                           PCIE   pci_0    primary  OCC
/SYS/MB/NET0                              PCIE   pci_0    primary  OCC     
/SYS/MB/PCIE5                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE6                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE7                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE2                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE3                             PCIE   pci_2    primary  OCC     
/SYS/MB/PCIE4                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE8                             PCIE   pci_3    primary  EMP     
/SYS/MB/SASHBA1                           PCIE   pci_3    primary  OCC     
/SYS/MB/NET2                              PCIE   pci_3    primary  OCC     
/SYS/MB/NET0/IOVNET.PF0                   PF     pci_0    primary          
/SYS/MB/NET0/IOVNET.PF1                   PF     pci_0    primary          
/SYS/MB/NET2/IOVNET.PF0                   PF     pci_3    primary          
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_3    primary

All of the "PCIE" type devices are available for SDIO, with a few limitations.  If the device is a slot, the card in that slot must support the DIO feature.  The documentation lists all such cards.  Moving a slot to a different domain works just like moving a PCI root complex.  Again, this is not a dynamic process and includes reboots of the affected domains.  The resulting configuration is nicely shown in a diagram in the Admin Guide:

There are several important things to note and consider here:

  • The domain receiving the slot/endpoint device turns into an IO domain in LDoms terminology, because it now owns some physical IO hardware.
  • Solaris will create nodes for this hardware under /devices.  This includes entries for the virtual PCI root complex (pci_0 in the diagram) and anything between it and the actual endpoint device.  It is very important to understand that all of this PCIe infrastructure is virtual only!  Only the actual endpoint devices are true physical hardware.
  • There is an implicit dependency between the guest owning the endpoint device and the root domain owning the real PCIe infrastructure:
    • Only if the root domain is up and running, will the guest domain have access to the endpoint device.
    • The root domain is still responsible for resetting and configuring the PCIe infrastructure (root complex, PCIe level configurations, error handling etc.) because it owns this part of the physical infrastructure.
    • This also means that if the root domain needs to reset the PCIe root complex for any reason (typically a reboot of the root domain) it will reset and thus disrupt the operation of the endpoint device owned by the guest domain.  The result in the guest is not predictable.  I recommend to configure the resulting behaviour of the guest using domain dependencies as described in the Admin Guide in Chapter "Configuring Domain Dependencies".
  • Please consult the Admin Guide in Section "Creating an I/O Domain by Assigning PCIe Endpoint Devices" for all the details!

As you can see, there are several restrictions for this feature.  It was introduced in LDoms 2.0, mainly to allow the configuration of guest domains that need access to tape devices.  Today, with the higher number of PCIe root complexes and the availability of SR-IOV, the need to use this feature is declining.  I personally do not recommend to use it, mainly because of the drawbacks of the depencies on the root domain and because it can be replaced with SR-IOV (although then with similar limitations).

This was a rather short entry, more for completeness.  I believe that DIO can usually be replaced by SR-IOV, which is much more flexible.  I will cover SR-IOV in the next section of this blog series.

Montag Feb 24, 2014

What's up with LDoms: Part 8 - Physical IO

Virtual IO SetupFinally finding some time to continue this blog series...  And starting the new year with a new chapter for which I hope to write several sections: Physical IO options for LDoms and what you can do with them.  In all previous sections, we talked about virtual IO and how to deal with it.  The diagram at the right shows the general architecture of such virtual IO configurations. However, there's much more to IO than that. 

From an architectural point of view, the primary task of the SPARC hypervisor is partitioning of  the system.  The hypervisor isn't usually very active - all it does is assign ownership of some parts of the hardware (CPU, memory, IO resources) to a domain, build a virtual machine from these components and finally start OpenBoot in that virtual machine.  After that, the hypervisor essentially steps aside.  Only if the IO components are virtual components, do we need hypervisor support.  But those IO components could also be physical.  Actually, that is the more "natural" option, if you like.  So lets revisit the creation of a domain:

We always start with assigning of CPU and memory in some very simple steps:

root@sun:~# ldm create mars
root@sun:~# ldm set-memory 8g mars
root@sun:~# ldm set-core 8 mars

If we now bound and started the domain, we would have OpenBoot running and we could connect using the virtual console.  Of course, since this domain doesn't have any IO devices, we couldn't yet do anything particularily useful with it.  Since we want to add physical IO devices, where are they?

To begin with, all physical components are owned by the primary domain.  This is the same for IO devices, just like it is for CPU and memory.  So just like we need to remove some CPU and memory from the primary domain in order to assign these to other domains, we will have to remove some IO from the primary if we want to assign it to another domain.  A general inventory of available IO resources can be obtained with the "ldm ls-io" command:

root@sun:~# ldm ls-io
NAME                                      TYPE   BUS      DOMAIN   STATUS  
----                                      ----   ---      ------   ------  
pci_0                                     BUS    pci_0    primary          
pci_1                                     BUS    pci_1    primary          
pci_2                                     BUS    pci_2    primary          
pci_3                                     BUS    pci_3    primary          
/SYS/MB/PCIE1                             PCIE   pci_0    primary  EMP     
/SYS/MB/SASHBA0                           PCIE   pci_0    primary  OCC
/SYS/MB/NET0                              PCIE   pci_0    primary  OCC     
/SYS/MB/PCIE5                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE6                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE7                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE2                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE3                             PCIE   pci_2    primary  OCC     
/SYS/MB/PCIE4                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE8                             PCIE   pci_3    primary  EMP     
/SYS/MB/SASHBA1                           PCIE   pci_3    primary  OCC     
/SYS/MB/NET2                              PCIE   pci_3    primary  OCC     
/SYS/MB/NET0/IOVNET.PF0                   PF     pci_0    primary          
/SYS/MB/NET0/IOVNET.PF1                   PF     pci_0    primary          
/SYS/MB/NET2/IOVNET.PF0                   PF     pci_3    primary          
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_3    primary

The output of this command will of course vary greatly, depending on the type of system you have.  The above example is from a T5-2.  As you can see, there are several types of IO resources.  Specifically, there are

  • BUS
    This is a whole PCI bus, which means everything controlled by a single PCI control unit, also called a PCI root complex.  It typically contains several PCI slots and possibly some end point devices like SAS or network controllers.
  • PCIE
    This is either a single PCIe slot.  In that case, it's name corresponds to the slot number you will find imprinted on the system chassis.  It is controlled by a root complex listed in the "BUS" column.  In the above example, you can see that some slots are empty, while others are occupied.  Or it is an endpoint device like a SAS HBA or network controller.  An example would be "/SYS/MB/SASHBA0" or "/SYS/MB/NET2".  Both of these typically control more than one actual device, so for example, SASHBA0 would control 4 internal disks and NET2 would control 2 internal network ports.
  • PF
    This is a SR-IOV Physical Function - usually an endpoint device like a network port which is capable of PCI virtualization.  We will cover SR-IOV in a later section of this blog.

All of these devices are available for assignment.  Right now, they are all owned by the primary domain.  We will now release some of them from the primary domain and assign them to a different domain.  Unfortunately, this is not a dynamic operation, so we will have to reboot the control domain (more precisely, the affected domains) once to complete this.

root@sun:~# ldm start-reconf primary
root@sun:~# ldm rm-io pci_3 primary
root@sun:~# reboot
[ wait for the system to come back up ]
root@sun:~# ldm add-io pci_3 mars
root@sun:~# ldm bind mars

With the removal of pci_3, we also removed PCIE8, SYSBHA1 and NET1 from the primary domain and added all three to mars.  Mars will now have direct, exclusive access to all the disks controlled by SASHBA1, all the network ports on NET1 and whatever we chose to install in PCIe slot 8.  Since in this particular example, mars has access to internal disk and network, it can boot and communicate using these internal devices.  It does not depend on the primary domain for any of this.  Once started, we could actually shut down the primary domain.  (Note that the primary is usually the home of vntsd, the console service.  While we don't need this for running or rebooting mars, we do need it in case mars falls back to OBP or single-user.) 

Root Domain SetupMars now owns its own PCIe root complex.  Because of this, we call mars a root domain.  The diagram on the right shows the general architecture.  Compare this to the diagram above!  Root domains are truely independent partitions of a SPARC system, very similar in functionality to Dynamic System Domains in the E10k, E25k or M9000 times (or Physical Domains, as they're now called).  They own their own CPU, memory and physical IO.   They can be booted, run and rebooted independently of any other domain.  Any failure in another domain does not affect them.  Of course, we have plenty of shared components: A root domain might share a mainboard, a part of a CPU (mars, for example, only has 2 cores...), some memory modules, etc. with other domains.  Any failure in a shared component will of course affect all the domains sharing that component, which is different in Physical Domains because there are significantly fewer shared components.  But beyond this, root domains have a level of isolation very similar to that of Physical Domains.

Comparing root domains (which are the most general form of physical IO in LDoms) with virtual IO, here are some pros and cons:

Pros:

  • Root domains are fully independet of all other domains (with the exception of console access, but this is a minor limitation).
  • Root domains have zero overhead in IO - they have no virtualization overhead whatsoever.
  • Root domains, because they don't use virtual IO, are not limited to disk and network, but can also attach to tape, tape libraries or any other, generic IO device supported in their PCIe slots.

Cons:

  • Root domains are limited in number.  You can only create as many root domains as you have PCIe root complexes available.  In current T5 and M5/6 systems, that's two per CPU socket.
  • Root domains can not live migrate.  Because they own real IO hardware (with all these nasty little buffers, registers and FIFOs), they can not be live migrated to another chassis.

Because of these different characteristics, root domains are typically used for applications that tend to be more static, have higher IO requirements and/or larger CPU and memory footprints.  Domains with virtual IO, on the other hand, are typically used for the mass of smaller applications with lower IO requirements.  Note that "higher" and "lower" are relative terms - LDoms virtual IO is quite powerful.

This is the end of the first part of the physical IO section, I'll cover some additional options next time.  Here are some links for further reading:

Donnerstag Jul 04, 2013

What's up with LDoms: Part 7 - Layered Virtual Networking

Back for another article about LDoms - today we'll cover some tricky networking options that come up if you want to run Solaris 11 zones in LDom guest systems.  So what's the problem?

MAC Tables in an LDom systemLet's look at what happens with MAC addresses when you create a guest system with a single vnet network device.  By default, the LDoms Manager selects a MAC address for the new vnet device.  This MAC address is managed in the vswitch, and ethernet packets from and to that MAC address can flow between the vnet device, the vswitch and the outside world.  The ethernet switch on the outside will learn about this new MAC address, too.  Of course, if you assign a MAC address manually, this works the same way.  This situation is shown in the diagram at the right.  The important thing to note here is that the vnet device in the guest system will have exactly one MAC address, and no "spare slots" with additional addresses. 

Add zones into the picture.  With Solaris 10, the situation is simple.  The default behaviour will be a "shared IP" zone, where traffic from the non-global zone will use the IP (and thus ethernet) stack from the global zone.  No additional MAC addresses required.  Since you don't have further "physical" interfaces, there's no temptation to use "exclusive IP" for that zone, except if you'd use a tagged VLAN interface.  But again, this wouldn't need another MAC address.


MAC Tables in previous versionsWith Solaris 11, this changes fundamentally.  Solaris 11, by default, will create a so called "anet" device for any new zone.  This device is created using the new Solaris 11 network stack, and is simply a virtual NIC.  As such, it will have a MAC address.  The default behaviour is to generate a random MAC address.  However, this random MAC address will not be known to the vswitch in the IO domain and to the vnet device in the global zone, and starting such a zone will fail.


MAC Tables in version 3.0.0.2The solution is to allow the vnet device of the LDoms guest to provide more than one MAC address, similar to typical physical NICs which have support for numerous MAC addresses in "slots" that they manage.  This feature has been added to Oracle VM Server for SPARC in version 3.0.0.2.  Jeff Savit wrote about it in his blog, showing a nice example of how things fail without this feature, and how they work with it.  Of course, the same solution will also work if your global zone uses vnics for purposes other than zones.

To make this work, you need to do two things:

  1. Configure the vnet device to have more than one MAC address.  This is done using the new option "alt-mac-addrs" with either ldm add-vnet or ldm set-vnet.  You can either provide manually selected MAC addresses here, or rely on LDoms Manager to use it's MAC address selection algorithm to provide one.
  2. Configure the zone to use the "auto" option instead of "random" for selecting a MAC address.  This will cause the zone to query the NIC for available MAC addresses instead of coming up with one and making the NIC accept it.

I will not go into the details of how this is configured, as this is very nicely covered by Jeff's blog entry already.  I do want to add that you might see similar issues with layered virtual networking in other virtualization solutions:  Running Solaris 11 vnics or zones with exclusive IP in VirtualBox, OVM x86 or VMware will show the very same behaviour.   I don't know if/when these thechnologies will provide a solution similar to what we now have with LDoms.

Montag Jan 14, 2013

LDoms IO Best Practices & T4 Red Crypto Stack

In November, I presented at DOAG Konferenz & Ausstellung 2012.  Now, almost two months later, I finally get around to posting the slides here...

  • In "LDoms IO Best Practices" I discuss different IO options for both disk and networking and give some recommens on how you to choose the right ones for your environment.  A couple hints about performance are also included.

I hope the slides are useful!

Freitag Dez 21, 2012

What's up with LDoms: Part 6 - Sizing the IO Domain

Before Christmas break, let's look at a topic that's one of the more frequently asked questions: Sizing of the Control Domain and IO Domain.

By now, we've seen how to create the basic setup, create a simple domain and configure networking and disk IO.  We know that for typical virtual IO, we use vswitches and virtual disk services to provide virtual network and disk services to the guests.  The question to address here is: How much CPU and memory is required in the Control and IO-domain (or in any additional IO domain) to provide these services without being a bottleneck?

The answer to this question can be very quick: LDoms Engineering usually recommends 1 or 2 cores for the Control Domain.

However, as always, one size doesn't fit all, and I'd like to look a little closer. 

Essentially, this is a sizing question just like any other system sizing.  So the first question to ask is: What services is the Control Domain providing that need CPU or memory resources?  We can then continue to estimate or measure exactly how much of each we will need. 

As for the services, the answer is straight forward: 

  • The Control Domain usually provides
    • Console Services using vntsd
    • Dynamic Reconfiguration and other infrastructure services
    • Live Migration
  • Any IO Domain (either the Control Domain or an additional IO domain) provides
    • Disk Services configured through the vds
    • Network Services configured through the vswitch

For sizing, it is safe to assume that vntsd, ldmd (the actual LDoms Manager daemon), ldmad (the LDoms agent) and any other infrastructure tasks will require very little CPU and can be ignored.  Let's look at the remaining three services:

  • Disk Services
    Disk Services have two parts:  Data transfer from the IO domain to the backend devices and data transfer from the IO Domain to the guest.  Disk IO in the IO domain is relatively cheap, you don't need many CPU cycles to deal with it.  I have found 1-2 threads of a T2 CPU to be sufficient for about 15.000 IOPS.  Today we usually use T4...
    However, this also depends on the type of backend storage you use.  FC or SAS rawdevice LUNs will have very little CPU overhead.  OTOH, if you use files hosted on NFS or ZFS, you are likely to see more CPU activity involved.  Here, your mileage will vary, depending on the configuration and usage pattern.  Also keep in mind that backends hosted on NFS or iSCSI also involve network traffic.
  • Network Services - vswitches
    There is a very old sizing rule that says that you need 1 GHz worth of CPU to saturate 1GBit worth of ethernet.  SAE has published a network encryption benchmark where a single T4 CPU at 2.85 GHz will transmit around 9 GBit at 20% utilization.  Converted into strands and cores, that would mean about 13 strands - less than 2 cores for 9GBit worth of traffic.  Encrypted, mind you.  Applying the mentioned old rule to this result, we would need just over 3 cores at 2.85 GHz to do 9 GBit - it seems we've made some progress in efficiency ;-)
    Applying all of this to IO Domain sizing, I would consider 2 cores an upper bound for typical installations, where you might very well get along with just one core, especially on smaller systems like the T4-1, where you're not likely to have several guest systems that each require  10GBit wirespeed networking.
  • Live Migration
    When considering Live Migration, we should understand that the Control Domains of the two involved systems are the ones actually doing all the work.  They encrypt, compress and send the source system's memory to the target system.  For this, they need quite a bit of CPU.  Of course, one could argue that Live Migration is something happening in the background, so it doesn't matter how fast it's actually done.  However, there's still the suspend-phase, where the guest system is suspended and the remaining dirty memory pages copied over to the other side.  This phase, while typically very very short, significantly impacts the "live" experience of Live Migration.  And while other factors like guest activity level and memory size also play a role, there's also a direct connection between CPU power and the length of this suspend time.  The relation between Control Domain CPU configuration and suspend time has been studied and published in a Whitepaper "Increasing Application Availability Using Oracle VM Server for SPARC (LDoms) An Oracle Database Example".  The conclusion: For minimum suspend times, configure 3 cores in the Control Domain.  I personally have made good experience with 2 cores, measuring suspend times as low as 0.1 second with a very idle domain, so again, your mileage will vary.

    Another thought here:  The Control Domain doesn't usually do Live Migration on a permanent basis.  So if a single core is sufficient for the IO Domain role of the Control Domain, you are in good shape for everyday business with just one core.  When you need additional CPU for a quick Live Migration, why not borrow it from somewhere else, like the domain being migrated, or any other domain not currently very busy?  CPU DR does lend itself for this purpose...

As you've seen, there are some rules, there is some experience, but still, there isn't the single, one answer.  In many cases, you should be ok with a single core on T4 for each IO domain.  If you use Live Migration a lot, you might want to add another core to the Control Domain.  On larger systems with higher networking demands, two cores for each IO Domain might be right.  If these recommendations are good enough for you, you're done.  If you want to dig deeper, simply check what's really going on in your IO Domains.  Use mpstat (1M) to study the utilization of your IO Domain's CPUs in times of high activity.  Perhaps record CPU utilization over a period of time, using your tool of choice.  (I recommend DimSTAT for that.)  With these results, you should be able to adjust the amount of CPU resources of your IO Domains to your exact needs.  However, when doing that, please remember those unannounced utilization peaks - don't be too stingy.  Saving one or two CPU strands won't buy you too much, all things considered.

A few words about memory:  This is much more straight forward.  If you're not using ZFS as a backing store for your virtual disks, you should be well in the green with 2-4GB of RAM.  My current test system, running Solaris 11.0 in the Control Domain, needs less than 600 MB of virtual memory.  Remember that 1GB is the supported minimum for Solaris 11 (and it's changed to 1.5 GB for Solaris 11.1). If you do use ZFS, you might want to reserve a couple GB for its ARC, so perhaps 8 GB are more appropriate.  On the Control Domain, which is the first domain to be bound, take 7680MB, which add up to 8GB together with the hypervisor's own 512MB, nicely fitting the 8GB boundary favoured by the memory controllers.  Again, if you want to be precise, monitor memory usage in your IO domains.

Links:

Update: I just learned that the hypervisor doesn't always take exactly 512MB. So if you do want to align with the 8GB boundary, check the sizes using "ldm ls-devices -a mem". Everything bound to "sys" is owned by the hypervisor.

Mittwoch Nov 07, 2012

What's up with LDoms: Part 5 - A few Words about Consoles

Back again to look at a detail of LDom configuration that is often forgotten - the virtual console server.

Remember, LDoms are SPARC systems.  As such, each guest will have it's own OBP running.  And to connect to that OBP, the administrator will need a console connection.  Since it's OBP, and not some x86 BIOS, this console will be very serial in nature ;-)  It's really very much like in the good old days, where we had a terminal concentrator where all those serial cables ended up in.  Just like with other components in LDoms, the virtualized solution looks very similar.

Every LDom guest requires exactly one console connection.  Envision this similar to the RS-232 port on older SPARC systems.  The LDom framework provides one or more console services that provide access to these connections.  This would be the virtual equivalent of a network terminal server (NTS), where all those serial cables are plugged in.  In the physical world, we'd have a list somewhere, that would tell us which TCP-Port of the NTS was connected to which server.  "ldm list" does just that:

root@sun # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  UART    16    7680M    0.4%  27d 8h 22m
jupiter          bound      ------  5002    20    8G             
mars             active     -n----  5000    2     8G       0.5%  55d 14h 10m
venus            active     -n----  5001    2     8G       0.5%  56d 40m
pluto            inactive   ------          4     4G             

The column marked "CONS" tells us, where to reach the console of each domain. In the case of the primary domain, this is actually a (more) physical connection - it's the console connection of the physical system, which is either reachable via the ILOM of that system, or directly via the serial console port on the chassis. All the other guests are reachable through the console service which we created during the inital setup of the system.  Note that pluto does not have a port assigned.  This is because pluto is not yet bound.  (Binding can be viewed very much as the assembly of computer parts - CPU, Memory, disks, network adapters and a serial console cable are all put together when binding the domain.)  Unless we set the port number explicitly, LDoms Manager will do this on a first come, first serve basis.  For just a few domains, this is fine.  For larger deployments, it might be a good idea to assign these port numbers manually using the "ldm set-vcons" command.  However, there is even better magic associated with virtual consoles.

You can group several domains into one console group, reachable through one TCP port of the console service.  This can be useful when several groups of administrators are to be given access to different domains, or for other grouping reasons.  Here's an example:

root@sun # ldm set-vcons group=planets service=console jupiter
root@sun # ldm set-vcons group=planets service=console pluto
root@sun # ldm bind jupiter 
root@sun # ldm bind pluto
root@sun # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  UART    16    7680M    6.1%  27d 8h 24m
jupiter          bound      ------  5002    200   8G             
mars             active     -n----  5000    2     8G       0.6%  55d 14h 12m
pluto            bound      ------  5002    4     4G             
venus            active     -n----  5001    2     8G       0.5%  56d 42m

root@sun # telnet localhost 5002
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

sun-vnts-planets: h, l, c{id}, n{name}, q:l
DOMAIN ID           DOMAIN NAME                   DOMAIN STATE             
2                   jupiter                       online                   
3                   pluto                         online                   

sun-vnts-planets: h, l, c{id}, n{name}, q:npluto
Connecting to console "pluto" in group "planets" ....
Press ~? for control options ..

What I did here was add the two domains pluto and jupiter to a new console group called "planets" on the service "console" running in the primary domain.  Simply using a group name will create such a group, if it doesn't already exist.  By default, each domain has its own group, using the domain name as the group name.  The group will be available on port 5002, chosen by LDoms Manager because I didn't specify it.  If I connect to that console group, I will now first be prompted to choose the domain I want to connect to from a little menu.

Finally, here's an example how to assign port numbers explicitly:

root@sun # ldm set-vcons port=5044 group=pluto service=console pluto
root@sun # ldm bind pluto
root@sun # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  UART    16    7680M    3.8%  27d 8h 54m
jupiter          active     -t----  5002    200   8G       0.5%  30m
mars             active     -n----  5000    2     8G       0.6%  55d 14h 43m
pluto            bound      ------  5044    4     4G             
venus            active     -n----  5001    2     8G       0.4%  56d 1h 13m

With this, pluto would always be reachable on port 5044 in its own exclusive console group, no matter in which order other domains are bound.

Now, you might be wondering why we always have to mention the console service name, "console" in all the examples here.  The simple answer is because there could be more than one such console service.  For all "normal" use, a single console service is absolutely sufficient.  But the system is flexible enough to allow more than that single one, should you need them.  In fact, you could even configure such a console service on a domain other than the primary (or control domain), which would make that domain a real console server.  I actually have a customer who does just that - they want to separate console access from the control domain functionality.  But this is definately a rather sophisticated setup.

Something I don't want to go into in this post is access control.  vntsd, which is the daemon providing all these console services, is fully RBAC-aware, and you can configure authorizations for individual users to connect to console groups or individual domain's consoles.  If you can't wait until I get around to security, check out the man page of vntsd.

Further reading:

  • The Admin Guide is rather reserved on this subject.  I do recommend to check out the Reference Manual.
  • The manpage for vntsd will discuss all the control sequences as well as the grouping and authorizations mentioned here.

Montag Sep 10, 2012

Secure Deployment of Oracle VM Server for SPARC - updated

Quite a while ago, I published a paper with recommendations for a secure deployment of LDoms.  Many things happend in the mean time, and an update to that paper was due.  Besides some minor spelling corrections, many obsolete or changed links were updated.  However, the main reason for the update was the introduction of a second usage model for LDoms.  In a very short few words: With the success especially of the T4-4, many deployments make use of the hardware partitioning capabilities of that platform, assigning full PCIe root complexes to domains, mimicking dynamic system domains if you will.  This different way of using the hypervisor needed to be addressed in the paper.  You can find the updated version here:

Secure Deployment of Oracle VM Server for SPARC
Second Edition

I hope it'll be useful!

Freitag Jul 13, 2012

What's up with LDoms: Part 3 - A closer look at Disk Backend Choices

In this section, we'll have a closer look at virtual disk backends and the various choises available here.  As a little reminder, a disk backend, in LDoms speak, is the physical storage used when creating a virtual disk for a guest system.  In other virtualization solutions, these are sometimes called virtual disk images, a term that doesn't really fit for all possible options available in LDoms.

In the previous example, we used a ZFS volume as a backend for the boot disk of mars.  But there are many other ways to store the data of virtual disks.  The relevant section in the Admin Guide lists all the available options:

  • Physical LUNs, in any variant that the Control Domain supports.  This of course includes SAN, iSCSI and SAS, including the internal disks of the host system.
  • Logical Volumes like ZFS Volumes, but also SVM or VxVM
  • Regular Files. These can be stored in any filesystem, as long as they're accessible by the LDoms subsystem. This includes storage on NFS.

Each of these backend devices have their own set of characteristica that should be considered when deciding which backend type to use.  Let's look at them in a little more detail.

LUNs are the most generic option. By assigning a virtual disk to a LUN backend, the guest essentially gains full access to the underlying storage device, whatever that might be.  It will see the volume label of the LUN, it can see and alter the partition table of the LUN, it can also read or set SCSI reservations on that device.  Depending on the way the LUN is connected to the host system, this very same LUN could also be attached to a second host and a guest residing on it, with the two guests sharing the data on that one LUN, or supporting live migration.  If there is a filesystem on the LUN, the guest will be able to mount that filesystem, just like any other system with access to that LUN, be it virtualized or direct.  Bear in mind that most filesystems are non-shared filesystems.  This doesn't change here, either.  For the IO domain (that's the domain where the physical LUN is connected) LUNs mean the least possible amount of work.  All it has to do is pass data blocks up and down to and from the LUN, there is a very minimum of driver layers invovled.

Flat files, on the other hand, are the most simple option, very similar in user experience to what one would do in a desktop hypervisor like VirtualBox.  The easiest way to create one is with the "mkfile" command.  For the guest, there is no real difference to LUNs.  The virtual disk will, just like in the LUN case, appear to be a full disk, partition table, label and all.  Of course, initially, it'll be all empty, so the first thing the guest usually needs to do is write a label to the disk.  The main difference to LUNs is in the way these image files are managed.  Since they are files in a filesystem, they can be copied, moved and deleted, all of which should be done with care, especially if the guest is still running.  They can be managed by the filesystem, which means attributes like compression, encryption or deduplication in ZFS could apply to them - fully transparent to the guest.  If the filesystem is a shared filesystem like NFS or SAM-FS, the file (and thus the disk image) could be shared by another LDom on another system, for example as a shared database disk or for live migration.  Their performance will be impacted by the filesystem, too.  The IO domain might cache some of the file, hoping to speed operations.  If there are many such image files on a single filesystem, they might impact each other's performance.  These files, by the way, need not be empty initially.  A typical use case would be a Solaris iso image file.  Adding it to a guest as a virtual disk will allow that guest to boot (and install) off that iso image as if it were a physical CD drive.

Finally, there are logical Volumes, typically created with volume managers such as Solaris Volume Manager (SVM) or Veritas Volume Manager (VxVM) or ZFS, of course.  For the guest, again, these look just like ordinary disks, very much like files.  The difference to files is in the management layer;  The logical volumes are created straigt from the underlying storage, without a filesystem layer in between.  In the database world, we would call these "raw devices", and their device names in Solaris are very similar to those of physical LUNs.  We need different commands to find out how large these volumes are, or how much space is left on the storage devices underneath.  Other than that, however, they are very similar to files in many ways.  Sharing them between two host systems is likely to be more complex, as one would need the corresponding cluster volume managers, which typically only really work in combination with Solaris Cluster.  One type of volume that deserves special mentioning is the ZFS Volume.  It offers all the features of a normal ZFS dataset: Clones, snapshots, compression, encryption, deduplication, etc.  Especially with snapshots and clones, they lend themselves as the ideal backend for all use cases that make heavy use of these features. 

For the sake of completeness, I'd like to mention that you can export all of these backends to a guest with or without the "slice" option, something that I consider less usefull in most cases, which is why I'd like to refer you to the relevant section in the admin guide if you want to know more about this.

Lastly, you do have the option to export these backends read-only to prevent any changes from the guests.  Keep in mind that even mounting a UFS filesystem read only would require a write operation to the virtual disk.  The most typical usecase for this is probably an iso-image, which can indeed be mounted read-only.  You can also export one backend to more than one guest.  In the physical world, this would correspond to using the same SAN LUN on several hosts, and the same restrictions with regards to shared filesystems etc. apply.

So now that we know about all these different options, when should we use which kind of backend ?  The answer, as usual, is: It depends!

LUNs require a SAN (or iSCSI) infrastructure which we tend to associate with higher cost.  On the other hand, they can be shared between many hosts, are easily mapped from host to host and bring a rich feature set of storage management and redundancy with them.  I recommend LUNs (especially SAN) for both boot devices and data disks of guest systems in production environments.  My main reasons for this are:

  • They are very light-weight on the IO domain
  • They avoid any double buffering of data in the guest and in the IO domain because there is no filesystem layer involved in the IO domain.
  • Redundancy for the device and the data path is easy
  • They allow sharing between hosts, which in turn allows cluster implementations and live migration
  • All ZFS features can be implemented in the guest, if desired.

For test and development, my first choice is usually the ZFS volume.  Unlike VxVM, it comes free of charge, and it's features like snapshots and clones meet the typical requirements of such environments to quickly create, copy and destroy test environments.  I explicitly recommend against using ZFS snapshots/clones (files or volumes) over a longer period of time.  Since ZFS records the delta between the original image and the clones, the space overhead will eventually grow to a multiple of the initial size and eventually even prevent further IO to the virtual disk if the zpool is full.  Also keep in mind that ZFS is not a shared filesystem.  This prevents guest that use ZFS files or volumes as virtual disks from doing live migration.  Which leads directly to the recommendation for files:

I recommend files on NFS (or other shared filesystems) in all those cases where SAN LUNs are not available but shared access to disk images is required because of live migration (or because of cluster software like Solaris Cluster or RAC is running in the guests).  The functionality is mostly the same as for LUNs, with the exception of SCSI reservations, which don't work with a file backend.  However, CPU requirements in the IO domain and performance of NFS files as compared to SAN LUNs is likely to be different, which is why I strongly recommend to use SAN LUNs for all prodution use cases.

Further reading:

Freitag Jun 29, 2012

Oracle VM Server for SPARC Demo Videos

I just stumbled across several well done demos for newer LDoms features.  Find them all in the youtube channel "Oracle VM Server for SPARC".  I'd like to recommend the ones about power management and cross CPU migration specifically :-)
About

Neuigkeiten, Tipps und Wissenswertes rund um SPARC, CMT, Performance und ihre Analyse sowie Erfahrungen mit Solaris auf dem Server und dem Laptop.

This is a bilingual blog (most of the time). Please select your prefered language:
.
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Categories
Archives
« August 2016
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today