Freitag Jul 22, 2016

Setting up Owncloud on Solaris

I recently had this private little project to try out Owncloud and Nextcloud for personal use.  But since I tried it on Solaris, I thought I might as well share a short summary here for whoever might find it useful.

To deploy either Owncloud or Nextcloud on Solaris, you generally follow the commandline installation instructions.  They are very short and straightforward.  In general, use the Linux manual installation for guidance. However, there are a few Solaris specifics like package dependencies, which are not documented.  Here's what you'll need to do:

  • I installed in a non-global zone (targeting to make it immutable once it's all up and running).  To resolve all the dependencies, you'll need to install these packages right after deploying the empty zone (not sure I need all those apache packages...):
  • Make sure your zone has internet access and DNS resolution.  It will need it to use the Owncloud/Nextcloud appstore.
  • It is easiest to install and run Owncloud/Nextcloud as webservd, since then you don't have to bother with tweaking apache into using a different user.
  • You'll need to enable a few extensions for php.  You do this in /ec/php/5.6/conf.d/extensions.ini  Here are the ones I enabled, I'm not sure I need them all...
  • Create a config file for the mysql extension in /etc/php/5.6/conf.d/mysql.ini.  I took the example from the Admin Guide.
  • I wanted to have a separate ZFS dataset for the software, the data and the mysql database.  This would give me snapshot capability as well as write access to the data once the zone is immutable.
    • Delegate a ZFS dataset to the zone.
      zonecfg -z nextcloud info dataset
      	name: datapool/nextcloud
      	alias: nextcloud
    • Create some filesystems in the dataset to host software, data and database
      root@nextcloud:~# zfs list -r nextcloud
      nextcloud          243M  2.52T  38.6K  /nextcloud
      nextcloud/apache  38.0K  2.52T  38.0K  /nextcloud/apache
      nextcloud/data    17.5M  2.52T  17.5M  /nextcloud/server/nextcloud/data
      nextcloud/mysql    146M  2.52T   146M  /nextcloud/mysql
      nextcloud/server  79.2M  2.52T  79.2M  /nextcloud/server
    • Change the mysql default to point to the new location:
      svccfg -s mysql:version_56 setprop mysql/data=/nextcloud/mysql/data 
      svccfg -s mysql:version_56 refresh
  • Now just follow the Admin Guide to create the mysql database:
    svcadm enable mysql
    mysqladmin -u root password "secret"
    mysql -u root -p
    mysql> create user 'admin'@'localhost' identified by 'secret';
    Query OK, 0 rows affected (0.25 sec)
    mysql> create database if not exists nextcloud ;
    Query OK, 1 row affected (0.00 sec)
    mysql> GRANT ALL PRIVILEGES ON nextcloud.* TO 'admin'@'localhost' identified by 'secret';
    Query OK, 0 rows affected (0.00 sec)
  • And finally, perform the installation:
    php occ maintenance:install --database "mysql" --database-name "nextcloud" --database-user "root" --database-pass "secret"\
    --admin-user "admin" --admin-pass "secret"
  • The rest is no different to the Linux installation.  You'll need to configure apache to serve the application.  Don't forget to do this with SSL if you're actually running this on the internet!
  • Don't forget to tighten file security as described in the Admin Guide!
  • Once done, I turned my zone immutable for additional security.  For this to work, I had to redirect the apache logs to a writable directory, so I created another zfs dataset in the nextcloud pool and had apache send it's logs there.  To turn immutability on, just do
    zoneadm -z nextcloud halt
    zonecfg -z nextcloud set file-mac-profile=fixed-configuration
    zoneadm -z nextcloud boot

Have fun!

Dienstag Apr 26, 2016

Socket, Core, Strand - Where are my Zones?

Consolidation using Solaris Zones is widely adopted.  In many cases, people run all the zones on all available CPUs, which is great for overall utilization.  In such a case, Solaris does all the scheduling, taking care that the best CPU is chosen for each process and that all resources are distributed fairly amongst all applications.  However, there are cases where you would want to dedicate a certain set of CPUs to one or more zones.  For example to deal with license restrictions or to create a more strict separation between different workloads.  This separation is achieved either by using the "dedicated-cpu" setting in the zone's configuration, or by binding the zone to an existing resource pool, which in turn contains a processor set.  The technology in both cases is the same, since in the case of "dedicated-cpu", Solaris automatically creates a temporary resource pool when the zone is started.  The effect of using a processor set is that the CPUs assigned to it are available exclusively to the zones associated with this set.  This means that these zones can use exactly those CPUs - not more, not less.  Anything else running on the system (the global zone and any other zones) can no longer be executed on these CPUs.

In this article, I'll discuss (and hopefully answer) the question, which CPUs to include in such a processor set, and how to figure out which zones currently run on which CPUs.

To avoid unnecessary confusion, let me define a few terms first, since there are multiple names in use for the various concepts:

  • A CPU is a processor, consisting of one or more cores, cache and optionally some IO controllers and/or memory controllers.
  • A Core is one computation or execution unit on a CPU.  (Not to be confused with the pipelines that it contains.)
  • A Strand is an entry point into a core, which makes the core's services available to the operating system.

For example, a SPARC M7 CPU consists of 32 cores.  Each core provides 8 strands, so a M7 CPU provides 32*8=256 strands to the OS.  The OS treats each of these strands as a fully-fledged execution unit and therefore shows 256 "CPUs".

All modern multi-core CPUs include multiple levels of caches.  The L3 cache is usually shared by all cores.  L2 and L1 caches are closer to the cores.  They are smaller but faster and often dedicated to one or a small number of cores.  (The M7 CPU applies different strategies, but each core owns it's own, exclusive L1 cache.)  Now, if multiple strands of the same core are used by the same process (or application), this can lead to relatively high hit rates in these caches.  If, on the other hand, different processes use the same core, there will be competition for the little cache space, overwriting each other's entries.  We call this behavior "cache thrashing".  Solaris does a good job trying to prevent this.  However, when using many zones, it is common to assign different zones to different sets of cores.  Use whole cores (complete sets of 8 strands) to avoid sharing of cores between zones or applications.  This also makes the most sense with regards to license capping, since you usually license your application by the number of cores.

So how can you make sure that your zones are bound correctly to whole, exclusive cores?

Solaris knows about the relation between strands, cores and cpus (as well as the memory hierarchy, which I'll not cover here).  You can query this relation using kstat.  For historical reasons (from the times where there were no multi-core or multi-strand cpus), Solaris uses the term "CPU" for what we now call a strand:

root@mars:~# kstat -m cpu_info -s core_id -i 150
module: cpu_info                        instance: 150   
name:   cpu_info150                     class:    misc
        core_id                         18

root@mars:~# kstat -m cpu_info -s chip_id -i 150
module: cpu_info                        instance: 150   
name:   cpu_info150                     class:    misc
        chip_id                         1

In the above example, the "cpu" with id 150 is a strand of core 18, which belongs to CPU 1.  You can discover all available strands and CPUs like this.

Usually, when you configure a processor set for a resource pool, you just tell it the minimum and maximum number of strands it should contain (where min=max is quite common). Optionally, you can also specify specific CPU-IDs (strands) or, since Solaris 11.2, core IDs.  The commands to do this are "pooladm" and "poolcfg".  (There is also the command "psrset", but it only creates a processor set, not a resource pool, and is not permanent, so needs to be run after every reboot.)  I already described the use of these commands a while ago.  Now, to figure out which strands, cores or CPUs are assigned to a specific zone, you'd need to use kstat to find the association between strand IDs in your processor set and the corresponding cores and CPUs.  Done manually, that's a little painful, which is why I wrote a little script to do this for you:

root@mars:~# ./zonecores -h
usage: zonecores [-Sscl] 
       -S report whole Socket use
       -s report shared use
       -c report whole core use
       -l list cpu overview

 With the "-l" commandline option, it will give you an overview of the available CPUs and which zones are running on them.  Here's an example from a SPARC system with 2 16-core CPUs:

root@mars:~# ./zonecores -l
# Socket, Core, Strand and Zone Overview
Socket Core Strands Zones
0      0    0,1,2,3,4,5,6,7 db2,
0      1    8,9,10,11,12,13,14,15 db2,
0      2    16,17,18,19,20,21,22,23 none
0      3    24,25,26,27,28,29,30,31 db2,
0      4    32,33,34,35,36,37,38,39 db2,
0      5    40,41,42,43,44,45,46,47 db2,
0      6    48,49,50,51,52,53,54,55 db2,
0      7    56,57,58,59,60,61,62,63 coreshare,db1,
0      8    64,65,66,67,68,69,70,71 db2,
0      9    72,73,74,75,76,77,78,79 none
0     10    80,81,82,83,84,85,86,87 none
0     11    88,89,90,91,92,93,94,95 none
0     12    96,97,98,99,100,101,102,103 none
0     13    104,105,106,107,108,109,110,111 none
0     14    112,113,114,115,116,117,118,119 none
0     15    120,121,122,123,124,125,126,127 none
1     16    128,129,130,131,132,133,134,135 none
1     17    136,137,138,139,140,141,142,143 none
1     18    144,145,146,147,148,149,150,151 none
1     19    152,153,154,155,156,157,158,159 none
1     20    160,161,162,163,164,165,166,167 none
1     21    168,169,170,171,172,173,174,175 none
1     22    176,177,178,179,180,181,182,183 none
1     23    184,185,186,187,188,189,190,191 none
1     24    192,193,194,195,196,197,198,199 none
1     25    200,201,202,203,204,205,206,207 none
1     26    208,209,210,211,212,213,214,215 none
1     27    216,217,218,219,220,221,222,223 none
1     28    224,225,226,227,228,229,230,231 none
1     29    232,233,234,235,236,237,238,239 none
1     30    240,241,242,243,244,245,246,247 db2,
1     31    248,249,250,251,252,253,254,255 none

Using the options -S and -c, you can check whether your zones use whole sockets (-S) or whole cores (-c).   With -s you can check whether or not several zones share one or more cores, which can be intentional or not, depending on the use case.  Here's an example with various pools and zones on the same system as above:

root@mars:~# ./zonecores -Ssc
# Checking Socket Affinity (16 cores per socket)
INFO - Zone db2 using 2 sockets for 8 cores.
OK - Zone db1 using 1 sockets for 1 cores.
OK - Zone capped7 using default pool.
OK - Zone coreshare using 1 sockets for 1 cores.
# Checking Core Resource Sharing
OK - Core 0 used by only one zone.
OK - Core 1 used by only one zone.
OK - Core 3 used by only one zone.
OK - Core 30 used by only one zone.
OK - Core 4 used by only one zone.
OK - Core 5 used by only one zone.
OK - Core 6 used by only one zone.
INFO - Core 7 used by 2 zones!
-> coreshare
-> db1
OK - Core 8 used by only one zone.
# Checking Whole Core Assignments
OK - Zone db2 using all 8 strands of core 0.
OK - Zone db2 using all 8 strands of core 1.
OK - Zone db2 using all 8 strands of core 3.
OK - Zone db2 using all 8 strands of core 30.
OK - Zone db2 using all 8 strands of core 4.
OK - Zone db2 using all 8 strands of core 5.
FAIL - only 7 strands of core 6 in use for zone db2.
FAIL - only 1 strands of core 8 in use for zone db2.
OK - Zone db1 using all 8 strands of core 7.
OK - Zone coreshare using all 8 strands of core 7.

Info: 1 instances of core sharing found.
Info: 1 instances of socket spanning found.
Warning: 2 issues found with whole core assignments.

While this mostly speaks for itself, here are some comments:

  • Zone db01 uses a resource pool with 8 strands from one core.
  • Zone coreshare also uses that same pool.
  • Zone db2 uses a resource pool with 64 strands coming from cores from two different CPUs.  It only uses 7 of the 8 strands from core 6, while the 8th strand comes from core 8.  This is probably not intentional.  It would make more sense to use all 8 strands from the same core to avoid cache sharing and reduce the number of cores to license by one.   It might also be benefitial to use all 8 cores from the same CPU.  In this case, Solaris would attempt to allocate memory local to that CPU to avoid remote memory access.
  • Zone capped7 is configured with the option "capped-cpu: ncpus=7".  This is implemented using the Fair Share Scheduler (FSS) which uses all available CPUs in the default pool.

The script is available for download here: zonecores

I also wrote a more detailed discussion of all of this, with examples how to reconfigure your pool configuration in MOS DocID 2116794.1

Some links to further reading:

Montag Apr 04, 2016

ASM Scoped Security - A Realistic Example

If you run multiple grid infrastructures (aka RAC Clusters) on SuperCluster, which share the same set of Exadata Storage Servers (aka cells), adding ASM Scoped Security to the setup is a good idea.  Even if there are no security reasons like multi-tenancy, just simply preventing accidental use of one cluster's diskgroups by another cluster should be reason enough to implement this simple precaution.

Of course, there is good documentation on this feature, available here.   However, as often the case, the devil is in the details, so here's a comprehensive example of how to do this:

  1. Shutdown the cluster you want to modify.  Use "crsctl stop crs" on all cluster nodes.
  2. Create a key for this cluster
    On any storage cell, use "cellcli -e create key".  It will give you an ASCII string to use as a key.  Copy that string to a temporary place.  In this example, I'll use the key '9e9a606a461a1abc6af43626e85af3b7'
  3. Invent a unique name to use for this cluster.  In this example, I'll use "marsc1" to denote the first cluster running on mars.
  4. Create a name/key pair on all cells using this unique name and the key from above.  On all cells, execute this cellcli command:
    assign key for 'marsc1'='9e9a606a461a1abc6af43626e85af3b7'
  5. Here's the most difficult part.  We'll need to assign all griddisks that are used by our cluster to this unique name.  Cellcli's filters and wildcards don't help much here.  Here's how I did it:
    1. On all cells, create a list of all disks belonging to marsc1.  In cellcli, do:
      spool /tmp/disks
      list griddisk where asmdiskgroupname='DATAC1' attributes name
      list griddisk where asmdiskgroupname='RECOC1' attributes name
    2. In /tmp/disks on each cell, there will now be a number of lines similar to this:
    3. Using your favorite file manipulation tools (I used awk and vi), use this file to create a command file that contains one "alter griddisk" command for each griddisk.  Mine looked like this afterwards:
      alter griddisk DATAC1_CD_00_marsceladm04 availableTo='marsc1'
      alter griddisk DATAC1_CD_01_marsceladm04 availableTo='marsc1'
      alter griddisk DATAC1_CD_02_marsceladm04 availableTo='marsc1'
      alter griddisk DATAC1_CD_03_marsceladm04 availableTo='marsc1'
    4. Run this command script on each cell.  Of course, each cell will have its own script.
      # cellcli < script
    5. Check that it worked using cellcli:
      list griddisk attributes name,availableTo
  6. Finally, enter the unique name and the key in a file called "cellkey.ora".  On Solaris, this file is located in /etc/oracle/cell/network-config
    My file looks like this:
  7. Restart crs on all nodes:
    crsctl start crs

That should be all.  You can easily verify that your other clusters can no longer see these diskgroups or disks from another cluster's asm:  asmcmd lsdg --discovery

Now, repeat this for all of your clusters.  The end result will be exclusive access to each cluster's disks, with no danger of intentional snooping or unintentional use.

One tool that comes in very handy for doing stuff on all cells at the same time is "cssh" - a one to many commandline included in recent versions of Solaris.

Mittwoch Dez 23, 2015

Getting Started with OVM Templates for SPARC - Part 3: Using Templates on SuperCluster

As with the previous parts of this mini-series, this is the blog version of my MOS Note 2065199.1, although this time I'll deviate a bit more from the "original".

In the previous section, we saw how to create and deploy templates on standalone systems.  While straightforward, it was all commandline work.  On SuperCluster, this is very different.  First of all, you can't create a template on SuperCluster.  That's because it's not supported to manually create domains on SuperCluster, and all the domains that are created by the tools don't use a file backend for their rpool.  Deployment, on the other hand, is made even easier than on standalone systems by the very easy to use "IO Domain Creation Tool". 

The "IO Domain Creation Tool" manages all aspects of the lifecycle of an IO domain - a new type of domain introduced in the 2.0 release of SuperCluster.  I will not go into the details of this tool and the SuperCluster IO domain here.  Have a look in the official documentation for details.  However, IO domains also support OVM templates as a possible installation method, and this is what I'll cover here.

On standalone systems, you're responsible for managing your own templates.  Typically, you'll have some shared storage with a collection of templates that you use.  In SuperCluster, this collection is formalized into a "Library" from which you can pick a template at install time.  So, to get started, the first step is to upload a template into that library.  To do that, the template file will need to be available on a filesystem visible on the first node primary domain.  In the example I'll use here, this is /net/  In the IO Domain Creation Tool, go to the "OVM Templates" tab, enter the full pathname of the OVA file and select "Add Template".  Click on the images for a readable view. 
The template will now be uploaded to the library.  Since this includes unpacking the OVA file and uncompressing the disk images, it will take several minutes.  There is no progress display, but you can check the contents of the library from time to time.

The template will be displayed in the list of existing templates once the import process has completed.  Note that at this time, you can't delete or otherwise modify a template once it's in this library.  We're working on that...

Allocating Hardware and Selecting the Template

On SuperCluster, you can't just create a new domain using ldm commands like you would on a standalone system.  All hardware resources available for IO domains are managed by the IO domains subsystem.  You'll need to "register" the hardware you want for the new domain with that subsystem, using the IO Domain Creation Tool.  To do that, go to the "IO Domains" tab of the IO Domain Creation Tool.  It will display a list of existing IO Domains.  Select the "Add IO Domain" button.  On the following screen, select your template as the domain type and the desired hardware size as the domain recipe.  The size you choose here will override any resource requirements layed down in the template.  Then click "Next".

The next screen will ask you to provide values for various properties.  Some of these are hostnames for different network interfaces.  These are SuperCluster specific and will override any settings configured in the template.  Depending on the template, there might be other properties to be filled with values.  In our example, you will be asked to provide values for the three properties defined earlier.  Once you are satisfied with your entries, click "Allocate".  This will take you back to the list of IO domains.

At the top of the screen, some network details will be displayed.  The tool chooses IP addresses from a pool which was configured during SuperCluster Installation.  Make note of these so you can easily connect to your domain after deployment.

Deploying the Domain

The final step is very simple.  To deploy the domain with the configuration given in the previous steps, select the domain in the list of available domains and click on "Deploy".  After a final confirmation, the domain will be created and the template will be deployed.

Now wait for the Domain to be created.

Once ready, you can view some details.

And of course, you can connect to the application and see if it works.

Differences to Templates on Standalone Systems

Domains deployed using templates on standalone systems are configured based on the requirements defined in the template.  This means the number of CPUs, the amount of memory, the number of network interfaces, the type and location of the disk images are either taken from the template definition or are defined by the administrator during template deployment and configuration.  When deploying on SuperCluster, all of this is controlled by the IO domain creation utility:

  • The domain size is set by the utility based on the domain recipe chosen by the administrator.
  • Network adapters are created and configured to conform with the SuperCluster network environment, overriding any settings in the template.
  • Disk images are always provided as iSCSI LUNs from the central ZFS SA.  The image sizes are take from the template.
  • /etc/hosts is populated with entries from the SuperCluster environment.
  • The Solaris publisher is set to the standard publisher in the SuperCluster environment.  This includes adding the exa-family publisher.
  • ssctuner is installed and activated.
  • The domain's root password is set to a default password.

The net effect of these differences is that the IO domain, although based on a template, is a normal SuperCluster IO domain that provides full connectivity to the SuperCluster environment and is fully supported, just like any other application domain.  This also means that any restrictions that apply to normal application domains also apply to IO domains deployed from a template.

For additional and background reading, please refer to the links I provided at the end of the first article in this mini-series.

Donnerstag Dez 03, 2015

Getting Started with OVM Templates for SPARC - Part 2: Creating a Template

The primary purpose of a template is to deliver an application "ready to run" in an environment that contains everything the application needs, but not more than necessary.  The template will be deployed multiple times and you will want to configure each deployed instance to give it a personality and perhaps connect it with other components running elsewhere.  For this to work, you will need to understand the configuration aspects of the application well.  You might need a script that does some or all of the configuration of the application once the template has been deployed and the domain boots for the first time.  Any configuration not done by such a script will need to be done manually by the user of the template after the domain has booted.  It is usually desirable to create templates in such a way that no further manual configuration is required.  I'll not cover how to create such first-boot scripts here.

(Note: This article is the blog version of the second part of MOS DocID 2063739.1 reproduced here for convenient access.) 

Here are the steps usually required to build a template:

  1. Create a new domain and install Solaris and the ovmt utilities.
  2. Define what properties will be needed to configure the application.
  3. Install and configure the application as much as possible.
  4. Write and test a first-boot script or SMF service that will read the values for these properties using the ovmt utilities and configures the application.
  5. Unconfigure Solaris in your domain and remove any temporary files, then and shut it down.
  6. Create the template. 
  7. Test your template.  Go back to any of the previous steps to fix issues you find.

Before we look at each of these steps using an example, here is a little background about how properties work in this context.

Properties for Your Template

Most applications require some sort of configuration before they are ready.  Typical configuration items might be TCP ports on which to listen for requests, admin passwords for a web user interface or the IP address of a central administration server for that application.  These are usually set during manual installation of the application.  Since the whole idea of an application template is that once you deploy it, the application is ready to run, there needs to be a way to pass these configuration items to the application during template deployment.  This is where properties come in.  Any configuration item that must be passed to the application at deployment time is defined as a property of the template.  This happens in a properties definition file that is bundled with the actual OS image to build the template.  Note that this file only contains the definition of these properties, not their values!  Properties are populated with values at deployment time.

You have already seen an example of how to pass such values to a template during deployment in the first part of this mini-series, when Solaris properties were passed to the template.

Properties are defined in a small XML file, as they themselves have various properties.  They have a name, which can be fully qualified or simple.  They have a data type, like number, text, selection lists etc.  These are covered in detail in the Template Authoring Guide.  For the simple example in this article, we will use text properties only.

During deployment, you pass values for these properties to the target domain using the ovmtconfig utility.  It will use one of two methods to pass these values to the domain.  The more complex method is backmounting the target domain's disk images to the control domain and then running a script which will put configuration items right into the filesystem of the target domain.  This is how a Solaris AI profile is constructed and passed to the domain for first-boot configuration of Solaris itself.  The other method currently uses VM variables to pass properties and their values to the domain.  These can then be read at first boot using the ovmtprop utility.  This is why the ovmt utilities should be installed in the template.  In the example below, you will see how this is done.

Creating a Source Domain

The first step in developing a template is to create a source domain.  This is where you will install your application, configure it as much as possible and test the first-boot script which will do the rest of the configuration based on the template property values passed during deployment.  Here are a few hints and recommendations for this step:

  • Use a standalone SPARC sun4v system to build your template.  Do not attempt to build a template on SuperCluster - this is not supported.
  • Create a simple domain:
    • Use only one network adapter if possible.  This makes it easier to deploy the template in many different environments.
    • Use only one disk image for the OS.  Keep it small.  Unnecessarily large disk images will make it more difficult to transport and deploy the template.
    • If required, you can use one or more additional disk images for the application.  Separating it from the OS is often a good idea.
    • Define CPU and memory resources only as needed.  They can always be increased after deployment if necessary.
    • Install only those OS packages required by your application.  You might want to start with "solaris-minimal-server" and then add any packages required.
  • You must use flat files for all disk images.  A later version of the utilities will also support other disk backends.
  • All your disk image files must have a filename ending in ".img".  While this is not a restriction for general use, it is currently a requirement if you intend to deploy your template on SuperCluster.
    • Use a sensible name for the disk images, as they are used in the volume descriptions within the template.
  • To speed up testing (create and deploy cycles) it is helpful to install pigz, a parallel implementation of gzip.  If present, it will be used to speed up compression and decompression of the disk images.
  • As you will most likely be using ovmtprop to read properties for your application, don't forget to install the ovmtutilities into the domain.
  • Apply all required configurations to make the application ready to run.  If this includes adding users, file systems, system configurations etc. to the domain, then do that.  All of these will be preserved during template creation.
  • Note that system properties like IP addresses, root passwords, timezone information etc. will not be preserved.

Defining the Properties

What properties you need in your template depends on your application.  So of course you should understand how to install and configure the application before you start building the template.  Here is an example for a very very simple application:  An Apache webserver that shows a plain text page displaying the properties passed to the domain.  In this example, there are three properties defined in the properties file: property1, property2 and property3.  Here is the full XML file defining these properties:

<ovf:ProductSection ovf:class="">
        <ovf:Product>Oracle SuperCluster</ovf:Product>
        <ovf:Category>OVM Templates</ovf:Category>
        <ovf:Property ovf:key="property1" ovf:type="string" 
                      ovf:userConfigurable="true" ovf:value="Value1">
                <ovf:Description>Specifies property 1</ovf:Description>
        <ovf:Property ovf:key="property2" ovf:type="string"
                      ovf:userConfigurable="true" ovf:value="Value2">
                <ovf:Description>Specifies property 2</ovf:Description>
        <ovf:Property ovf:key="property3" ovf:type="string"
                      ovf:userConfigurable="true" ovf:value="Value3">
                <ovf:Description>Specifies property 3</ovf:Description>

This file will be bundled with the template when you create the template.  Note that although there are values defined in this file, these values are not passed to the target domain during deployment!  A detailed description of all the options for template properties will be available in the Template Authoring Guide, which is currently being written.

Installing and Configuring the Application

Again, this step very much depends on the application you intend for your template.  In this simple example, there is very little to do.  Note that this step also shows a very simple first-boot script which does the configuration of the application:

root@source:~# pkg install ovmtutils

root@source:~# pkg install apache22

root@source:~# svcadm enable apache22

root@source:~# more /etc/rc3.d/S99template-demo 


# simplest of all first-boot scripts
# just for demo purposes


if [ ! -f $OUTFILE ]
   cat >>$OUTFILE << EOF
<h1>OVM Template Demo</h1>
   for i in 1 2 3
      $OVMTPROP -q get-prop -k $BASEKEY$i >> $OUTFILE
      echo "</br>" >> $OUTFILE
cat >>$OUTFILE << EOF

After you have completed the installation and are happy with how the first-boot scripts work, the final step is to unconfigure Solaris.  You do this using the command "sysconfig unconfigure --destructive --include-site-profile". This removes system identity like hostname, IP addresses, time zones and also root passwords from the system.  It is necessary so that after deployment, a new system identity can be passed to the target domain.  You might also want to remove other files like ssh keys, temp files in /var/tmp etc. which were created during configuration and testing but should not be left in the template.

Creating the Template

After all the preparations are complete, creating the actual template is a very simple step.  You only need to call ovmtcreate, which will collect all the pieces and assemble the OVF container for you.

root@mars:~# ovmtcreate -d source -o /templates/apache-template.ova \
             -s "OVM S11.2 SPARC Apache 2.2 Demo" \
             -P /development/

This will collect all the disk images attached to the source domain as well as the property definition file and package them in OVF format.   This file can now be transferred to any target system and deployed there.  With the "-s" flag, a descriptive name is given to the template.  This name can be used to identify the template.  For example, it will be displayed in the library of templates available in SuperCluster.

To complete the example workflow, deployment and configuration with the three custom properties for the sample Apache configuration is shown next.

Deploying Your Template on a Standalone Server with Custom Properties

To prepare for deployment, you will need a small file containing values for the three custom properties defined above: 

root@mars:~# cat custom.values

Of course, Solaris itself will also need some configuration.  You have already seen this in Part 1.  This example will simply reuse that configuration.

To deploy and configure the target domain, you now only need two simple commands:

root@mars:~# ovmtdeploy -d target -o /domains \
    -s /templates/apache-template.ova
root@mars:~# ovmtconfig -d target  \
    -c /opt/ovmtutils/share/scripts/ \
    -P solaris_11.props.values,custom.values

After starting the domain, you should be able to point your browser to port 80 and see the simple html page created by the first-boot script.

Additional Notes

  • You can check the properties passed to a domain from the control domain.  They are passed as normal VM variables, so the command "ldm ls-variable <domain>" will display them.  Note that this also means that any sensitive information passed to a domain in this way will be visible to anyone on the control domain with privileges to execute the ldm command.  The same information is also available to any user logged into the target domain, as any user has read access to the ovmtprop utility.  So while this is a nice way to check if the right variables have been passed to the target domain during testing, you should not use template properties to pass clear text passwords or other sensitive information to the domain.  Use encrypted password strings instead, if possible.  If the application requires clear text sensitive information, you should at least remove these variables after the target domain is in production.  Use the command "ldm rm-variable" to do so.  Triggering a change of passwords after first login is also a good idea.
  • In general, OVM Templates support Solaris Zones.  This means you can create a template domain that contains one or more Zones as part of the template.  However, you will need to care for the configuration (IP addresses, hostnames, etc) of each zone as part of the first-boot configuration, as this is not covered by the Solaris configuration utilities.  Also note that templates containing Zones are currently not supported on SuperCluster.

For additional and background reading, please refer to the links I provided at the end of the first article in this mini-series.

Getting Started with OVM Templates for SPARC - Part 1: Deploying a Template

OVM Templates have been around for a while now.  Documentation and various templates are readily available on OTN.  However, most of the documentation is centered around OVM Manager and most of the templates are built for Linux on x86 hardware.  But SPARC and Solaris are catching up.  A template for Weblogic 12.1.3 is already available.

With the release of Solaris 11.3, commandline tools to create, deploy and configure OVM Templates for SPARC are now available.  The tools are also available as a separate download on MOS as patch ID 21210110.  In this small series of blog entries, I will discuss how to deploy OVM Templates on SPARC and how to create your own.

(Note: This article is the blog version of the first part of MOS DocID 2063739.1 reproduced here for convenient access.)

Let's start with the easiest part: Deploying an existing template on a SPARC system.

Of course, the first step here is to get a template.  Today, there isn't very much choice - you can get a template for Solaris 10 and one for Solaris 11.  But more are being developed.  Go to  to get them.  Here's a little screenshot to guide you in the right direction.  My notes are in red...

There is also a template for Weblogic 12.1.3 available here.

Once you've downloaded the template, you'll find a file called "sol-11_2-ovm-sparc.ova" or similar.  This is the template in Open Virtualization Format.   Since OVA and OVF are based on tar, you can actually extract and explore that file if you are curious.

The second step in deploying this template is to download and install the OVM Template Toolkit in the control domain of your server.  Either update to Solaris 11.3 or download the toolkit as a separate patch.  Then install it - you will find the tools in /opt/ovmtutils.  The patch will also contain a README file and manpages for the utilities.  I recommend looking at them for additional details and commandline switches not covered here.

root@mars:/tmp/ovmt# unzip /tmp/ 
Archive:  /tmp/
  inflating: README.txt              
   creating: man1m/
  inflating: man1m/ovmtconfig.1m     
  inflating: man1m/ovmtprop.1m       
  inflating: man1m/ovmtlibrary.1m    
  inflating: man1m/ovmtcreate.1m     
  inflating: man1m/ovmtdeploy.1m     
  inflating: ovmt-utils1.1.0.1.p5p   

root@mars:/tmp/ovmt# pkg install -vg ./ovmt-utils1.1.0.1.p5p ovmtutils


root@mars:~# ls /opt/ovmtutils/bin
agent        dist         ovmtconfig   ovmtdeploy   ovmtprop
bin          lib          ovmtcreate   ovmtlibrary

Now, before we deploy the template, let's have a short look at what the utilities find in this specific template:

root@mars:~# ovmtdeploy -l ./sol-11_2-ovm-sparc.ova 
Oracle Virtual Machine for SPARC Deployment Utility
ovmtdeploy Version
Copyright (c) 2014, 2015, Oracle and/or its affiliates. All rights reserved.

Checking user privilege
Performing platform & prerequisite checks
Checking for required services
Named resourced available

Checking .ova format and contents
Validating archive configuration
Listing archive configuration

Assembly name: sol-11_2-ovm-sparc.ovf
Gloabl settings: 
References: zdisk-ovm-template-s11_2 -> zdisk-ovm-template-s11_2.gz
Disks: zdisk-ovm-template-s11_2 -> zdisk-ovm-template-s11_2
Networks: primary-vsw0

Virtual machine 1
Name: sol-11_2-ovm-sparc
Description: Oracle VM for SPARC Template with 8 vCPUs, 4G memory, 1 disk image(s)
vcpu Quantity: 8
Memory Quantity: 4G
Disk image 1: ovf:/disk/zdisk-ovm-template-s11_2 -> zdisk-ovm-template-s11_2
Network adapter 1: Ethernet_adapter_0 -> primary-vsw0
Oracle VM for SPARC Template

We can see that the target domain will have:

  • 8 vCPUs and 4GB of RAM
  • One Ethernet adapter connected to "primary-vsw0"
  • One disk image

It also supports several properties that can be configured after deployment and before the domain is first started.  These are all Solaris properties that one would usually provide during initial system configuration - either manually on the system console or using an AI profile.  We will see later in this article how to populate these properties.

Before we actually go and deploy this, we should check the prerequisites on the platform:

  • Your control domain should be running Solaris 11.3 (or at least 11.2 if you're using the patch mentioned above).
  • The virtual console service must be configured and running.  If not, set it up now.
    (See here for an example.)
  • A virtual disk service must be available.  If non is there, create one now.

But first, let's deploy the template without bothering about any details:

root@mars:~# ovmtdeploy -d solarisguest -o /localstore/domains \

In this very simple example, the domain we're creating and installing will be called "solarisguest".  It's disk images will be stored in /localstore/domains and it will be installed from the template found in /incoming/sol-11_2-ovm-sparc.ova.

There are a few things to note here:

  • ovmtdeploy will create the domain with resources as they are defined in the template.
  • It will, if necessary, create vswitches to connect network ports.  Make sure to check the result.
  • By default, ovmtdeploy will use flat files for disk images.
  • All of these settings can be overridden with commandline switches.  They allow very sophisticated domain configurations and are not covered here.
  • The domain is started right after the deployment.  Since we didn't populate the available properties, Solaris will boot in an unconfigured state and request configuartion on the system console.

So let's do this again, this time providing values for these properties.  This will allow us to boot the domain in a configured state without ever logging in.  What we need for this is a small text file which contains values for these properties:

root@mars:~# more solaris_11.props.values 

# Default hostname

# Root user account settings'password hash goes here'

# Administrator account settings"Administrator"'password hash goes here'

# Network settings for first network instance
# Domain network interface

# IP Address
# if not set, use DHCP

# DNS settings
# (comma separated list of DNS servers)
# (comma separated list of domains)

# System default locale settings
It should be obvious how to populate this file with your own values.  With this file, deployment and configuration is a simple, two step operation:

root@mars:~# ovmtdeploy -d solarisguest -o /localstore/domains \
             -s /incoming/sol-11_2-ovm-sparc.ova 

This deploys the guest, but doesn't start it.  That's what the "-s" commandline switch is for.

root@mars:~# ovmtconfig -d solarisguest \
    -c /opt/ovmtutils/share/scripts/ \
    -P solaris_11.props.values

The script "" is part of the ovmtutils distribution.  ovmtconfig will mount the deployed disk image and call this script.  It will create a configuration profile using the property values given in "solaris_1.props.values".  This profile will be picked up by solaris at first boot to configure the domain.

With this, you have a solaris guest domain up and running.  If you don't like it, un-deploy it using "ovmtdeploy -U solarisguest".  It cleans up nicely.  You could now use this domain as a starting point for developing your own template.  But this will be covered in the next part.

For the curious, here are some additional links and references:

Freitag Aug 28, 2015

Open a MOS Document by its DocID

I've always been wondering if there wasn't an easier way to view a MOS document than to go to the portal and "search" for the DocID which I already know.  Using a Firefox search plugin had been an idea for a while.  Now I finally found a few minutes to try that.  It works - find the plugin below.  Simply save it as "DocID.xml" in the searchplugins directory of your firefox profile, restart firefox and enter the DocID in the search field of the browser.

<SearchPlugin xmlns="" xmlns:os="">
<os:ShortName>MOS DocID</os:ShortName>
<os:Description>MOS DocID Search</os:Description>
<os:Image width="16" height="16">data:image/x-icon;base64,
<os:Url type="text/html" method="GET" template="{searchTer

(The full text of the plugin might not be displayed, but copy & paste to a file will work.)

Dienstag Jan 06, 2015

What's up with LDoms - Article Index

In the last few years - yes, it's actually years! - I wrote a series of articles about LDoms and their various features.  It's about time to publish a small index to all those articles:

I will update this index if and when I find time for a new article.

What's up with LDoms: Part 11 - IO Recommendations

In the last few articles, I discussed various different options for IO with LDoms.  Here's a very short summary:

IO Option Links to previous articles
Direct IO
Root Domains
Virtual IO

In this article, I will discuss the pros and cons of each of these options and give some recommendations for their use.

Root Domain SetupIn the case of physical IO, there are several options:  Root Domains, DirectIO and SR-IOV.  Let's start with SR-IOV.  The most recent addition to the LDom IO options, it is by far the most flexible and the most sophisticated PCI virtualization option available.  Please see the diagram on the right (from the Admin Guide) for an overview.  First introduced for Ethernet adapters, Oracle today supports SR-IOV for Ethernet, Infiniband and Fibre Channel.  Note that the exact features depend on the hardware capabilities and built-in support of the individual adapter.  SR-IOV is not a feature of a server but rather a feature of an individual IO card in a server platform that supports it.  Here are the advantages of this solution:

  • It is very fine grain, with between 7 and 63 Virtual Functions per adapter.  The exact number depends on adapter capabilities.  This means that you can create and use as many as 63 virtual devices in a single PCIe slot!
  • It provides bare metal performance (especially latency), although hardware resources like send and receive buffers, MAC slots and other resources are devided between VFs which might lead to slight performance differences in some cases.
  • Particularily for Fibre Channel, there are no limitations to what end-point device (disk, tape, library, etc.) you attach to the fabric.  Since this is a virtual HBA, it is administered like one.
  • Different than Root Domains and Direct IO, most SR-IOV configuration operations can be performed dynamically, if the adapters support it.  This is currently the case for Ethernet and Fibre Channel.  This means you can add or remove SR-IOV VFs to and from domains in a dynamic reconfiguration operation, without rebooting the domain.

Of course, there are also some drawbacks:

  • First of all, you have a hard dependency on the domain owning the root complex.  Here's a little more detail about this:
    As you can see in the diagram, the IO domain owns the physical IO card.  The physical Root Complex (pci_0 in the diagram) remains under the control of the root domain (the control domain in this example).  This means that if the root domain should reboot for whatever reason, it will reset the root complex as part of that reboot.  This reset will cascade down the PCI structures controlled by that root complex and eventually reset the PCI card in the slot and all the VFs given away to the IO domains.  Essentially, seen from the IO domain, its (virtual) IO card will perform an unexpected reset.  The best way to respond to this is with a panic of the IO domain, which is the most likely consequence.  Note that the Admin Guide says that the behaviour of the IO domain is unpredictable, which means that a panic is the best, but not the only possible outcome.  Please also take note of the recommended precautions (by configuring domain dependencies) documented in the same section of the Admin Guide.  Furthermore, you should be aware that this also means that any kind of multi-pathing on top of VFs is counter-productive.  While it is possible to create a configuration where one guest uses VFs from two different root domains (and thus from two different physical adapters), this does not increase the availability of the configuration.  While this might protect against external failures like link failures to a single adapter, it doubles the likelyhood of a failure of the guest, because it now depends on two root domains instead of one.  I strongly recommend against any such configurations at this time.  (There is work going on to mitigate this dependency.)
  • Live Migration is not possible for domains that use VFs.  In the case of Ethernet, this can be worked around by creating an IPMP failover group consiting of one virtual network port and one Ethernet VF and manually removing the VF before initiating the migration as described by Raghuram here.  Note that this is not currently possible for Fibre Channel or IB.
  • Since you are actually sharing one adapter between many guests, these guests do share the IO bandwidth of this one adapter.  Depending on the adapter, there might be bandwidth management available, however, the side effects of sharing should be considered.
  • Not all PCIe adapters support SR-IOV.  Please consult MOS DocID 1325454.1 for details.

SR-IOV is a very flexible solution, especially if you need a larger number of virtual devices and yet don't want to buy into the slightly higher IO latencies of virtual IO.  Due to the limitations mentioned above, I can not currently recommend SR-IOV or Direct IO for use in domains with highest availability requirements.  In all other situations, and definately in test and development environments, it is an interesting alternative to virtual IO.  The performance gap between SR-IOV and virtual IO has been narrowed considerably with the latest improvements in virtual IO.  You will essentially have to weigh the availability, latency and managability characteristics of SR-IOV against virtual IO to make your decision.

Root Domain SetupNext in line is Direct IO.  As described in an earlier post, you give one full PCI slot to the receiving domain.  The hypervisor will create a virtual PCIe infrastructure in the receiving guest and reconfigure the PCIe subsystem accordingly.  This is shown in an abstract view in the diagram (from the Admin Guide) at the right. Here are the advantages:

  • Since Direct IO works on a per slot basis, it is a more fine grain solution, compared to root domains.  For example, you have 16 slots in a T5-4, but only 8 root complexes.
  • The IO domain has full control over the adapter.
  • Like SR-IOV, it will provide bare-metal performance.
  • There is no sharing, and thus no cross-influencing from other domains.
  • It will support all kinds of IO devices, tape drives and tape libraries being the most popular example.

The disadvantages of Direct IO are:

  • There is a hard dependency on the domain owning the root complex.  The reason is the same as with SR-IOV, so there's no need to repeat this here.  Please make sure you understand this and read the recommendations in the Admin Guide on how to deal with this dependency.
  • Not all IO cards are supported with DirectIO.  They must not contain their own PCIe switch.  A list of supported cards is maintained in MOS DocID 1325454.1.
  • Like Root Domains, dynamic reconfiguration is not currently supported with DirectIO slots.  This means that you will need to reboot both the root domain and the receiving guest domain to change this configuration.
  • And of course, Live Migration is not possible with Direct IO devices.

DirectIO was introduced in an early release of the LDoms software.  At the time, systems like the T2000 only supported two Root Complexes.  The most common usecase was to support tape devices in domains other than the control domain.  Today, with a much better ratio of slots/root complex, the need for this feature is diminishing and although it is fully supported, you should consider other alternatives first.

Root Domain Setup Finally there are Root Domains.  Again, a diagram you already know, just as a reminder.

The advantages of Root Domains are:

  • Highest Isolation of all domain types.  Since they own and control their own CPU, memory and one or more PCIe root complex, they are fully isolated from all other domains in the system.  This is very similar to Dynamic System Domains you might know from older SPARC systems, just that we now use a hypervisor instead of a crossbar.
  • This also means no sharing of any IO resources with other domains, and thus no cross-influence of any kind.
  • Bare metal performance.  Since there's no virtualization of any kind involved, there are no performance penalties anywhere.
  • Root Domains are fully independent of all other domains in all aspects.  The only exception is console access, which is usually provided by the control domain.  However, this is not a single point of failure, as the root domain will continue to operate and will be fully available over the network even if the control domain is unavailable.
  • They allow hot-swapping of IO cards under their control, if the chassis supports it.  Today, that is for T5-4 and above.

Of course, there are disadvantages, too:

  • Root Domains are not very flexible.  You can not add or remove PCIe root complexes without rebooting the domain.
  • You are limited in the number of Root Domains, mostly by the number of PCIe root complexes available in the system.
  • As with all physical IO, Live Migration is not possible.

Use Root Domains whenever you have an application that needs at least one socket worth of CPU and memory or more and has high IO requirements, but where you'd prefer to host it on a larger system to allow some flexibility in CPU and memory assignment.  Typically, Root Domains have a memory footprint and CPU activity which is too high to allow sensible live migration. They are typically used for high value applications that are secured with some kind of cluster framework. 

Virtual IO SetupHaving covered all the options for PCI virtualization, there is only virtual IO left to cover. For easier reference, here's the diagram from previous posts that shows this basic setup.  This variant is probably the most widely used one.  It has been available from the very first version, it's performance has been significantly improved recently.  The advantages of this type of IO are mostly obvious:

  • Virtual IO allows live migration of guests.  In fact, only if all the IO of a guest is fully virtualized, can it be live migrated.
  • This type of IO is by far the most flexible from a platform point of view.  The number of virtual networks and the overall network architecture is only limited by the number of available LDCs (which has recently been increased to 1984 per domain).  There is a big choice of disk backends.  Providing disk and networking to a great number of guests can be achieved with a minimum of hardware.
  • Virtual IO fully supports dynamic reconfiguration - the adding and removing of virtual devices.
  • Virtual IO can be configured with redundant IO service domains, allowing a rolling upgrade of the IO service domains without disrupting the guest domains and without requiring live migration of the guests for this purpose.  Especially when running a large number of guests on one platform, this is a huge advantage.

Of course, there are also some drawbacks:

  • As with all virtual IO, there is a small overhead involved.  In the LDoms implementation, there is no limitation of physical bandwidth.  But there is a small amount of additional latency added to each data packet as it is processed through the stack.  Note that this additional latency, while measurable, is very small and not typically an issue for applications.
  • LDoms virtual IO currently supports virtual Ethernet and virtual disk.  While virtual Ethernet provides the same functionality as a physical Ethernet switch, the virtual disk interface works on a LUN by LUN basis.  This is different to other solutions that provide a virtual HBA and comes with some overhead in administration, since you have to add each virtual disk individually instead of just a single (virtual) HBA.  It also means that other SCSI devices like tapes or tape libraries can not be connected with virtual IO.
  • As is natural for virtual IO, the physical devices (and thus their resources) are shared between all consumers.  While recent releases of LDoms do support bandwidth limitations for network traffic, no such limits can currently be set on virtual disk devices.
  • You need to configure sufficient CPU and memory resources in the IO service domains.  The usual recommendation is one to two cores and 8-16 GB of memory.  While this doesn't strictly count as overhead for the CPU resources of the guests, those is still resources that are not directly available to guests.

Some recommendations for virtual IO:

  • In general, use the latest version of LDoms, along with Solaris 11.
  • Other than general networking considerations, there are no specific tunables for networking, if you are using a recent version of LDoms.  Stick to the defaults.
  • The same is true for disk IO.  However, keep in mind what has been true for the last 20 years: More LUNs do more IOPS.  Just because you've virtualized your guest doesn't mean that a single, 10TB LUN would give you more IOPS than 10x1TB LUNs - quite the opposite!  In the special case of the Oracle database: Make sure the redo logs are on dedicated storage.  This has been a recommendation since the "bad old days", and it continues to be true, whether you virtualize or not.

Virtual IO is best used in consolidation scenarios, where you have many smaller systems to host on one chassis.  These smaller systems tend to be lightweight in most of their resource consumption, including IO.  Hence, they will definately work well on virtual IO.  These are also the workloads that lend themselves best to Live Migration because of their smaller memory footprint and lower overall activity.  This is not to say that domains with moderate IO requirements wouldn't be well suited for virtual IO, they are.  However, larger domains with higher overall resource consumption (CPU, Memory, IO), tend to benefit less from the advantages of Live Migration and the flexibility of virtual IO.

To finalize this article, here's a tabular overview of the different options and the most important points to consider:

IO Option
Pros Cons When to use
  • Highest granularity of all PCIe-based IO solutions
  • Bare metal performance
  • Supports Ethernet, FC and IB
  • Dynamic reconfiguration
  • Depends on support by PCIe card
  • No Live Migration
  • Dependency on root domain
  • For larger number of guests that need bare metal latency and can do without live migration.
  • When administrating a great number of LUNs is a constant burden, consider FC SR-IOV
  • When availability is not the top priority.
Direct IO
  • Dedicated slot, no hardware sharing
  • Bare metal performance
  • Supports Ethernet, FC and IB
  • Granularity limited by number of PCIe slots in the system
  • Not all PCIe cards supported
  • No Live Migration
  • No dynamic reconfiguration
  • Dependency on root domain
  • If you need a dedicated or special purpose IO card
Root Domains
  • Fully independent domains, similar to dynamic domains
  • Full bare metal performance, dedicated to each domain
  • All types of IO cards supported
  • Granularity limited by the number of Root Complexes in the system
  • No Live Migration
  • No dynamic reconfiguration
  • High value applications with high CPU, memory and IO requirements
  • Live Migration is not a requirement and/or not practical because of domain size and activity.
Virtual IO
  • Allows Live Migration
  • Most flexible, including full dynamic reconfiguration
  • No special hardware requirements
  • Almost no limit to the number of virtual devices
  • Allows fully redundant virtual IO configuration for HA deployments
  • Limited to Ethernet and virtual disk
  • Small performance overhead, mostly visible in additional latency
  • vDisk administration complexity
  • Sharing of IO hardware may have performance implications
  • Consolidation Scenarios
  • Many small guests
  • Live Migration is a requirement

There are already quite a few links for further reading spread throughout this article.  Here is just one more:

Montag Dez 15, 2014

What's up with LDoms: Part 10 - SR-IOV

Back after a long "break" filled with lots of interesting work...  In this article, I'll cover the most flexible solution in LDoms PCI virtualization: SR-IOV.

SR-IOV - Single Root IO Virtualization, is a PCI Express standard developed and published by the PCI-SIG.  The idea here is that each PCIe card capable of SR-IOV, also called a "physical function", can create multiple virtual copies or "virtual functions" of itself and present these to the PCIe bus.  There, they appear very similar to the original, physical card and can be assigned to a guest domain very similar to a whole slot in case of DirectIO.  The domain then has direct hardware access to this virtual adapter.  Support for SR-IOV was first introduced to LDoms in version 2.2, quite a while ago.  Since SR-IOV very much depends on the capabilities of the PCIe adapters, support for various communication protocols was added one by one, as the adapters started to support this.  Today, LDoms support SR-IOV for Ethernet, Infiniband and FibreChannel.  Creating, assigning or de-assigning virtual functions (with the exception of Infiniband) is dynamic since LDoms version 3.1 which means you can do all of this without rebooting the domains affected.

All of this is well documented, not only in the LDoms Admin Guide, but also in various blog entries, most of them by Raghuram Kothakota, one of the chief developers for this feature.  However, I do want to give a short example on how this is configured, pointing to a few things to note as we go along.

Just like with DirectIO, the first thing you want to do is an inventory of what SR-IOV capable hardware you have in your system:

root@sun:~# ldm ls-io
NAME                                      TYPE   BUS      DOMAIN   STATUS   
----                                      ----   ---      ------   ------   
pci_0                                     BUS    pci_0    primary           
pci_1                                     BUS    pci_1    primary           
niu_0                                     NIU    niu_0    primary           
niu_1                                     NIU    niu_1    primary           
/SYS/MB/PCIE0                             PCIE   pci_0    primary  EMP      
/SYS/MB/PCIE2                             PCIE   pci_0    primary  OCC      
/SYS/MB/PCIE4                             PCIE   pci_0    primary  OCC      
/SYS/MB/PCIE6                             PCIE   pci_0    primary  EMP      
/SYS/MB/PCIE8                             PCIE   pci_0    primary  EMP      
/SYS/MB/SASHBA                            PCIE   pci_0    primary  OCC      
/SYS/MB/NET0                              PCIE   pci_0    primary  OCC      
/SYS/MB/PCIE1                             PCIE   pci_1    primary  EMP      
/SYS/MB/PCIE3                             PCIE   pci_1    primary  EMP      
/SYS/MB/PCIE5                             PCIE   pci_1    primary  OCC      
/SYS/MB/PCIE7                             PCIE   pci_1    primary  EMP      
/SYS/MB/PCIE9                             PCIE   pci_1    primary  EMP      
/SYS/MB/NET2                              PCIE   pci_1    primary  OCC      
/SYS/MB/NET0/IOVNET.PF0                   PF     pci_0    primary           
/SYS/MB/NET0/IOVNET.PF1                   PF     pci_0    primary           
/SYS/MB/NET2/IOVNET.PF0                   PF     pci_1    primary           
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_1    primary           
We've discussed this example earlier, this time let's concentrate on the four last lines. Those are physical functions (PF) of two network devices (/SYS/MB/NET0 and NET2). Since there are two PFs for each device, we know that each device actually has two ports. (These are the four internal ports of a T4-2 system.) To dynamically create a virtual function of one of these ports, we first have to turn on IO Virtualization on the corresponding PCI bus. Unfortunately, this is not (yet) a dynamic operation, so we have to reboot the domain owning that bus once. But only once. So let's do that now:
root@sun:~# ldm start-reconf primary
Initiating a delayed reconfiguration operation on the primary domain.
All configuration changes for other domains are disabled until the primary
domain reboots, at which time the new configuration for the primary domain
will also take effect.
root@sun:~# ldm set-io iov=on pci_0
Notice: The primary domain is in the process of a delayed reconfiguration.
Any changes made to the primary domain will only take effect after it reboots.
root@sun:~# reboot

Once the system comes back up, we can check that everything went well:

root@sun:~# ldm ls-io
NAME                                      TYPE   BUS      DOMAIN   STATUS   
----                                      ----   ---      ------   ------   
pci_0                                     BUS    pci_0    primary  IOV      
pci_1                                     BUS    pci_1    primary        
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_1    primary      

As you can see, pci_0 now shows "IOV" in the Status column. We can use the "-d" option to ldm ls-io to learn a bit more about the capabilities of the PF we intend to use:

root@sun:~# ldm ls-io -d /SYS/MB/NET2/IOVNET.PF1
Device-specific Parameters
    Flags = PR
    Default = 7
    Descr = Max number of configurable VFs
    Flags = VR
    Default = 9216
    Descr = Max MTU supported for a VF
    Flags = VR
    Default = 32
    Descr = Max number of VLAN filters supported
    Flags = VR
    Default = 1
    Descr = Exclusive configuration of pvid required
    Flags = PV
    Default = 0 Min = 0 Max = 32
    Descr = Number of unicast mac-address slots    

All of these capabilities depend on the type of adapter and the driver that supports it.  In this example case, we can see that we can create up to 7 VFs, the VFs support a maximum MTU of 9216 bytes and have hardware support for 32 VLANs and 32 MAC addresses.  Other adapters are likely to give you different values here.

Now we can create a virtual function (VF) and assign it to a guest domain.  We have to do this with a currently unused port - creating VFs doesn't work while there's traffic on the device.

root@sun:~# ldm create-vf /SYS/MB/NET2/IOVNET.PF1 
Created new vf: /SYS/MB/NET2/IOVNET.PF1.VF0
root@sun:~# ldm add-io /SYS/MB/NET2/IOVNET.PF1.VF0 mars
root@sun:~# ldm ls-io /SYS/MB/NET2/IOVNET.PF1    
NAME                                      TYPE   BUS      DOMAIN   STATUS   
----                                      ----   ---      ------   ------   
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_1    primary           
/SYS/MB/NET2/IOVNET.PF1.VF0               VF     pci_1    mars             

The first command here tells the hypervisor, or actually, the NIC located at /SYS/MB/NET2/IOVNET.PF1, to create one virtual function.  The command returns and reports the name of that virtual function.  There is a different variant of this command to create multiple VFs in one go.  The second command then assigns this newly create VF to a domain called "mars".  This is an online operation - mars is already up and running Solaris at this point.  Finally, the third command just shows us that everything went well and mars now owns the VF. 

Used with the "-l" option, the ldm command tells us some details about the device structure of the PF and VF:

root@sun:~# ldm ls-io -l /SYS/MB/NET2/IOVNET.PF1
NAME                                      TYPE   BUS      DOMAIN   STATUS   
----                                      ----   ---      ------   ------   
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_1    primary           
    maxvfs = 7
/SYS/MB/NET2/IOVNET.PF1.VF0               VF     pci_1    mars             
    Class properties [NETWORK]
        mac-addr = 00:14:4f:f8:07:ad
        mtu = 1500

Of course, we also want to check if and how this shows up in mars:

root@mars:~# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         0      unknown   vnet0
net1              Ethernet             unknown    0      unknown   igbvf0
root@mars:~# grep network /etc/path_to_inst
"/virtual-devices@100/channel-devices@200/network@0" 0 "vnet"
"/pci@500/pci@1/pci@0/pci@5/network@0,81" 0 "igbvf"

As you can see, mars now has two network interfaces.  One, net0, is a more conventional, virtual network interface.  The other, net1, uses the VF driver for the underlying physical device, in our case igb.  Checking in /etc/path_to_inst (or, if you prefer, in /devices), we can now find an entry for this network interface that shows us the PCIe infrastructure now plumbed into mars to support this NIC. Of course, it's the same device path as in the root domain (sun).

So far, we've seen how to create a VF in the root domain, how to assign this to a guest and how it shows up there.  I've used Ethernet for this example, as it's readily available in all systems.  As I mentioned earlier, LDoms also support Infiniband and FibreChannel with SR-IOV, so you could also add a FC HBA's VF to a guest domain.  Note that this doesn't work with just any HBA.  The HBA itself has to support this functionality.  There is a list of supported cards maintained in MOS. 

There are a few more things to note with SR-IOV.  First, there's the VFs identity.  You might not have noticed it, but the VF created in the example above has it's own identity - it's own MAC address.  While this seems natural in the case of Ethernet, it is actually something that you should be aware of with FC and IB as well.  FC VFs use WWNs and NPIV to identify themselves in the attached fabric.  This means the fabric has to be NPIV capable and the guest domain using the VF can not layer further software NPIV-HBAs on top.  Likewise, IB VFs use HCAGUIDs to identify themselves.  While you can choose Ethernet MAC-addresses and FC WWNs if you prefer, IB VFs choose their HCAGUIDs automatically.  If you intend to run Solaris zones within a guest domain that uses a SR-IOV VF for Ethernet, remember to assign this VF additional MAC-addresses to be used by the anet devices of these zones.

Finally I want to point out once more that while SR-IOV devices can be moved in and out of domains dynamically, and can be added from two different root domains to the same guest, they still depend on their respective root domains.  This is very similar to the restriction with DirectIO.  So if the root domain owning the PF reboots (for whatever reason), it will reset the PF which will also reset all VFs and have unpredictable results in the guests using them.  Keep this in mind when deciding whether or not to use SR-IOV.  If you do, consider to configure explicit domain dependencies reflecting these physical dependencies.  You can find details about this in the Admin Guide. Development in this area is continuing, so you may expect to see enhancements in this space in upcoming versions. 

Since it is possible to work with multiple root domains and have each of those root domains create VFs of some of their devices, it is important to avoid cyclic dependencies between these root domains.  This is explicitly prevented by the ldm command, which does not allow a VF from one root domain to be assigned to another root domain.

We have now seen multiple ways of providing IO resources to logical domains: Virtual network and disk, PCIe root complexes, PCIe slots and finally SR-IOV.  Each of them have their own pros and cons and you will need to weigh them carefully to find the correct solution for a given task.  I will dedicate one of the next chapters of this series to a discussion of IO best practices and recommendations.  For now, here are some links for further reading about SR-IOV:

Mittwoch Aug 20, 2014

What's up with LDoms: Part 9 - Direct IO

In the last article of this series, we discussed the most general of all physical IO options available for LDoms, root domains.  Now, let's have a short look at the next level of granularity: Virtualizing individual PCIe slots.  In the LDoms terminology, this feature is called "Direct IO" or DIO.  It is very similar to root domains, but instead of reassigning ownership of a complete root complex, it only moves a single PCIe slot or endpoint device to a different domain.  Let's look again at hardware available to mars in the original configuration:

root@sun:~# ldm ls-io
NAME                                      TYPE   BUS      DOMAIN   STATUS  
----                                      ----   ---      ------   ------  
pci_0                                     BUS    pci_0    primary          
pci_1                                     BUS    pci_1    primary          
pci_2                                     BUS    pci_2    primary          
pci_3                                     BUS    pci_3    primary          
/SYS/MB/PCIE1                             PCIE   pci_0    primary  EMP     
/SYS/MB/SASHBA0                           PCIE   pci_0    primary  OCC
/SYS/MB/NET0                              PCIE   pci_0    primary  OCC     
/SYS/MB/PCIE5                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE6                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE7                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE2                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE3                             PCIE   pci_2    primary  OCC     
/SYS/MB/PCIE4                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE8                             PCIE   pci_3    primary  EMP     
/SYS/MB/SASHBA1                           PCIE   pci_3    primary  OCC     
/SYS/MB/NET2                              PCIE   pci_3    primary  OCC     
/SYS/MB/NET0/IOVNET.PF0                   PF     pci_0    primary          
/SYS/MB/NET0/IOVNET.PF1                   PF     pci_0    primary          
/SYS/MB/NET2/IOVNET.PF0                   PF     pci_3    primary          
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_3    primary

All of the "PCIE" type devices are available for SDIO, with a few limitations.  If the device is a slot, the card in that slot must support the DIO feature.  The documentation lists all such cards.  Moving a slot to a different domain works just like moving a PCI root complex.  Again, this is not a dynamic process and includes reboots of the affected domains.  The resulting configuration is nicely shown in a diagram in the Admin Guide:

There are several important things to note and consider here:

  • The domain receiving the slot/endpoint device turns into an IO domain in LDoms terminology, because it now owns some physical IO hardware.
  • Solaris will create nodes for this hardware under /devices.  This includes entries for the virtual PCI root complex (pci_0 in the diagram) and anything between it and the actual endpoint device.  It is very important to understand that all of this PCIe infrastructure is virtual only!  Only the actual endpoint devices are true physical hardware.
  • There is an implicit dependency between the guest owning the endpoint device and the root domain owning the real PCIe infrastructure:
    • Only if the root domain is up and running, will the guest domain have access to the endpoint device.
    • The root domain is still responsible for resetting and configuring the PCIe infrastructure (root complex, PCIe level configurations, error handling etc.) because it owns this part of the physical infrastructure.
    • This also means that if the root domain needs to reset the PCIe root complex for any reason (typically a reboot of the root domain) it will reset and thus disrupt the operation of the endpoint device owned by the guest domain.  The result in the guest is not predictable.  I recommend to configure the resulting behaviour of the guest using domain dependencies as described in the Admin Guide in Chapter "Configuring Domain Dependencies".
  • Please consult the Admin Guide in Section "Creating an I/O Domain by Assigning PCIe Endpoint Devices" for all the details!

As you can see, there are several restrictions for this feature.  It was introduced in LDoms 2.0, mainly to allow the configuration of guest domains that need access to tape devices.  Today, with the higher number of PCIe root complexes and the availability of SR-IOV, the need to use this feature is declining.  I personally do not recommend to use it, mainly because of the drawbacks of the depencies on the root domain and because it can be replaced with SR-IOV (although then with similar limitations).

This was a rather short entry, more for completeness.  I believe that DIO can usually be replaced by SR-IOV, which is much more flexible.  I will cover SR-IOV in the next section of this blog series.

Dienstag Mai 20, 2014

Improved vDisk Performance for LDoms

In all the LDoms workshops I've been doing in the past years, I've always been cautioning customers to keep their expectations within reasonable limits when it comes to virtual IO.  And I'll not stop doing that today.  Virtual IO will always come at a certain cost, because of the additional work necessary to translate physical IOs to the virtual world.  Until we invent time travel, this will always need some additional time to be done.  But there's some good news about this, too:

First, in many cases the overhead involved in virtualizing IO isn't that much - the LDom implementation is very efficient.  And in many of these many cases, it doesn't hurt.  Often, because the workload involved doesn't care and virtual IO is fast enough.

Second, there are good ways to configure virtual IO, and not so good ways.  If you stick to the good ways (which I previously discussed here), you'll increase the number of cases where virtual IO is more than just good enough. 

But of course, there are always those other cases where it just isn't.  But there's more good news, too:

For virtualized network, we've introduced a new implementation utilizing large segment offload (LSO) and some other techniques to increase throughput and reduce latency to a point where virtual networking has gone away as a reason for performance issues.  This was in LDoms release 3.1.  Now is when we introduce a similar enhancement for virtual disk.

When we talk about disk IO and performance, the most important configuration best practice is to spread IO load to multiple LUNs.  This has always been the case, long before we started to even think about virtualization.  The reason for this is the limited number of IOPS a single LUN will deliver.  Whether that LUN is a single physical disk or a volume in a more sophisticated disk array doesn't matter.  IOPS delivered by one LUN are limited, and IOs will queue up in this LUN's queue in a very sequential manner.  A single physical disk might deliver 150 IOPS, perhaps 300 IOPS.  A SAN LUN with a strong array in the backend might deliver 5000 IOPS or a little more.  But that isn't enough, and has never been.  Disk striping of any kind was invented to solve this problem.  And virtualization of both servers and storage doesn't change the overall picture.  Which means that in LDoms, the best practice has always been to configure several LUNs, which means several vdisks, into a single guest system.  This often provided the required IO performance, but there were quite a few cases where this just wasn't good enough and people had to move back to physical IO.  Of course, there are several ways to provide physical IO and still virtualize using LDoms, but the situation was not ideal. 

With the release of Solaris 11.1 SRU 19 (and a Solaris 10 patch shortly afterwards) we are introducing a new implementation of the vdisk/vds software stack, which significantly improves both latency and throughput of virtual disk IO.  The improvement can best be seen in the graphs below.

This first graph shows the overall number of IOPS during a performance test, comparing bare metal with the old and the new vdisk implementation. As you can see, the new implementation delivers essentially the same performance as bare metal, with a variation that might as well be statistical deviation. Note that these tests were run on a total of 28 SAN LUNs, so please don't expect a single LUN to deliver 150k IOPS anytime soon :-) The improvement over the old implementation is significant, with differences of up to 55% in some cases. Again, note that running only a single stream of IOs against a single LUN will not show as much of an improvement as running multiple streams (denoted as threads in the graphs). This is due to the fact that parts of the new implementation have focused on de-serializing the IO infrastructure, something you'll not notice if you run single threaded IO streams. But then, most IO hungry applications issue multiple IOs.  Likewise, if your storage backend can't provide this kind of performance (perhaps because you're testing on a single, internal disk?), don't expect much change! 

So we know that throughput has been fixed (with 150k IOPS and 1.1 GB/sec virtual IO in this test, I believe I can safely say so). But what about IO latency? This next graphs shows a similar improvement here:

Again, response time (or service time) with the new implementation is very similar to what you get from bare metal.  The maximum difference is in the 2 thread case with less than 4% difference between virtual IO and bare metal.  Close enough to actually start talking about zero overhead IO (at least as far as the IO performance is concerned).  Talking about overhead:  I sometimes call the overhead involved in virtualization the "Virtualization Tax" - the resources you invest in virtualization itself, or, in other words, the performance (or response time) you lose because of virtualization.  In the case of LDoms disk IO, we've just seen a signifcant reduction in virtualization taxes:

The last graph shows how much higher the response time for virtual disk IO was with the old implementation, and how much of that we've been given back by this charming piece of engineering in the new implementation. Where we paid up to 55% of virtualization tax before, we're now down to 4% or less. A big "Thank you!" to engineering!

Of course, there's always a little disclaimer involved:  Your milage will vary.  The results I show here were obtained on 28 LUNs coming from some kind of FC infrastructure.  The tests were done using vdbench in a read/write mix of 60%/40% running from 2 to 20 threads doing random IO.  While this is quite a challenging load for any IO subsystem and represents the load pattern that showed the highest virtualization tax with the old implementation, this still means that real world benefits from this new implementation might not achieve the same improvements.  Although I am very optimistic that they will be similar.

In conclusion, with the new, improved virtual networking and virtual disk IO that are now available, the range of applications that can safely be run on fully virtualized IO has been expanded significantly.  This is in line with the expectations I often find in customer workshops, where high end performance is naturally expected from SPARC systems under all circumstances.

Before I close, here's how to use this new implementation:

  • Update to Solaris 11.1 SRU 19 in
    • all guest domains that want to use the new implementation.
    • all IO domains that provide virtual disks to these guests
    • This will also update LDoms Manager to 3.1.1
    • If only one in the pair (guest|IO domain) is updated, virtual IO will continue to work using the old implementation.
  • A patch for Solaris 10 will be available shortly.

Update 2014-06-16: Patch 150400-13 has now been released for Solaris 10.  See the Readme for details.

Donnerstag Mrz 27, 2014

A few Thoughts about Single Thread Performance

[Read More]

Montag Feb 24, 2014

What's up with LDoms: Part 8 - Physical IO

Virtual IO SetupFinally finding some time to continue this blog series...  And starting the new year with a new chapter for which I hope to write several sections: Physical IO options for LDoms and what you can do with them.  In all previous sections, we talked about virtual IO and how to deal with it.  The diagram at the right shows the general architecture of such virtual IO configurations. However, there's much more to IO than that. 

From an architectural point of view, the primary task of the SPARC hypervisor is partitioning of  the system.  The hypervisor isn't usually very active - all it does is assign ownership of some parts of the hardware (CPU, memory, IO resources) to a domain, build a virtual machine from these components and finally start OpenBoot in that virtual machine.  After that, the hypervisor essentially steps aside.  Only if the IO components are virtual components, do we need hypervisor support.  But those IO components could also be physical.  Actually, that is the more "natural" option, if you like.  So lets revisit the creation of a domain:

We always start with assigning of CPU and memory in some very simple steps:

root@sun:~# ldm create mars
root@sun:~# ldm set-memory 8g mars
root@sun:~# ldm set-core 8 mars

If we now bound and started the domain, we would have OpenBoot running and we could connect using the virtual console.  Of course, since this domain doesn't have any IO devices, we couldn't yet do anything particularily useful with it.  Since we want to add physical IO devices, where are they?

To begin with, all physical components are owned by the primary domain.  This is the same for IO devices, just like it is for CPU and memory.  So just like we need to remove some CPU and memory from the primary domain in order to assign these to other domains, we will have to remove some IO from the primary if we want to assign it to another domain.  A general inventory of available IO resources can be obtained with the "ldm ls-io" command:

root@sun:~# ldm ls-io
NAME                                      TYPE   BUS      DOMAIN   STATUS  
----                                      ----   ---      ------   ------  
pci_0                                     BUS    pci_0    primary          
pci_1                                     BUS    pci_1    primary          
pci_2                                     BUS    pci_2    primary          
pci_3                                     BUS    pci_3    primary          
/SYS/MB/PCIE1                             PCIE   pci_0    primary  EMP     
/SYS/MB/SASHBA0                           PCIE   pci_0    primary  OCC
/SYS/MB/NET0                              PCIE   pci_0    primary  OCC     
/SYS/MB/PCIE5                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE6                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE7                             PCIE   pci_1    primary  EMP     
/SYS/MB/PCIE2                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE3                             PCIE   pci_2    primary  OCC     
/SYS/MB/PCIE4                             PCIE   pci_2    primary  EMP     
/SYS/MB/PCIE8                             PCIE   pci_3    primary  EMP     
/SYS/MB/SASHBA1                           PCIE   pci_3    primary  OCC     
/SYS/MB/NET2                              PCIE   pci_3    primary  OCC     
/SYS/MB/NET0/IOVNET.PF0                   PF     pci_0    primary          
/SYS/MB/NET0/IOVNET.PF1                   PF     pci_0    primary          
/SYS/MB/NET2/IOVNET.PF0                   PF     pci_3    primary          
/SYS/MB/NET2/IOVNET.PF1                   PF     pci_3    primary

The output of this command will of course vary greatly, depending on the type of system you have.  The above example is from a T5-2.  As you can see, there are several types of IO resources.  Specifically, there are

  • BUS
    This is a whole PCI bus, which means everything controlled by a single PCI control unit, also called a PCI root complex.  It typically contains several PCI slots and possibly some end point devices like SAS or network controllers.
  • PCIE
    This is either a single PCIe slot.  In that case, it's name corresponds to the slot number you will find imprinted on the system chassis.  It is controlled by a root complex listed in the "BUS" column.  In the above example, you can see that some slots are empty, while others are occupied.  Or it is an endpoint device like a SAS HBA or network controller.  An example would be "/SYS/MB/SASHBA0" or "/SYS/MB/NET2".  Both of these typically control more than one actual device, so for example, SASHBA0 would control 4 internal disks and NET2 would control 2 internal network ports.
  • PF
    This is a SR-IOV Physical Function - usually an endpoint device like a network port which is capable of PCI virtualization.  We will cover SR-IOV in a later section of this blog.

All of these devices are available for assignment.  Right now, they are all owned by the primary domain.  We will now release some of them from the primary domain and assign them to a different domain.  Unfortunately, this is not a dynamic operation, so we will have to reboot the control domain (more precisely, the affected domains) once to complete this.

root@sun:~# ldm start-reconf primary
root@sun:~# ldm rm-io pci_3 primary
root@sun:~# reboot
[ wait for the system to come back up ]
root@sun:~# ldm add-io pci_3 mars
root@sun:~# ldm bind mars

With the removal of pci_3, we also removed PCIE8, SYSBHA1 and NET1 from the primary domain and added all three to mars.  Mars will now have direct, exclusive access to all the disks controlled by SASHBA1, all the network ports on NET1 and whatever we chose to install in PCIe slot 8.  Since in this particular example, mars has access to internal disk and network, it can boot and communicate using these internal devices.  It does not depend on the primary domain for any of this.  Once started, we could actually shut down the primary domain.  (Note that the primary is usually the home of vntsd, the console service.  While we don't need this for running or rebooting mars, we do need it in case mars falls back to OBP or single-user.) 

Root Domain SetupMars now owns its own PCIe root complex.  Because of this, we call mars a root domain.  The diagram on the right shows the general architecture.  Compare this to the diagram above!  Root domains are truely independent partitions of a SPARC system, very similar in functionality to Dynamic System Domains in the E10k, E25k or M9000 times (or Physical Domains, as they're now called).  They own their own CPU, memory and physical IO.   They can be booted, run and rebooted independently of any other domain.  Any failure in another domain does not affect them.  Of course, we have plenty of shared components: A root domain might share a mainboard, a part of a CPU (mars, for example, only has 2 cores...), some memory modules, etc. with other domains.  Any failure in a shared component will of course affect all the domains sharing that component, which is different in Physical Domains because there are significantly fewer shared components.  But beyond this, root domains have a level of isolation very similar to that of Physical Domains.

Comparing root domains (which are the most general form of physical IO in LDoms) with virtual IO, here are some pros and cons:


  • Root domains are fully independet of all other domains (with the exception of console access, but this is a minor limitation).
  • Root domains have zero overhead in IO - they have no virtualization overhead whatsoever.
  • Root domains, because they don't use virtual IO, are not limited to disk and network, but can also attach to tape, tape libraries or any other, generic IO device supported in their PCIe slots.


  • Root domains are limited in number.  You can only create as many root domains as you have PCIe root complexes available.  In current T5 and M5/6 systems, that's two per CPU socket.
  • Root domains can not live migrate.  Because they own real IO hardware (with all these nasty little buffers, registers and FIFOs), they can not be live migrated to another chassis.

Because of these different characteristics, root domains are typically used for applications that tend to be more static, have higher IO requirements and/or larger CPU and memory footprints.  Domains with virtual IO, on the other hand, are typically used for the mass of smaller applications with lower IO requirements.  Note that "higher" and "lower" are relative terms - LDoms virtual IO is quite powerful.

This is the end of the first part of the physical IO section, I'll cover some additional options next time.  Here are some links for further reading:

Dienstag Okt 01, 2013

CPU-DR for Zones

In my last entry, I described how to change the memory configuration of a running zone.  The natural next question is of course, if that also works with CPUs that have been assigned to a zone.  The answer, of course, is "yes".

You might wonder why that would be necessary in the first place.  After all, there's the Fair Share Scheduler, that's extremely capable of managing zones' CPU usage.  However, there are reasons to assign dedicated CPU resources to zones, licensing is one, SLAs with specified CPU requirements another.  In such cases, you configure a fixed amount of CPUs (more precisely, strands) for a zone.  Being able to change this configuration on the fly then becomes desirable.  I'll show how to do that in this blog entry.

In general, there are two ways to assign exclusive CPUs to a zone.  The classic approach is by using a resource pool with an associated processor set.  One or more zones can then be bound to that pool.  The easier solution is to use the parameter "dedicated-cpu" directly when configuring the zone.  In this second case, Solaris will create a temporary pool to manage these resources.  So effectively, the implementation is the same in both cases.  Which makes it clear how to change the CPU configuration in both cases: By changing the pool.  If you do this in the classical approach, the change to the pool will be persistent.  If working with the temporary pool created for the zone, you will also need to change the zone's configuration if you want the change to survive a zone restart.

If you configured you zone with "dedicated-cpu", the temporary pool (and also the temporary processor set that goes along with it) will usually be called "SUNWtmp_<zonename>".   If not, you'll know the name of the pool...  In both cases, everything else is the same:

Let's assume a zone called orazone, currently configured with 1 CPU.  It's to be assigned a second CPU.  The current pool configuration is like this:
root@benjaminchen:~# pooladm                

system default
	string	system.comment 
	int	system.version 1
	boolean	system.bind-default true
	string	system.poold.objectives wt-load

	pool pool_default
		int	pool.sys_id 0
		boolean true
		boolean	pool.default true
		int	pool.importance 1
		string	pool.comment 
		pset	pset_default

	pool SUNWtmp_orazone
		int	pool.sys_id 5
		boolean true
		boolean	pool.default false
		int	pool.importance 1
		string	pool.comment 
		boolean	pool.temporary true
		pset	SUNWtmp_orazone

	pset pset_default
		int	pset.sys_id -1
		boolean	pset.default true
		uint	pset.min 1
		uint	pset.max 65536
		string	pset.units population
		uint	pset.load 687
		uint	pset.size 3
		string	pset.comment 

			int	cpu.sys_id 1
			string	cpu.comment 
			string	cpu.status on-line

			int	cpu.sys_id 3
			string	cpu.comment 
			string	cpu.status on-line

			int	cpu.sys_id 2
			string	cpu.comment 
			string	cpu.status on-line

	pset SUNWtmp_orazone
		int	pset.sys_id 2
		boolean	pset.default false
		uint	pset.min 1
		uint	pset.max 1
		string	pset.units population
		uint	pset.load 478
		uint	pset.size 1
		string	pset.comment 
		boolean	pset.temporary true

			int	cpu.sys_id 0
			string	cpu.comment 
			string	cpu.status on-line
As we can see in the definition of pset SUNWtmp_orazone, it has been assigned CPU #0.  To add another CPU to this pool, you'll need these two commands:
root@benjaminchen:~# poolcfg -dc 'modify pset SUNWtmp_orapset \
                     (uint pset.max=2)' 
root@benjaminchen:~# poolcfg -dc 'transfer to pset \
                     orapset (cpu 1)'

To remove that CPU from the pool again, use these:

root@benjaminchen:~# poolcfg -dc 'transfer to pset pset_default \
                     (cpu 1)'
root@benjaminchen:~# poolcfg -dc 'modify pset SUNWtmp_orapset \
                     (uint pset.max=1)' 

That's it.   If you've used "dedicated-cpu" for your zone's configuration, you'll need to change that before the next reboot.  If not, you'd have to use the pool name you assigned to the zone.

Further details:


Neuigkeiten, Tipps und Wissenswertes rund um SPARC, CMT, Performance und ihre Analyse sowie Erfahrungen mit Solaris auf dem Server und dem Laptop.

This is a bilingual blog (most of the time). Please select your prefered language:
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


« December 2016