Tuesday Aug 26, 2014

Migratory Solaris Kernel Zones

The Introduction

Oracle Solaris 11.2 introduced Oracle Solaris Kernel Zones. Kernel Zones (KZs) offer a midpoint between traditional operating system virtualization and virtual machines. They exhibit the low overhead and low management effort of Solaris Zones, and add the best parts of the independence of virtual machines.

A Kernel Zone is a type of Solaris Zone that runs its own Solaris kernel. This gives each Kernel Zone complete independence of software packages, as well as other benefits.

One of the more interesting new abilities that Kernel Zones bring to Solaris Zones is the ability to "pause" a running KZ and, "resume" it on a different computer - or the same computer, if you prefer.

Of what value is the ability to "pause" a zone? One potential use is moving a workload from a smaller computer (with too few CPUs, or insufficient RAM) to a larger one. Some workloads do not maintain much state, and can restart quickly, and so they wouldn't benefit from suspend/resume. Others, such as static (read-only) databases, may take 30 minutes to start and obtain good performance. The ability to suspend, but not stop, the workload and its operating system can be very valuable.

Another possible use of this ability is the staging of multiple KZs which have already booted and, perhaps, have started to run a workload. Instead of booting in a few minutes, the workload can continue from a known state in just a few seconds. Further, the suspended zone can be "unpaused" on the computer of your choice. Suspended kernel zones are like a nest of dozing ants, waiting to take action at a moment's notice.

This blog entry shows the steps to create and move a KZ, highlighting both the Solaris iSCSI implementation as well as kernel zones and their suspend/resume feature. Briefly, the steps are:

  1. Create shared storage
  2. Make shared storage available to both computers - the one that will run the zone, at first, as well as the computer on which the zone will be resumed.
  3. Configure the zone on each system.
  4. Install the zone on one system.
  5. "Warm migrate" the zone by pausing it, and then, on the other computer, resuming it.

Links to relevant documentation and blogs are provided at the bottom.

The Method

The Kernel Zones suspend/resume feature requires the use of storage accessible by multiple computers. However, neither Kernel Zones nor suspend/resume requires a specific type of shared storage. In Solaris 11.2 the only types of shared storage that supports zones are iSCSI and Fiber Channel. This blog entry uses iSCSI.

The example below uses three computers. One is the iSCSI target, i.e. the storage server. The other two run the KZ, one at a time. All three systems run Solaris 11.2, although the iSCSI features below work on early updates to Solaris 11, or a ZFS Storage Appliance (the current family shares the brand name ZS3), or another type of iSCSI target.

In the commands shown below, the prompt "storage1#" indicates commands that would be entered into the iSCSI target. Similarly, "node1#" indicates commands that you would enter into the first computer that will run the kernel zone. The few commands preceded by the prompt "bothnodes#" must be run on the both node1 and node2. The name of the kernel zone is "ant1".

For simplicity, the example below ignores security concerns. (More about security below.)

Finally, note that these commands should be run by a non-root user who prefaces each command with the pseudo-command "sudo". ;-)

Step 1. Provide shared storage for the kernel zone. The zone only needs one device for its zpool. Redundancy is provided by the zpool in the iSCSI target. (For a more detailed explanation, see the link to the COMSTAR documentation in the section "The Links" below.)

storage1# pkg install group/feature/storage-server           # Install necessary software.
storage1# svcadm enable stmf:default                         # Enable that software.
storage1# zfs create rpool/zvols                             # A dataset for the zvol.
storage1# zfs create -V 20g rpool/zvols/ant1                 # Create a zvol as backing store.
storage1# stmfadm create-lu /dev/zvol/rdsk/rpool/zvols/ant1  # Create a back-end LUN.
Logical unit created: 600144F068D1CD00000053ECD3D20001
storage1# stmfadm list-lu
LU Name: 600144F068D1CD00000053ECD3D20001
storage1# stmfadm add-view 600144F068D1CD00000053ECD3D20001
storage1# stmfadm list-view -l 600144F068D1CD00000053ECD3D20001
View Entry: 0
Host group : All
Target Group : All
LUN : Auto

storage1# svcadm enable -r svc:/network/iscsi/target:default # Enable the target service.
storage1# svcs -l iscsi/target
fmri svc:/network/iscsi/target:default
name iscsi target
enabled true
state online
next_state none
state_time Auguest 10, 2014 03:58:50 PM EST
logfile /var/svc/log/network-iscsi-target:default.log
restarter svc:/system/svc/restarter:default
manifest /lib/svc/manifest/network/iscsi/iscsi-target.xml
dependency require_any/error svc:/milestone/network (online)
dependency require_all/none svc:/system/stmf:default (online)
storage1# itadm create-target
Target iqn.1986-03.com.sun:02:238d10b8-cca8-ef7a-e095-e1132d91c4a5
successfully created
storage1# itadm list-target -v
TARGET NAME                                                  STATE    SESSIONS
iqn.1986-03.com.sun:02:238d10b8-cca8-ef7a-e095-e1132d91c4a5  online   0
        alias:                  -
        auth:                   none (defaults)
        targetchapuser:         -
        targetchapsecret:       unset
        tpg-tags:               default

Step 2A. Configure initiators Configuring iSCSI on the two iSCSI initiators uses exactly the same commands on each, so I'll just list them once.

bothnodes# svcadm enable network/iscsi/initiator
bothnodes# iscsiadm modify discovery --sendtargets enable # The simplest discovery method.
bothnodes# iscsiadm add discovery-address    # IP address of the storage server.
At this point, the initiator will automatically discover all of the iSCSI LUNs offered by that target. One way to view the list of them is with the format(1M) command.
bothnodes# format
Searching for disks...done

       1. c0t600144F068D1CD00000053ECD3D20001d0 

Step 2B. On each of the two computers that will host the zone, identify the Storage Uniform Resource Identfiers ("SURI") - see suri(5) for more information. This command tells you the SURI of that LUN, in each of multiple formats. We'll need this SURI to specify the storage for the kernel zone.

bothnodes# suriadm lookup-uri c0t600144F068D1CD00000053ECD3D20001d0
Step 2C. When you suspend a kernel zone, its RAM pages must be stored temporarily in a file. In order to resume the zone on a different computer, the "suspend file" must be on storage that both computers can access. For this example, we'll use an NFS share. (Another iSCSI LUN could be used instead.) The method shown below is not particularly secure, although the suspended image is first encrypted. Secure methods would require the use of other Solaris features, but they are not the topic of this blog entry.
storage1# zfs create -p rpool/export/suspend
storage1# zfs set share.nfs=on rpool/export/suspend
That share must be made available on both nodes, with appropriate permissions.
node1# mkdir /mnt/suspend
node1# mount -F nfs storage1:/export/suspend /mnt/suspend
node2# mkdir /mnt/suspend
node2# mount -F nfs storage1:/export/suspend /mnt/suspend
Step 3. Configure a kernel zone, using the two iSCSI LUNs and a system profile. You can configure a kernel zone very easily. The only required settings are the name and the use of the kernel zone template. The name of the latter is SYSsolaris-kz. That template specifies a VNIC, 2GB of dedicated RAM, 1 virtual CPU, and local storage that will be configured automatically when the zone is installed. We need shared storage instead of local storage, so one of the first steps will be deleting the local storage resource. That device will have an ID number of zero. After deleting that resource, we add the LUN, using the SURI determined earlier.
node1# zonecfg -z ant1
  Use 'create' to begin configuring a new zone.
  zonecfg:ant1> create -t SYSsolaris-kz
  zonecfg:ant1> remove device id=0
  zonecfg:ant1> add device
  zonecfg:ant1:device> set storage=iscsi://house/luname.naa.600144f068d1cd00000053ecd3d20001
  zonecfg:ant1:device> set bootpri=0
  zonecfg:ant1:device> info
        match not specified
        storage: iscsi://house/luname.naa.600144f068d1cd00000053ecd3d20001
        id: 1
        bootpri: 0
  zonecfg:ant1:device> end
  zonecfg:ant1> add suspend
  zonecfg:ant1:suspend> set path=/mnt/suspend/ant1.sus
  zonecfg:ant1:suspend> end
  zonecfg:ant1> exit
We can create a reusable configuration profile.
node1# sysconfig create-profile  -o ant1
[The usual sysconfig conversation ensues...]

Step 4. Install and boot the kernel zone.

node1# zoneadm -z ant1 install -c ant1/sc_profile.xml
Progress being logged to /var/log/zones/zoneadm.20140815T155143Z.ant1.install
pkg cache: Using /var/pkg/publisher.
 Install Log: /system/volatile/install.15996/install_log
 AI Manifest: /tmp/zoneadm15390.W_a4NE/devel-ai-manifest.xml
  SC Profile: /usr/share/auto_install/sc_profiles/enable_sci.xml
Installation: Starting ...

        Creating IPS image
        Installing packages from:
                origin:  http://pkg.oracle.com/solaris/release/
        The following licenses have been accepted and not displayed.
        Please review the licenses for the following packages post-install:
        Package licenses may be viewed using the command:
          pkg info --license 

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                            483/483   64276/64276  543.7/543.7    0B/s

PHASE                                          ITEMS
Installing new actions                   87530/87530
Updating package state database                 Done
Updating package cache                           0/0
Updating image state                            Done
Creating fast lookup database                   Done
Installation: Succeeded
        Done: Installation completed in 207.564 seconds.

node1# zoneadm -z ant1 boot

Step 5. With all of the hard work behind us, we can "warm migrate" the zone. The first step is preparation of the destination system - "node2" in our example - by applying the zone's configuration to the destination.

node1# zonecfg -z ant1 export -f /mnt/suspend/ant1.cfg
node2# zonecfg -z ant1 -f /mnt/suspend/ant1.cfg
The "detach" operation does not delete anything. It merely tells node1 to cease considering the zone to be usable.
node1# zoneadm -z ant1 suspend
node1# zoneadm -z ant1 detach

A separate "resume" sub-command for zoneadm was not necessary. The "boot" sub-command fulfills that purpose.

node2# zoneadm -z ant1 attach
node2# zoneadm -z ant1 boot
Of course, "warm migration" is different from "live migration" in one important respect: the duration of the service outage. Live migration achieves a service outage that lasts a small fraction of a second. In one experiment, warm migration of a kernel zone created a service outage that lasted 30 seconds. It's not live migration, but is an important step forward, compared to other types of Solaris Zones.

The Notes

  1. This example used a zpool as back-end storage. That zpool provided data redundancy, so additional redundancy was not needed within the kernel zone. If unmirrored devices (e.g. physical disks were specified in zonecfg) then data redundancy should be achieved within the zone. Fortunately, you can specify two devices in zonecfg, and "zoneadm ... install" will automatically mirror them.
  2. In a simple network configuration, the steps above create a kernel zone that has normal network access. More complicated networks may require additional steps, such as VLAN configuration, etc.
  3. Some steps regarding file permissions on the NFS mount were omitted for clarity. This is one of the security weaknesses of the steps shown above. All of the weaknesses can be addressed by using additional Solaris features. These include, but are not limited to, iSCSI features (iSNS, CHAP authentication, RADIUS, etc.), NFS security features (e.g. NFS ACLs, Kerberos, etc.), RBAC, etc.

The Links

Tuesday Apr 08, 2008

ZoiT: Solaris Zones on iSCSI Targets (aka NAC: Network-Attached Containers)


Solaris Containers have a 'zonepath' ('home') which can be a directory on the root file system or on a non-root file system. Until Solaris 10 8/07 was released, a local file system was required for this directory. Containers that are on non-root file systems have used UFS, ZFS, or VxFS. All of those are local file systems - putting Containers on NAS has not been possible. With Solaris 10 8/07, that has changed: a Container can now be placed on remote storage via iSCSI.


Solaris Containers (aka Zones) are Sun's operating system level virtualization technology. They allow a Solaris system (more accurately, an 'instance' of Solaris) to have multiple, independent, isolated application environments. A program running in a Container cannot detect or interact with a process in another Container.

Each Container has its own root directory. Although viewed as the root directory from within that Container, that directory is also a non-root directory in the global zone. For example, a Container's root directory might be called /zones/roots/myzone/root in the global zone.

The configuration of a Container includes something called its "zonepath." This is the directory which contains a Container's root directory (e.g. /zones/roots/myzone/root) and other directories used by Solaris. Therefore, the zonepath of myzone in the example above would be /zones/roots/myzone.

The global zone administrator can choose any directory to be a Container's zonepath. That directory could just be a directory on the root partition of Solaris, though in that case some mechanism should be used to prevent that Container from filling up the root partition. Another alternative is to use a separate partition for that Container, or one shared among multiple Containers. In the latter case, a quota should be used for each Container.

Local file systems have been used for zonepaths. However, many people have strongly expressed a desire for the ability to put Containers on remote storage. One significant advantage to placing Containers on NAS is the simplification of Container migration - moving a Container from one system to another. When using a local file system, the contents of the Container must be transmitted from the original host to the new host. For small, sparse zones this can take as little as a few seconds. For large, whole-root zones, this can take several minutes - a whole-root zone is an entire copy of Solaris, taking up as much as 3-5 GB. If remote storage can be used to store a zone, the zone's downtime can be as little as a second or two, during which time a file system is unmounted on one system and mounted on another.

Here are some significant advantages to iSCSI over SANs:

  1. the ability to use commodity Ethernet switching gear, which tends to be less expensive than SAN switching equipment
  2. the ability to manage storage bandwidth via standard, mature, commonly used IP QoS features
  3. iSCSI networks can be combined with non-iSCSI IP networks to reduce the hardware investment and consolidate network management. If that is not appropriate, the two networks can be separate but use the same type of equipment, reducing costs and types of in-house infrastrucuture management expertise.

Unfortunately, a Container cannot 'live' on an NFS server, and it's not clear if or when that limitation will be removed.

iSCSI Basics

iSCSI is simply "SCSI communication over IP." In this case, SCSI commands and responses are sent between two iSCSI-capable devices, which can be general-purpose computers (Solaris, Windows, Linux, etc.) or specific-purpose storage devices (e.g. Sun StorageTek 5210 NAS, EMC Celerra NS40, etc.). There are two endpoints to iSCSI communications: the initiator (client) and the target (server). A target publicizes its existence. An initiator binds to a target.

The industry's design for iSCSI includes a large number of features, including security. Solaris implements many of those features. Details can be found:

In Solaris, the command iscsiadm(1M) configures an initiator, and the command iscsitadm(1M) configures a target.


This section demonstrates the installation of a Container onto a remote file system that uses iSCSI for its transport.

The target system is an LDom on a T2000, and looks like this:

System Configuration:  Sun Microsystems  sun4v
Memory size: 1024 Megabytes
SunOS ldg1 5.10 Generic_127111-07 sun4v sparc SUNW,Sun-Fire-T200
Solaris 10 8/07 s10s_u4wos_12b SPARC
The initiator system is another LDom on the same T2000 - although there is no requirement that LDoms are used, or that they be on the same computer if they are used.
System Configuration:  Sun Microsystems  sun4v
Memory size: 896 Megabytes
SunOS ldg4 5.11 snv_83 sun4v sparc SUNW,Sun-Fire-T200
Solaris Nevada snv_83a SPARC
The first configuration step is the creation of the storage underlying the iSCSI target. Although UFS could be used, let's improve the robustness of the Container's contents and put the target's storage under control of ZFS. I don't have extra disk devices to give to ZFS, so I'll make some and use them for a zpool - in real life you would use disk devices here:
Target# mkfile 150m /export/home/disk0
Target# mkfile 150m /export/home/disk1
Target# zpool create myscsi mirror /export/home/disk0 /export/home/disk1
Target# zpool status
  pool: myscsi
 state: ONLINE
 scrub: none requested

        NAME                  STATE     READ WRITE CKSUM
        myscsi                ONLINE       0     0     0
          /export/home/disk0  ONLINE       0     0     0
          /export/home/disk1  ONLINE       0     0     0
Now I can create a zvol - an emulation of a disk device:
Target# zfs list
myscsi   86K   258M  24.5K  /myscsi
Target# zfs create -V 200m myscsi/jvol0
Target# zfs list
myscsi         200M  57.9M  24.5K  /myscsi
myscsi/jvol0  22.5K   258M  22.5K  -
Creating an iSCSI target device from a zvol is easy:
Target# iscsitadm list target
Target# zfs set shareiscsi=on myscsi/jvol0
Target# iscsitadm list target
Target: myscsi/jvol0
    iSCSI Name: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6
    Connections: 0
Target# iscsitadm list target -v
Target: myscsi/jvol0
    iSCSI Name: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6
    Alias: myscsi/jvol0
    Connections: 0
    ACL list:
    TPGT list:
    LUN information:
        LUN: 0
            GUID: 0x0
            VID: SUN
            PID: SOLARIS
            Type: disk
            Size:  200M
            Backing store: /dev/zvol/rdsk/myscsi/jvol0
            Status: online

Configuring the iSCSI initiator takes a little more work. There are three methods to find targets. I will use a simple one. After telling Solaris to use that method, it only needs to know what the IP address of the target is.

Note that the example below uses "iscsiadm list ..." several times, without any output. The purpose is to show the difference in output before and after the command(s) between them.

First let's look at the disks available before configuring iSCSI on the initiator:

Initiator# ls /dev/dsk
c0d0s0  c0d0s2  c0d0s4  c0d0s6  c0d1s0  c0d1s2  c0d1s4  c0d1s6
c0d0s1  c0d0s3  c0d0s5  c0d0s7  c0d1s1  c0d1s3  c0d1s5  c0d1s7
We can view the currently enabled discovery methods, and enable the one we want to use:
Initiator# iscsiadm list discovery
        Static: disabled
        Send Targets: disabled
        iSNS: disabled
Initiator# iscsiadm list target
Initiator# iscsiadm modify discovery --sendtargets enable
Initiator# iscsiadm list discovery
        Static: disabled
        Send Targets: enabled
        iSNS: disabled
At this point we just need to tell Solaris which IP address we want to use as a target. It takes care of all the details, finding all disk targets on the target system. In this case, there is only one disk target.
Initiator# iscsiadm list target
Initiator# iscsiadm add discovery-address
Initiator# iscsiadm list target
Target: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6
        Alias: myscsi/jvol0
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
Initiator# iscsiadm list target -v
Target: iqn.1986-03.com.sun:02:c8a82272-b354-c913-80f9-db9cb378a6f6
        Alias: myscsi/jvol0
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
                CID: 0
                  IP address (Local):
                  IP address (Peer):
                  Discovery Method: SendTargets
                  Login Parameters (Negotiated):
                        Data Sequence In Order: yes
                        Data PDU In Order: yes
                        Default Time To Retain: 20
                        Default Time To Wait: 2
                        Error Recovery Level: 0
                        First Burst Length: 65536
                        Immediate Data: yes
                        Initial Ready To Transfer (R2T): yes
                        Max Burst Length: 262144
                        Max Outstanding R2T: 1
                        Max Receive Data Segment Length: 8192
                        Max Connections: 1
                        Header Digest: NONE
                        Data Digest: NONE
The initiator automatically finds the iSCSI remote storage, but we need to turn this into a disk device. (Newer builds seem to not need this step, but it won't hurt. Looking in /devices/iscsi will help determine whether it's needed.)
Initiator# devfsadm -i iscsi
Initiator# ls /dev/dsk
c0d0s0    c0d0s3    c0d0s6    c0d1s1    c0d1s4    c0d1s7    c1t7d0s2  c1t7d0s5
c0d0s1    c0d0s4    c0d0s7    c0d1s2    c0d1s5    c1t7d0s0  c1t7d0s3  c1t7d0s6
c0d0s2    c0d0s5    c0d1s0    c0d1s3    c0d1s6    c1t7d0s1  c1t7d0s4  c1t7d0s7
Initiator# ls -l /dev/dsk/c1t7d0s0
lrwxrwxrwx   1 root     root         100 Mar 28 00:40 /dev/dsk/c1t7d0s0 ->

Now that the local device entry exists, we can do something useful with it. Installing a new file system requires the use of format(1M) to partition the "disk" but it is assumed that the reader knows how to do that. However, here is the first part of the format dialogue, to show that format lists the new disk device with its unique identifier - the same identifier listed in /devices/iscsi.
Initiator# format
Searching for disks...done

c1t7d0: configured with capacity of 199.98MB

       0. c0d0 
       1. c0d1 
       2. c1t7d0 
Specify disk (enter its number): 2
selecting c1t7d0
[disk formatted]
Disk not labeled.  Label it now? no

Let's jump to the end of the partitioning steps, after assigning all of the available disk space to partition 0:
partition> print
Current partition table (unnamed):
Total disk cylinders available: 16382 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders        Size            Blocks
  0       root    wm       0 - 16381      199.98MB    (16382/0/0) 409550
  1 unassigned    wu       0                0         (0/0/0)          0
  2     backup    wu       0 - 16381      199.98MB    (16382/0/0) 409550
  3 unassigned    wm       0                0         (0/0/0)          0
  4 unassigned    wm       0                0         (0/0/0)          0
  5 unassigned    wm       0                0         (0/0/0)          0
  6 unassigned    wm       0                0         (0/0/0)          0
  7 unassigned    wm       0                0         (0/0/0)          0

partition> label
Ready to label disk, continue? y

The new raw disk needs a file system.
Initiator# newfs /dev/rdsk/c1t7d0s0
newfs: construct a new file system /dev/rdsk/c1t7d0s0: (y/n)? y
/dev/rdsk/c1t7d0s0:     409550 sectors in 16382 cylinders of 5 tracks, 5 sectors
        200.0MB in 1024 cyl groups (16 c/g, 0.20MB/g, 128 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 448, 864, 1280, 1696, 2112, 2528, 2944, 3232, 3648,
Initializing cylinder groups:
super-block backups for last 10 cylinder groups at:
 405728, 406144, 406432, 406848, 407264, 407680, 408096, 408512, 408928, 409344

Back on the target:
Target# zfs list
myscsi         200M  57.9M  24.5K  /myscsi
myscsi/jvol0  32.7M   225M  32.7M  -
Finally, the initiator has a new file system, on which we can install a zone.
Initiator# mkdir /zones/newroots
Initiator# mount /dev/dsk/c1t7d0s0 /zones/newroots
Initiator# zonecfg -z iscuzone
iscuzone: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:iscuzone> create
zonecfg:iscuzone> set zonepath=/zones/newroots/iscuzone
zonecfg:iscuzone> add inherit-pkg-dir
zonecfg:iscuzone:inherit-pkg-dir> set dir=/opt
zonecfg:iscuzone:inherit-pkg-dir> end
zonecfg:iscuzone> exit
Initiator# zoneadm -z iscuzone install
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying <2762> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1162> packages on the zone.
Initialized <1162> packages on zone.
Zone  is initialized.
Installation of these packages generated warnings: 
The file  contains a log of the zone installation.

There it is: a Container on an iSCSI target on a ZFS zvol.

Zone Lifecycle, and Tech Support

There is more to management of Containers than creating them. When a Solaris instance is upgraded, all of its native Containers are upgraded as well. Some upgrade methods work better with certain system configurations than others. This is true for UFS, ZFS, other local file system types, and iSCSI targets that use any of them for underlying storage.

You can use Solaris Live Upgrade to patch or upgrade a system with Containers. If the Containers are on a traditional file system which uses UFS (e.g. /, /export/home) LU will automatically do the right thing. Further, if you create a UFS file system on an iSCSI target and install one or more Containers on it, the ABE will also need file space for its copy of those Containers. To mimic the layout of the original BE you could use another UFS file system on another iSCSI target. The lucreate command would look something like this:

# lucreate -m /:/dev/dsk/c0t0d0s0:ufs   -m /zones:/dev/dsk/c1t7d0s0:ufs -n newBE


If you want to put your Solaris Containers on NAS storage, Solaris 10 8/07 will help you get there, using iSCSI.


Jeff Victor writes this blog to help you understand Oracle's Solaris and virtualization technologies.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


« March 2015