Thursday Jun 27, 2013

Improving Manageability of Virtual Environments

Boot Environments for Solaris 10 Branded Zones

Until recently, Solaris 10 Branded Zones on Solaris 11 suffered one notable regression: Live Upgrade did not work. The individual packaging and patching tools work correctly, but the ability to upgrade Solaris while the production workload continued running did not exist. A recent Solaris 11 SRU (Solaris 11.1 SRU 6.4) restored most of that functionality, although with a slightly different concept, different commands, and without all of the feature details. This new method gives you the ability to create and manage multiple boot environments (BEs) for a Solaris 10 Branded Zone, and modify the active or any inactive BE, and to do so while the production workload continues to run.

Background

In case you are new to Solaris: Solaris includes a set of features that enables you to create a bootable Solaris image, called a Boot Environment (BE). This newly created image can be modified while the original BE is still running your workload(s). There are many benefits, including improved uptime and the ability to reboot into (or downgrade to) an older BE if a newer one has a problem.

In Solaris 10 this set of features was named Live Upgrade. Solaris 11 applies the same basic concepts to the new packaging system (IPS) but there isn't a specific name for the feature set. The features are simply part of IPS. Solaris 11 Boot Environments are not discussed in this blog entry.

Although a Solaris 10 system can have multiple BEs, until recently a Solaris 10 Branded Zone (BZ) in a Solaris 11 system did not have this ability. This limitation was addressed recently, and that enhancement is the subject of this blog entry.

This new implementation uses two concepts. The first is the use of a ZFS clone for each BE. This makes it very easy to create a BE, or many BEs. This is a distinct advantage over the Live Upgrade feature set in Solaris 10, which had a practical limitation of two BEs on a system, when using UFS. The second new concept is a very simple mechanism to indicate the BE that should be booted: a ZFS property. The new ZFS property is named com.oracle.zones.solaris10:activebe (isn't that creative? ;-) ).

It's important to note that the property is inherited from the original BE's file system to any BEs you create. In other words, all BEs in one zone have the same value for that property. When the (Solaris 11) global zone boots the Solaris 10 BZ, it boots the BE that has the name that is stored in the activebe property.

Here is a quick summary of the actions you can use to manage these BEs:

To create a BE:

  • Create a ZFS clone of the zone's root dataset

To activate a BE:

  • Set the ZFS property of the root dataset to indicate the BE

To add a package or patch to an inactive BE:

  • Mount the inactive BE
  • Add packages or patches to it
  • Unmount the inactive BE

To list the available BEs:

  • Use the "zfs list" command.

To destroy a BE:

  • Use the "zfs destroy" command.

Preparation

Before you can use the new features, you will need a Solaris 10 BZ on a Solaris 11 system. You can use these three steps - on a real Solaris 11.1 server or in a VirtualBox guest running Solaris 11.1 - to create a Solaris 10 BZ. The Solaris 11.1 environment must be at SRU 6.4 or newer.

  1. Create a flash archive on the Solaris 10 system
    s10# flarcreate -n s10-system /net/zones/archives/s10-system.flar
  2. Configure the Solaris 10 BZ on the Solaris 11 system
    s11# zonecfg -z s10z
    Use 'create' to begin configuring a new zone.
    zonecfg:s10z> create -t SYSsolaris10
    zonecfg:s10z> set zonepath=/zones/s10z
    zonecfg:s10z> exit
    s11# zoneadm list -cv
      ID NAME             STATUS     PATH                           BRAND     IP    
       0 global           running    /                              solaris   shared
       - s10z             configured /zones/s10z                    solaris10 excl  
    
  3. Install the zone from the flash archive
    s11# zoneadm -z s10z install -a /net/zones/archives/s10-system.flar -p
    

You can find more information about the migration of Solaris 10 environments to Solaris 10 Branded Zones in the documentation.

The rest of this blog entry demonstrates the commands you can use to accomplish the aforementioned actions related to BEs.

New features in action

Note that the demonstration of the commands occurs in the Solaris 10 BZ, as indicated by the shell prompt "s10z# ". Many of these commands can be performed in the global zone instead, if you prefer. If you perform them in the global zone, you must change the ZFS file system names.

Create

The only complicated action is the creation of a BE. In the Solaris 10 BZ, create a new "boot environment" - a ZFS clone. You can assign any name to the final portion of the clone's name, as long as it meets the requirements for a ZFS file system name.

s10z# zfs snapshot rpool/ROOT/zbe-0@snap
s10z# zfs clone -o mountpoint=/ -o canmount=noauto rpool/ROOT/zbe-0@snap rpool/ROOT/newBE
cannot mount 'rpool/ROOT/newBE' on '/': directory is not empty
filesystem successfully created, but not mounted
You can safely ignore that message: we already know that / is not empty! We have merely told ZFS that the default mountpoint for the clone is the root directory.

(Note that a Solaris 10 BZ that has a separate /var file system requires additional steps. See the MOS document mentioned at the bottom of this blog entry.)

List the available BEs and active BE

Because each BE is represented by a clone of the rpool/ROOT dataset, listing the BEs is as simple as listing the clones.

s10z# zfs list -r rpool/ROOT
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT        3.55G  42.9G    31K  legacy
rpool/ROOT/zbe-0     1K  42.9G  3.55G  /
rpool/ROOT/newBE  3.55G  42.9G  3.55G  /
The output shows that two BEs exist. Their names are "zbe-0" and "newBE".

You can tell Solaris that one particular BE should be used when the zone next boots by using a ZFS property. Its name is com.oracle.zones.solaris10:activebe. The value of that property is the name of the clone that contains the BE that should be booted.

s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
NAME        PROPERTY                             VALUE  SOURCE
rpool/ROOT  com.oracle.zones.solaris10:activebe  zbe-0  local

Change the active BE

When you want to change the BE that will be booted next time, you can just change the activebe property on the rpool/ROOT dataset.

s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
NAME        PROPERTY                             VALUE  SOURCE
rpool/ROOT  com.oracle.zones.solaris10:activebe  zbe-0  local
s10z# zfs set com.oracle.zones.solaris10:activebe=newBE rpool/ROOT
s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
NAME        PROPERTY                             VALUE  SOURCE
rpool/ROOT  com.oracle.zones.solaris10:activebe  newBE  local
s10z# shutdown -y -g0 -i6
After the zone has rebooted:
s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
rpool/ROOT  com.oracle.zones.solaris10:activebe  newBE  local
s10z# zfs mount
rpool/ROOT/newBE                /
rpool/export                    /export
rpool/export/home               /export/home
rpool                           /rpool
Mount the original BE to see that it's still there.
s10z# zfs mount -o mountpoint=/mnt rpool/ROOT/zbe-0
s10z# ls /mnt
Desktop                         export                          platform
Documents                       export.backup.20130607T214951Z  proc
S10Flar                         home                            rpool
TT_DB                           kernel                          sbin
bin                             lib                             system
boot                            lost+found                      tmp
cdrom                           mnt                             usr
dev                             net                             var
etc                             opt

Patch an inactive BE

At this point, you can modify the original BE. If you would prefer to modify the new BE, you can restore the original value to the activebe property and reboot, and then mount the new BE to /mnt (or another empty directory) and modify it.

Let's mount the original BE so we can modify it. (The first command is only needed if you haven't already mounted that BE.)

s10z# zfs mount -o mountpoint=/mnt rpool/ROOT/zbe-0
s10z# patchadd -R /mnt -M /var/sadm/spool 104945-02
Note that the typical usage will be:
  1. Create a BE
  2. Mount the new (inactive) BE
  3. Use the package and patch tools to update the new BE
  4. Unmount the new BE
  5. Reboot

Delete an inactive BE

ZFS clones are children of their parent file systems. In order to destroy the parent, you must first "promote" the child. This reverses the parent-child relationship. (For more information on this, see the documentation.)

The original rpool/ROOT file system is the parent of the clones that you create as BEs. In order to destroy an earlier BE that is that parent of other BEs, you must first promote one of the child BEs to be the ZFS parent. Only then can you destroy the original BE.

Fortunately, this is easier to do than to explain:

s10z# zfs promote rpool/ROOT/newBE 
s10z# zfs destroy rpool/ROOT/zbe-0
s10z# zfs list -r rpool/ROOT
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT        3.56G   269G    31K  legacy
rpool/ROOT/newBE  3.56G   269G  3.55G  /

Documentation

This feature is so new, it is not yet described in the Solaris 11 documentation. However, MOS note 1558773.1 offers some details.

Conclusion

With this new feature, you can add and patch packages to boot environments of a Solaris 10 Branded Zone. This ability improves the manageability of these zones, and makes their use more practical. It also means that you can use the existing P2V tools with earlier Solaris 10 updates, and modify the environments after they become Solaris 10 Branded Zones.

Wednesday Jun 12, 2013

Comparing Solaris 11 Zones to Solaris 10 Zones

Many people have asked whether Oracle Solaris 11 uses sparse-root zones or whole-root zones. I think the best answer is "both and neither, and more" - but that's a wee bit confusing. :-) This blog entry attempts to explain that answer.

First a recap: Solaris 10 introduced the Solaris Zones feature set, way back in 2005. Zones are a form of server virtualization called "OS (Operating System) Virtualization." They improve consolidation ratios by isolating processes from each other so that they cannot interact. Each zone has its own set of users, naming services, and other software components. One of the many advantages is that there is no need for a hypervisor, so there is no performance overhead. Many data centers run tens to hundreds of zones per server!

In Solaris 10, there are two models of package deployment for Solaris Zones. One model is called "sparse-root" and the other "whole-root." Each form has specific characteristics, abilities, and limitations.

A whole-root zone has its own copy of the Solaris packages. This allows the inclusion of other software in system directories - even though that practice has been discouraged for many years. Although it is also possible to modify the Solaris content in such a zone, e.g. patching a zone separately from the rest, this was highly frowned on. :-( (More importantly, modifying the Solaris content in a whole-root zone may lead to an unsupported configuration.)

The other model is called "sparse-root." In that form, instead of copying all of the Solaris packages into the zone, the directories containing Solaris binaries are re-mounted into the zone. This allows the zone's users to access them at their normal places in the directory tree. Those are read-only mounts, so a zone's root user cannot modify them. This improves security, and also reduces the amount of disk space used by the zone - 200MB instead of the usual 3-5 GB per zone. These loopback mounts also reduce the amount of RAM used by zones because Solaris only stores in RAM one copy of a program that is in use by several zones. This model also has disadvantages. One disadvantage is the inability to add software into system directories such as /usr. Also, although a sparse-root can be migrated to another Solaris 10 system, it cannot be moved to a Solaris 11 system as a "Solaris 10 Zone."

In addition to those contrasting characteristics, here are some characteristics of zones in Solaris 10 that are shared by both packaging models:

  • A zone can modify its own configuration files in /etc.
  • A zone can be configured so that it manages its own networking, or so that it cannot modify its network configuration.
  • It is difficult to give a non-root user in the global zone the ability to boot and stop a zone, without giving that user other abilities.
  • In a zone that can manage its own networking, the root user can do harmful things like spoof other IP addresses and MAC addresses.
  • It is difficult to assign network patcket processing to the same CPUs that a zone used. This could lead to unpredictable performance and performance troubleshooting challenges.
  • You cannot run a large number of zones in one system (e.g. 50) that each managed its own networking, because that would require assignment of more physical NICs than available (e.g. 50).
  • Except when managed by Ops Center, zones could not be safely stored on NAS.
  • Solaris 10 Zones cannot be NFS servers.
  • The fsstat command does not report statistics per zone.

Solaris 11 Zones use the new packaging system of Solaris 11. Their configuration does not offer a choice of packaging models, as Solaris 10 does. Instead, two (well, four) different models of "immutability" (changeability) are offered. The default model allows a privileged zone user to modify the zone's content. The other (three) limit the content which can be changed: none, or two overlapping sets of configuration files. (See "Configuring and Administering Immutable Zones".)

Solaris 11 addresses many of those limitations. With the characteristics listed above in mind, the following table shows the similarities and differences between zones in Solaris 10 and in Solaris 11. (Cells in a row that are similar have the same background color.)

Characteristic Solaris 10
Whole-Root
Solaris 10
Sparse-Root
Solaris 11 Solaris 11
Immutable Zones
Each zone has a copy of most Solaris packagesYesNo YesYes
Disk space used by a zone (typical)3.5 GB100 MB 500MB500MB
A privileged zone user can add software to /usrYesNo YesNo
A zone can modify its Solaris programsTrueFalse TrueFalse
Each zone can modify its configuration filesYesYes YesNo
Delegated administrationNoNo YesYes
A zone can be configured to manage its own networkingYesYes YesYes
A zone can be configured so that it cannot manage its own networkingYesYes YesYes
A zone can be configured with resource controlsYesYes YesYes
Integrated tool to measure a zone's resource consumption (zonestat)NoNo YesYes
Network processing automatically happens on that zone's CPUsNoNo YesYes
Zones can be NFS serversNoNoYesYes
Per-zone fsstat dataNoNoYesYes

As you can see, the statement "Solaris 11 Zones are whole-root zones" is only true using the narrowest definition of whole-root zones: those zones which have their own copy of Solaris packaging content. But there are other valuable characteristics of sparse-root zones that are still available in Solaris 11 Zones. Also, some Solaris 11 Zones do not have some characteristics of whole-root zones.

For example, the table above shows that you can configure a Solaris 11 zone that has read-only Solaris content. And Solaris 11 takes that concept further, offering the ability to tailor that immutability. It also shows that Solaris 10 sparse-root and whole-root zones are more similar to each other than to Solaris 11 Zones.

Conclusion

Solaris 11 Zones are slightly different from Solaris 10 Zones. The former can achieve the goals of the latter, and they also offer features not found in Solaris 10 Zones. Solaris 11 Zones offer the best of Solaris 10 whole-root zones and sparse-root zones, and offer an array of new features that make Zones even more flexible and powerful.

Tuesday Nov 13, 2012

Oracle Solaris: Zones on Shared Storage

Oracle Solaris 11.1 has several new features. At oracle.com you can find a detailed list.

One of the significant new features, and the most significant new feature releated to Oracle Solaris Zones, is casually called "Zones on Shared Storage" or simply ZOSS (rhymes with "moss"). ZOSS offers much more flexibility because you can store Solaris Zones on shared storage (surprise!) so that you can perform quick and easy migration of a zone from one system to another. This blog entry describes and demonstrates the use of ZOSS.

ZOSS provides complete support for a Solaris Zone that is stored on "shared storage." In this case, "shared storage" refers to fiber channel (FC) or iSCSI devices, although there is one lone exception that I will demonstrate soon. The primary intent is to enable you to store a zone on FC or iSCSI storage so that it can be migrated from one host computer to another much more easily and safely than in the past.

With this blog entry, I wanted to make it easy for you to try this yourself. I couldn't assume that you have a SAN available - which is a good thing, because neither do I! :-) What could I use, instead? [There he goes, foreshadowing again... -Ed.]

Developing this entry reinforced the lesson that the solution to every lab problem is VirtualBox. ;-) Oracle VM VirtualBox (its formal name) helps here in a couple of important ways. It offers the ability to easily install multiple copies of Solaris as guests on top of any popular system (Microsoft Windows, MacOS, Solaris, Oracle Linux (and other Linuxes) etc.). It also offers the ability to create a separate virtual disk drive (VDI) that appears as a local hard disk to a guest. This virtual disk can be moved very easily from one guest to another. In other words, you can follow the steps below on a laptop or larger x86 system.

Please note that the ability to use ZOSS to store a zone on a local disk is very useful for a lab environment, but not so useful for production. I do not suggest regularly moving disk drives among computers. [Update, 2013.01.28: Apparently the previous sentence caused some confusion. I do recommend the use of Zones on Shared Storage in production environments, when appropriate storage is used. "Appropriate storage" would include SAN or iSCSI at this point. I do not recommend using ZOSS with local disks in production because doing so would require moving the disks between computers.]

In the method I describe below, that virtual hard disk will contain the zone that will be migrated among the (virtual) hosts. In production, you would use FC or iSCSI LUNs instead. The zonecfg(1M) man page details the syntax for each of the three types of devices.

Why Migrate?

Why is the migration of virtual servers important? Some of the most common reasons are:
  • Moving a workload to a different computer so that the original computer can be turned off for extensive maintenance.
  • Moving a workload to a larger system because the workload has outgrown its original system.
  • If the workload runs in an environment (such as a Solaris Zone) that is stored on shared storage, you can restore the service of the workload on an alternate computer if the original computer has failed and will not reboot.
  • You can simplify lifecycle management of a workload by developing it on a laptop, migrating it to a test platform when it's ready, and finally moving it to a production system.

Concepts

For ZOSS, the important new concept is named "rootzpool". You can read about it in the zonecfg(1M) man page, but here's the short version: it's the backing store (hard disk(s), or LUN(s)) that will be used to make a ZFS zpool - the zpool that will hold the zone. This zpool:

  • contains the zone's Solaris content, i.e. the root file system
  • does not contain any content not related to the zone
  • can only be mounted by one Solaris instance at a time

Method Overview

Here is a brief list of the steps to create a zone on shared storage and migrate it. The next section shows the commands and output.
  1. You will need a host system with an x86 CPU (hopefully at least a couple of CPU cores), at least 2GB of RAM, and at least 25GB of free disk space. (The steps below will not actually use 25GB of disk space, but I don't want to lead you down a path that ends in a big sign that says "Your HDD is full. Good luck!")
  2. Configure the zone on both systems, specifying the rootzpool that both will use. The best way is to configure it on one system and then copy the output of "zonecfg export" to the other system to be used as input to zonecfg. This method reduces the chances of pilot error. (It is not necessary to configure the zone on both systems before creating it. You can configure this zone in multiple places, whenever you want, and migrate it to one of those places at any time - as long as those systems all have access to the shared storage.)
  3. Install the zone on one system, onto shared storage.
  4. Boot the zone.
  5. Provide system configuration information to the zone. (In the Real World(tm) you will usually automate this step.)
  6. Shutdown the zone.
  7. Detach the zone from the original system.
  8. Attach the zone to its new "home" system.
  9. Boot the zone.
The zone can be used normally, and even migrated back, or to a different system.

Details

The rest of this shows the commands and output. The two hostnames are "sysA" and "sysB".

Note that each Solaris guest might use a different device name for the VDI that they share. I used the device names shown below, but you must discover the device name(s) after booting each guest. In a production environment you would also discover the device name first and then configure the zone with that name. Fortunately, you can use the command "zpool import" or "format" to discover the device on the "new" host for the zone.

The first steps create the VirtualBox guests and the shared disk drive. I describe the steps here without demonstrating them.

  1. Download VirtualBox and install it using a method normal for your host OS. You can read the complete instructions.
  2. Create two VirtualBox guests, each to run Solaris 11.1. Each will use its own VDI as its root disk.
  3. Install Solaris 11.1 in each guest.Install Solaris 11.1 in each guest. To install a Solaris 11.1 guest, you can either download a pre-built VirtualBox guest, and import it, or install Solaris 11.1 from the "text install" media. If you use the latter method, after booting you will not see a windowing system. To install the GUI and other important things, login and run "pkg install solaris-desktop" and take a break while it installs those important things.
  4. Life is usually easier if you install the VirtualBox Guest Additions because then you can copy and paste between the host and guests, etc. You can find the guest additions in the folder matching the version of VirtualBox you are using. You can also read the instructions for installing the guest additions.
  5. To create the zone's shared VDI in VirtualBox, you can open the storage configuration for one of the two guests, select the SATA controller, and click on the "Add Hard Disk" icon nearby. Choose "Create New Disk" and specify an appropriate path name for the file that will contain the VDI. The shared VDI must be at least 1.5 GB. Note that the guest must be stopped to do this.
  6. Add that VDI to the other guest - using its Storage configuration - so that each can access it while running. The steps start out the same, except that you choose "Choose Existing Disk" instead of "Create New Disk." Because the disk is configured on both of them, VirtualBox prevents you from running both guests at the same time.
  7. Identify device names of that VDI, in each of the guests. Solaris chooses the name based on existing devices. The names may be the same, or may be different from each other. This step is shown below as "Step 1."

Assumptions

In the example shown below, I make these assumptions.
  • The guest that will own the zone at the beginning is named sysA.
  • The guest that will own the zone after the first migration is named sysB.
  • On sysA, the shared disk is named /dev/dsk/c7t2d0
  • On sysB, the shared disk is named /dev/dsk/c7t3d0

(Finally!) The Steps

Step 1) Determine the name of the disk that will move back and forth between the systems.
root@sysA:~# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c7t0d0 
          /pci@0,0/pci8086,2829@d/disk@0,0
       1. c7t2d0 
          /pci@0,0/pci8086,2829@d/disk@2,0
Specify disk (enter its number): ^D
Step 2) The first thing to do is partition and label the disk. The magic needed to write an EFI label is not overly complicated.
root@sysA:~# format -e c7t2d0
selecting c7t2d0
[disk formatted]

FORMAT MENU:
...
format> fdisk
No fdisk table exists. The default partition for the disk is:

  a 100% "SOLARIS System" partition

Type "y" to accept the default partition,  otherwise type "n" to edit the
 partition table. n
SELECT ONE OF THE FOLLOWING:
...
Enter Selection: 1
...
  G=EFI_SYS    0=Exit? f
SELECT ONE...
...
6

format> label
...
Specify Label type[1]: 1
Ready to label disk, continue? y

format> quit

root@sysA:~# ls /dev/dsk/c7t2d0
/dev/dsk/c7t2d0

Step 3) Configure zone1 on sysA.
root@sysA:~# zonecfg -z zone1
Use 'create' to begin configuring a new zone.
zonecfg:zone1> create
create: Using system default template 'SYSdefault'
zonecfg:zone1> set zonename=zone1
zonecfg:zone1> set zonepath=/zones/zone1
zonecfg:zone1> add rootzpool
zonecfg:zone1:rootzpool> add storage dev:dsk/c7t2d0
zonecfg:zone1:rootzpool> end
zonecfg:zone1> exit
root@sysA:~#
oot@sysA:~# zonecfg -z zone1 info
zonename: zone1
zonepath: /zones/zone1
brand: solaris
autoboot: false
bootargs:
file-mac-profile:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
hostid:
fs-allowed:
anet:
...
rootzpool:
        storage: dev:dsk/c7t2d0
Step 4) Install the zone. This step takes the most time, but you can wander off for a snack or a few laps around the gym - or both! (Just not at the same time...)
root@sysA:~# zoneadm -z zone1 install
Created zone zpool: zone1_rpool
Progress being logged to /var/log/zones/zoneadm.20121022T163634Z.zone1.install
       Image: Preparing at /zones/zone1/root.

 AI Manifest: /tmp/manifest.xml.RXaycg
  SC Profile: /usr/share/auto_install/sc_profiles/enable_sci.xml
    Zonename: zone1
Installation: Starting ...

              Creating IPS image
Startup linked: 1/1 done
              Installing packages from:
                  solaris
                      origin:  http://pkg.us.oracle.com/support/
DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                            183/183   33556/33556  222.2/222.2  2.8M/s

PHASE                                          ITEMS
Installing new actions                   46825/46825
Updating package state database                 Done
Updating image state                            Done
Creating fast lookup database                   Done
Installation: Succeeded

        Note: Man pages can be obtained by installing pkg:/system/manual

 done.

        Done: Installation completed in 1696.847 seconds.


  Next Steps: Boot the zone, then log into the zone console (zlogin -C)

              to complete the configuration process.

Log saved in non-global zone as /zones/zone1/root/var/log/zones/zoneadm.20121022T163634Z.zone1.install
Step 5) Boot the Zone.
root@sysA:~# zoneadm -z zone1 boot
Step 6) Login to zone's console to complete the specification of system information.
root@sysA:~# zlogin -C zone1
Answer the usual questions and wait for a login prompt. Then you can end the console session with the usual "~." incantation.

Step 7) Shutdown the zone so it can be "moved."

root@sysA:~# zoneadm -z zone1 shutdown
Step 8) Detach the zone so that the original global zone can't use it.
root@sysA:~# zoneadm list -cv
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              solaris  shared
   - zone1            installed  /zones/zone1                   solaris  excl
root@sysA:~# zpool list
NAME          SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool        17.6G  11.2G  6.47G  63%  1.00x  ONLINE  -
zone1_rpool  1.98G   484M  1.51G  23%  1.00x  ONLINE  -
root@sysA:~# zoneadm -z zone1 detach
Exported zone zpool: zone1_rpool
Step 9) Review the result and shutdown sysA so that sysB can use the shared disk.
root@sysA:~# zpool list
NAME    SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  17.6G  11.2G  6.47G  63%  1.00x  ONLINE  -
root@sysA:~# zoneadm list -cv
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              solaris  shared
   - zone1            configured /zones/zone1                   solaris  excl
root@sysA:~# init 0
Step 10) Now boot sysB and configure a zone with the parameters shown above in Step 1. (Again, the safest method is to use "zonecfg ... export" on sysA as described in section "Method Overview" above.) The one difference is the name of the rootzpool storage device, which was shown in the list of assumptions, and which you must determine by booting sysB and using the "format" or "zpool import" command.

When that is done, you should see the output shown next. (I used the same zonename - "zone1" - in this example, but you can choose any valid zonename you want.)

root@sysB:~# zoneadm list -cv
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              solaris  shared
   - zone1            configured /zones/zone1                   solaris  excl
root@sysB:~# zonecfg -z zone1 info
zonename: zone1
zonepath: /zones/zone1
brand: solaris
autoboot: false
bootargs:
file-mac-profile:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
hostid:
fs-allowed:
anet:
        linkname: net0
...
rootzpool:
        storage: dev:dsk/c7t3d0
Step 11) Attaching the zone automatically imports the zpool.
root@sysB:~# zoneadm -z zone1 attach
Imported zone zpool: zone1_rpool
Progress being logged to /var/log/zones/zoneadm.20121022T184034Z.zone1.attach
    Installing: Using existing zone boot environment
      Zone BE root dataset: zone1_rpool/rpool/ROOT/solaris
                     Cache: Using /var/pkg/publisher.
  Updating non-global zone: Linking to image /.
Processing linked: 1/1 done
  Updating non-global zone: Auditing packages.
No updates necessary for this image.

  Updating non-global zone: Zone updated.
                    Result: Attach Succeeded.
Log saved in non-global zone as /zones/zone1/root/var/log/zones/zoneadm.20121022T184034Z.zone1.attach

root@sysB:~# zoneadm -z zone1 boot
root@sysB:~# zlogin zone1
[Connected to zone 'zone1' pts/2]
Oracle Corporation      SunOS 5.11      11.1    September 2012
Step 12) Now let's migrate the zone back to sysA. Create a file in zone1 so we can verify it exists after we migrate the zone back, then begin migrating it back.
root@zone1:~# ls /opt
root@zone1:~# touch /opt/fileA
root@zone1:~# ls -l /opt/fileA
-rw-r--r--   1 root     root           0 Oct 22 14:47 /opt/fileA
root@zone1:~# exit
logout

[Connection to zone 'zone1' pts/2 closed]
root@sysB:~# zoneadm -z zone1 shutdown
root@sysB:~# zoneadm -z zone1 detach
Exported zone zpool: zone1_rpool
root@sysB:~# init 0
Step 13) Back on sysA, check the status.
Oracle Corporation      SunOS 5.11      11.1    September 2012
root@sysA:~# zoneadm list -cv
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              solaris  shared
   - zone1            configured /zones/zone1                   solaris  excl
root@sysA:~# zpool list
NAME    SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  17.6G  11.2G  6.47G  63%  1.00x  ONLINE  -
Step 14) Re-attach the zone back to sysA.
root@sysA:~# zoneadm -z zone1 attach
Imported zone zpool: zone1_rpool
Progress being logged to /var/log/zones/zoneadm.20121022T190441Z.zone1.attach
    Installing: Using existing zone boot environment
      Zone BE root dataset: zone1_rpool/rpool/ROOT/solaris
                     Cache: Using /var/pkg/publisher.
  Updating non-global zone: Linking to image /.
Processing linked: 1/1 done
  Updating non-global zone: Auditing packages.
No updates necessary for this image.

  Updating non-global zone: Zone updated.
                    Result: Attach Succeeded.
Log saved in non-global zone as /zones/zone1/root/var/log/zones/zoneadm.20121022T190441Z.zone1.attach

root@sysA:~# zpool list
NAME          SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool        17.6G  11.2G  6.47G  63%  1.00x  ONLINE  -
zone1_rpool  1.98G   491M  1.51G  24%  1.00x  ONLINE  -
root@sysA:~# zoneadm -z zone1 boot
root@sysA:~# zlogin zone1
[Connected to zone 'zone1' pts/2]
Oracle Corporation      SunOS 5.11      11.1    September 2012
root@zone1:~# zpool list
NAME    SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  1.98G   538M  1.46G  26%  1.00x  ONLINE  -
Step 15) Check for the file created on sysB, earlier.
root@zone1:~# ls -l /opt
total 1
-rw-r--r--   1 root     root           0 Oct 22 14:47 fileA

Next Steps

Here is a brief list of some of the fun things you can try next.
  • Add space to the zone by adding a second storage device to the rootzpool. Make sure that you add it to the configurations of both zones!
  • Create a new zone, specifying two disks in the rootzpool when you first configure the zone. When you install that zone, or clone it from another zone, zoneadm uses those two disks to create a mirrored pool. (Three disks will result in a three-way mirror, etc.)

Conclusion

Hopefully you have seen the ease with which you can now move Solaris Zones from one system to another.

Thursday Oct 25, 2012

Oracle Solaris 11.1

Oracle Solaris 11.1 was announced at Oracle OpenWorld recently. This release added 300 new performance and feature enhancements.

My favorite new features:

  • Solaris Zones on Shared Storage
  • Support for 32 TB (!) of RAM
  • Improved Oracle RAC lock latency
  • Dynamically resize the Oracle DB SGA
  • Industry-first support for FedFS
You can learn more from the press release or by attending the Solaris 11.1 webcast on November 7.

Tuesday Aug 02, 2011

Solaris Zones Optimize Real Workloads

Oracle published two Optimized Solutions last week that utilize Oracle Solaris Containers.

The first is for Oracle WebCenter Suite. The Optimized Solution shows how one server can support more than 1,000 users for WebCenter Spaces.The announcement includes links to business-focused and technical white papers.

The second is for Agile Product Lifecycle Management. The announcement includes links to business-focused and technical white papers.

Each optimized solution showcases the ability to use Oracle Solaris Containers to optimize performance of multiple workloads within one consolidated server.

Thursday Jul 14, 2011

Extreme Oracle Solaris Virtualization

There will be a live webcast today, explaining how to leverage Oracle Solaris' unmatched virtualization features. The webcast begins at 9 AM PT. Register is required, at: Oracle.com.

Wednesday Jul 13, 2011

Solaris Zones help achieve World Record Benchmark Result

Maximizing performance of multi-node workloads can be challenging. Should I maximize CPU clock rate, or RAM size per node, or network bandwidth? And how do I analyze performance of each component while also measuring aggregate throughput? Solaris Zones provide characteristics that are useful for multi-node architectures:
  • Architectural flexibility: easily remove a network bottleneck between two components by running both in zones on one Solaris server - and move any of them to different servers later as processing needs change
  • Convenient, dynamic resource management, assign a workload to a set of CPUs for predictable, consistent performance, ensure that each workload component has access to sufficient hardware resources, etc.
These characteristics are displayed in the world record benchmark result for Oracle JD Edwards EnterpriseOne. It was achieved using Solaris Containers (Zones) to isolate individual software components, including the WebLogic-based applications and Web Tier Utilities.

Solaris Zones features enabled software isolation and resource management, making the process of fine-tuning resource assignment very easy. For more details, see:

Friday Jul 08, 2011

Downloadable Database in a Solaris Zone

To simplify the process of creating Oracle databases, Oracle has released two Solaris Zones with Oracle 11gR2 pre-installed. You can simply download the appropriate template and "attach" it to your x86 or SPARC system running Solaris 10 (update 10/09 or newer).

Links to the downloads are at oracle.com.

Of course, you must have a valid license to run 11gR2 on that computer.

Monday May 23, 2011

Oracle DB 11gR2 Certified for Solaris Containers

Just a short entry today: last week Oracle completed certification of Oracle RAC 11gR2 (with Clusterware) on Oracle Solaris 10 Containers ("Zones").

For details, see http://www.oracle.com/technetwork/database/virtualizationmatrix-172995.html .

A paper "Best Practices for Deploying Oracle RAC Inside Oracle Solaris Containers" is also available, but is not specific to 11gR2.

This extends the previous certifications of Oracle RAC (9iR2, 10gR2, 11gR1) on Solaris Containers.

Tuesday Mar 01, 2011

Virtual Network - Part 4

Resource Controls

This is the fourth part of a series of blog entries about Solaris network virtualization. Part 1 introduced network virtualization, Part 2 discussed network resource management capabilities available in Solaris 11 Express, and Part 3 demonstrated the use of virtual NICs and virtual switches.

This entry shows the use of a bandwidth cap on Virtual Network Elements (VNEs). This form of network resource control can effectively limit the amount of bandwidth consumed by a particular stream of packets. In our context, we will restrict the amount of bandwidth that a zone can use.

As a reminder, we have the following network topology, with three zones and three VNICs, one VNIC per zone.

All three VNICs were created on one ethernet interface in Part 3 of this series.

Capping VNIC Bandwidth

Using a T2000 server in a lab environment, we can measure network throughput with the new dlstat(1) command. This command reports various statistics about data links, including the quantity of packets, bytes, interrupts, polls, drops, blocks, and other data. Because I am trying to illustrate the use of commands, not optimize performance, the network workload will be a simple file transfer using ftp(1). This method of measuring network bandwidth is reasonable for this purpose, but says nothing about the performance of this platform. For example, this method reads data from a disk. Some of that data may be cached, but disk performance may impact the network bandwidth measured here. However, we can still achieve the basic goal: demonstrating the effectiveness of a bandwidth cap.

With that background out of the way, first let's check the current status of our links.

GZ# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g2     phys      1500   unknown  --         --
e1000g1     phys      1500   down     --         --
e1000g3     phys      1500   unknown  --         --
emp_web1    vnic      1500   up       --         e1000g0
emp_app1    vnic      1500   up       --         e1000g0
emp_db1     vnic      1500   up       --         e1000g0
GZ# dladm show-linkprop emp_app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     autopush        rw   --             --             --
emp_app1     zone            rw   emp-app        --             --
emp_app1     state           r-   unknown        up             up,down
emp_app1     mtu             rw   1500           1500           1500
emp_app1     maxbw           rw   --             --             --
emp_app1     cpus            rw   --             --             --
emp_app1     cpus-effective  r-   1-9            --             --
emp_app1     pool            rw   SUNWtmp_emp-app --             --
emp_app1     pool-effective  r-   SUNWtmp_emp-app --             --
emp_app1     priority        rw   high           high           low,medium,high
emp_app1     tagmode         rw   vlanonly       vlanonly       normal,vlanonly
emp_app1     protection      rw   --             --             mac-nospoof,
                                                                restricted,
                                                                ip-nospoof,
                                                                dhcp-nospoof
<some lines deleted>
Before setting any bandwidth caps, let's determine the transfer rates between a zone on this system and a remote system.

It's easy to use dlstat to determine the data rate to my home system while transferring a file from a zone:

GZ# dlstat -i 10 e1000g0 
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   27.99M    2.11G   54.18M   77.34G
       emp_app1       83    6.72K        0        0
       emp_app1      339   23.73K    1.36K    1.68M
       emp_app1    1.79K  120.09K    6.78K    8.38M
       emp_app1    2.27K  153.60K    8.49K   10.50M
       emp_app1    2.35K  156.27K    8.88K   10.98M
       emp_app1    2.65K  182.81K    5.09K    6.30M
       emp_app1      600   44.10K      935    1.15M
       emp_app1      112    8.43K        0        0
The OBYTES column is simply the number of bytes transferred during that data sample. I'll ignore the 1.68MB and 1.15MB data points because the file transfer began and ended during those samples. The average of the other values leads to a bandwidth of 7.6 Mbps (megabits per second), which is typical for my broadband connection.

Let's pretend that we want to constrain the bandwidth consumed by that workload to 2 Mbps. Perhaps we want to leave all of the rest for a higher-priority workload. Perhaps we're an ISP and charge for different levels of available bandwidth. Regardless of the situation, capping bandwidth is easy:

GZ# dladm set-linkprop -p maxbw=2000k emp_app1
GZ# dladm show-linkprop -p maxbw emp__app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     maxbw           rw       2          --             --
GZ# dlstat -i 20 emp_app1 
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   18.21M    1.43G   10.22M   14.56G
       emp_app1      186   13.98K        0        0
       emp_app1      613   51.98K    1.09K    1.34M
       emp_app1    1.51K  107.85K    3.94K    4.87M
       emp_app1    1.88K  131.19K    3.12K    3.86M
       emp_app1    2.07K  143.17K    3.65K    4.51M
       emp_app1    1.84K  136.03K    3.03K    3.75M
       emp_app1    2.10K  145.69K    3.70K    4.57M
       emp_app1    2.24K  154.95K    3.89K    4.81M
       emp_app1    2.43K  166.01K    4.33K    5.35M
       emp_app1    2.48K  168.63K    4.29K    5.30M
       emp_app1    2.36K  164.55K    4.32K    5.34M
       emp_app1      519   42.91K      643  793.01K
       emp_app1      200   18.59K        0        0
Note that for dladm, the default unit for maxbw is Mbps. The average of the full samples is 1.97 Mbps.

Between zones, the uncapped data rate is higher:

GZ# dladm reset-linkprop -p maxbw emp_app1
GZ# dladm show-linkprop  -p maxbw emp_app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     maxbw           rw   --             --             --
GZ# dlstat -i 20 emp_app1
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   20.80M    1.62G   23.36M   33.25G
       emp_app1      208   16.59K        0        0
       emp_app1   24.48K    1.63M  193.94K  277.50M
       emp_app1  265.68K   17.54M    2.05M    2.93G
       emp_app1  266.87K   17.62M    2.06M    2.94G
       emp_app1  255.78K   16.88M    1.98M    2.83G
       emp_app1  206.20K   13.62M    1.34M    1.92G
       emp_app1   18.87K    1.25M   79.81K  114.23M
       emp_app1      246   17.08K        0        0
This five year old T2000 can move at least 1.2 Gbps of data, internally, but that took five simultaneous ftp sessions. (A better measurement method, one that doesn't include the limits of disk drives, would yield better results, and newer systems, either x86 or SPARC, have higher internal bandwidth characteristics.) In any case, the maximum data rate is not interesting for our purpose, which is demonstration of the ability to cap that rate.

You can often resolve a network bottleneck while maintaining workload isolation, by moving two separate workloads onto the same system, within separate zones. However, you might choose to limit their bandwidth consumption. Fortunately, the NV tools in Solaris 11 Express enable you to accomplish that:

GZ# dladm set-linkprop -t -p maxbw=25m emp_app1
GZ# dladm show-linkprop -p maxbw emp_app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     maxbw           rw      25          --             --
Note that the change to the bandwidth cap was made while the zone was running, potentially while network traffic was flowing. Also, changes made by dladm are persistent across reboots of Solaris unless you specify a "-t" on command line.

Data moves much more slowly now:

GZ# # dlstat  -i 20 emp_app1
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   23.84M    1.82G   46.44M   66.28G
       emp_app1      192   16.10K        0        0
       emp_app1    1.15K   79.21K    5.77K    8.24M
       emp_app1   18.16K    1.20M   40.24K   57.60M
       emp_app1   17.99K    1.20M   39.46K   56.48M
       emp_app1   17.85K    1.19M   39.11K   55.97M
       emp_app1   17.39K    1.15M   38.16K   54.62M
       emp_app1   18.02K    1.19M   39.53K   56.58M
       emp_app1   18.66K    1.24M   39.60K   56.68M
       emp_app1   18.56K    1.23M   39.24K   56.17M
<many lines deleted>
The data show an aggregate bandwidth of 24 Mbps.

Conclusion

The network virtualization tools in Solaris 11 Express include various resource controls. The simplest of these is the bandwidth cap, which you can use to effectively limit the amount of bandwidth that a workload can consume. Both physical NICs and virtual NICs may be capped by using this simple method. This also applies to workloads that are in Solaris Zones - both default zones and Solaris 10 Zones which mimic Solaris 10 systems.

Next time we'll explore some other virtual network architectures.

Tuesday Feb 08, 2011

Virtual Network - Part 3

This is the third in a series of blog entries that discuss the network virtualization features in Oracle Solaris 11 Express. Part 1 introduced the concept of network virtualization and listed the basic virtual network elements that Solaris 11 Express (S11E) provides. Part 2 expanded on the concepts and discussed the resource management features which can be applied to those virtual network elements (VNEs).

This blog entry assumes that you have some experience with Solaris Zones. If you don't, you can read my earlier blog entries, or buy the book "Oracle Solaris 10 System Virtualization Essentials" or read the documentation.

This entry will demonstrate the creation of some of these VNEs.

For today's examples, I will use an old Sun Fire T2000 that has one SPARC CMT (T1) chip and 32GB RAM. I will pretend that I am implementing a 3-tier architecture in this one system, where each tier is represented by one Solaris zone. The mythical example provides access to an employee database. The 3-tier service is named 'emp' and VNEs will use 'emp' in their names to reduce confusion regarding the dozens of VNEs we expect to create for the services this system will deliver.

The commands shown below use the prompt "GZ#" to indicate that the command is entered in the global zone by someone with sufficient privileges. Similarly, the prompt "emp-web1#" indicates a command which is entered in the zone "emp-web1" as a sufficiently privileged user.

Fortunately, Solaris network engineers gathered all of the actions regarding the management of network elements (virtual or physical) into one command: dladm(1M). You use dladm to create, destroy, and configure datalinks such as VNICs. You can also use it to list physical NICs:

GZ# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g2     phys      1500   unknown  --         --
e1000g1     phys      1500   down     --         --
e1000g3     phys      1500   unknown  --         --
We need three VNICs for our three zones, one VNIC per zone. They will also have useful names - one for each of the tiers - and will share e1000g0:
GZ# dladm create-vnic -l e1000g0 emp_web1
GZ# dladm create-vnic -l e1000g0 emp_app1
GZ# dladm create-vnic -l e1000g0 emp_db1
GZ# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g2     phys      1500   unknown  --         --
e1000g1     phys      1500   down     --         --
e1000g3     phys      1500   unknown  --         --
emp_web1    vnic      1500   up       --         e1000g0
emp_app1    vnic      1500   up       --         e1000g0
emp_db1     vnic      1500   up       --         e1000g0
GZ# dladm show-vnic
LINK         OVER         SPEED  MACADDRESS        MACADDRTYPE         VID
emp_web1     e1000g0      0      2:8:20:3a:43:c8   random              0
emp_app1     e1000g0      0      2:8:20:36:a1:17   random              0
emp_db1      e1000g0      0      2:8:20:b4:5b:d3   random              0

The system has four NICs and three VNICs. Note that the name of a VNIC may not include a hyphen (-) but may include an underscore (_).

VNICs that share a NIC appear to be attached together via a virtual switch. That vSwitch is created automatically by Solaris. This diagram represents the NIC and NVEs we have created.

Now that these datalinks - the VNICs - exist, we can assign them to our zones. I'll assume that the zones already exist, and just need network assignment.

GZ# zonecfg -z emp-web1 info
zonename: emp-web1
zonepath: /zones/emp-web1
brand: ipkg
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
hostid:
fs-allowed:
GZ# zonecfg -z emp-web1
zonecfg:emp-web1> add net
zonecfg:emp-web1:net> set physical=emp_web1
zonecfg:emp-web1:net> end
zonecfg:emp-web1> exit

Those steps can be followed for the other two zones and matching VNICs. After those steps are completed, our earlier diagram would look like this:

Packets passing from one zone to another within a Solaris instance do not leave the computer, if they are in the same subnet and use the same datalink. This greatly improves network bandwidth and latency. Otherwise, the packets will head for the zone's default router.

Therefore, in the above diagram packets sent from emp-web1 destined for emp-app1 would traverse the virtual switch, but not pass through e1000g0.

This zone is an "exclusive-IP" zone, meaning that it "owns" its own networking. What is its view of networking? That's easy to determine. The zlogin(1M) command inserts a complete command-line into the zone. By default, the command is run as the root user.

GZ# zoneadm -z emp-web1 boot
GZ# zlogin emp-web1 dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
emp_web1    vnic      1500   up       --         ?
GZ# zlogin emp-web1 dladm show-vnic
LINK         OVER         SPEED  MACADDRESS        MACADDRTYPE         VID
emp_web1     ?            0      2:8:20:3a:43:c8   random              0

Notice that the zone sees its own VNEs, but cannot see NEs or VNEs in the global zone, or in any other zone.

The other important new networking command in Solaris 11 Express is ipadm(1M). That command creates IP address assignments, enables and disables them, displays IP address configuration information, and performs other actions.

The following example shows the global zone's view before configuring IP in the zone:

GZ# ipadm show-if
IFNAME     STATE    CURRENT      PERSISTENT
lo0        ok       -m-v------46 ---
e1000g0    ok       bm--------4- ---
GZ# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/?             static   ok           127.0.0.1/8
lo0/?             static   ok           127.0.0.1/8
lo0/?             static   ok           127.0.0.1/8
e1000g0/_a        static   ok           10.140.204.69/24
lo0/v6            static   ok           ::1/128
lo0/?             static   ok           ::1/128
lo0/?             static   ok           ::1/128
lo0/?             static   ok           ::1/128

At this point, not only does the zone know it has a datalink (which we saw above) but the IP tools show that it is there, ready for use. The next example shows this:

GZ# zlogin emp-web1 ipadm show-if
IFNAME     STATE    CURRENT      PERSISTENT
lo0        ok       -m-v------46 ---
GZ# zlogin emp-web1 ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
An ethernet datalink without an IP address isn't very useful, so let's configure an IP interface and apply an IP address to it:
GZ# zlogin emp-web1 ipadm show-if
IFNAME     STATE    CURRENT      PERSISTENT
lo0        ok       -m-v------46 ---
GZ# zlogin emp-web1 ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128

GZ# zlogin emp-web1 ipadm create-if emp_web1
GZ# zlogin emp-web1 ipadm show-if
IFNAME     STATE    CURRENT      PERSISTENT
lo0        ok       -m-v------46 ---
emp_web1   down     bm--------46 -46

GZ# zlogin emp-web1 ipadm create-addr -T static -a local=10.140.205.82/24 emp_web1/v4static
GZ# zlogin emp-web1 ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
emp_web1/v4static static   ok           10.140.205.82/24
lo0/v6            static   ok           ::1/128

GZ# zlogin emp-web1 ifconfig emp_web1
emp_web1: flags=1000843 mtu 1500 index 2
        inet 10.140.205.82 netmask ffffff00 broadcast 10.140.205.255
        ether 2:8:20:3a:43:c8

The last command above shows the "old" way of displaying IP address configuration. The command ifconfig(1) is still there, but the new tools dladm and ipadm provide a more consistent interface, with well-defined separation between datalink management and IP management.

Of course, if you want the zone's outbound packets to be routed to other networks, you must use the route(1M) command, the /etc/defaultrouter file, or both.

Next time, I'll show a new network measurement tool and the ability to control the amount of network bandwidth consumed.

Thursday Jan 27, 2011

Virtual Networks - Part 2

This is the second in a series of blog entries that discuss the network virtualization features in Solaris 11 Express. The first entry discussed the basic concepts and the virtual network elements, including virtual NICs, VLANs, virtual switches, and InfiniBand datalinks.

This entry adds to that list the resource controls and security features that are necessary for a well-managed virtual network.

Virtual Networks, Real Resource Controls

In Oracle Solaris 11 Express, there are four main datalink resource controls:
  1. a bandwidth cap, which limits the amount of traffic passing through a datalink in a small amount of elapsed time
  2. assignment of packet processing tasks to a subset of the system's CPUs
  3. flows, which were introduced in the previous blog post
  4. rings, which are hardware or software resources that can be dedicated to a single purpose.
Let's take them one at a time. By default, datalinks such as VNICs can consume as much of the physical NIC's bandwidth as they want. That might be the desired behavior, but if it isn't you can apply the property "maxbw" to a datalink. The maximum permitted bandwidth can be specified in Kbps, Mbps or Gbps. This value can be changed dynamically, so if you set this value too low, you can change without affecting the traffic flowing over that link. Solaris will not allow traffic to flow over that datalink at a rate faster than you specify.

You can "over-subscribe" this bandwidth cap: the sum of the bandwidth caps on the VNICs assigned to a NIC can exceed the rated bandwidth of the NIC. If that happens, the bandwidth caps become less effective.

In addition the bandwidth cap, packet processing computation can be constrained to the CPUs associated with a workload.

First some background. When Solaris boots, it assigns interrupt handler threads to the CPUs in the system. (See Solaris CPUs for an explanation of the meaning of "CPU".) Solaris attempts to spread the interrupt handlers out evenly so that one CPU does not become a bottleneck for interrupt handling.

If you create non-default CPU pools, the interrupt handlers will retain their CPU assignments. One unintended side effect of this is a situation where the CPUs intended for one workload will be handling interrupts caused by another workload. This can occur even with simple configurations of Solaris Zones. In extreme cases, network packet processing for one zone can severely impact the performance of another zone.

To prevent this behavior, Solaris 11 Express offers the ability to assign a datalink's interrupt handler to a set of CPUs or a pool of CPUs. To simplify this further, the obvious choice is made for you, by default, for a zone which is assigned its own resource pool. When such a zone boots, a resource pool is created for the zone, a sufficient quantity of CPUs is moved from the default pool to the zone's pool, and interrupt handlers for that zone's datalink(s) are automatically reassigned to that resource pool. Network flows enable you to create multiple lanes of traffic. This allows the parallelization of network traffic. You can assign a bandwidth cap to a flow. Flows were introduced in the previous post and will be discussed further in future posts.

Finally, the newest high speed NICs support hardware rings: memory resources that can be dedicated to a particular set of network traffic. For inbound packets, this is the first resource control that separates network traffic based on packet information such as destination MAC address. By assigning one or more rings to a stream of traffic, you can commit sufficient hardware resources to it and ensure a greater relative priority for those packets, even if another stream of traffic on the same NIC would otherwise cause congestion and impact packet latency of all streams.

If you are using a NIC that does not support hardware rings, Solaris 11 Express support software rings which cause a similar effect.

Virtual Networks, Real Security

In addition to rescource controls, Solaris 11 Express offers datalink protection controls. These controls are intended to prevent a user from creating improper packets that would cause mischief on the network. The mac-nospoof property requires that outgoing packets have a MAC address which matches the link's MAC address. The ip-nospoof property implements a similar restriction, but for IP addresses. The dhcp-nospoof property prevents improper DHCP assignment.

Summary (so far)

The network virtualization features in Solaris 11 Express enable the creation of virtual network devices, leading to the implementation of an entire network inside one Solaris system. Associated resource control features give you the ability to manage network bandwidth as a resource and reduce the potential for one workload to cause network performance problems for another workload. Finally, security features help you minimize the impact of an intruder.

With all of the introduction out of the way, next time I'll show some actual uses of these concepts.

Tuesday Dec 07, 2010

All New Zonestat!

Part 1

Recently I gave a brief overview of the enhancements available in Solaris 11 Express. I also hinted at more blog entries, mostly featuring Solaris Zones and network virtualization.

Before experimenting with new functions it's useful to have some tools to measure the results. With that in mind, this blog entry and its successor(s) will discuss new measurement tools that are in Solaris 11 Express: zonestat(1), flowstat(1M) and dlstat(1M). I will start with zonestat.

Zonestat Introduction

But first some history. For Solaris 10 I created an open-source tool I named "zonestat". That tool filled a need: one integrated view of the resource consumption and optional resource control settings of all running zones. The resources listed included CPUs, physical memory, virtual memory, and locked memory. Zonestat provided a "dashboard" that greatly eased the task of monitoring the resource usage of Solaris Zones.

That tool has these main drawbacks:

  1. It's written in Perl and uses a large set of existing Solaris commands to gather all of the data that it needs. Executing all of those commands for each data sample uses a significant amount of CPU time.
  2. It is a separate tool, not part of Solaris. It is not supported.
  3. It was originally intended as a prototype, a demonstration of what could be accomplished. I made a number of enhancements along the way, but for a while it wasn't clear whether it made sense to upgrade it for Solaris 11.
However, even with those shortcomings, that zonestat script was put into production at a number of data centers.

In 2009 Solaris Engineering decided to write a fully supported version of zonestat, as a new Solaris command. Instead of someone writing code in his spare time (me), a member of the Solaris Zones Engineering Team (Steve Lawrence) was assigned to write a comprehensive, efficient, fully featured tool that achieved many of the same goals as the original zonestat, and many more. Using the experience gained from the original zonestat script, a completely new program (also called "zonestat") had the potential to solve all of the problems of the open-source Perl script, and add new features which had been requested by users of Solaris Zones, and other features which the Zones Engineering Team knew would be useful.

And in Solaris 11 Express that potential was realized. Because the new zonestat performs almost all of the functions of the original zonestat script, and performs far more in addition, the rest of this blog entry (and the next one) will only discuss the new zonestat which is part of Solaris 11 Express.

The new zonestat(1) command has a plethora of options. These options allow the user to list data:

  • for each of the system's zones, including the global zone and data specific to kernel processing but not directly attributable to any one zone
  • for any subset of zones
  • regarding one or more types of resources, in absolute units or as a portion of available or capped resources
  • regarding one or more instances of resources (e.g. a particular processor set)
  • that has been sorted by one or more output columns
  • that is human-readable output or output that can be easily parsed by a script or other program
  • that includes timestamps, in one of several formats
  • that includes regular aggregations (called "summary reports"), such as "highest value during the interval" or "average value during the interval"
Zonestat has a variety of uses. The most obvious is monitoring resource usage of zones. Even if you don't use resource controls, zonestat will help you by telling you when a zone is using a significant portion (or all!) of the system's resources. Of course, zonestat really brings value to systems that are using resource controls, making it easy to determine which zones are near their caps - a sure sign that there is a problem with that zone's workload or that the zone's cap is too low.

In addition, you can use zonestat to determine proper values for resource controls. For example, you can deploy a workload in a zone and use zonestat to determine the maximum amount of CPU capacity it uses. That information will enable you to make better decisions about how many CPUs to assign to that zone - if you have decided that the workload should use its own, dedicated CPUs to the zone.

If you are not familiar with the resource management controls offered by Solaris, you may wish to view the relevant documentation before, during or after reading the rest of this. The book "Oracle Solaris 10 System Virtualization Essentials" also describes all of the resource controls available for Solaris 10 Zones, and how they can be used to achieve various goals. Finally, the document "Understanding the Security Capabilities of Solaris" approaches the same content from a security perspective.

Now let's explore some of the interesting things you can do with zonestat.

Basics

The default output provides the data you would expect - basic information about the resource usage of all zones on the system. The command syntax can be simplified to this (omitting some features for now):

zonestat [options] interval [duration]
and the basic output looks like this:
GZ$ zonestat 5 2
Collecting data for first interval...
Interval: 1, Duration: 0:00:05
SUMMARY                  Cpus/Online: 32/32   Physical: 31.8G    Virtual: 47.8G
                    ----------CPU---------- ----PHYSICAL----- -----VIRTUAL-----
               ZONE  USED %PART  %CAP %SHRU  USED   PCT  %CAP  USED   PCT  %CAP
            [total]  0.10 0.31%     -     - 3109M 9.52%     - 7379M 15.0%     -
           [system]  0.01 0.04%     -     - 2797M 8.57%     - 7115M 14.5%     -
             global  0.08 0.51%     -     -  141M 0.43%     -  129M 0.26%     -
              zoneA  0.00 0.02%     -     - 43.7M 0.13%     - 35.4M 0.07%     -
              zoneB  0.00 0.02%     -     - 42.0M 0.12%     - 32.8M 0.06%     -
              zoneC  0.00 0.04%     -     - 42.0M 0.12%     - 32.8M 0.06%     -
              zoneD  0.00 0.02%     -     - 42.1M 0.12%     - 33.2M 0.06%     -

Interval: 2, Duration: 0:00:10
SUMMARY                  Cpus/Online: 32/32   Physical: 31.8G    Virtual: 47.8G
                    ----------CPU---------- ----PHYSICAL----- -----VIRTUAL-----
               ZONE  USED %PART  %CAP %SHRU  USED   PCT  %CAP  USED   PCT  %CAP
            [total]  0.09 0.30%     -     - 3109M 9.52%     - 7379M 15.0%     -
           [system]  0.01 0.03%     -     - 2797M 8.57%     - 7115M 14.5%     -
             global  0.08 0.51%     -     -  142M 0.43%     -  129M 0.26%     -
              zoneA  0.00 0.02%     -     - 43.7M 0.13%     - 35.4M 0.07%     -
              zoneB  0.00 0.02%     -     - 42.0M 0.12%     - 32.8M 0.06%     -
              zoneC  0.00 0.02%     -     - 42.0M 0.12%     - 32.8M 0.06%     -
              zoneD  0.00 0.02%     -     - 42.1M 0.12%     - 33.2M 0.06%     -

First, note that unlike other Solaris stat tools (e.g. vmstat) the first set of data is not a summary since the system booted. Instead, zonestat pauses for the time interval specified on the command line, at which point it displays data representing the first sample. (Zonestat doesn't actually collect the data. Its companion, zonestatd(1M) performs that service for all zonestat clients.)

Also, you probably noticed those two special lines, "[total]" and "[system]". The first of those indicates data about the total quantity of each resource, across the whole system. The lines labeled "[system]" show resource consumption by the kernel or by processes that aren't associated with any one zone.

Zonestat can produce a great deal of information - more than will fit on one line. Its various options allow you to view summary data - as provided in the default - or to focus on a zone, or on a particular type or instance of a resource, or a combination of those. Obviously, the header will be tailored to the output requested.

The summary header looks like this:

Interval: 1, Duration: 0:00:05
SUMMARY                  Cpus/Online: 32/32   Physical: 31.8G    Virtual: 47.8G
                    ----------CPU---------- ----PHYSICAL----- -----VIRTUAL-----
               ZONE  USED %PART  %CAP %SHRU  USED   PCT  %CAP  USED   PCT  %CAP
The first line of data, per sample, tells you the ordinal number of the sample - not very useful if you're just checking a few seconds of data, but pretty helpful when you're scanning through 3 days worth of output. The Duration field is similar, but is a measurement of time since the command began.

The SUMMARY line shows the quantity of CPUs that exist in the system and how many of them are online. (I wrote an earlier blog entry about the method that Solaris uses to count CPUs.) That line also shows the system's amount of RAM ("Physical") and Virtual Memory (the size of RAM plus swap space on disk).

The ZONE column contains the name of the zone. The values in that row represent that zone's use of resources. The columns labeled USED show that zone's consumption of each resource. The unit depends on the resource. For CPUs, a value of 1 represents a "Solaris CPU." For memory, the unit is specified in the output.

Besides those generic header elements, some are specific to a resource type. %PART shows the CPU utilization, as a percentage of the compute capacity of the processor set in which the zone's processes run. %CAP is the percentage of the zone's CPU cap which has been used recently (if a cap has been applied to the zone). %SHRU indicates the amount of CPU used as a percentage of the shares assigned to the zone (if the Fair Share Scheduler is in use and shares have been assigned to this zone). The latter may occasionally show a surprising result: a value greater than 100%. I don't have space here to explain the Fair Share Scheduler, but the short version is "FSS enforces a minimum amount of available CPU capicity if there is contention for the CPUs, but it does not enforce a maximum. If there isn't contention, any process which wants to consume CPU cycles can do so - which can lead to a value greater than 100%."

The PHYSICAL section shows the amount of RAM used, the portion of the system's memory (PCT) represented by that amount of RAM, and the portion of the zone's RAM cap, if one has been set. The VIRTUAL section has similar fields.

Comparing Usage to Caps

To show the data you might see when a zone has a RAM cap, let's set one. We could do this in zonecfg(1M) for the next time the zone boots, but I don't feel like rebooting the zone, so let's add that cap while the zone runs. First, a quick check of the resource capping daemon. (In these examples, I am logged in as a user which is configured to use non-default administrative privileges. To temporarily gain those privileges, I will use the pfexec(1) command.)
GZ$ svcs rcap
STATE          STIME    FMRI
online         Oct_28   svc:/system/rcap:default
The service is online, and the cap is easy to set:
GZ$ pfexec rcapadm -z zoneB -m 512m
GZ$ zonestat -z zoneB 10 1
Collecting data for first interval...
Interval: 1, Duration: 0:00:10
SUMMARY                  Cpus/Online: 32/32   Physical: 31.8G    Virtual: 47.8G
                    ----------CPU---------- ----PHYSICAL----- -----VIRTUAL-----
               ZONE  USED %PART  %CAP %SHRU  USED   PCT  %CAP  USED   PCT  %CAP
            [total]  0.15 0.49%     -     - 3112M 9.53%     - 7382M 15.0%     -
           [system]  0.01 0.05%     -     - 2797M 8.57%     - 7115M 14.5%     -
              zoneB  0.00 0.10%     -     - 42.5M 0.13% 8.31% 33.3M 0.06%     -

ZoneB is using 42.5MB, which is 0.13% of the system's memory (31.8GB), and 8.31% of the 512MB cap that we set.

One of the many very useful abilities of zonestat is its ability to focus on a small part of the data which it can potentially display. The previous example demonstrated its ability to limit the output to one zone. We can also limit the output to just one resource type, or "zoom in" further to one instance of a resource.

Let's limit our view to the RAM used by that zone:

GZ$ zonestat -r physical-memory -z zoneB 10 1
Collecting data for first interval...
Interval: 1, Duration: 0:00:10
PHYSICAL-MEMORY              SYSTEM MEMORY
mem_default                          31.8G
                                ZONE  USED   PCT   CAP  %CAP
                             [total] 3113M 9.53%     -     -
                            [system] 2797M 8.56%     -     -
                               zoneB 42.5M 0.13%  512M 8.31%
We can "zoom out" and look at all of the processor sets and their zone assignments (something that was difficult in Solaris 10):
GZ$ pfexec zoneadm -z zoneA boot
GZ$ pfexec zoneadm -z zoneC boot
GZ$ pfexec zoneadm -z zoneD boot
GZ$ pfexec zonestat -r psets 10 1
Collecting data for first interval...
Interval: 1, Duration: 0:00:10
PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
pset_default            default-pset        16/16         1/-
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  0.17 1.10%     -     -      -     -     -
                            [system]  0.03 0.23%     -     -      -     -     -
                              global  0.13 0.86%     -     -      -     -     -

PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
zoneD                  dedicated-cpu          4/4         4/4
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  1.10 27.6%     -     -      -     -     -
                            [system]  0.00 0.00%     -     -      -     -     -
                               zoneD  1.10 27.6%     -     -      -     -     -

PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
zoneC                  dedicated-cpu          4/4         4/4
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  1.00 25.0%     -     -      -     -     -
                            [system]  0.08 2.00%     -     -      -     -     -
                               zoneC  0.92 23.0%     -     -      -     -     -

PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
zoneB                  dedicated-cpu          4/4         4/4
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  0.00 0.14%     -     -      -     -     -
                            [system]  0.00 0.00%     -     -      -     -     -
                               zoneB  0.00 0.14%     -     -      -     -     -

PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
zoneA                  dedicated-cpu          4/4         4/4
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  0.00 0.14%     -     -      -     -     -
                            [system]  0.00 0.00%     -     -      -     -     -
                               zoneA  0.00 0.14%     -     -      -     -     -


With the basics out of the way, next time I will discuss some other options that display other data and organize the output in different ways.

<script type="text/javascript"> var sc_project=2359564; var sc_invisible=1; var sc_security="22b325fd"; var sc_https=1; var sc_remove_link=1; var scJsHost = (("https:" == document.location.protocol) ? "https://secure." : "http://www."); document.write("");</script>

counter for tumblr

Friday Apr 02, 2010

Solaris Virtualization Book

This blog has been pretty quiet lately, but that doesn't mean I haven't been busy! For the last 6 months I've been leading the writing of a book: _Oracle Solaris 10 System Virtualization Essentials_.

This book discusses all of the forms of server virtualization, not just hypervisors. It covers the forms of virtualization that Solaris 10 provides and those that it can use. These include Solaris Containers (also called Solaris Zones), VirtualBox, Oracle VM ( x86 and SPARC; the latter was called Logical Domains), and Dynamic Domains.

One chapter is dedicated to the topic of choosing the best virtualization technology for a particular workload or set of workloads. Another chapter shows how to use each virtualization technology to achieve specific goals, including screenshots and command sequences. The last chapter of the book describes the need for virtualization management tools and then uses Oracle EM Ops Center as an example.

The book is available for pre-order at Amazon.com.

Wednesday Apr 08, 2009

Zonestat 1.4 Now Available

I have posted Zonestat v1.4 at: the Zone Statistics project page (click on "Files" in the left navbar).

Zonestat is a 'dashboard' for Solaris Containers. It shows resource consumption of each Container (aka Zone) and a comparison of consumption against limits you have set.

Changes from v1.3:

  • BugFix: various failures if the pools service was not online. V1.4 checks for the existence of the pools packages, and behaves correctly whether they are installed and enabled, or not.
  • BugFix: various symptoms if the rcapd service was not online. V1.4 checks for the existence of the rcap packages, and behaves correctly whether they are installed and enabled, or not.
  • BugFix: mis-reported shared memory usage
  • BugFix: -lP produced human-readable, not machine-parseable output
  • Bug/RFE: detect and fail if zone != global or user != root
  • RFE: Prepare for S10 update numbers past U6
  • RFE: Add option to print entire name of zones with long names
  • RFE: Add timestamp to machine-consumable output
  • RFE: improve performance and correctness by collecting CPU% with DTrace instead of prstat

Note that the addition of a timestamp to -P output changes the output format for "machine-readable" output.

For most people, the most important change will be the use of DTrace to collect CPU% data. This has two effects. The first effect is improved correctness. The prstat command - used in V1.3 and earlier, can horribly underestimate CPU cycles consumed because it can miss many short-lived processes. The mpstat has its own problems with mis-counting CPU usage. So I expanded on a solution Jim Fiori offered, which uses DTrace to answer the question "which zone is using a CPU right now?"

The other benefit to DTrace is the improvement in performance of Zonestat.

The less popular, but still interesting additions include:

  • -N expands the width of the zonename field to the length of the longest zone name. This preserves the entire zone name, for all zones, and also leaves the columns lined up. However, the length of the output lines will exceed 80 characters.
  • The new timestamp field in -P output makes it easier for tools like the "System Data Recorder" (SDR) to consume zonestat output. However, this was a change to the output format. If you have written a script which used -P and assumed a specific format for zonestat output, you must change your script to understand the new format.

Please send questions and requests to zones-discuss@opensolaris.org .

About

Jeff Victor writes this blog to help you understand Oracle's Solaris and virtualization technologies.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today