Thursday Jun 27, 2013

Improving Manageability of Virtual Environments

Boot Environments for Solaris 10 Branded Zones

Until recently, Solaris 10 Branded Zones on Solaris 11 suffered one notable regression: Live Upgrade did not work. The individual packaging and patching tools work correctly, but the ability to upgrade Solaris while the production workload continued running did not exist. A recent Solaris 11 SRU (Solaris 11.1 SRU 6.4) restored most of that functionality, although with a slightly different concept, different commands, and without all of the feature details. This new method gives you the ability to create and manage multiple boot environments (BEs) for a Solaris 10 Branded Zone, and modify the active or any inactive BE, and to do so while the production workload continues to run.

Background

In case you are new to Solaris: Solaris includes a set of features that enables you to create a bootable Solaris image, called a Boot Environment (BE). This newly created image can be modified while the original BE is still running your workload(s). There are many benefits, including improved uptime and the ability to reboot into (or downgrade to) an older BE if a newer one has a problem.

In Solaris 10 this set of features was named Live Upgrade. Solaris 11 applies the same basic concepts to the new packaging system (IPS) but there isn't a specific name for the feature set. The features are simply part of IPS. Solaris 11 Boot Environments are not discussed in this blog entry.

Although a Solaris 10 system can have multiple BEs, until recently a Solaris 10 Branded Zone (BZ) in a Solaris 11 system did not have this ability. This limitation was addressed recently, and that enhancement is the subject of this blog entry.

This new implementation uses two concepts. The first is the use of a ZFS clone for each BE. This makes it very easy to create a BE, or many BEs. This is a distinct advantage over the Live Upgrade feature set in Solaris 10, which had a practical limitation of two BEs on a system, when using UFS. The second new concept is a very simple mechanism to indicate the BE that should be booted: a ZFS property. The new ZFS property is named com.oracle.zones.solaris10:activebe (isn't that creative? ;-) ).

It's important to note that the property is inherited from the original BE's file system to any BEs you create. In other words, all BEs in one zone have the same value for that property. When the (Solaris 11) global zone boots the Solaris 10 BZ, it boots the BE that has the name that is stored in the activebe property.

Here is a quick summary of the actions you can use to manage these BEs:

To create a BE:

  • Create a ZFS clone of the zone's root dataset

To activate a BE:

  • Set the ZFS property of the root dataset to indicate the BE

To add a package or patch to an inactive BE:

  • Mount the inactive BE
  • Add packages or patches to it
  • Unmount the inactive BE

To list the available BEs:

  • Use the "zfs list" command.

To destroy a BE:

  • Use the "zfs destroy" command.

Preparation

Before you can use the new features, you will need a Solaris 10 BZ on a Solaris 11 system. You can use these three steps - on a real Solaris 11.1 server or in a VirtualBox guest running Solaris 11.1 - to create a Solaris 10 BZ. The Solaris 11.1 environment must be at SRU 6.4 or newer.

  1. Create a flash archive on the Solaris 10 system
    s10# flarcreate -n s10-system /net/zones/archives/s10-system.flar
  2. Configure the Solaris 10 BZ on the Solaris 11 system
    s11# zonecfg -z s10z
    Use 'create' to begin configuring a new zone.
    zonecfg:s10z> create -t SYSsolaris10
    zonecfg:s10z> set zonepath=/zones/s10z
    zonecfg:s10z> exit
    s11# zoneadm list -cv
      ID NAME             STATUS     PATH                           BRAND     IP    
       0 global           running    /                              solaris   shared
       - s10z             configured /zones/s10z                    solaris10 excl  
    
  3. Install the zone from the flash archive
    s11# zoneadm -z s10z install -a /net/zones/archives/s10-system.flar -p
    

You can find more information about the migration of Solaris 10 environments to Solaris 10 Branded Zones in the documentation.

The rest of this blog entry demonstrates the commands you can use to accomplish the aforementioned actions related to BEs.

New features in action

Note that the demonstration of the commands occurs in the Solaris 10 BZ, as indicated by the shell prompt "s10z# ". Many of these commands can be performed in the global zone instead, if you prefer. If you perform them in the global zone, you must change the ZFS file system names.

Create

The only complicated action is the creation of a BE. In the Solaris 10 BZ, create a new "boot environment" - a ZFS clone. You can assign any name to the final portion of the clone's name, as long as it meets the requirements for a ZFS file system name.

s10z# zfs snapshot rpool/ROOT/zbe-0@snap
s10z# zfs clone -o mountpoint=/ -o canmount=noauto rpool/ROOT/zbe-0@snap rpool/ROOT/newBE
cannot mount 'rpool/ROOT/newBE' on '/': directory is not empty
filesystem successfully created, but not mounted
You can safely ignore that message: we already know that / is not empty! We have merely told ZFS that the default mountpoint for the clone is the root directory.

(Note that a Solaris 10 BZ that has a separate /var file system requires additional steps. See the MOS document mentioned at the bottom of this blog entry.)

List the available BEs and active BE

Because each BE is represented by a clone of the rpool/ROOT dataset, listing the BEs is as simple as listing the clones.

s10z# zfs list -r rpool/ROOT
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT        3.55G  42.9G    31K  legacy
rpool/ROOT/zbe-0     1K  42.9G  3.55G  /
rpool/ROOT/newBE  3.55G  42.9G  3.55G  /
The output shows that two BEs exist. Their names are "zbe-0" and "newBE".

You can tell Solaris that one particular BE should be used when the zone next boots by using a ZFS property. Its name is com.oracle.zones.solaris10:activebe. The value of that property is the name of the clone that contains the BE that should be booted.

s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
NAME        PROPERTY                             VALUE  SOURCE
rpool/ROOT  com.oracle.zones.solaris10:activebe  zbe-0  local

Change the active BE

When you want to change the BE that will be booted next time, you can just change the activebe property on the rpool/ROOT dataset.

s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
NAME        PROPERTY                             VALUE  SOURCE
rpool/ROOT  com.oracle.zones.solaris10:activebe  zbe-0  local
s10z# zfs set com.oracle.zones.solaris10:activebe=newBE rpool/ROOT
s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
NAME        PROPERTY                             VALUE  SOURCE
rpool/ROOT  com.oracle.zones.solaris10:activebe  newBE  local
s10z# shutdown -y -g0 -i6
After the zone has rebooted:
s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
rpool/ROOT  com.oracle.zones.solaris10:activebe  newBE  local
s10z# zfs mount
rpool/ROOT/newBE                /
rpool/export                    /export
rpool/export/home               /export/home
rpool                           /rpool
Mount the original BE to see that it's still there.
s10z# zfs mount -o mountpoint=/mnt rpool/ROOT/zbe-0
s10z# ls /mnt
Desktop                         export                          platform
Documents                       export.backup.20130607T214951Z  proc
S10Flar                         home                            rpool
TT_DB                           kernel                          sbin
bin                             lib                             system
boot                            lost+found                      tmp
cdrom                           mnt                             usr
dev                             net                             var
etc                             opt

Patch an inactive BE

At this point, you can modify the original BE. If you would prefer to modify the new BE, you can restore the original value to the activebe property and reboot, and then mount the new BE to /mnt (or another empty directory) and modify it.

Let's mount the original BE so we can modify it. (The first command is only needed if you haven't already mounted that BE.)

s10z# zfs mount -o mountpoint=/mnt rpool/ROOT/zbe-0
s10z# patchadd -R /mnt -M /var/sadm/spool 104945-02
Note that the typical usage will be:
  1. Create a BE
  2. Mount the new (inactive) BE
  3. Use the package and patch tools to update the new BE
  4. Unmount the new BE
  5. Reboot

Delete an inactive BE

ZFS clones are children of their parent file systems. In order to destroy the parent, you must first "promote" the child. This reverses the parent-child relationship. (For more information on this, see the documentation.)

The original rpool/ROOT file system is the parent of the clones that you create as BEs. In order to destroy an earlier BE that is that parent of other BEs, you must first promote one of the child BEs to be the ZFS parent. Only then can you destroy the original BE.

Fortunately, this is easier to do than to explain:

s10z# zfs promote rpool/ROOT/newBE 
s10z# zfs destroy rpool/ROOT/zbe-0
s10z# zfs list -r rpool/ROOT
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT        3.56G   269G    31K  legacy
rpool/ROOT/newBE  3.56G   269G  3.55G  /

Documentation

This feature is so new, it is not yet described in the Solaris 11 documentation. However, MOS note 1558773.1 offers some details.

Conclusion

With this new feature, you can add and patch packages to boot environments of a Solaris 10 Branded Zone. This ability improves the manageability of these zones, and makes their use more practical. It also means that you can use the existing P2V tools with earlier Solaris 10 updates, and modify the environments after they become Solaris 10 Branded Zones.

Wednesday Jun 12, 2013

Comparing Solaris 11 Zones to Solaris 10 Zones

Many people have asked whether Oracle Solaris 11 uses sparse-root zones or whole-root zones. I think the best answer is "both and neither, and more" - but that's a wee bit confusing. :-) This blog entry attempts to explain that answer.

First a recap: Solaris 10 introduced the Solaris Zones feature set, way back in 2005. Zones are a form of server virtualization called "OS (Operating System) Virtualization." They improve consolidation ratios by isolating processes from each other so that they cannot interact. Each zone has its own set of users, naming services, and other software components. One of the many advantages is that there is no need for a hypervisor, so there is no performance overhead. Many data centers run tens to hundreds of zones per server!

In Solaris 10, there are two models of package deployment for Solaris Zones. One model is called "sparse-root" and the other "whole-root." Each form has specific characteristics, abilities, and limitations.

A whole-root zone has its own copy of the Solaris packages. This allows the inclusion of other software in system directories - even though that practice has been discouraged for many years. Although it is also possible to modify the Solaris content in such a zone, e.g. patching a zone separately from the rest, this was highly frowned on. :-( (More importantly, modifying the Solaris content in a whole-root zone may lead to an unsupported configuration.)

The other model is called "sparse-root." In that form, instead of copying all of the Solaris packages into the zone, the directories containing Solaris binaries are re-mounted into the zone. This allows the zone's users to access them at their normal places in the directory tree. Those are read-only mounts, so a zone's root user cannot modify them. This improves security, and also reduces the amount of disk space used by the zone - 200MB instead of the usual 3-5 GB per zone. These loopback mounts also reduce the amount of RAM used by zones because Solaris only stores in RAM one copy of a program that is in use by several zones. This model also has disadvantages. One disadvantage is the inability to add software into system directories such as /usr. Also, although a sparse-root can be migrated to another Solaris 10 system, it cannot be moved to a Solaris 11 system as a "Solaris 10 Zone."

In addition to those contrasting characteristics, here are some characteristics of zones in Solaris 10 that are shared by both packaging models:

  • A zone can modify its own configuration files in /etc.
  • A zone can be configured so that it manages its own networking, or so that it cannot modify its network configuration.
  • It is difficult to give a non-root user in the global zone the ability to boot and stop a zone, without giving that user other abilities.
  • In a zone that can manage its own networking, the root user can do harmful things like spoof other IP addresses and MAC addresses.
  • It is difficult to assign network patcket processing to the same CPUs that a zone used. This could lead to unpredictable performance and performance troubleshooting challenges.
  • You cannot run a large number of zones in one system (e.g. 50) that each managed its own networking, because that would require assignment of more physical NICs than available (e.g. 50).
  • Except when managed by Ops Center, zones could not be safely stored on NAS.
  • Solaris 10 Zones cannot be NFS servers.
  • The fsstat command does not report statistics per zone.

Solaris 11 Zones use the new packaging system of Solaris 11. Their configuration does not offer a choice of packaging models, as Solaris 10 does. Instead, two (well, four) different models of "immutability" (changeability) are offered. The default model allows a privileged zone user to modify the zone's content. The other (three) limit the content which can be changed: none, or two overlapping sets of configuration files. (See "Configuring and Administering Immutable Zones".)

Solaris 11 addresses many of those limitations. With the characteristics listed above in mind, the following table shows the similarities and differences between zones in Solaris 10 and in Solaris 11. (Cells in a row that are similar have the same background color.)

Characteristic Solaris 10
Whole-Root
Solaris 10
Sparse-Root
Solaris 11 Solaris 11
Immutable Zones
Each zone has a copy of most Solaris packagesYesNo YesYes
Disk space used by a zone (typical)3.5 GB100 MB 500MB500MB
A privileged zone user can add software to /usrYesNo YesNo
A zone can modify its Solaris programsTrueFalse TrueFalse
Each zone can modify its configuration filesYesYes YesNo
Delegated administrationNoNo YesYes
A zone can be configured to manage its own networkingYesYes YesYes
A zone can be configured so that it cannot manage its own networkingYesYes YesYes
A zone can be configured with resource controlsYesYes YesYes
Integrated tool to measure a zone's resource consumption (zonestat)NoNo YesYes
Network processing automatically happens on that zone's CPUsNoNo YesYes
Zones can be NFS serversNoNoYesYes
Per-zone fsstat dataNoNoYesYes

As you can see, the statement "Solaris 11 Zones are whole-root zones" is only true using the narrowest definition of whole-root zones: those zones which have their own copy of Solaris packaging content. But there are other valuable characteristics of sparse-root zones that are still available in Solaris 11 Zones. Also, some Solaris 11 Zones do not have some characteristics of whole-root zones.

For example, the table above shows that you can configure a Solaris 11 zone that has read-only Solaris content. And Solaris 11 takes that concept further, offering the ability to tailor that immutability. It also shows that Solaris 10 sparse-root and whole-root zones are more similar to each other than to Solaris 11 Zones.

Conclusion

Solaris 11 Zones are slightly different from Solaris 10 Zones. The former can achieve the goals of the latter, and they also offer features not found in Solaris 10 Zones. Solaris 11 Zones offer the best of Solaris 10 whole-root zones and sparse-root zones, and offer an array of new features that make Zones even more flexible and powerful.

Tuesday Jan 31, 2012

(Solaris) Destination: Detroit

We will have another Solaris 11 Technology Forum soon. The two I hosted in New York City and Boston included over 150 attendees. Dozens more attended the session in Chicago hosted by my associate Scott Dickson. These sessions pack hundreds of details regarding the practical uses of Solaris 11 into 4 hours - plus lunch!

Next week - February 8, to be exact - I will host a session near Detroit, Michigan. You can attend by registering online. Two other Solaris experts - Dave Miner and Alex Barclay - will join me as we explain new Solaris 11 features such as the Image Packaging System and Automated Installer, network virtualization and resource controls, Immutable Zones, and ZFS Encryption and other new security features.

Registration is not required, but available seating is filling up quickly, so don't wait!

Thursday Nov 10, 2011

Solaris 11 Released!!

Oracle released Solaris 11 on November 9, 2011.

You can download Solaris 11.

You can also watch videos, participate in forums, and read white papers, data sheets and documentation.

Friday Oct 28, 2011

Oracle Solaris 11 Launch

Join Oracle executives Mark Hurd and John Fowler and key Oracle Solaris Engineers and Execs at the Oracle Solaris 11 launch event in New York City, at Gotham Hall on Broadway, November 9th and learn how you can build your infrastructure with Oracle Solaris 11 to:

  • Accelerate internal, public, and hybrid cloud applications
  • Optimize application deployment with built-in virtualization
  • Achieve top performance and cost advantages with Oracle Solaris 11–based engineered systems
The launch event will also feature exclusive content for our in-person audience including a session led by the VP of core Solaris development and his leads on Solaris 11 and a customer insights panel during lunch. We will also have a technology showcase featuring our latest systems and Solaris technologies. The Solaris executive team will also be there throughout the day to answer questions and give insights into future developments in Solaris.

Don't miss the Oracle Solaris 11 launch in New York on November 9. Register Today!

Tuesday Oct 18, 2011

What's New in Oracle Solaris 11

Oracle Solaris 11 adds new features to the #1 Enterprise OS, Solaris 10. Some of these features were in "preview form" in Solaris 11 Express. The feature sets introduced there have been greatly expanded in order to make Solaris 11 ready for your data center. Also, new features have been added that were not in Solaris 11 Express in any form.

The list of features below is not exhaustive. Complete documentation about changes to Solaris will be made available. To learn more, register for the Solaris 11 launch. You can attend in person, in New York City, or via webcast.

Software management features designed for cloud computing

The new package management system is far easier to use than previous versions of Solaris.
  • A completely new Solaris packaging system uses network-based repositories (located at our data centers or at yours) to modernize Solaris packaging.
  • A new version of Live Upgrade minimizes service downtime during package updates. It also provides the ability to simply reboot to a previous version of the software if necessary - without resorting to backup tapes.
  • The new Automated Installer replaces Solaris JumpStart and simplifies hands-off installation. AI also supports automatic creation of Solaris Zones.
  • Distro Constructor creates Solaris binary images that can be installed over the network, or copied to physical media.
  • The previous SVR4 (System V Release 4) packaging tools are included in Solaris 11 for installation of non-Solaris software packages.
  • All of this is integrated with ZFS. For example, the alternate boot environemnts (ABEs) created by the Live Upgrade tools are ZFS clones, minimizing the time to create them and the space they occupy.

Network virtualization and resource control features enable networks-in-a-box

Previewed in Solaris 11 Express, the network virtualization and resource control features in Oracle Solaris 11 enable you to create an entire network in a Solaris instance. This can include virtual switches, virtual routers, integrated firewall and load-balancing software, IP tunnels, and more. I described the relevant concepts in an earlier blog entry.

In addition to the significant improvements in flexibility compared to a physical network, network performance typically improves. Instead of traversing multiple physical network components (NICs, cables, switches and routers), packet transfers are accomplished by in-memory loads and stores. Packet latency shrinks dramatically, and aggregate bandwidth is no longer limited by NICs, but by memory link bandwidth.

But mimicking a network wasn't enough. The Solaris 11 network resource controls provide the ability to dynamically control the amount of network bandwidth that a particular workload can use. Another blog entry described these controls. (Note that some of the details may have changed between the Solaris 11 Express details described in that entry, and the details of Solaris 11.)

Easy, efficient data management

Solaris 11 expands on the award-winning ZFS file system, adding encryption and deduplication. Multiple encryption algorithms are available and can make use of encryption features included in the CPU, such as the SPARC T3 and T4 CPUs. An in-kernel CIFS server was also added, and the data is stored in a ZFS dataset. Ease-of-use is still a high-priority goal. Enabling CIFS service is as simple as enabling a dataset property.

Improved built-in computer virtualization

Along with ZFS, Oracle Solaris Zones continues to be a core feature set in use at many data centers. (The use of the word "Zones" will be preferred over the use of "Containers" to reduce confusion.) These features are enhanced in Solaris 11. I will detail these enhancements in a future blog entry, but here is a quick summary:
  • Greater flexibility for immutable zones - called "sparse-root zones" in Solaris 10. Multiple options are available in Solaris 11.
  • A zone can be an NFS server!
  • Administration of existing zones can be delegated to users in the global zone.
  • Zonestat(1) reports on resource consumption of zones. I blogged about the Solaris 11 Express version of this tool.
  • A P2V "pre-flight" checker verifies that a Solaris 10 or Solaris 11 system is configured correctly for migration (P2V) into a zone on Solaris 11.
  • To simplify the process of creating a zone, by default a zone gets a VNIC that is automatically configured on the most obvious physical NIC. Of course, you can manually configure a plethora of non-default network options.

Advanced protection

Long known as one of the most secure operating systems on the planet, Oracle Solaris 11 continues making advances, including:
  • CPU-speed network encryption means no compromises
  • Secure startup: by default, only the ssh service is enabled - a minimal attack surface reduces risk
  • Restricted root: by default, 'root' is a role, not a user - all actions are logged or audited by username
  • Anti-spoofing properties for data links
  • ...and more.
As you can guess, we're looking forward to releasing Oracle Solaris 11! Its new features provide you with you the tools to simplify enterprise computing. To learn more about Solaris 11, register for the Solaris 11 launch.

Tuesday Oct 11, 2011

Solaris 11 is Coming!

Yes, the noise you hear is the pre-launch sequence for Solaris 11, which is getting closer every day!

The Launch Event will be held in New York City, at Gotham Hall on Broadway. Space is limited, so if you want to attend in person you should register online. A webcast of the event will also be available. Registration is available for both.

I have registered, so if you attend perhaps I will see you there!

Monday Sep 12, 2011

Oracle Solaris 11 Early Adopter Program

The next step in the release of Oracle Solaris 11 is here! Gold members of the Oracle Partner Network (OPN) may download the Early Adopter release to begin qualification of their applications on Oracle Solaris 11.

Oracle Solaris 11 includes the new major feature sets that are available in Solaris 11 Express, released in November 2010, and much more. You can find a complete description and download link of this Early Adopter release at oracle.com.

If you are not currently a member of the Oracle PartnerNetwork (OPN), you still have two choices:

  1. Learn more about OPN and register at http://www.oracle.com/partners/en/opn-program/index.html
  2. Begin to experience key new features by downloading Solaris 11 Express

We are looking forward to helping you learn all about these exciting new features and the benefits you will derive from them!

Tuesday Mar 01, 2011

Virtual Network - Part 4

Resource Controls

This is the fourth part of a series of blog entries about Solaris network virtualization. Part 1 introduced network virtualization, Part 2 discussed network resource management capabilities available in Solaris 11 Express, and Part 3 demonstrated the use of virtual NICs and virtual switches.

This entry shows the use of a bandwidth cap on Virtual Network Elements (VNEs). This form of network resource control can effectively limit the amount of bandwidth consumed by a particular stream of packets. In our context, we will restrict the amount of bandwidth that a zone can use.

As a reminder, we have the following network topology, with three zones and three VNICs, one VNIC per zone.

All three VNICs were created on one ethernet interface in Part 3 of this series.

Capping VNIC Bandwidth

Using a T2000 server in a lab environment, we can measure network throughput with the new dlstat(1) command. This command reports various statistics about data links, including the quantity of packets, bytes, interrupts, polls, drops, blocks, and other data. Because I am trying to illustrate the use of commands, not optimize performance, the network workload will be a simple file transfer using ftp(1). This method of measuring network bandwidth is reasonable for this purpose, but says nothing about the performance of this platform. For example, this method reads data from a disk. Some of that data may be cached, but disk performance may impact the network bandwidth measured here. However, we can still achieve the basic goal: demonstrating the effectiveness of a bandwidth cap.

With that background out of the way, first let's check the current status of our links.

GZ# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g2     phys      1500   unknown  --         --
e1000g1     phys      1500   down     --         --
e1000g3     phys      1500   unknown  --         --
emp_web1    vnic      1500   up       --         e1000g0
emp_app1    vnic      1500   up       --         e1000g0
emp_db1     vnic      1500   up       --         e1000g0
GZ# dladm show-linkprop emp_app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     autopush        rw   --             --             --
emp_app1     zone            rw   emp-app        --             --
emp_app1     state           r-   unknown        up             up,down
emp_app1     mtu             rw   1500           1500           1500
emp_app1     maxbw           rw   --             --             --
emp_app1     cpus            rw   --             --             --
emp_app1     cpus-effective  r-   1-9            --             --
emp_app1     pool            rw   SUNWtmp_emp-app --             --
emp_app1     pool-effective  r-   SUNWtmp_emp-app --             --
emp_app1     priority        rw   high           high           low,medium,high
emp_app1     tagmode         rw   vlanonly       vlanonly       normal,vlanonly
emp_app1     protection      rw   --             --             mac-nospoof,
                                                                restricted,
                                                                ip-nospoof,
                                                                dhcp-nospoof
<some lines deleted>
Before setting any bandwidth caps, let's determine the transfer rates between a zone on this system and a remote system.

It's easy to use dlstat to determine the data rate to my home system while transferring a file from a zone:

GZ# dlstat -i 10 e1000g0 
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   27.99M    2.11G   54.18M   77.34G
       emp_app1       83    6.72K        0        0
       emp_app1      339   23.73K    1.36K    1.68M
       emp_app1    1.79K  120.09K    6.78K    8.38M
       emp_app1    2.27K  153.60K    8.49K   10.50M
       emp_app1    2.35K  156.27K    8.88K   10.98M
       emp_app1    2.65K  182.81K    5.09K    6.30M
       emp_app1      600   44.10K      935    1.15M
       emp_app1      112    8.43K        0        0
The OBYTES column is simply the number of bytes transferred during that data sample. I'll ignore the 1.68MB and 1.15MB data points because the file transfer began and ended during those samples. The average of the other values leads to a bandwidth of 7.6 Mbps (megabits per second), which is typical for my broadband connection.

Let's pretend that we want to constrain the bandwidth consumed by that workload to 2 Mbps. Perhaps we want to leave all of the rest for a higher-priority workload. Perhaps we're an ISP and charge for different levels of available bandwidth. Regardless of the situation, capping bandwidth is easy:

GZ# dladm set-linkprop -p maxbw=2000k emp_app1
GZ# dladm show-linkprop -p maxbw emp__app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     maxbw           rw       2          --             --
GZ# dlstat -i 20 emp_app1 
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   18.21M    1.43G   10.22M   14.56G
       emp_app1      186   13.98K        0        0
       emp_app1      613   51.98K    1.09K    1.34M
       emp_app1    1.51K  107.85K    3.94K    4.87M
       emp_app1    1.88K  131.19K    3.12K    3.86M
       emp_app1    2.07K  143.17K    3.65K    4.51M
       emp_app1    1.84K  136.03K    3.03K    3.75M
       emp_app1    2.10K  145.69K    3.70K    4.57M
       emp_app1    2.24K  154.95K    3.89K    4.81M
       emp_app1    2.43K  166.01K    4.33K    5.35M
       emp_app1    2.48K  168.63K    4.29K    5.30M
       emp_app1    2.36K  164.55K    4.32K    5.34M
       emp_app1      519   42.91K      643  793.01K
       emp_app1      200   18.59K        0        0
Note that for dladm, the default unit for maxbw is Mbps. The average of the full samples is 1.97 Mbps.

Between zones, the uncapped data rate is higher:

GZ# dladm reset-linkprop -p maxbw emp_app1
GZ# dladm show-linkprop  -p maxbw emp_app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     maxbw           rw   --             --             --
GZ# dlstat -i 20 emp_app1
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   20.80M    1.62G   23.36M   33.25G
       emp_app1      208   16.59K        0        0
       emp_app1   24.48K    1.63M  193.94K  277.50M
       emp_app1  265.68K   17.54M    2.05M    2.93G
       emp_app1  266.87K   17.62M    2.06M    2.94G
       emp_app1  255.78K   16.88M    1.98M    2.83G
       emp_app1  206.20K   13.62M    1.34M    1.92G
       emp_app1   18.87K    1.25M   79.81K  114.23M
       emp_app1      246   17.08K        0        0
This five year old T2000 can move at least 1.2 Gbps of data, internally, but that took five simultaneous ftp sessions. (A better measurement method, one that doesn't include the limits of disk drives, would yield better results, and newer systems, either x86 or SPARC, have higher internal bandwidth characteristics.) In any case, the maximum data rate is not interesting for our purpose, which is demonstration of the ability to cap that rate.

You can often resolve a network bottleneck while maintaining workload isolation, by moving two separate workloads onto the same system, within separate zones. However, you might choose to limit their bandwidth consumption. Fortunately, the NV tools in Solaris 11 Express enable you to accomplish that:

GZ# dladm set-linkprop -t -p maxbw=25m emp_app1
GZ# dladm show-linkprop -p maxbw emp_app1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
emp_app1     maxbw           rw      25          --             --
Note that the change to the bandwidth cap was made while the zone was running, potentially while network traffic was flowing. Also, changes made by dladm are persistent across reboots of Solaris unless you specify a "-t" on command line.

Data moves much more slowly now:

GZ# # dlstat  -i 20 emp_app1
           LINK    IPKTS   RBYTES    OPKTS   OBYTES
       emp_app1   23.84M    1.82G   46.44M   66.28G
       emp_app1      192   16.10K        0        0
       emp_app1    1.15K   79.21K    5.77K    8.24M
       emp_app1   18.16K    1.20M   40.24K   57.60M
       emp_app1   17.99K    1.20M   39.46K   56.48M
       emp_app1   17.85K    1.19M   39.11K   55.97M
       emp_app1   17.39K    1.15M   38.16K   54.62M
       emp_app1   18.02K    1.19M   39.53K   56.58M
       emp_app1   18.66K    1.24M   39.60K   56.68M
       emp_app1   18.56K    1.23M   39.24K   56.17M
<many lines deleted>
The data show an aggregate bandwidth of 24 Mbps.

Conclusion

The network virtualization tools in Solaris 11 Express include various resource controls. The simplest of these is the bandwidth cap, which you can use to effectively limit the amount of bandwidth that a workload can consume. Both physical NICs and virtual NICs may be capped by using this simple method. This also applies to workloads that are in Solaris Zones - both default zones and Solaris 10 Zones which mimic Solaris 10 systems.

Next time we'll explore some other virtual network architectures.

Thursday Jan 27, 2011

Virtual Networks - Part 2

This is the second in a series of blog entries that discuss the network virtualization features in Solaris 11 Express. The first entry discussed the basic concepts and the virtual network elements, including virtual NICs, VLANs, virtual switches, and InfiniBand datalinks.

This entry adds to that list the resource controls and security features that are necessary for a well-managed virtual network.

Virtual Networks, Real Resource Controls

In Oracle Solaris 11 Express, there are four main datalink resource controls:
  1. a bandwidth cap, which limits the amount of traffic passing through a datalink in a small amount of elapsed time
  2. assignment of packet processing tasks to a subset of the system's CPUs
  3. flows, which were introduced in the previous blog post
  4. rings, which are hardware or software resources that can be dedicated to a single purpose.
Let's take them one at a time. By default, datalinks such as VNICs can consume as much of the physical NIC's bandwidth as they want. That might be the desired behavior, but if it isn't you can apply the property "maxbw" to a datalink. The maximum permitted bandwidth can be specified in Kbps, Mbps or Gbps. This value can be changed dynamically, so if you set this value too low, you can change without affecting the traffic flowing over that link. Solaris will not allow traffic to flow over that datalink at a rate faster than you specify.

You can "over-subscribe" this bandwidth cap: the sum of the bandwidth caps on the VNICs assigned to a NIC can exceed the rated bandwidth of the NIC. If that happens, the bandwidth caps become less effective.

In addition the bandwidth cap, packet processing computation can be constrained to the CPUs associated with a workload.

First some background. When Solaris boots, it assigns interrupt handler threads to the CPUs in the system. (See Solaris CPUs for an explanation of the meaning of "CPU".) Solaris attempts to spread the interrupt handlers out evenly so that one CPU does not become a bottleneck for interrupt handling.

If you create non-default CPU pools, the interrupt handlers will retain their CPU assignments. One unintended side effect of this is a situation where the CPUs intended for one workload will be handling interrupts caused by another workload. This can occur even with simple configurations of Solaris Zones. In extreme cases, network packet processing for one zone can severely impact the performance of another zone.

To prevent this behavior, Solaris 11 Express offers the ability to assign a datalink's interrupt handler to a set of CPUs or a pool of CPUs. To simplify this further, the obvious choice is made for you, by default, for a zone which is assigned its own resource pool. When such a zone boots, a resource pool is created for the zone, a sufficient quantity of CPUs is moved from the default pool to the zone's pool, and interrupt handlers for that zone's datalink(s) are automatically reassigned to that resource pool. Network flows enable you to create multiple lanes of traffic. This allows the parallelization of network traffic. You can assign a bandwidth cap to a flow. Flows were introduced in the previous post and will be discussed further in future posts.

Finally, the newest high speed NICs support hardware rings: memory resources that can be dedicated to a particular set of network traffic. For inbound packets, this is the first resource control that separates network traffic based on packet information such as destination MAC address. By assigning one or more rings to a stream of traffic, you can commit sufficient hardware resources to it and ensure a greater relative priority for those packets, even if another stream of traffic on the same NIC would otherwise cause congestion and impact packet latency of all streams.

If you are using a NIC that does not support hardware rings, Solaris 11 Express support software rings which cause a similar effect.

Virtual Networks, Real Security

In addition to rescource controls, Solaris 11 Express offers datalink protection controls. These controls are intended to prevent a user from creating improper packets that would cause mischief on the network. The mac-nospoof property requires that outgoing packets have a MAC address which matches the link's MAC address. The ip-nospoof property implements a similar restriction, but for IP addresses. The dhcp-nospoof property prevents improper DHCP assignment.

Summary (so far)

The network virtualization features in Solaris 11 Express enable the creation of virtual network devices, leading to the implementation of an entire network inside one Solaris system. Associated resource control features give you the ability to manage network bandwidth as a resource and reduce the potential for one workload to cause network performance problems for another workload. Finally, security features help you minimize the impact of an intruder.

With all of the introduction out of the way, next time I'll show some actual uses of these concepts.

Wednesday Jan 05, 2011

Virtual Networks

Network virtualization is one of the industry's hot topics. The potential to reduce cost while increasing network flexibility easily justifies the investment in time to understand the possibilities. This blog entry describes network virtualization and some concepts. Future entries will show the steps to create a virtual network.

Introduction to Network Virtualization

Network virtualization can be described as the process of creating a computer network which does not match the physical topology of a physical network. Usually this is achieved by using software tools of general-purpose computers or by using features of network hardware. A defining characteristic of a virtual network is the ability to re-configure the topology without manipulating any physical objects: devices or cables.

Such a virtual network mimics a physical network. Some types of virtual networks, for example virtual LANs (VLANs), can be implemented using features of network switches and computers. However, some other implementations do not require traditional network hardware such as routers and switches. All of the functionality of network hardware has been re-implemented in software, perhaps in the operating system.

Benefits of network virtualization (NV) include increased architectural flexibility, better bandwidth and latency characteristics, the ability to prioritize network traffic to meet desired performance goals, and lower cost from fewer devices, reduced total power consumption, etc.

The remainder of this blog entry will focus on a software-only implementation of NV.

A few years ago, networking engineers at Sun began working on a software project named "Crossbow." The goal was to create a comprehensive set of NV features within Solaris. Just like Solaris Zones, Crossbow would provide integrated features for creation and monitoring of general purpose virtual network elements that could be deployed in limitless configurations. Because these features are integrated into the operating system, they automatically take advantage of - and smoothly interoperate with - existing features. This is most noticeable in the integration of Solaris NV features and Solaris Zones. Also, because these NV features are a part of Solaris, future Solaris enhancements will be integrated with Solaris NV where appropriate.

The core NV features were first released in OpenSolaris 2009.06. Since then, those core features have matured and more details have been added. The result is the ability to re-implement entire networks as virtual networks using Solaris 11 Express. Here is an example of a virtual network architecture:

As you can guess from that example, you can create virtually :-) any network topology as a virtual network...

Oracle Solaris NV does more than is described here. This content focuses on the key features which might be used to consolidate workloads or entire networks into a Solaris system, using zones and NV features.

Virtual Network Elements

Solaris 11 Express implements the following virtual network elements.
  • NIC: OK, this isn't a virtual element, it's just on the list as a starting point.
    For a very long time, Solaris has managed Network Interface Connectors (NICs). Solaris offers tools to manage NICs, including bringing them up and down, and assigning various characteristics to them, such as IP addresses, assignment to IP Multipathing (IPMP) groups, etc. Note that up through Solaris 10, most of those configuration tasks were accomplished with the ifconfig(1M) command, but in Solaris 11 Express the dladm(1M) and ipadm(1M) commands perform those tasks, and a few more. You can monitor the use of NICs with dlstat(1M). The term "datalink" is now used consistently to refer to NICs and things like NICs, such as...

  • A VNIC is a pseudo interface created on a datalink (a NIC or an etherstub, described next). Each VNIC has its own MAC address, which can be generated automatically, but can be specified manually. For almost all purposes, a VNIC can be can be managed like a NIC. The dladm command creates, lists, deletes, and modifies VNICs. The dlstat command displays statistics about VNICs. The ipadm(1M) command configures IP interfaces on VNICs.
    Like NICs, VNICs have a number of properties that can be modified with dladm. These include the ability to force network processing of a VNIC to a certain set of CPUs, setting a cap (maximum) on permitted bandwidth for a VNIC, the relative priority of this VNIC versus other VNICs on the same NIC, and other properties.

  • Etherstubs are pseudo NICs, making internal networks possible. For a general understanding, think of them as virtual switches. The command dladm manages etherstubs.

  • A flow is a stream of packets that share particular attributes such as source IP address or TCP port number. Once defined, a flow can be managed as an entity, including capping bandwidth usage, setting relative priorities, etc. The new flowadm(1M) command enables you to create and manage flows. Even if you don't set resource controls, flows will benefit from dedicated kernel resources and more predictable, consistent performance. Further, you can directly observe detailed statistics on each flow, improving your ability to understand these streams of packets and set proper resource controls. Flows are managed with flowadm(1M) and monitored with flowstat(1M).

  • VLANs (Virtual LANs) have been around for a long time. For consistency, the commands dladm, dlstat and ipadm now manage VLANs.

  • InfiniBand partitions are virtual networks that use an InfiniBand fabric. They are managed with the same commands as VNICs and VLANs: dladm, dlstat, ipadm and others.

Summary

Solaris 11 Express provides a complete set of virtual network components which can be used to deploy virtual networks within a Solaris instance. The next blog entry will describe network resource management and security. Future entries will provide some examples.

Thursday Dec 09, 2010

All New Zonestat - Part 2

Part 2

Recently I introduced zonestat(1), a new command offered in Solaris 11 Express that replaces the zonestat Perl script that I had open-sourced a couple of years ago. Today I will complete the description of the new zonestat.

Fair Share Scheduler

One of the many useful resource controls offered by Solaris is the Fair Share Scheduler (FSS(7)). You can use FSS to tell Solaris "make sure that zoneXYZ can use a specific portion of the compute capacity of a set of 'Solaris CPUs' at a minimum. (For more information on FSS, see its man page and the Solaris 10 or Solaris 11 Express documentation.) FSS only enforces those minima if there is CPU contention.

That terse explanation used the phrase "of a set of CPUs" because the minimum portions of compute capacity enforced by Solaris can be calculated across all of the CPUs in the system, or across a set of CPUs that you have configured into a processor set. For my purpose here, the point is that you can create a resource pool - including a processor set, assign multiple zones to that pool, and tell Solaris to use the FSS scheduling algorithm for the processes to be scheduled on those CPUs. (See the libpool(3LIB) man page for more information.)

Zonestat will show information relating to FSS, including data that answers the question "is there CPU contention in any of my psets, and if there is, which zone(s) are using more than I expected?"

The next example uses two new zones, and assumes that most of the zones used earlier have been turned off for now. A dynamic resource pool, sharedDB, has been created. It will be the set of CPUs used by two zones which run database software. (This method is called "capped Containers" in Oracle licensing documents and is considered hard partitioning. It can be used to limit software license costs.) One of those zones, zoneDB-2, is more important than the other, zoneDB-1. To meet its SLA, zoneDB-2 must always be able to use at least 4 of the 6 CPUs in that pset.

# zonestat -r psets 10 1
Collecting data for first interval...
Interval: 1, Duration: 0:00:10
PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
pset_default            default-pset        22/22         1/-
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  0.12 0.56%     -     -      -     -     -
                            [system]  0.02 0.12%     -     -      -     -     -
                              global  0.09 0.44%     -     -      -     -     -

PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
sharedDB                   pool-pset          6/6         6/6
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  4.01 66.8%     -     -    300     -     -
                            [system]  0.00 0.00%     -     -      -     -     -
                            zoneDB-2  3.00 50.1%     -     -    200 66.6% 75.1%
                            zoneDB-1  1.00 16.7%     -     -    100 33.3% 50.2%

PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
zoneA                  dedicated-cpu          4/4         4/4
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  0.00 0.09%     -     -      -     -     -
                            [system]  0.00 0.00%     -     -      -     -     -
                               zoneA  0.00 0.09%     -     -      -     -     -
In the output above, zoneDB-1 is using the equivalent processing capacity of 1 Solaris CPU, and zoneDB-2 is using the equivalent of 3 Solaris CPUs. The PCT column indicates that zoneDB-2 is using 50% (3 out of 6) of the CPU capacity of the entire pset.

In addition, the SHRS column shows the number of FSS shares assigned to each of those zones, and %SHR is that zone's proportion of the total number of shares. In other words, zoneDB-2 is using 50% of the pset, but hasn't even used its enforced minimum of 66.6%.

This FSS configuration ensures that 200/300ths of 6 CPUs (i.e. 4 CPUs) are available to zoneDB-2. The %SHRU value of 75% tells us that the zone is using 75% of those 4 CPUs. Again, each of those two zones is allowed to use more than its share, as long as each zone can use its specified minimum.

Sorting the Output

You may have noticed in earlier examples that the zones were not listed in alphabetical order. By default, they are sorted by the amount of the resource that zonestat was reporting. If a resource is not specified, the output is sorted on CPU%. The two rows showing total and system data are always listed first.

You can change the sort order with the -S option, for example:

GZ$ zonestat -r processes -S used 2 1
Collecting data for first interval...
Interval: 1, Duration: 0:00:02
PROCESSES                     SYSTEM LIMIT
system-limit                          292K
                                ZONE  USED   PCT   CAP  %CAP
                             [total]   110 0.36%     -     -
                            [system]     0 0.00%     -     -
                              global    61 0.20%     -     -
                               zoneD    26 0.08%     -     -
                               zoneA    23 0.07%     -     -

As the man page for zonestat(1) shows, many other resources can be monitored. I will not review each of them here.

Aggregated Data

But wait! There's more! ;-)

In addition to understanding resource usage over a short interval (e.g. 10 or 60 seconds) it is often to necessary to understand the peak usage or average usage over a longer period of time. For that purpose, zonestat provides Summary Reports.

A simple summary report is one that is appended to the per-sample data shown earlier. Here is the output for two samples and one summary report:

GZ$ zonestat -r processes -R high -S used 10 2
Collecting data for first interval...
Interval: 1, Duration: 0:00:10
PROCESSES                     SYSTEM LIMIT
system-limit                          292K
                                ZONE  USED   PCT   CAP  %CAP
                             [total]   109 0.36%     -     -
                            [system]     0 0.00%     -     -
                              global    61 0.20%     -     -
                               zoneD    25 0.08%     -     -
                               zoneA    23 0.07%     -     -

Interval: 2, Duration: 0:00:20
PROCESSES                     SYSTEM LIMIT
system-limit                          292K
                                ZONE  USED   PCT   CAP  %CAP
                             [total]   109 0.36%     -     -
                            [system]     0 0.00%     -     -
                              global    61 0.20%     -     -
                               zoneD    25 0.08%     -     -
                               zoneA    23 0.07%     -     -

Report: High Usage
    Start: Thu Dec  2 21:58:54 EST 2010
      End: Thu Dec  2 21:59:14 EST 2010
    Intervals: 2, Duration: 0:00:20
PROCESSES                     SYSTEM LIMIT
system-limit                          292K
                                ZONE  USED   PCT   CAP  %CAP
                             [total]   109 0.36%     -     -
                            [system]     0 0.00%     -     -
                              global    61 0.20%     -     -
                               zoneD    25 0.08%     -     -
                               zoneA    23 0.07%     -     -

If you only need the summary report, -q will be useful: it suppresses the individual samples of data.
GZ$ zonestat -q -r physical-memory -R high -S used 10 2
Report: High Usage
    Start: Thu Dec  2 22:03:54 EST 2010
      End: Thu Dec  2 22:04:14 EST 2010
    Intervals: 2, Duration: 0:00:20
PHYSICAL-MEMORY              SYSTEM MEMORY
mem_default                          31.8G
                                ZONE  USED   PCT   CAP  %CAP
                             [total] 5205M 15.9%     -     -
                            [system] 2790M 8.54%     -     -
                               zoneD 2229M 6.83%     -     -
                              global  141M 0.43%     -     -
                               zoneA 44.5M 0.13%     -     -
Let's assume that, from the data above, we determine that zoneD will never need more than 3 GB of RAM when it is operating correctly. We can add a RAM cap so that Solaris enforces that limit in case something goes awry:
GZ$ pfexec rcapadm -z zoneD -m 3g
GZ$ zonestat -q -r physical-memory -R high -S used 10 2
Report: High Usage
    Start: Thu Dec  2 22:37:15 EST 2010
      End: Thu Dec  2 22:37:35 EST 2010
    Intervals: 2, Duration: 0:00:20
PHYSICAL-MEMORY              SYSTEM MEMORY
mem_default                          31.8G
                                ZONE  USED   PCT   CAP  %CAP
                             [total] 5204M 15.9%     -     -
                            [system] 2789M 8.54%     -     -
                               zoneD 2229M 6.83% 3072M 72.5%
                              global  141M 0.43%     -     -
                               zoneA 44.5M 0.13%     -     -
In addition to a summary report at the end of the output, zonestat is able to generate periodic summary reports based on the collection of individual samples. For example, you might want to know what the peak memory usage is for each zone, per hour.

(A quick side note: zonestat does not perform continuous data collection. It collects data at an interval you specify. Therefore, the peak values reported by zonestat are the peak values of the values which were collected. In other words, zonestat reports the peak observed values.)

The example below collects data every 10 seconds for 24 hours. It reports the peak observed values every hour.

GZ$ zonestat -q -r physical-memory -R high 10 24h 60m
Report: High Usage
    Start: Mon Dec  5 16:42:01 EST 2010
      End: Mon Dec  5 17:42:01 EST 2010
    Intervals: 360, Duration: 1:00:00
PHYSICAL-MEMORY              SYSTEM MEMORY
mem_default                          31.8G
                                ZONE  USED   PCT   CAP  %CAP
                             [total] 3015M 9.23%     -     -
                            [system] 2791M 8.55%     -     -
                              global  136M 0.41%     -     -
                               zoneA 44.5M 0.13%     -     -
                               zoneD 45.7M 0.13% 3072M 1.48%

Report: High Usage
    Start: Mon Dec  5 17:42:01 EST 2010
      End: Mon Dec  5 18:42:01 EST 2010
    Intervals: 10, Duration: 1:00:00
PHYSICAL-MEMORY              SYSTEM MEMORY
mem_default                          31.8G
                                ZONE  USED   PCT   CAP  %CAP
                             [total] 3015M 9.23%     -     -
                            [system] 2791M 8.55%     -     -
                              global  136M 0.41%     -     -
                               zoneA 44.5M 0.13%     -     -
                               zoneD 64.3M 0.19% 3072M 2.09%

Report: High Usage
    Start: Mon Dec  5 18:42:01 EST 2010
      End: Mon Dec  5 19:42:01 EST 2010
    Intervals: 15, Duration: 1:00:00
PHYSICAL-MEMORY              SYSTEM MEMORY
mem_default                          31.8G
                                ZONE  USED   PCT   CAP  %CAP
                             [total] 3015M 9.23%     -     -
                            [system] 2791M 8.55%     -     -
                              global  136M 0.41%     -     -
                               zoneA 44.5M 0.13%     -     -
                               zoneD 65.1M 0.19% 3072M 2.11%
[further output deleted]

Parseable Data

At this point (especially after reading the sub-title of this section...) you might think that zonestat should have an option to generate output which is easy to parse. And you won't be disappointed: CSV (colon-separated-value) output is the result of the lower-case -p option:
GZ$ zonestat -p -q -r physical-memory -R high -S used 10 2
report-high:header:20101203T034655Z:20101203T034715Z:10:20
report-high:physical-memory:mem_default:[resource]:33423360K
report-high:physical-memory:mem_default:[total]:5329836K:15.94%:-:-
report-high:physical-memory:mem_default:[system]:2856556K:8.54%:-:-
report-high:physical-memory:mem_default:zoneD:2282968K:6.83%:3145728K:72.57%
report-high:physical-memory:mem_default:global:144608K:0.43%:-:-
report-high:physical-memory:mem_default:zoneA:45576K:0.13%:-:-
report-high:footer:20101203T034715Z10:20
Even a small number of options can generate a great deal of output:
GZ$ zonestat -p -r psets -R high 10 2
interval:header:since-last-interval:20101203T035201Z:1:10
interval:processor-set:default-pset:pset_default:[resource]:24:24:1:-:0-04-03.01
interval:processor-set:default-pset:pset_default:[total]:0.09:0.38%:-:-:-:-:-:0-00-00.92
interval:processor-set:default-pset:pset_default:[system]:0.01:0.06%:-:-:-:-:-:0-00-00.15
interval:processor-set:default-pset:pset_default:global:0.07:0.31%:-:-:-:-:-:0-00-00.77
interval:processor-set:dedicated-cpu:zoneD:[resource]:4:4:4:4:0-00-40.50
interval:processor-set:dedicated-cpu:zoneD:[total]:1.00:25.01%:-:-:-:-:-:0-00-10.13
interval:processor-set:dedicated-cpu:zoneD:[system]:0.00:0.05%:-:-:-:-:-:0-00-00.02
interval:processor-set:dedicated-cpu:zoneD:zoneD:0.99:24.95%:-:-:-:-:-:0-00-10.10
interval:processor-set:dedicated-cpu:zoneA:[resource]:4:4:4:4:0-00-40.50
interval:processor-set:dedicated-cpu:zoneA:[total]:0.00:0.01%:-:-:-:-:-:0-00-00.00
interval:processor-set:dedicated-cpu:zoneA:[system]:0.00:0.00%:-:-:-:-:-:0-00-00.00
interval:processor-set:dedicated-cpu:zoneA:zoneA:0.00:0.01%:-:-:-:-:-:0-00-00.00
interval:footer:20101203T035201Z10:10
interval:header:since-last-interval:20101203T035211Z:2:20
interval:processor-set:default-pset:pset_default:[resource]:24:24:1:-:0-04-03.08
interval:processor-set:default-pset:pset_default:[total]:0.07:0.32%:-:-:-:-:-:0-00-00.79
interval:processor-set:default-pset:pset_default:[system]:0.01:0.06%:-:-:-:-:-:0-00-00.15
interval:processor-set:default-pset:pset_default:global:0.06:0.26%:-:-:-:-:-:0-00-00.63
interval:processor-set:dedicated-cpu:zoneD:[resource]:4:4:4:4:0-00-40.51
interval:processor-set:dedicated-cpu:zoneD:[total]:1.00:25.01%:-:-:-:-:-:0-00-10.13
interval:processor-set:dedicated-cpu:zoneD:[system]:0.00:0.12%:-:-:-:-:-:0-00-00.04
interval:processor-set:dedicated-cpu:zoneD:zoneD:0.99:24.89%:-:-:-:-:-:0-00-10.08
interval:processor-set:dedicated-cpu:zoneA:[resource]:4:4:4:4:0-00-40.51
interval:processor-set:dedicated-cpu:zoneA:[total]:0.00:0.01%:-:-:-:-:-:0-00-00.00
interval:processor-set:dedicated-cpu:zoneA:[system]:0.00:0.00%:-:-:-:-:-:0-00-00.00
interval:processor-set:dedicated-cpu:zoneA:zoneA:0.00:0.01%:-:-:-:-:-:0-00-00.00
interval:footer:20101203T035211Z10:20
report-high:header:20101203T035151Z:20101203T035211Z:10:20
report-high:processor-set:default-pset:pset_default:[resource]:24:24:1:-:0-04-03.08
report-high:processor-set:default-pset:pset_default:[total]:0.09:0.38%:-:-:-:-:-:0-00-00.92
report-high:processor-set:default-pset:pset_default:[system]:0.01:0.06%:-:-:-:-:-:0-00-00.15
report-high:processor-set:default-pset:pset_default:global:0.07:0.31%:-:-:-:-:-:0-00-00.77
report-high:processor-set:dedicated-cpu:zoneD:[resource]:4:4:4:4:0-00-40.51
report-high:processor-set:dedicated-cpu:zoneD:[total]:1.00:25.07%:-:-:-:-:-:0-00-10.15
report-high:processor-set:dedicated-cpu:zoneD:[system]:0.00:0.12%:-:-:-:-:-:0-00-00.04
report-high:processor-set:dedicated-cpu:zoneD:zoneD:0.99:24.95%:-:-:-:-:-:0-00-10.10
report-high:processor-set:dedicated-cpu:zoneA:[resource]:4:4:4:4:0-00-40.51
report-high:processor-set:dedicated-cpu:zoneA:[total]:0.00:0.01%:-:-:-:-:-:0-00-00.00
report-high:processor-set:dedicated-cpu:zoneA:[system]:0.00:0.00%:-:-:-:-:-:0-00-00.00
report-high:processor-set:dedicated-cpu:zoneA:zoneA:0.00:0.01%:-:-:-:-:-:0-00-00.00
report-high:footer:20101203T035211Z10:20
With parseable output, you can easily write scripts that consume the output. Those scripts can further analyze the data, draw colorful graphs, and perform other data manipulation.

Miscellaneous Comments

A few parting notes:
  1. You can run zonestat in any zone. It will only receive data that should be visible to that zone. For example, when run in a non-global zone, zonestat will only display data about the processor set on which that zone's processes run.
  2. Zonestat gets all of the data from the zonestatd daemon, which is part of the service svc:/system/zones-monitoring:default. That service is enabled by default. Because that service is managed by SMF, if for any reason zonestatd stops, SMF will restart it.
  3. You can specify the interval at which zonestatd gathers data by setting the zones-monitoring:default service property config/sample_interval.

Summary

The new zonestat will provide hours of entertainment. It also provides answers to countless questions regarding your zones' use of system resources. Armed with that information, you can improve your understanding of the resource consumption of those zones, and improve the use of resource controls to ensure predictable performance of your workloads.

I hope that you have enjoyed learning about zonestat as much as I did. Check back during the first week of January for information on other new observability tools in Solaris 11 Express!

<script type="text/javascript"> var sc_project=2359564; var sc_invisible=1; var sc_security="22b325fd"; var sc_https=1; var sc_remove_link=1; var scJsHost = (("https:" == document.location.protocol) ? "https://secure." : "http://www."); document.write("");</script>

counter for tumblr

Tuesday Dec 07, 2010

All New Zonestat!

Part 1

Recently I gave a brief overview of the enhancements available in Solaris 11 Express. I also hinted at more blog entries, mostly featuring Solaris Zones and network virtualization.

Before experimenting with new functions it's useful to have some tools to measure the results. With that in mind, this blog entry and its successor(s) will discuss new measurement tools that are in Solaris 11 Express: zonestat(1), flowstat(1M) and dlstat(1M). I will start with zonestat.

Zonestat Introduction

But first some history. For Solaris 10 I created an open-source tool I named "zonestat". That tool filled a need: one integrated view of the resource consumption and optional resource control settings of all running zones. The resources listed included CPUs, physical memory, virtual memory, and locked memory. Zonestat provided a "dashboard" that greatly eased the task of monitoring the resource usage of Solaris Zones.

That tool has these main drawbacks:

  1. It's written in Perl and uses a large set of existing Solaris commands to gather all of the data that it needs. Executing all of those commands for each data sample uses a significant amount of CPU time.
  2. It is a separate tool, not part of Solaris. It is not supported.
  3. It was originally intended as a prototype, a demonstration of what could be accomplished. I made a number of enhancements along the way, but for a while it wasn't clear whether it made sense to upgrade it for Solaris 11.
However, even with those shortcomings, that zonestat script was put into production at a number of data centers.

In 2009 Solaris Engineering decided to write a fully supported version of zonestat, as a new Solaris command. Instead of someone writing code in his spare time (me), a member of the Solaris Zones Engineering Team (Steve Lawrence) was assigned to write a comprehensive, efficient, fully featured tool that achieved many of the same goals as the original zonestat, and many more. Using the experience gained from the original zonestat script, a completely new program (also called "zonestat") had the potential to solve all of the problems of the open-source Perl script, and add new features which had been requested by users of Solaris Zones, and other features which the Zones Engineering Team knew would be useful.

And in Solaris 11 Express that potential was realized. Because the new zonestat performs almost all of the functions of the original zonestat script, and performs far more in addition, the rest of this blog entry (and the next one) will only discuss the new zonestat which is part of Solaris 11 Express.

The new zonestat(1) command has a plethora of options. These options allow the user to list data:

  • for each of the system's zones, including the global zone and data specific to kernel processing but not directly attributable to any one zone
  • for any subset of zones
  • regarding one or more types of resources, in absolute units or as a portion of available or capped resources
  • regarding one or more instances of resources (e.g. a particular processor set)
  • that has been sorted by one or more output columns
  • that is human-readable output or output that can be easily parsed by a script or other program
  • that includes timestamps, in one of several formats
  • that includes regular aggregations (called "summary reports"), such as "highest value during the interval" or "average value during the interval"
Zonestat has a variety of uses. The most obvious is monitoring resource usage of zones. Even if you don't use resource controls, zonestat will help you by telling you when a zone is using a significant portion (or all!) of the system's resources. Of course, zonestat really brings value to systems that are using resource controls, making it easy to determine which zones are near their caps - a sure sign that there is a problem with that zone's workload or that the zone's cap is too low.

In addition, you can use zonestat to determine proper values for resource controls. For example, you can deploy a workload in a zone and use zonestat to determine the maximum amount of CPU capacity it uses. That information will enable you to make better decisions about how many CPUs to assign to that zone - if you have decided that the workload should use its own, dedicated CPUs to the zone.

If you are not familiar with the resource management controls offered by Solaris, you may wish to view the relevant documentation before, during or after reading the rest of this. The book "Oracle Solaris 10 System Virtualization Essentials" also describes all of the resource controls available for Solaris 10 Zones, and how they can be used to achieve various goals. Finally, the document "Understanding the Security Capabilities of Solaris" approaches the same content from a security perspective.

Now let's explore some of the interesting things you can do with zonestat.

Basics

The default output provides the data you would expect - basic information about the resource usage of all zones on the system. The command syntax can be simplified to this (omitting some features for now):

zonestat [options] interval [duration]
and the basic output looks like this:
GZ$ zonestat 5 2
Collecting data for first interval...
Interval: 1, Duration: 0:00:05
SUMMARY                  Cpus/Online: 32/32   Physical: 31.8G    Virtual: 47.8G
                    ----------CPU---------- ----PHYSICAL----- -----VIRTUAL-----
               ZONE  USED %PART  %CAP %SHRU  USED   PCT  %CAP  USED   PCT  %CAP
            [total]  0.10 0.31%     -     - 3109M 9.52%     - 7379M 15.0%     -
           [system]  0.01 0.04%     -     - 2797M 8.57%     - 7115M 14.5%     -
             global  0.08 0.51%     -     -  141M 0.43%     -  129M 0.26%     -
              zoneA  0.00 0.02%     -     - 43.7M 0.13%     - 35.4M 0.07%     -
              zoneB  0.00 0.02%     -     - 42.0M 0.12%     - 32.8M 0.06%     -
              zoneC  0.00 0.04%     -     - 42.0M 0.12%     - 32.8M 0.06%     -
              zoneD  0.00 0.02%     -     - 42.1M 0.12%     - 33.2M 0.06%     -

Interval: 2, Duration: 0:00:10
SUMMARY                  Cpus/Online: 32/32   Physical: 31.8G    Virtual: 47.8G
                    ----------CPU---------- ----PHYSICAL----- -----VIRTUAL-----
               ZONE  USED %PART  %CAP %SHRU  USED   PCT  %CAP  USED   PCT  %CAP
            [total]  0.09 0.30%     -     - 3109M 9.52%     - 7379M 15.0%     -
           [system]  0.01 0.03%     -     - 2797M 8.57%     - 7115M 14.5%     -
             global  0.08 0.51%     -     -  142M 0.43%     -  129M 0.26%     -
              zoneA  0.00 0.02%     -     - 43.7M 0.13%     - 35.4M 0.07%     -
              zoneB  0.00 0.02%     -     - 42.0M 0.12%     - 32.8M 0.06%     -
              zoneC  0.00 0.02%     -     - 42.0M 0.12%     - 32.8M 0.06%     -
              zoneD  0.00 0.02%     -     - 42.1M 0.12%     - 33.2M 0.06%     -

First, note that unlike other Solaris stat tools (e.g. vmstat) the first set of data is not a summary since the system booted. Instead, zonestat pauses for the time interval specified on the command line, at which point it displays data representing the first sample. (Zonestat doesn't actually collect the data. Its companion, zonestatd(1M) performs that service for all zonestat clients.)

Also, you probably noticed those two special lines, "[total]" and "[system]". The first of those indicates data about the total quantity of each resource, across the whole system. The lines labeled "[system]" show resource consumption by the kernel or by processes that aren't associated with any one zone.

Zonestat can produce a great deal of information - more than will fit on one line. Its various options allow you to view summary data - as provided in the default - or to focus on a zone, or on a particular type or instance of a resource, or a combination of those. Obviously, the header will be tailored to the output requested.

The summary header looks like this:

Interval: 1, Duration: 0:00:05
SUMMARY                  Cpus/Online: 32/32   Physical: 31.8G    Virtual: 47.8G
                    ----------CPU---------- ----PHYSICAL----- -----VIRTUAL-----
               ZONE  USED %PART  %CAP %SHRU  USED   PCT  %CAP  USED   PCT  %CAP
The first line of data, per sample, tells you the ordinal number of the sample - not very useful if you're just checking a few seconds of data, but pretty helpful when you're scanning through 3 days worth of output. The Duration field is similar, but is a measurement of time since the command began.

The SUMMARY line shows the quantity of CPUs that exist in the system and how many of them are online. (I wrote an earlier blog entry about the method that Solaris uses to count CPUs.) That line also shows the system's amount of RAM ("Physical") and Virtual Memory (the size of RAM plus swap space on disk).

The ZONE column contains the name of the zone. The values in that row represent that zone's use of resources. The columns labeled USED show that zone's consumption of each resource. The unit depends on the resource. For CPUs, a value of 1 represents a "Solaris CPU." For memory, the unit is specified in the output.

Besides those generic header elements, some are specific to a resource type. %PART shows the CPU utilization, as a percentage of the compute capacity of the processor set in which the zone's processes run. %CAP is the percentage of the zone's CPU cap which has been used recently (if a cap has been applied to the zone). %SHRU indicates the amount of CPU used as a percentage of the shares assigned to the zone (if the Fair Share Scheduler is in use and shares have been assigned to this zone). The latter may occasionally show a surprising result: a value greater than 100%. I don't have space here to explain the Fair Share Scheduler, but the short version is "FSS enforces a minimum amount of available CPU capicity if there is contention for the CPUs, but it does not enforce a maximum. If there isn't contention, any process which wants to consume CPU cycles can do so - which can lead to a value greater than 100%."

The PHYSICAL section shows the amount of RAM used, the portion of the system's memory (PCT) represented by that amount of RAM, and the portion of the zone's RAM cap, if one has been set. The VIRTUAL section has similar fields.

Comparing Usage to Caps

To show the data you might see when a zone has a RAM cap, let's set one. We could do this in zonecfg(1M) for the next time the zone boots, but I don't feel like rebooting the zone, so let's add that cap while the zone runs. First, a quick check of the resource capping daemon. (In these examples, I am logged in as a user which is configured to use non-default administrative privileges. To temporarily gain those privileges, I will use the pfexec(1) command.)
GZ$ svcs rcap
STATE          STIME    FMRI
online         Oct_28   svc:/system/rcap:default
The service is online, and the cap is easy to set:
GZ$ pfexec rcapadm -z zoneB -m 512m
GZ$ zonestat -z zoneB 10 1
Collecting data for first interval...
Interval: 1, Duration: 0:00:10
SUMMARY                  Cpus/Online: 32/32   Physical: 31.8G    Virtual: 47.8G
                    ----------CPU---------- ----PHYSICAL----- -----VIRTUAL-----
               ZONE  USED %PART  %CAP %SHRU  USED   PCT  %CAP  USED   PCT  %CAP
            [total]  0.15 0.49%     -     - 3112M 9.53%     - 7382M 15.0%     -
           [system]  0.01 0.05%     -     - 2797M 8.57%     - 7115M 14.5%     -
              zoneB  0.00 0.10%     -     - 42.5M 0.13% 8.31% 33.3M 0.06%     -

ZoneB is using 42.5MB, which is 0.13% of the system's memory (31.8GB), and 8.31% of the 512MB cap that we set.

One of the many very useful abilities of zonestat is its ability to focus on a small part of the data which it can potentially display. The previous example demonstrated its ability to limit the output to one zone. We can also limit the output to just one resource type, or "zoom in" further to one instance of a resource.

Let's limit our view to the RAM used by that zone:

GZ$ zonestat -r physical-memory -z zoneB 10 1
Collecting data for first interval...
Interval: 1, Duration: 0:00:10
PHYSICAL-MEMORY              SYSTEM MEMORY
mem_default                          31.8G
                                ZONE  USED   PCT   CAP  %CAP
                             [total] 3113M 9.53%     -     -
                            [system] 2797M 8.56%     -     -
                               zoneB 42.5M 0.13%  512M 8.31%
We can "zoom out" and look at all of the processor sets and their zone assignments (something that was difficult in Solaris 10):
GZ$ pfexec zoneadm -z zoneA boot
GZ$ pfexec zoneadm -z zoneC boot
GZ$ pfexec zoneadm -z zoneD boot
GZ$ pfexec zonestat -r psets 10 1
Collecting data for first interval...
Interval: 1, Duration: 0:00:10
PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
pset_default            default-pset        16/16         1/-
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  0.17 1.10%     -     -      -     -     -
                            [system]  0.03 0.23%     -     -      -     -     -
                              global  0.13 0.86%     -     -      -     -     -

PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
zoneD                  dedicated-cpu          4/4         4/4
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  1.10 27.6%     -     -      -     -     -
                            [system]  0.00 0.00%     -     -      -     -     -
                               zoneD  1.10 27.6%     -     -      -     -     -

PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
zoneC                  dedicated-cpu          4/4         4/4
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  1.00 25.0%     -     -      -     -     -
                            [system]  0.08 2.00%     -     -      -     -     -
                               zoneC  0.92 23.0%     -     -      -     -     -

PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
zoneB                  dedicated-cpu          4/4         4/4
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  0.00 0.14%     -     -      -     -     -
                            [system]  0.00 0.00%     -     -      -     -     -
                               zoneB  0.00 0.14%     -     -      -     -     -

PROCESSOR_SET                   TYPE  ONLINE/CPUS     MIN/MAX
zoneA                  dedicated-cpu          4/4         4/4
                                ZONE  USED   PCT   CAP  %CAP   SHRS  %SHR %SHRU
                             [total]  0.00 0.14%     -     -      -     -     -
                            [system]  0.00 0.00%     -     -      -     -     -
                               zoneA  0.00 0.14%     -     -      -     -     -


With the basics out of the way, next time I will discuss some other options that display other data and organize the output in different ways.

<script type="text/javascript"> var sc_project=2359564; var sc_invisible=1; var sc_security="22b325fd"; var sc_https=1; var sc_remove_link=1; var scJsHost = (("https:" == document.location.protocol) ? "https://secure." : "http://www."); document.write("");</script>

counter for tumblr

Wednesday Dec 01, 2010

Solaris 11 Express Podcast

A brief podcast discusses some of the major enhancements in Solaris 11 Express.

Thursday Aug 12, 2010

First Light: Solaris 11

Recently, John Fowler (an Oracle Executive VP) announced some plans for Solaris 11. Plans include introduction of Solaris 11 in 2011. He also declared that Solaris 11 would be "as large a product release as Solaris 10 was." You can view the webcast and download the slide deck at: http://www.oracle.com/dm/11h1corp/53947_systems_strategy_webcast.html.
About

Jeff Victor writes this blog to help you understand Oracle's Solaris and virtualization technologies.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today