Thursday Jun 27, 2013

Improving Manageability of Virtual Environments

Boot Environments for Solaris 10 Branded Zones

Until recently, Solaris 10 Branded Zones on Solaris 11 suffered one notable regression: Live Upgrade did not work. The individual packaging and patching tools work correctly, but the ability to upgrade Solaris while the production workload continued running did not exist. A recent Solaris 11 SRU (Solaris 11.1 SRU 6.4) restored most of that functionality, although with a slightly different concept, different commands, and without all of the feature details. This new method gives you the ability to create and manage multiple boot environments (BEs) for a Solaris 10 Branded Zone, and modify the active or any inactive BE, and to do so while the production workload continues to run.

Background

In case you are new to Solaris: Solaris includes a set of features that enables you to create a bootable Solaris image, called a Boot Environment (BE). This newly created image can be modified while the original BE is still running your workload(s). There are many benefits, including improved uptime and the ability to reboot into (or downgrade to) an older BE if a newer one has a problem.

In Solaris 10 this set of features was named Live Upgrade. Solaris 11 applies the same basic concepts to the new packaging system (IPS) but there isn't a specific name for the feature set. The features are simply part of IPS. Solaris 11 Boot Environments are not discussed in this blog entry.

Although a Solaris 10 system can have multiple BEs, until recently a Solaris 10 Branded Zone (BZ) in a Solaris 11 system did not have this ability. This limitation was addressed recently, and that enhancement is the subject of this blog entry.

This new implementation uses two concepts. The first is the use of a ZFS clone for each BE. This makes it very easy to create a BE, or many BEs. This is a distinct advantage over the Live Upgrade feature set in Solaris 10, which had a practical limitation of two BEs on a system, when using UFS. The second new concept is a very simple mechanism to indicate the BE that should be booted: a ZFS property. The new ZFS property is named com.oracle.zones.solaris10:activebe (isn't that creative? ;-) ).

It's important to note that the property is inherited from the original BE's file system to any BEs you create. In other words, all BEs in one zone have the same value for that property. When the (Solaris 11) global zone boots the Solaris 10 BZ, it boots the BE that has the name that is stored in the activebe property.

Here is a quick summary of the actions you can use to manage these BEs:

To create a BE:

  • Create a ZFS clone of the zone's root dataset

To activate a BE:

  • Set the ZFS property of the root dataset to indicate the BE

To add a package or patch to an inactive BE:

  • Mount the inactive BE
  • Add packages or patches to it
  • Unmount the inactive BE

To list the available BEs:

  • Use the "zfs list" command.

To destroy a BE:

  • Use the "zfs destroy" command.

Preparation

Before you can use the new features, you will need a Solaris 10 BZ on a Solaris 11 system. You can use these three steps - on a real Solaris 11.1 server or in a VirtualBox guest running Solaris 11.1 - to create a Solaris 10 BZ. The Solaris 11.1 environment must be at SRU 6.4 or newer.

  1. Create a flash archive on the Solaris 10 system
    s10# flarcreate -n s10-system /net/zones/archives/s10-system.flar
  2. Configure the Solaris 10 BZ on the Solaris 11 system
    s11# zonecfg -z s10z
    Use 'create' to begin configuring a new zone.
    zonecfg:s10z> create -t SYSsolaris10
    zonecfg:s10z> set zonepath=/zones/s10z
    zonecfg:s10z> exit
    s11# zoneadm list -cv
      ID NAME             STATUS     PATH                           BRAND     IP    
       0 global           running    /                              solaris   shared
       - s10z             configured /zones/s10z                    solaris10 excl  
    
  3. Install the zone from the flash archive
    s11# zoneadm -z s10z install -a /net/zones/archives/s10-system.flar -p
    

You can find more information about the migration of Solaris 10 environments to Solaris 10 Branded Zones in the documentation.

The rest of this blog entry demonstrates the commands you can use to accomplish the aforementioned actions related to BEs.

New features in action

Note that the demonstration of the commands occurs in the Solaris 10 BZ, as indicated by the shell prompt "s10z# ". Many of these commands can be performed in the global zone instead, if you prefer. If you perform them in the global zone, you must change the ZFS file system names.

Create

The only complicated action is the creation of a BE. In the Solaris 10 BZ, create a new "boot environment" - a ZFS clone. You can assign any name to the final portion of the clone's name, as long as it meets the requirements for a ZFS file system name.

s10z# zfs snapshot rpool/ROOT/zbe-0@snap
s10z# zfs clone -o mountpoint=/ -o canmount=noauto rpool/ROOT/zbe-0@snap rpool/ROOT/newBE
cannot mount 'rpool/ROOT/newBE' on '/': directory is not empty
filesystem successfully created, but not mounted
You can safely ignore that message: we already know that / is not empty! We have merely told ZFS that the default mountpoint for the clone is the root directory.

(Note that a Solaris 10 BZ that has a separate /var file system requires additional steps. See the MOS document mentioned at the bottom of this blog entry.)

List the available BEs and active BE

Because each BE is represented by a clone of the rpool/ROOT dataset, listing the BEs is as simple as listing the clones.

s10z# zfs list -r rpool/ROOT
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT        3.55G  42.9G    31K  legacy
rpool/ROOT/zbe-0     1K  42.9G  3.55G  /
rpool/ROOT/newBE  3.55G  42.9G  3.55G  /
The output shows that two BEs exist. Their names are "zbe-0" and "newBE".

You can tell Solaris that one particular BE should be used when the zone next boots by using a ZFS property. Its name is com.oracle.zones.solaris10:activebe. The value of that property is the name of the clone that contains the BE that should be booted.

s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
NAME        PROPERTY                             VALUE  SOURCE
rpool/ROOT  com.oracle.zones.solaris10:activebe  zbe-0  local

Change the active BE

When you want to change the BE that will be booted next time, you can just change the activebe property on the rpool/ROOT dataset.

s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
NAME        PROPERTY                             VALUE  SOURCE
rpool/ROOT  com.oracle.zones.solaris10:activebe  zbe-0  local
s10z# zfs set com.oracle.zones.solaris10:activebe=newBE rpool/ROOT
s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
NAME        PROPERTY                             VALUE  SOURCE
rpool/ROOT  com.oracle.zones.solaris10:activebe  newBE  local
s10z# shutdown -y -g0 -i6
After the zone has rebooted:
s10z# zfs get com.oracle.zones.solaris10:activebe rpool/ROOT
rpool/ROOT  com.oracle.zones.solaris10:activebe  newBE  local
s10z# zfs mount
rpool/ROOT/newBE                /
rpool/export                    /export
rpool/export/home               /export/home
rpool                           /rpool
Mount the original BE to see that it's still there.
s10z# zfs mount -o mountpoint=/mnt rpool/ROOT/zbe-0
s10z# ls /mnt
Desktop                         export                          platform
Documents                       export.backup.20130607T214951Z  proc
S10Flar                         home                            rpool
TT_DB                           kernel                          sbin
bin                             lib                             system
boot                            lost+found                      tmp
cdrom                           mnt                             usr
dev                             net                             var
etc                             opt

Patch an inactive BE

At this point, you can modify the original BE. If you would prefer to modify the new BE, you can restore the original value to the activebe property and reboot, and then mount the new BE to /mnt (or another empty directory) and modify it.

Let's mount the original BE so we can modify it. (The first command is only needed if you haven't already mounted that BE.)

s10z# zfs mount -o mountpoint=/mnt rpool/ROOT/zbe-0
s10z# patchadd -R /mnt -M /var/sadm/spool 104945-02
Note that the typical usage will be:
  1. Create a BE
  2. Mount the new (inactive) BE
  3. Use the package and patch tools to update the new BE
  4. Unmount the new BE
  5. Reboot

Delete an inactive BE

ZFS clones are children of their parent file systems. In order to destroy the parent, you must first "promote" the child. This reverses the parent-child relationship. (For more information on this, see the documentation.)

The original rpool/ROOT file system is the parent of the clones that you create as BEs. In order to destroy an earlier BE that is that parent of other BEs, you must first promote one of the child BEs to be the ZFS parent. Only then can you destroy the original BE.

Fortunately, this is easier to do than to explain:

s10z# zfs promote rpool/ROOT/newBE 
s10z# zfs destroy rpool/ROOT/zbe-0
s10z# zfs list -r rpool/ROOT
NAME               USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT        3.56G   269G    31K  legacy
rpool/ROOT/newBE  3.56G   269G  3.55G  /

Documentation

This feature is so new, it is not yet described in the Solaris 11 documentation. However, MOS note 1558773.1 offers some details.

Conclusion

With this new feature, you can add and patch packages to boot environments of a Solaris 10 Branded Zone. This ability improves the manageability of these zones, and makes their use more practical. It also means that you can use the existing P2V tools with earlier Solaris 10 updates, and modify the environments after they become Solaris 10 Branded Zones.

Tuesday Aug 02, 2011

Solaris Zones Optimize Real Workloads

Oracle published two Optimized Solutions last week that utilize Oracle Solaris Containers.

The first is for Oracle WebCenter Suite. The Optimized Solution shows how one server can support more than 1,000 users for WebCenter Spaces.The announcement includes links to business-focused and technical white papers.

The second is for Agile Product Lifecycle Management. The announcement includes links to business-focused and technical white papers.

Each optimized solution showcases the ability to use Oracle Solaris Containers to optimize performance of multiple workloads within one consolidated server.

Wednesday Jul 13, 2011

Solaris Zones help achieve World Record Benchmark Result

Maximizing performance of multi-node workloads can be challenging. Should I maximize CPU clock rate, or RAM size per node, or network bandwidth? And how do I analyze performance of each component while also measuring aggregate throughput? Solaris Zones provide characteristics that are useful for multi-node architectures:
  • Architectural flexibility: easily remove a network bottleneck between two components by running both in zones on one Solaris server - and move any of them to different servers later as processing needs change
  • Convenient, dynamic resource management, assign a workload to a set of CPUs for predictable, consistent performance, ensure that each workload component has access to sufficient hardware resources, etc.
These characteristics are displayed in the world record benchmark result for Oracle JD Edwards EnterpriseOne. It was achieved using Solaris Containers (Zones) to isolate individual software components, including the WebLogic-based applications and Web Tier Utilities.

Solaris Zones features enabled software isolation and resource management, making the process of fine-tuning resource assignment very easy. For more details, see:

Friday Jul 08, 2011

Downloadable Database in a Solaris Zone

To simplify the process of creating Oracle databases, Oracle has released two Solaris Zones with Oracle 11gR2 pre-installed. You can simply download the appropriate template and "attach" it to your x86 or SPARC system running Solaris 10 (update 10/09 or newer).

Links to the downloads are at oracle.com.

Of course, you must have a valid license to run 11gR2 on that computer.

Monday May 23, 2011

Oracle DB 11gR2 Certified for Solaris Containers

Just a short entry today: last week Oracle completed certification of Oracle RAC 11gR2 (with Clusterware) on Oracle Solaris 10 Containers ("Zones").

For details, see http://www.oracle.com/technetwork/database/virtualizationmatrix-172995.html .

A paper "Best Practices for Deploying Oracle RAC Inside Oracle Solaris Containers" is also available, but is not specific to 11gR2.

This extends the previous certifications of Oracle RAC (9iR2, 10gR2, 11gR1) on Solaris Containers.

Wednesday Dec 09, 2009

Virtual Overhead?

So you're wondering about operating system efficiency or the overhead of virtualization. How about a few data points?

SAP created benchmarks that measure transaction performance. One of them, the SAP SD, 2-Tier benchmark, behaves more like real-world workloads than most other benchmarks, because it exercises all of the parts of a system: CPUs, memory access, I/O and the operating system. The other factor that makes this benchmark very useful is the large number of results submitted by vendors. This large data set enables you to make educated performance comparisons between computers, or operating systems, or application software.

A couple of interesting comparisons can be made from this year's results. Many submissions use the same hardware configuration: two Nehalem (Xeon X5570) CPUs (8 cores total) running at 2.93 GHz, and 48GB RAM (or more). Submitters used several different operating systems: Windows Server 2008 EE, Solaris 10, and SuSE Linux Enterprise Server (SLES) 10. Also, two results were submitted using some form of virtualization: Solaris 10 Containers and SLES 10 on VMware ESX Server 4.0.

Operating System Comparison

The first interesting comparison is of different operating systems and database software, on the same hardware, with no virtualization. Using the hardware configuration listed above, the following results were submitted. The Solaris 10 and Windows results are the best results on each of those operating systems, on this hardware. The SLES 10 result is the best of any Linux distro, with any DB software, on the same hardware configuration.

Operating SystemDBResult (SAPS)
Solaris 10Oracle 10g21,000
Windows Server 2008 EESQL Server 200818,670
SLES 10MaxDB 7.817,380

(Note that all of the results submitted in 2009 cannot be compared against results from previous years because SAP changed the workload.)

With those data points, it's very easy to conclude that for transactional workloads, the combination of Solaris 10 and Oracle 10g is roughly 20% more powerful than Linux and MaxDB.

Virtualization Comparison

The virtualization comparison is also interesting. The same benchmark was run using Solaris 10 Containers and 8 vCPUs. It was also run using SLES 10 on VMware ESX, also using 8 vCPUs.

Operating SystemVirtualizationDBResult (SAPS)
Solaris 10Solaris ContainersOracle 10g15,320
SLES 10VMware ESXMaxDB 7.811,230

Interpretation

Some of the 36% advantage of the Solaris Containers result is due to the operating systems and DB software, as we saw above. But the rest is due to the virtualization tools. The virtualized and non-virtualized results for each OS had only one difference: virtualization was used. For example, the two Solaris 10 results shown above used the same hardware, the same OS, the same DB software and the same workload. The only difference was the use of Containers and the limitation of 8 vCPUs.

If we assume that Solaris 10/Oracle 10G is consistently 21% more powerful than SLES 10/MaxDB on this benchmark, than it's easy to conclude that VMWare ESX has 13% more overhead than Solaris Containers when running this workload.

However, the non-virtualized performance advantage of the Solaris 10 configuration over that of SLES 10 may be different with 8 vCPUs than with 8 cores. If Solaris' advantage is less, then the overhead of VMware is even worse. If the advantage of Solaris 10 Containers/Oracle over VMware/SLES 10/MaxDB with 8 vCPUs is more than the non-virtualized results, than the real overhead of VMware is not quite that bad. Without more data, it's impossible to know.

But one of those three cases (same, less, more) is true. And the claims by some people that VMware ESX has "zero" or "almost no" overhead are clearly untrue, at least for transactional workloads. For compute-intensive workloads, like HPC, the overhead of software hypervisors like VMware ESX is typically much smaller.

What Does All That Mean?

What does that overhead mean for real applications? Extra overhead means longer response times for transactions or fewer users per workload, or both. It also means that fewer workloads (guests) can be configured per system.

In other words, response time should be better (or maximum number of users should be greater) if your transactional workload is running in a Solaris Container rather than in a VMware ESX guest. And when you want to add more workloads, Solaris Containers should support more of those workloads than VMware ESX, on the same hardware.

Qualification

Of course, the comparison shown above only applies to certain types of workloads. You should test your workload on different configurations before committing yourself to one.

Disclosure

For more detail, see the results for yourself.

SAP wants me to include the results:
Best result for Solaris 10 on 2-way X5570, 2.93GHz, 48GB:
Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,800 SAP SD Users, 21,000 SAPS, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, Oracle 10g, Solaris 10, Cert# 2009033.
Best result for any Linux distro on 2-way X5570, 2.93GHz, 48GB:
HP ProLiant DL380 G6 (2 processors, 8 cores, 16 threads) 3,171 SAP SD Users, 17,380 SAPS, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, MaxDB 7.8, SuSE Linux Enterprise Server 10, Cert# 2009006.
Result on Solaris 10 using Solaris Containers and 8 vCPUs:
Sun Fire X4270 (2 processors, 8 cores, 16 threads) run in 8 virtual cpu container, 2,800 SAP SD Users, 2x 2.93 GHz Intel Xeon X5570, 48 GB memory, Oracle 10g, Solaris 10, Cert# 2009034.
Result on SuSE Enterprise Linux as a VMware guest, using 8 vCPUs:
Fujitsu PRIMERGY Model RX300 S5 (2 processors, 8 cores, 16 threads) 2,056 SAP SD Users, 2x 2.93 GHz Intel Xeon X5570, 96 GB memory, MaxDB 7.8, SUSE Linux Enterprise Server 10 on VMware ESX Server 4.0, Cert# 2009029.
SAP, R/3, reg TM of SAP AG in Germany and other countries.

Addendum, added December 10, 2009:

Today an associate reminded me that previous SAP SD 2-tier results demonstrated the overhead of Solaris Containers. Sun ran four copies of the benchmark on one system, simultaneously, one copy in each of four Solaris Containers. The system was a Sun Fire T2000, with a single 1.2GHz SPARC processor, running Solaris 10 and MaxDB 7.5:

  1. 2006029
  2. 2006030
  3. 2006031
  4. 2006032

The same hardware and software configuration - but without Containers - already had a submission:
2005047

The sum of the results for the four Containers can be compared to the single result for the configuration without Containers. The single system outpaced the four Containers by less than 1.7%.

Second Addendum, also added December 10, 2009:

Although this blog entry focused on a comparison of performance overhead, there are other good reasons to use Solaris Containers in SAP deployments. At least 10, in fact, as shown in this slide deck. One interesting reason is that Solaris Containers is the only server virtualization technology supported by both SAP and Oracle on x86 systems. <script type="text/javascript"> var sc_project=2359564; var sc_invisible=1; var sc_security="22b325fd"; var sc_https=1; var sc_remove_link=1; var scJsHost = (("https:" == document.location.protocol) ? "https://secure." : "http://www."); document.write("");</script>

counter for tumblr

Monday May 04, 2009

Layered Virtualization

It's time for another guest blogger.

Solving a Corner Case

One of my former colleagues, Joe Yanushpolsky (josephy100 -AT- gmail.com) was recently involved in the movement of a latency-sensitive Linux application to Solaris as part of platform consolidation. The code was old and it required access to kernel routines not available under BrandZ. Using VirtualBox as a virtual x86 system, the task was easier than expected.

Background

VirtualBox enables you to run multiple x86-based operating system "guests" on an x86 computer - desktop or server. Unlike other virtualization tools, like VMware ESX, VirtualBox allows you to keep your favorite operating system as the 'base' operating system. This is called a Type 2 hypervisor. For existing systems - especially desktops and laptops - this means you can keep your current setup and applications and maintain their current performance. Only the guests will have reduced performance - more on that later.

Here is Joe's report of his tests.

The goals included allowing many people to independently run this application while sharing a server. It would be important to isolate each user from other users. But the resource controls included with VirtualBox were not sufficiently granular for the overall purpose. Solaris Containers (zones) have a richer set of resource controls. Would it be possible to combine Containers and VirtualBox?

The answer was 'yes' - I tried two slightly different methods. Each method starts by installing VirtualBox in the global zone to set up a device entry and some of the software. Details are provided later. After that is complete, the two methods differ.

  1. Create a Container and install VirtualBox in it. This is the Master WinXP VirtualBox (MWVB) Container. If any configuration steps specific to a WinXP environment are needed, they can be done now. When a Windows XP environment is needed, clone the MWVB Container and install WinXP in the clone. Management of the Container can be delegated to the user of the WinXP environment if you want.
  2. Create a Container and install VirtualBox in it. This is the Master CentOS VirtualBox (MCVB) Container. Install CentOS in the Container. When a CentOS environment is needed, clone the MCVB - including the copy of CentOS that's already in the Container - to create a new Container. Management of the Container can be delegated to the user of the CentOS environment if you want.
In each case, resource controls can be applied to the Container to ensure that everyone gets a fair share of the system's resources like CPU, RAM, virtual memory, etc.

When the process is complete, you have a guest OS, shown here via X Windows.

CentOS picture

Not only did the code run well but it did so in a sparse root non-global zone

Well that was easy! How about Windows?
Windows XP picture

Now, this is interesting. As long as the client VM is supported by VirtualBox, it can be installed and run in a Solaris/OpenSolaris Container. I immediately thought of several useful applications of this combination of virtualization technologies:

  • migrate existing applications that are deemed "unmovable" to latest eco-friendly x64 (64-bit x86) platforms
  • reduce network latency of distributed applications by collapsing the network onto a large memory system with zones, regardless of which OS the application components were originally written in
  • on-demand provisioning, as a service, an entire development environment for Linux or Windows developers. When using ZFS, this could be accomplished in seconds - is this a "poor man's" cloud or what?!
  • eliminate ISV support issues that are currently associated with BrandZ's lack of support for recent Linux kernels or Solaris 8 or 9 kernel
  • what else can you create?
Best of all, Solaris, OpenSolaris and VirtualBox can be downloaded and used free of charge. Simple to build, easy to deploy, low overhead, free - I love it!

Performance

The advantage of having access to application code through Containers more than compensated for a 5% overhead (on a laptop) due to having a second kernel. The overall environment seems to be disk-sensitive (SSDs to the rescue!). Given that typical server load in a large IT shop is 15-20%, a number of such "foreign" zones could be added without impacting overall server performance.

Future Investigations

It would be interesting to evaluate scalability of the overall environment by testing different resource controls in Solaris Containers and in VirtualBox. I'd need a machine bigger than the laptop for that :-).

Installation Details

Here are the highlights of "How to install." For more details, follow instructions in the VirtualBox User manual.

  • Install VirtualBox on a Solaris x64 machine in the global zone so that the vboxdrv driver is available in the Solaris kernel.
  • Create a target zone with access to the vboxdrv device ("add device; set match=/dev/vboxdrv; end").
  • In the zone, clean up the artifacts of the previous VirtualBox installation in the global zone. All you need to do is to uninstall the SUNWvbox package and remove references to /opt/VirtualBox directory.
  • Install VirtualBox package in the zone.
  • Copy the OS distro into a file system in the global zone (e.g. /export/distros/centos.iso, and configure a loopback mount into the zone ("add fs; set dir=/mnt/images; set special=/export/distros; set type=lofs; end").
  • Start VirtualBox in the zone and install the client OS distro.
What advantages does this model have over other virtualization solutions?
  • The Solaris kernel is the software layer closest to the hardware. With Solaris, you benefit from the industry-leading scalability of Solaris and all of its innovations, like:
    • ZFS for data protection - currently, neither Windows nor Linux distros have ZFS. You can greatly improve storage robustness of your Windows or Linux system by running it as a VirtualBox guest.
    • SMF/FMA, which allows the whole system to tolerate hardware problems
    • DTrace, which allows you to analyze system performance issues while the apps are running. Although you can use DTrace in the 'base' Solaris OS environment to determine which guest is causing the performance issue, and whether the problem is network I/O, disk I/O, or something else, DTrace will not be able to "see" into VirtualBox guests to help figure out which particular application is the culprit - unless the guest is running Solaris, in which case you run DTrace in the guest!
  • Cost: You can download and use Solaris and OpenSolaris without cost. You can download and use VirtualBox without cost. Some Linux distros are also free. What costs less than 'free?'
What can you do with this concept? Here are some more ideas:
  • Run almost any Linux apps on a Solaris system by running that Linux distro in VirtualBox - or a combination of different Linux distros.
  • Run multiple Windows apps - even on different versions of Windows - on Solaris.
Additional notes are available from the principal investigator, Joseph Yanushpolsky: josephy100 -AT- gmail.com .

Friday May 01, 2009

Zonestat 1.4 Bug Fixes

Attention Zonestat Fans: three bugs have been found in Zonestat 1.4. Two of them only happen if a zone boots or halts while using Zonestat. The third only happens if a zoneID is larger than 999.

I fixed all three bugs and posted v1.4.1 on the web site: http://opensolaris.org/os/project/zonestat/

Specifically:

  • Bug: if a zone with dedicated CPUs is booted between poolcfg and output, zonestat can get confused or halt.
  • Bug: if a zone is halted between "zoneadm list" and kstats, zonestat can get confused or halt.
  • Bug: zones with a zoneID number > 999 are ignored.

Wednesday Apr 08, 2009

Zonestat 1.4 Now Available

I have posted Zonestat v1.4 at: the Zone Statistics project page (click on "Files" in the left navbar).

Zonestat is a 'dashboard' for Solaris Containers. It shows resource consumption of each Container (aka Zone) and a comparison of consumption against limits you have set.

Changes from v1.3:

  • BugFix: various failures if the pools service was not online. V1.4 checks for the existence of the pools packages, and behaves correctly whether they are installed and enabled, or not.
  • BugFix: various symptoms if the rcapd service was not online. V1.4 checks for the existence of the rcap packages, and behaves correctly whether they are installed and enabled, or not.
  • BugFix: mis-reported shared memory usage
  • BugFix: -lP produced human-readable, not machine-parseable output
  • Bug/RFE: detect and fail if zone != global or user != root
  • RFE: Prepare for S10 update numbers past U6
  • RFE: Add option to print entire name of zones with long names
  • RFE: Add timestamp to machine-consumable output
  • RFE: improve performance and correctness by collecting CPU% with DTrace instead of prstat

Note that the addition of a timestamp to -P output changes the output format for "machine-readable" output.

For most people, the most important change will be the use of DTrace to collect CPU% data. This has two effects. The first effect is improved correctness. The prstat command - used in V1.3 and earlier, can horribly underestimate CPU cycles consumed because it can miss many short-lived processes. The mpstat has its own problems with mis-counting CPU usage. So I expanded on a solution Jim Fiori offered, which uses DTrace to answer the question "which zone is using a CPU right now?"

The other benefit to DTrace is the improvement in performance of Zonestat.

The less popular, but still interesting additions include:

  • -N expands the width of the zonename field to the length of the longest zone name. This preserves the entire zone name, for all zones, and also leaves the columns lined up. However, the length of the output lines will exceed 80 characters.
  • The new timestamp field in -P output makes it easier for tools like the "System Data Recorder" (SDR) to consume zonestat output. However, this was a change to the output format. If you have written a script which used -P and assumed a specific format for zonestat output, you must change your script to understand the new format.

Please send questions and requests to zones-discuss@opensolaris.org .

Wednesday Apr 01, 2009

Patching Zones Goes Zoom!

Accelerated Patching of Zoned Systems

Introduction

If you have patched a system with many zones, you have learned that it takes longer than patching a system without zones. The more zones there are, the longer it takes. In some cases, this can raise application downtime to an unacceptable duration.

Fortunately, there are a few methods which can be used to reduce application downtime. This document mentions many of them, and then describes the performance enhancements of two of them. But the bulk of this rather bulky entry is the description and results of my newest computer... "experiments."

Executive Summary, for the Attention-Span Challenged

It's important to distinguish between application downtime, service downtime, zone downtime, and platform downtime. 'Service' is the service being provided by an application or set of applications. To users, that's the most important measure. As long as they can access the service, they're happy. (Doesn't take much, does it?)

If a service depends on the proper operation of each of its component applications, planned or unplanned downtime of one application will result in downtime of the service. Some software, e.g. web server software, can be deployed in multiple, load-balanced systems so that the service will not experience downtime even if one of the software instances is down.

Applying an operating system patch may require service downtime, application downtime, zone downtime or platform downtime, depending on the patch and the entity being patched. Because in many cases patch application will require application downtime, the goal of the methods mentioned below, especially parallel patching of zones, is to minimize elapsed downtime to achieve a patched, running system.

Just Enough Choices to Confuse

Methods that people use - or will soon use - to improve the patching experience of zoned systems include:

  • Live Upgrade allows you to copy the existing Solaris instance into an "alternative boot environment," patch or upgrade the ABE, and then re-boot into it. Downtime of the service, application, or zone is limited to the amount of time it takes to re-boot the system. Further, if there's a problem, you can easily re-boot back into the original boot environment. Bob Netherton describes this in detail on his weblog. Maybe the software should have a secondary name: Live Patch.

  • You can detach all of the zones on the system, patch the system (which doesn't bother to patch the zones) and then re-attach the zones using the "update-on-attach" method which is also described here. This method can be used to reduce service downtime and application downtime, but not as much as the Live Upgrade / Live Patch method. Each zone (and its application(s)) will be down for the length of time to patch the system - and perhaps reboot it - plus the time to update/attach the zone.

  • You can apply the patch to another Solaris 10 system with contemporary or newer patches, and migrate the zones to that system. Optionally, you can patch the original system and migrate the zones back to it. Downtime of the zones is less than the previous solution because the new system is already patched and rebooted.

  • You can put the zones' zonepath (i.e. install the zones onto) very fast storage, e.g. an SSD or a storage array with battery-backed DRAM or NVRAM. The use of SSDs is described below. This method can be used in conjunction with any of the other solutions. It will speed up patching because patching is I/O intensive. However, this type of storage device is more expensive per MB, so this solution may not make fiscal sense in many situations.

  • Sun has developed an enhancement to the Solaris patching tools which is intended to significantly decrease the elapsed time of patching. It is currently being tested at a small number of sites. After it's released you can get the Zones Parallel Patching patch, described below. This solution decreases the elapsed time to patch a system. It can be combined with some of the solutions above, with varying benefits. For example, with Live Upgrade, parallel patching reduces the time to patch the ABE, but doesn't reduce service downtime. Also, ZPP offers little benefit for the detach/attach-on-upgrade method. However, as a stand-alone method, ZPP offers significant reduction in elapsed time without changing your current patching process. ZPP was mentioned by Gerry Haskins, Director of Software Patch Services.

Disclaimer 1: the "Zones Parallel Patching" patch ("ZPP") is still in testing and has not yet been released. It is expected to be released mid-CY2009. That may change. Further, the specific code changes may change, which may change the results described below.

Disclaimer 2: the experiment described below, and its results, are specific to one type of system (Sun Fire T2000) and one patch (120534-14 - "the Apache patch"). Performance improvements using other hardware and other patches will produce different results.

Yet More Background

I wanted to better understand two methods of accelerating the patching of zoned systems, especially when used in combination. Currently, a patch applied to the global zone will normally be applied to all non-global zones, one zone at a time. This is a conservative approach to the task of patching multiple zones, but doesn't take full advantage of the multi-tasking abilities of Solaris.

I learned that a proposed patch was created that enables the system administrator to apply a patch in the global zone which patches the global and then patches multiple zones at the same time. The parallelism (i.e. "the number of zones that are patched at one time") can be chosen before the patch is applied. If there are multiple "Solaris CPUs" in the system, multiple CPUs can be performing computational steps at the same time. Even if there aren't many CPUs, one zone's patching process can be using a CPU while another's is writing to a disk drive.

<tangent topic="Solaris vCPU"> I use the phrase "Solaris CPUs" to refer to the view that Solaris has of CPUs. In the old days, a CPU was a CPU - one chip, one computational entity, one ALU, one FPU, etc. Now there are many factors to consider - CPU sockets, CPU cores per socket, hardware threads per core, etc. Solaris now considers "vCPUs" - virtual processors - as entities on which to schedule processors. Solaris considers each of these a vCPU:

  • x86/x64 systems: a CPU core (today, can be one to six per socket, with a maximum of 24 vCPUs per system, ignoring some exotic, custom high-scale x86 systems)
  • UltraSPARC-II, -III[+], -IV[+]: a CPU core, max of 144 in an E25K
  • SPARC64-VI, -VII: a CPU core: max of 256 in an M9000
  • SPARC CMT (SPARC-T1, -T2+): a hardware thread, maximum of 256 in a T5440
</tangent>

Separately, I realized that one part of patching is disk-intensive. Many disk-intensive workloads benefit from writing to a solid-state disk (SSD) because of the performance benefit of those devices over spinning-rust disk drives (HDD).

So finally (hurrah!) the goal of this adventure: how much performance advantage would I achieve with the combination of parallel patching and an SSD, compared to sequential patching of zones on an HDD?

He Finally Begins to Get to the Point

I took advantage of an opportunity to test both of these methods to accelerate patching. The system was a Sun Fire T2000 with two HDDs and one SSD. The system had 32 vCPUs, was not using Logical Domains, and was running Solaris 10 10/08. Solaris was installed on the first HDD. Both HDDs were 72GB drives. The SSD was a 32GB device. (Thank you, Pat!)

For some of the tests I also applied the ZPP. (Thank you, Enda!) For some of the tests I used zones that had zonepaths on the SSD; the rest 'lived' on the second HDD.

As with all good journeys, this one had some surprises. And, as with all good research reports, this one has a table with real data. And graphs later on.

To get a general feel for the different performance of an HDD vs. an SSD, I created a zone on each - using the secondary HDD - and made clones of it. Some times I made just one clone at a time, other times I made ten clones simultaneously. The iostat(1) tool showed me the following performance numbers:

 r/sw/skr/skw/swaitactvsvc_t%w%b
clone x1 on HDD56227833332306.423172
clone x1 on SSD35379115616500016
clone x10 on HDD35470182327431952462599
clone x10 on SSD354295824133026241541034

At light load - just one clone at a time - the SSD performs better than the HDD, but at heavy load the SSD performs much much better, e.g. nine times the write throughput and 13x the write IOPS of the HDD, and the device driver and SSD still have room for more (34% busy vs. 99% busy).

Cloning a zone consists almost entirely of copying files. Patching has a higher proportion of computation, but those results gave me high hopes for patching. I wasn't disappointed. (Evidently, every good research report also includes foreshadowing.)

In addition to measuring the performance boost of the ZPP I wanted to know if that patch would help - or hurt - a system without using its parallelization feature. (I didn't have a particular reason to expect non-parallelized improvement, but occasionally I'm an optimist. Besides, if the performance with the ZPP was different without actually using parallelization, it would skew the parallelized numbers.) So before installing the patch, I measured the length of time to apply a patch. For all of my measurements, I used patch 120543-14 - the Apache patch. At 15MB, it's not a small patch, nor is it very large patch. (The "Baby Bear" patch, perhaps? --Ed.) It's big enough to tax the system and allow reasonable measurements, but small enough that I could expect to gather enough data to draw useful conclusions, without investing a year of time...

#define TEST while(1) {patchadd; patchrm;}

So, before applying the ZPP, and without any zones on the system, I applied the Apache patch. I measured the elapsed time because our goal is to minimize elapsed time of patch application. Then I removed the Apache patch.

Then I added a zone to the system, on the secondary HDD, and, I re-applied the Apache patch to the system, which automatically applied it to the zone as well. I removed the patch, created two more zones, and applied the same patch yet again. Finally, I compared the elapsed time of all three measurements. Patching the global zone alone took about 120 seconds. Patching with one non-global zone took about 175 seconds: 120 for the global zone and 55 for the zone. Patching three zones took about 285 seconds: 120 seconds for the global zone and 55 seconds for each of the three zones.

Theoretically, the length of time to patch each zone should be consistent. Testing that theory, I created a total of 16 zones and then applied the Apache patch. No surprises: 55 seconds per zone.

To test non-parallel performance of the ZPP, I applied it, kept the default setting of "no parallelization," and then re-did those tests. Application of the Apache patch did not change in behavior nor in elapsed time per zone, from zero to 16 zones. (However, I had a faint feeling that Solaris was beginning to question my sanity. "Get inline," I told it...)

How about the SSD - would it improve patch performance with zero or more zones? I removed the HDD zones and installed a succession of zones - zero to 16 - on the SSD and applied the Apache patch each time. The SSD did not help at all - the patch still took 55 seconds per zone. Evidently this particular patch is not I/O bound, it is CPU bound.

But applying the ZPP does not, by default, parallelize anything. To tell the patch tools that you would like some level of parallelization, e.g. "patch four zones at the same time," you must edit a specific file in the /etc/patch directory and supply a number, e.g. '4'. After you have done that, if parallel patching is possible, it will happen automatically. Multiple zones (e.g. four) will be patched at the same time by a patchadd process running in each zone. Because that patchadd is running in a zone, it will use the CPUs that the zone is assigned to use - default or otherwise. This also means that the zone's patchadd process is subject to all of the other resource controls assigned to the zone, if any.

Changing the parallelization level to 8, I re-did all of those measurements, on the HDD zones and then the SSD zones. The performance impact was obvious right away. As the graph to the right shows, the elapsed time to patch the system with a specific number of zones was less with the ZPP. ('P' indicates the level of parallelization: 1, 8 or 16 zones patched simultaneously. The blue line shows no parallelization, the red line shows the patching of eight zones simultaneously.) Turning the numbers around, the patching "speed" improved by a factor of three.

How much more could I squeeze out of that configuration? I increased the level of parallelization to 16 and re-did everything. No additional performance, as the graph shows. Maybe it's a coincidence, but a T2000 has eight CPU cores - maybe that's a limiting factor.

At this point the SSD was feeling neglected, so I re-did all of the above with zones on the SSD. This graph shows the results: little benefit at low zone count, but significant improvement at higher zone counts - when the HDD was the bottleneck. Combining ZPP and SSD resulted in patching throughput improvement of 5x with 16 zones.

That seems like magic! What's the trick? A few paragraphs back, I mentioned the 'trick': using all of the scalability of Solaris and, in this case, CMT systems. Patching a system without ZPP - especially one without a running application - leaves plenty of throughput performance "on the table." Patching muliple zones simultaneously uses CPU cycles - presumably cycles that would have been idle. And it uses I/O channel and disk bandwidth - also, hopefully, available bandwidth. Essentially, ZPP is shortening the elapsed time by using more CPU cycles and I/O bandwidth now instead of using them later.

So the main caution is "make sure there is sufficient compute and I/O capacity to patch multiple zones at the same time."

But whenever multiple apps are running on the same system at the same time, the operating system must perform extra tasks to enable them to run safely. It doesn't matter if the "app" is a database server or 'patchadd.' So is ZPP using any "extra" CPU, i.e. is there any CPU overhead?

Along the way, I collected basic CPU statistics, including system and user time. The next graph shows that the amount of total CPU time (user+sys) increased slightly. The overhead was less than 10% for up to 8 zones. Another coincidence? I don't know, but at that point the overhead was roughly 1% per zone. The overhead increased faster beyond P=8, indicating that, perhaps, a good rule of thumb is P="number of unused cores." Of course, if the system is using Dynamic Resource Pools or dedicated-cpus, the rule might need to be changed accordingly. TANSTAAFL.

Conclusion

All good tales need a conclusion. The final graph shows the speedup - the increase in patching throughput - based on the number of zones, level of parallelization, and device type. Specific conclusions are:
  1. Parallel patching zones significantly reduces elapsed time if there are sufficient compute and I/O resources available.
  2. Solid-state disks significantly improve patching throughput for high-zone counts and similar levels of parallelization.
  3. The amount of work accomplished does not decrease - it's merely compressed into a shorter period of elapsed time.
  4. If patching while applications are running:
    • plan carefully in order to avoid impacting the responsiveness of your applications. Choose a level of parallelization commensurate with the amount of available compute capacity
    • use appropriate resource controls to maintain desired response times for your applications.

Getting the ZPP requires waiting until mid-year. Getting SSDs is easy - they're available for the Sun 7210 and 7410 Unified Storage Systems and for Sun systems.

<script type="text/javascript"> var sc_project=2359564; var sc_invisible=1; var sc_security="22b325fd"; var sc_https=1; var sc_remove_link=1; var scJsHost = (("https:" == document.location.protocol) ? "https://secure." : "http://www."); document.write("");</script>

counter for tumblr

Tuesday Feb 10, 2009

Zones to the Rescue

Recently, Thomson Reuters "demonstrated that RMDS [Reuters Marked Data Systems software] performs better in a virtualized environment with Solaris Containers than it does with a number of individual Sun server machines."

This enabled Thomson Reuters to break the "million-messages-per-second barrier."

The performance improvement is probably due to the extremely high bandwidth, low latency characteristics of inter-Container network communications. Because all inter-Container network traffic is accomplished with memory transfers - using default settings - packets 'move' at computer memory speeds, which are much better than common 100Mbps or 1Gbps ethernet bandwidth. Further, that network performance is much more consistent without extra hardware - switches and routers - that can contribute to latency.

Articles can be found at: http://finance.yahoo.com/news/Sun-Microsystems-and-Thomson-bw-14306924.html

Thursday Jan 29, 2009

Group of Zones - Herd? Flock? Pod? Implausibility? Cluster!

Yesterday, Solaris Cluster (aka "Sun Cluster") 3.2 1/09 was released. This release has two new features which directly enhance support of Solaris Zones in Solaris Clusters.

The most significant new functionality is a feature called "Zone Clusters" which, at this point, 'merely' :-) provides support for Oracle RAC nodes in Zones. In other words, you can create an Oracle RAC cluster, using individual zones in a Solaris Cluster as RAC nodes.

Further, because a Solaris Cluster can contain multiple Zone Clusters, it can contain multiple Oracle RAC clusters. For details about configuring a zone cluster, see "Configuring a Zone Cluster" in the Sun Cluster Software Installation Guide and the clzonecluster(1CL) man page.

The second new feature is support for exclusive-IP zones. Note that this only applies to failover data services, not scalable data services nor with zone clusters.

Friday Jan 09, 2009

Zones and Solaris Security


An under-appreciated aspect of the isolation inherent in Solaris Zones (aka Solaris Containers) is their ability to use standard Solaris security features to enhance security of consolidated workloads. These features can be used alone or in combination to create an arbitrarily strong level of security. This includes DoD-strength security using Solaris Trusted Extensions - which use Solaris Zones to provide labeled, multi-level data classification. Trusted Extensions achieved one of the highest possible Common Criteria independent security certifications.

To shine some light on the topic of Zones and security, Glenn Brunette and I recently co-authored a new Sun BluePrint with an overly long name :-) - "Understanding the Security Capabilities of Solaris Zones Software." You can find it at http://www.sun.com/blueprints.

Monday Nov 24, 2008

Zonestat v1.3

It's - already - time for a zonestat update. I was never happy with the method that zonestat used to discover the mappings of zones to resource pools, but wanted to get v1.1 "out the door" before I had a chance to improve on its use of zonecfg(1M). The obvious problem, which at least one person stumbled over, was the fact that you can re-configure a zone while it's running. After doing that, the configuration information doesn't match the current mapping of zone to pool, and zonestat became confused.

Anyway, I found the time to replace the code in zonestat which discovered zone-to-pool mappings with a more sophisticated method. The new method uses ps(1) to learn the PID that each zone's [z]sched process is. Then it uses "poolbind -q <PID>" to look up the pool for that process. The result is more accurate data, but the ps command does use more CPU cycles.

While performing surgery on zonestat, I also:

  • began using the kstat module for Perl, reducing CPU time consumed by zonestat
  • fixed some bugs in CPU reporting
  • limited zonename output to 8 characters to improve readability
  • made some small performance improvements
  • added a $DEBUG variable which can be used to watch the commands that zonestat is using
With all of that, zonestat v1.3 provides better data, is roughly 30% faster, and is even smaller than the previous version! :-) You can find it at http:://opensolaris.org/os/project/zonestat.

Tuesday Nov 18, 2008

Zonestat: How Big is a Zone?

Introduction

Recently an organization showed great interest in Solaris Containers, especially the resource management features, and then asked "what command do you use to compare the resource usage of Containers to the resource caps you have set?"

I began to list them: prstat(1M), poolstat(1M), ipcs(1), kstat(1M), rcapstat(1), prctl(1), ... Obviously it would be easier to monitor Containers if there was one 'dashboard' to view. Such a dashboard would enable zone administrators to easily review zones' usage of system resources and decide if further investigation is necessary. Also, if there is a system-wide shortage of a resource, this tool would be the first out of the toolbox, simplifying the task of finding the 'resource hog.'

Zonestat

In order to investigate the possibilities, I created a Perl script I call 'zonestat' which summarizes resource usage of Containers. I consider this script a prototype, not intended for production use. On the other hand, for a small number of zones, it seems to be pretty handy, and moderately robust.

Its output looks like this:

        |----Pool-----|------CPU-------|----------------Memory----------------|
        |---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
  global  0D  66K    2       0.1 100 25      986M      139K  18E 2.2M  18E 754M
    db01  0D  66K    2       0.1 200 50   1G 122M 536M  0.0 536M    0   1G 135M
   web02  0D  66K    2  0.4  0.0 100 25 100M  11M  20M  0.0  20M    0 268M   8M 
==TOTAL= --- ----    2 ----  0.2  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 897M
The 'Pool' columns provide information about the Dynamic Resource Pool in which the zone's processes are running. The two-character 'I' column displays the pool ID (number) and the 'T' column indicates the type of pool - 'D' for 'default', 'P' for 'private' (using the dedicated-cpu feature of zonecfg) or 'S' for 'shared.' The two 'Size' columns show the quantity of CPUs assigned to the pool in which the zone is running.

The 'CPU Pset' columns show each zone's CPU usage and any caps that have been set. The first two columns show CPU quantities - CPU cores for x86, SPARC64 and all UltraSPARC systems except CMT (T1, T2, T2+). On CMT systems, Solaris considers every hardware thread ('strand') to be a CPU, and calls them 'vCPUs.'

The last two CPU columns - 'Shr' and 'S%' - show the number of FSS shares assigned to the zone, and what percentage of the total number of shares in that zone's pool. In the example above, all the zones share the default pset, and the zone 'db01' has two shares, so it should receive 50% of the CPU power of the pool at a minimum.

The 'Memory' columns show the caps and usage for RAM, locked memory and virtual memory. Note that virtual memory is RAM plus swap space.

The syntax of zonestat is very similar to the other \*stat tools:

   zonestat [-l] [interval [count]]
The output shown above is generated with the -l flag, which means "show the limits (caps) that have been set." Without -l, only usage columns are displayed.

Example of Usage

Here is more output, showing some of the conclusions that can be drawn from the data. I have added parenthetical numbers in the right-hand in order to refer to specific lines of output.

        |----Pool-----|------CPU-------|----------------Memory----------------|
        |---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
  global  0D  66K    2       0.1 100 HH      983M      139K  18E 2.2M  18E 752M
==TOTAL= --- ----    2 ----  0.1  -- -- 4.3G 983M 4.3G 139K 4.1G 2.2M 5.3G 752M
--------
  global  0D  66K    2       0.1 100 HH      983M      139K  18E   2M  18E 752M
==TOTAL= --- ----    2 ----  0.1  -- -- 4.3G 983M 4.3G 139K 4.1G 2.2M 5.3G 752M
Note that the none of the non-global zones are running. Because the global zone is the only zone running in its pool, its 100 FSS shares represent 100% of the shares in its pool. To save a column of output, I indicate that with 'HH' instead of '100'.

The "==TOTAL=" line provides two types of information, depending on the column type. For usage information, the sum of the resource used is shown. For example, "RAM Use" shows the amount of RAM used by all zones, including the global zone. For resource controls, either the system's amount of the resource is shown, e.g. "RAM Cap", or hyphens are displayed.

Note that there is a maximum amount of RAM that can be locked in a Solaris system. This prevents all of memory from being locked down, which would prevent the virtual memory system from running. In the output above, this system will only allow 4.1GB of RAM to be locked.

Also note that the amount of VM used is less than the amount of RAM used. This is because the memory pages which contain a program's instructions are not backed by swap disk, but by the file system itself. Those 'text' pages take up RAM, but do not take up swap space.

        |----Pool-----|------CPU-------|----------------Memory----------------|
        |---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.1 100 50 1.0G  30M 536M  0.0 536M  0.0 1.0G  27M
==TOTAL= --- ----    2 ----  0.2  -- -- 4.3G 1.0G 4.3G 139K 4.1G 2.2M 5.3G 780M
A zone has booted. It has caps for RAM, shared memory, locked memory, and VM. The default pool now has a total of 200 shares: 100 for each zone. Therefore, each zone has 50% of the shares in that pool. This provides a good reason to change the global zone's FSS value from its default of one share to a larger value as soon as you add the first zone to a system.
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.3 100 50   1G  93M 536M  0.0 536M  0.0   1G  95M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 848M
--------
  global  0D  66K    2       0.1 100 50      981M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.4 100 50   1G 122M 536M  0.0 536M  0.0   1G 135M
==TOTAL= --- ----    2 ----  0.5  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
The zone 'z3' is still booting, and is using 0.4 CPUs worth of CPU cycles.
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.3 100 50   1G 122M 536M      536M  0.0   1G 135M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
--------
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.2 100 50   1G 122M 536M      536M  0.0   1G 135M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
--------
  global  0D  66K    2       0.1 100 33      986M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.1 100 33   1G 122M 536M      536M  0.0   1G 135M
   web02  0D  66K    2  0.4  0.0 100 33 100M  11M  20M       20M  0.0 268M   8M
==TOTAL= --- ----    2 ----  0.2  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 897M
A third zone has booted. This zone has a CPU-cap of 0.4 CPUs. It also has memory caps, including a RAM cap that is less than the amount of memory that zone 'z3' is using. If the two zones need the same amount, web02 should begin paging before long. Let's see what happens...
--------
  global  0D  66K    2       0.1   1 33      985M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.1   1 33   1G 122M 536M      536M  0.0   1G 135M
   web02  0D  66K    2  0.4  0.1   1 33 100M  29M  20M       20M  0.0 268M  36M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 925M
--------
  global  0D  66K    2       0.1   1 33      984M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.1   1 33   1G 122M 536M      536M  0.0   1G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M  63M  20M       20M  0.0 268M 138M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
        |----Pool-----|------CPU-------|----------------Memory----------------|
        |---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
  global  0D  66K    2       0.1   1 33      985M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 122M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M  87M  20M       20M  0.0 268M 185M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.1G
--------
  global  0D  66K    2       0.1   1 33      985M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 122M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M 100M  20M       20M  0.0 268M 112M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
--------
  global  0D  66K    2       0.1   1 33      984M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 122M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.3   1 33 100M 112M  20M       20M  0.0 268M 117M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
As expected, web02 exceeds its RAM cap. Now rcapd should address the problem.
--------
  global  0D  66K    2       0.1   1 33      981M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 119M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.3   1 33 100M 111M  20M       20M  0.0 268M 127M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
One of two things has happened: either a process in web02 freed up memory, or rcapd caused pageouts. rcapstat(1M) will tell us which it is. Also, the increase in VM usage indicates that more memory was allocated than freed, so it's more likely that rcapd was active during this period.
--------
  global  0D  66K    2       0.1   1 33      981M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33 1.0G 119M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M 110M  20M       20M  0.0 268M 133M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
--------
  global  0D  66K    2       0.1   1 33      978M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33 1.0G 116M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M  91M  20M       20M  0.0 268M 133M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
At this point 'web02' is safely under its RAM cap. If this zone began to do 'real' work, it would continually be under memory pressure, and the value in 'Memory:RAM:Use" would fluctuate around 100M. When setting a RAM cap, it is very important to choose a reasonable valuable to avoid causing unnecessary paging.

One final example, taken from a different configuration of zones:

        |----Pool-----|------CPU-------|----------------Memory----------------|
        |---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap|Used| Cap|Used| Cap|Used| Cap|Used
-------------------------------------------------------------------------------
  global  0D  66K    1  0.0  0.0 200 66      1.2G  18E 343K  18E 2.6M  18E 1.1G
      zB  0D  66K    1  0.2  0.0 100 33      124M  18E  0.0  18E  0.0  18E 138M
      zA  1P    1    1  0.0  0.1   1 HH       31M  18E  0.0  18E  0.0  18E  24M
==TOTAL= --- ----    2 ----  0.1 --- -- 4.3G 1.4G 4.3G 343K 4.1G 2.6M 5.3G 1.2G
The global zone and zone 'zB' share the default pool. Because the global zone has 200 FSS shares, compared to zB's 100 shares, global zone processes will get 2/3 of the processing power of the default pool if there is contention for that CPU. However, that is unlikely, because zB is capped at 0.2 CPUs worth of compute time.

Zone 'zA' is in its own private resource pool. It has exclusive access to the one dedicated CPU in that pool.

Problems

CPU Hog

Zonestat's biggest problem is due to its brute-force nature. It runs a few commands for each zone that is running. This can consume many CPU cycles, and can take a few seconds to run with many zones. Performance improvements to zonestat are underway.

Wrong / Misleading CPU Usage Data

Two commonly used methods to 'measure' CPU usage by processes and zones are prstat and mpstat. Each can produce inaccurate 'data' in certain situations.

With mpstat, it is not difficult to create surprising results, e.g. on a CMT system, set a CPU cap on a zone in a pool, and run a few CPU-bound processes: the "Pset Used" column will not reach the CPU cap. This is due to the method used by mpstat to calculate its data.

Prstat only computes its data occasionally, ignoring anything that happened between samples. This leads to undercounting CPU usage for zones with many short-lived processes.

I wrote code to gather data from each, but prstat seemed more useful, so for now the output comes from prstat.

What's Next

I would like feedback on this tool, perhaps leading to minor modifications to improve its robustness and usability. What's missing? What's not clear?

The future of zonestat might include these:

  • I hope that this can be re-written in C or D. Either way, it might find its way into Solaris... If I can find the time, I would like to tackle this.
  • New features - best added to a 'real' version:
    1. -d: show disk usage
    2. -n: show network usage
    3. -i: also show installed zones that are not running
    4. -c: also show configured zones that are not installed
    5. -p: only show processor info, but add more fields, e.g. prstat's instantaneous CPU%, micro-state fields, and mpstat's CPU%
    6. -m: only show memory-related info, and add the paging columns of vmstat, output of "vmstat -p", free swap space
    7. -s : sort sample output by a field. Good example: sort by poolID
    8. add a one-character column showing the default scheduler for each pool
    9. report state transitions like mpstat does, e.g. changes in zone state, changes in pool configuration
    10. improve robustness

The Code

You can find the Perl script in the Files page of the OpenSolaris project "Zone Statistics."

About

Jeff Victor writes this blog to help you understand Oracle's Solaris and virtualization technologies.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today