Friday Apr 02, 2010

Solaris Virtualization Book

This blog has been pretty quiet lately, but that doesn't mean I haven't been busy! For the last 6 months I've been leading the writing of a book: _Oracle Solaris 10 System Virtualization Essentials_.

This book discusses all of the forms of server virtualization, not just hypervisors. It covers the forms of virtualization that Solaris 10 provides and those that it can use. These include Solaris Containers (also called Solaris Zones), VirtualBox, Oracle VM ( x86 and SPARC; the latter was called Logical Domains), and Dynamic Domains.

One chapter is dedicated to the topic of choosing the best virtualization technology for a particular workload or set of workloads. Another chapter shows how to use each virtualization technology to achieve specific goals, including screenshots and command sequences. The last chapter of the book describes the need for virtualization management tools and then uses Oracle EM Ops Center as an example.

The book is available for pre-order at

Wednesday Apr 08, 2009

Zonestat 1.4 Now Available

I have posted Zonestat v1.4 at: the Zone Statistics project page (click on "Files" in the left navbar).

Zonestat is a 'dashboard' for Solaris Containers. It shows resource consumption of each Container (aka Zone) and a comparison of consumption against limits you have set.

Changes from v1.3:

  • BugFix: various failures if the pools service was not online. V1.4 checks for the existence of the pools packages, and behaves correctly whether they are installed and enabled, or not.
  • BugFix: various symptoms if the rcapd service was not online. V1.4 checks for the existence of the rcap packages, and behaves correctly whether they are installed and enabled, or not.
  • BugFix: mis-reported shared memory usage
  • BugFix: -lP produced human-readable, not machine-parseable output
  • Bug/RFE: detect and fail if zone != global or user != root
  • RFE: Prepare for S10 update numbers past U6
  • RFE: Add option to print entire name of zones with long names
  • RFE: Add timestamp to machine-consumable output
  • RFE: improve performance and correctness by collecting CPU% with DTrace instead of prstat

Note that the addition of a timestamp to -P output changes the output format for "machine-readable" output.

For most people, the most important change will be the use of DTrace to collect CPU% data. This has two effects. The first effect is improved correctness. The prstat command - used in V1.3 and earlier, can horribly underestimate CPU cycles consumed because it can miss many short-lived processes. The mpstat has its own problems with mis-counting CPU usage. So I expanded on a solution Jim Fiori offered, which uses DTrace to answer the question "which zone is using a CPU right now?"

The other benefit to DTrace is the improvement in performance of Zonestat.

The less popular, but still interesting additions include:

  • -N expands the width of the zonename field to the length of the longest zone name. This preserves the entire zone name, for all zones, and also leaves the columns lined up. However, the length of the output lines will exceed 80 characters.
  • The new timestamp field in -P output makes it easier for tools like the "System Data Recorder" (SDR) to consume zonestat output. However, this was a change to the output format. If you have written a script which used -P and assumed a specific format for zonestat output, you must change your script to understand the new format.

Please send questions and requests to .

Wednesday Apr 01, 2009

Patching Zones Goes Zoom!

Accelerated Patching of Zoned Systems


If you have patched a system with many zones, you have learned that it takes longer than patching a system without zones. The more zones there are, the longer it takes. In some cases, this can raise application downtime to an unacceptable duration.

Fortunately, there are a few methods which can be used to reduce application downtime. This document mentions many of them, and then describes the performance enhancements of two of them. But the bulk of this rather bulky entry is the description and results of my newest computer... "experiments."

Executive Summary, for the Attention-Span Challenged

It's important to distinguish between application downtime, service downtime, zone downtime, and platform downtime. 'Service' is the service being provided by an application or set of applications. To users, that's the most important measure. As long as they can access the service, they're happy. (Doesn't take much, does it?)

If a service depends on the proper operation of each of its component applications, planned or unplanned downtime of one application will result in downtime of the service. Some software, e.g. web server software, can be deployed in multiple, load-balanced systems so that the service will not experience downtime even if one of the software instances is down.

Applying an operating system patch may require service downtime, application downtime, zone downtime or platform downtime, depending on the patch and the entity being patched. Because in many cases patch application will require application downtime, the goal of the methods mentioned below, especially parallel patching of zones, is to minimize elapsed downtime to achieve a patched, running system.

Just Enough Choices to Confuse

Methods that people use - or will soon use - to improve the patching experience of zoned systems include:

  • Live Upgrade allows you to copy the existing Solaris instance into an "alternative boot environment," patch or upgrade the ABE, and then re-boot into it. Downtime of the service, application, or zone is limited to the amount of time it takes to re-boot the system. Further, if there's a problem, you can easily re-boot back into the original boot environment. Bob Netherton describes this in detail on his weblog. Maybe the software should have a secondary name: Live Patch.

  • You can detach all of the zones on the system, patch the system (which doesn't bother to patch the zones) and then re-attach the zones using the "update-on-attach" method which is also described here. This method can be used to reduce service downtime and application downtime, but not as much as the Live Upgrade / Live Patch method. Each zone (and its application(s)) will be down for the length of time to patch the system - and perhaps reboot it - plus the time to update/attach the zone.

  • You can apply the patch to another Solaris 10 system with contemporary or newer patches, and migrate the zones to that system. Optionally, you can patch the original system and migrate the zones back to it. Downtime of the zones is less than the previous solution because the new system is already patched and rebooted.

  • You can put the zones' zonepath (i.e. install the zones onto) very fast storage, e.g. an SSD or a storage array with battery-backed DRAM or NVRAM. The use of SSDs is described below. This method can be used in conjunction with any of the other solutions. It will speed up patching because patching is I/O intensive. However, this type of storage device is more expensive per MB, so this solution may not make fiscal sense in many situations.

  • Sun has developed an enhancement to the Solaris patching tools which is intended to significantly decrease the elapsed time of patching. It is currently being tested at a small number of sites. After it's released you can get the Zones Parallel Patching patch, described below. This solution decreases the elapsed time to patch a system. It can be combined with some of the solutions above, with varying benefits. For example, with Live Upgrade, parallel patching reduces the time to patch the ABE, but doesn't reduce service downtime. Also, ZPP offers little benefit for the detach/attach-on-upgrade method. However, as a stand-alone method, ZPP offers significant reduction in elapsed time without changing your current patching process. ZPP was mentioned by Gerry Haskins, Director of Software Patch Services.

Disclaimer 1: the "Zones Parallel Patching" patch ("ZPP") is still in testing and has not yet been released. It is expected to be released mid-CY2009. That may change. Further, the specific code changes may change, which may change the results described below.

Disclaimer 2: the experiment described below, and its results, are specific to one type of system (Sun Fire T2000) and one patch (120534-14 - "the Apache patch"). Performance improvements using other hardware and other patches will produce different results.

Yet More Background

I wanted to better understand two methods of accelerating the patching of zoned systems, especially when used in combination. Currently, a patch applied to the global zone will normally be applied to all non-global zones, one zone at a time. This is a conservative approach to the task of patching multiple zones, but doesn't take full advantage of the multi-tasking abilities of Solaris.

I learned that a proposed patch was created that enables the system administrator to apply a patch in the global zone which patches the global and then patches multiple zones at the same time. The parallelism (i.e. "the number of zones that are patched at one time") can be chosen before the patch is applied. If there are multiple "Solaris CPUs" in the system, multiple CPUs can be performing computational steps at the same time. Even if there aren't many CPUs, one zone's patching process can be using a CPU while another's is writing to a disk drive.

<tangent topic="Solaris vCPU"> I use the phrase "Solaris CPUs" to refer to the view that Solaris has of CPUs. In the old days, a CPU was a CPU - one chip, one computational entity, one ALU, one FPU, etc. Now there are many factors to consider - CPU sockets, CPU cores per socket, hardware threads per core, etc. Solaris now considers "vCPUs" - virtual processors - as entities on which to schedule processors. Solaris considers each of these a vCPU:

  • x86/x64 systems: a CPU core (today, can be one to six per socket, with a maximum of 24 vCPUs per system, ignoring some exotic, custom high-scale x86 systems)
  • UltraSPARC-II, -III[+], -IV[+]: a CPU core, max of 144 in an E25K
  • SPARC64-VI, -VII: a CPU core: max of 256 in an M9000
  • SPARC CMT (SPARC-T1, -T2+): a hardware thread, maximum of 256 in a T5440

Separately, I realized that one part of patching is disk-intensive. Many disk-intensive workloads benefit from writing to a solid-state disk (SSD) because of the performance benefit of those devices over spinning-rust disk drives (HDD).

So finally (hurrah!) the goal of this adventure: how much performance advantage would I achieve with the combination of parallel patching and an SSD, compared to sequential patching of zones on an HDD?

He Finally Begins to Get to the Point

I took advantage of an opportunity to test both of these methods to accelerate patching. The system was a Sun Fire T2000 with two HDDs and one SSD. The system had 32 vCPUs, was not using Logical Domains, and was running Solaris 10 10/08. Solaris was installed on the first HDD. Both HDDs were 72GB drives. The SSD was a 32GB device. (Thank you, Pat!)

For some of the tests I also applied the ZPP. (Thank you, Enda!) For some of the tests I used zones that had zonepaths on the SSD; the rest 'lived' on the second HDD.

As with all good journeys, this one had some surprises. And, as with all good research reports, this one has a table with real data. And graphs later on.

To get a general feel for the different performance of an HDD vs. an SSD, I created a zone on each - using the secondary HDD - and made clones of it. Some times I made just one clone at a time, other times I made ten clones simultaneously. The iostat(1) tool showed me the following performance numbers:

clone x1 on HDD56227833332306.423172
clone x1 on SSD35379115616500016
clone x10 on HDD35470182327431952462599
clone x10 on SSD354295824133026241541034

At light load - just one clone at a time - the SSD performs better than the HDD, but at heavy load the SSD performs much much better, e.g. nine times the write throughput and 13x the write IOPS of the HDD, and the device driver and SSD still have room for more (34% busy vs. 99% busy).

Cloning a zone consists almost entirely of copying files. Patching has a higher proportion of computation, but those results gave me high hopes for patching. I wasn't disappointed. (Evidently, every good research report also includes foreshadowing.)

In addition to measuring the performance boost of the ZPP I wanted to know if that patch would help - or hurt - a system without using its parallelization feature. (I didn't have a particular reason to expect non-parallelized improvement, but occasionally I'm an optimist. Besides, if the performance with the ZPP was different without actually using parallelization, it would skew the parallelized numbers.) So before installing the patch, I measured the length of time to apply a patch. For all of my measurements, I used patch 120543-14 - the Apache patch. At 15MB, it's not a small patch, nor is it very large patch. (The "Baby Bear" patch, perhaps? --Ed.) It's big enough to tax the system and allow reasonable measurements, but small enough that I could expect to gather enough data to draw useful conclusions, without investing a year of time...

#define TEST while(1) {patchadd; patchrm;}

So, before applying the ZPP, and without any zones on the system, I applied the Apache patch. I measured the elapsed time because our goal is to minimize elapsed time of patch application. Then I removed the Apache patch.

Then I added a zone to the system, on the secondary HDD, and, I re-applied the Apache patch to the system, which automatically applied it to the zone as well. I removed the patch, created two more zones, and applied the same patch yet again. Finally, I compared the elapsed time of all three measurements. Patching the global zone alone took about 120 seconds. Patching with one non-global zone took about 175 seconds: 120 for the global zone and 55 for the zone. Patching three zones took about 285 seconds: 120 seconds for the global zone and 55 seconds for each of the three zones.

Theoretically, the length of time to patch each zone should be consistent. Testing that theory, I created a total of 16 zones and then applied the Apache patch. No surprises: 55 seconds per zone.

To test non-parallel performance of the ZPP, I applied it, kept the default setting of "no parallelization," and then re-did those tests. Application of the Apache patch did not change in behavior nor in elapsed time per zone, from zero to 16 zones. (However, I had a faint feeling that Solaris was beginning to question my sanity. "Get inline," I told it...)

How about the SSD - would it improve patch performance with zero or more zones? I removed the HDD zones and installed a succession of zones - zero to 16 - on the SSD and applied the Apache patch each time. The SSD did not help at all - the patch still took 55 seconds per zone. Evidently this particular patch is not I/O bound, it is CPU bound.

But applying the ZPP does not, by default, parallelize anything. To tell the patch tools that you would like some level of parallelization, e.g. "patch four zones at the same time," you must edit a specific file in the /etc/patch directory and supply a number, e.g. '4'. After you have done that, if parallel patching is possible, it will happen automatically. Multiple zones (e.g. four) will be patched at the same time by a patchadd process running in each zone. Because that patchadd is running in a zone, it will use the CPUs that the zone is assigned to use - default or otherwise. This also means that the zone's patchadd process is subject to all of the other resource controls assigned to the zone, if any.

Changing the parallelization level to 8, I re-did all of those measurements, on the HDD zones and then the SSD zones. The performance impact was obvious right away. As the graph to the right shows, the elapsed time to patch the system with a specific number of zones was less with the ZPP. ('P' indicates the level of parallelization: 1, 8 or 16 zones patched simultaneously. The blue line shows no parallelization, the red line shows the patching of eight zones simultaneously.) Turning the numbers around, the patching "speed" improved by a factor of three.

How much more could I squeeze out of that configuration? I increased the level of parallelization to 16 and re-did everything. No additional performance, as the graph shows. Maybe it's a coincidence, but a T2000 has eight CPU cores - maybe that's a limiting factor.

At this point the SSD was feeling neglected, so I re-did all of the above with zones on the SSD. This graph shows the results: little benefit at low zone count, but significant improvement at higher zone counts - when the HDD was the bottleneck. Combining ZPP and SSD resulted in patching throughput improvement of 5x with 16 zones.

That seems like magic! What's the trick? A few paragraphs back, I mentioned the 'trick': using all of the scalability of Solaris and, in this case, CMT systems. Patching a system without ZPP - especially one without a running application - leaves plenty of throughput performance "on the table." Patching muliple zones simultaneously uses CPU cycles - presumably cycles that would have been idle. And it uses I/O channel and disk bandwidth - also, hopefully, available bandwidth. Essentially, ZPP is shortening the elapsed time by using more CPU cycles and I/O bandwidth now instead of using them later.

So the main caution is "make sure there is sufficient compute and I/O capacity to patch multiple zones at the same time."

But whenever multiple apps are running on the same system at the same time, the operating system must perform extra tasks to enable them to run safely. It doesn't matter if the "app" is a database server or 'patchadd.' So is ZPP using any "extra" CPU, i.e. is there any CPU overhead?

Along the way, I collected basic CPU statistics, including system and user time. The next graph shows that the amount of total CPU time (user+sys) increased slightly. The overhead was less than 10% for up to 8 zones. Another coincidence? I don't know, but at that point the overhead was roughly 1% per zone. The overhead increased faster beyond P=8, indicating that, perhaps, a good rule of thumb is P="number of unused cores." Of course, if the system is using Dynamic Resource Pools or dedicated-cpus, the rule might need to be changed accordingly. TANSTAAFL.


All good tales need a conclusion. The final graph shows the speedup - the increase in patching throughput - based on the number of zones, level of parallelization, and device type. Specific conclusions are:
  1. Parallel patching zones significantly reduces elapsed time if there are sufficient compute and I/O resources available.
  2. Solid-state disks significantly improve patching throughput for high-zone counts and similar levels of parallelization.
  3. The amount of work accomplished does not decrease - it's merely compressed into a shorter period of elapsed time.
  4. If patching while applications are running:
    • plan carefully in order to avoid impacting the responsiveness of your applications. Choose a level of parallelization commensurate with the amount of available compute capacity
    • use appropriate resource controls to maintain desired response times for your applications.

Getting the ZPP requires waiting until mid-year. Getting SSDs is easy - they're available for the Sun 7210 and 7410 Unified Storage Systems and for Sun systems.

<script type="text/javascript"> var sc_project=2359564; var sc_invisible=1; var sc_security="22b325fd"; var sc_https=1; var sc_remove_link=1; var scJsHost = (("https:" == document.location.protocol) ? "https://secure." : "http://www."); document.write("");</script>

counter for tumblr

Tuesday Feb 10, 2009

Zones to the Rescue

Recently, Thomson Reuters "demonstrated that RMDS [Reuters Marked Data Systems software] performs better in a virtualized environment with Solaris Containers than it does with a number of individual Sun server machines."

This enabled Thomson Reuters to break the "million-messages-per-second barrier."

The performance improvement is probably due to the extremely high bandwidth, low latency characteristics of inter-Container network communications. Because all inter-Container network traffic is accomplished with memory transfers - using default settings - packets 'move' at computer memory speeds, which are much better than common 100Mbps or 1Gbps ethernet bandwidth. Further, that network performance is much more consistent without extra hardware - switches and routers - that can contribute to latency.

Articles can be found at:

Thursday Jan 29, 2009

Group of Zones - Herd? Flock? Pod? Implausibility? Cluster!

Yesterday, Solaris Cluster (aka "Sun Cluster") 3.2 1/09 was released. This release has two new features which directly enhance support of Solaris Zones in Solaris Clusters.

The most significant new functionality is a feature called "Zone Clusters" which, at this point, 'merely' :-) provides support for Oracle RAC nodes in Zones. In other words, you can create an Oracle RAC cluster, using individual zones in a Solaris Cluster as RAC nodes.

Further, because a Solaris Cluster can contain multiple Zone Clusters, it can contain multiple Oracle RAC clusters. For details about configuring a zone cluster, see "Configuring a Zone Cluster" in the Sun Cluster Software Installation Guide and the clzonecluster(1CL) man page.

The second new feature is support for exclusive-IP zones. Note that this only applies to failover data services, not scalable data services nor with zone clusters.

Friday Jan 09, 2009

Zones and Solaris Security

An under-appreciated aspect of the isolation inherent in Solaris Zones (aka Solaris Containers) is their ability to use standard Solaris security features to enhance security of consolidated workloads. These features can be used alone or in combination to create an arbitrarily strong level of security. This includes DoD-strength security using Solaris Trusted Extensions - which use Solaris Zones to provide labeled, multi-level data classification. Trusted Extensions achieved one of the highest possible Common Criteria independent security certifications.

To shine some light on the topic of Zones and security, Glenn Brunette and I recently co-authored a new Sun BluePrint with an overly long name :-) - "Understanding the Security Capabilities of Solaris Zones Software." You can find it at

Monday Nov 24, 2008

Zonestat v1.3

It's - already - time for a zonestat update. I was never happy with the method that zonestat used to discover the mappings of zones to resource pools, but wanted to get v1.1 "out the door" before I had a chance to improve on its use of zonecfg(1M). The obvious problem, which at least one person stumbled over, was the fact that you can re-configure a zone while it's running. After doing that, the configuration information doesn't match the current mapping of zone to pool, and zonestat became confused.

Anyway, I found the time to replace the code in zonestat which discovered zone-to-pool mappings with a more sophisticated method. The new method uses ps(1) to learn the PID that each zone's [z]sched process is. Then it uses "poolbind -q <PID>" to look up the pool for that process. The result is more accurate data, but the ps command does use more CPU cycles.

While performing surgery on zonestat, I also:

  • began using the kstat module for Perl, reducing CPU time consumed by zonestat
  • fixed some bugs in CPU reporting
  • limited zonename output to 8 characters to improve readability
  • made some small performance improvements
  • added a $DEBUG variable which can be used to watch the commands that zonestat is using
With all of that, zonestat v1.3 provides better data, is roughly 30% faster, and is even smaller than the previous version! :-) You can find it at http:://

Tuesday Nov 18, 2008

Zonestat: How Big is a Zone?


Recently an organization showed great interest in Solaris Containers, especially the resource management features, and then asked "what command do you use to compare the resource usage of Containers to the resource caps you have set?"

I began to list them: prstat(1M), poolstat(1M), ipcs(1), kstat(1M), rcapstat(1), prctl(1), ... Obviously it would be easier to monitor Containers if there was one 'dashboard' to view. Such a dashboard would enable zone administrators to easily review zones' usage of system resources and decide if further investigation is necessary. Also, if there is a system-wide shortage of a resource, this tool would be the first out of the toolbox, simplifying the task of finding the 'resource hog.'


In order to investigate the possibilities, I created a Perl script I call 'zonestat' which summarizes resource usage of Containers. I consider this script a prototype, not intended for production use. On the other hand, for a small number of zones, it seems to be pretty handy, and moderately robust.

Its output looks like this:

Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
  global  0D  66K    2       0.1 100 25      986M      139K  18E 2.2M  18E 754M
    db01  0D  66K    2       0.1 200 50   1G 122M 536M  0.0 536M    0   1G 135M
   web02  0D  66K    2  0.4  0.0 100 25 100M  11M  20M  0.0  20M    0 268M   8M 
==TOTAL= --- ----    2 ----  0.2  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 897M
The 'Pool' columns provide information about the Dynamic Resource Pool in which the zone's processes are running. The two-character 'I' column displays the pool ID (number) and the 'T' column indicates the type of pool - 'D' for 'default', 'P' for 'private' (using the dedicated-cpu feature of zonecfg) or 'S' for 'shared.' The two 'Size' columns show the quantity of CPUs assigned to the pool in which the zone is running.

The 'CPU Pset' columns show each zone's CPU usage and any caps that have been set. The first two columns show CPU quantities - CPU cores for x86, SPARC64 and all UltraSPARC systems except CMT (T1, T2, T2+). On CMT systems, Solaris considers every hardware thread ('strand') to be a CPU, and calls them 'vCPUs.'

The last two CPU columns - 'Shr' and 'S%' - show the number of FSS shares assigned to the zone, and what percentage of the total number of shares in that zone's pool. In the example above, all the zones share the default pset, and the zone 'db01' has two shares, so it should receive 50% of the CPU power of the pool at a minimum.

The 'Memory' columns show the caps and usage for RAM, locked memory and virtual memory. Note that virtual memory is RAM plus swap space.

The syntax of zonestat is very similar to the other \*stat tools:

   zonestat [-l] [interval [count]]
The output shown above is generated with the -l flag, which means "show the limits (caps) that have been set." Without -l, only usage columns are displayed.

Example of Usage

Here is more output, showing some of the conclusions that can be drawn from the data. I have added parenthetical numbers in the right-hand in order to refer to specific lines of output.

Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
  global  0D  66K    2       0.1 100 HH      983M      139K  18E 2.2M  18E 752M
==TOTAL= --- ----    2 ----  0.1  -- -- 4.3G 983M 4.3G 139K 4.1G 2.2M 5.3G 752M
  global  0D  66K    2       0.1 100 HH      983M      139K  18E   2M  18E 752M
==TOTAL= --- ----    2 ----  0.1  -- -- 4.3G 983M 4.3G 139K 4.1G 2.2M 5.3G 752M
Note that the none of the non-global zones are running. Because the global zone is the only zone running in its pool, its 100 FSS shares represent 100% of the shares in its pool. To save a column of output, I indicate that with 'HH' instead of '100'.

The "==TOTAL=" line provides two types of information, depending on the column type. For usage information, the sum of the resource used is shown. For example, "RAM Use" shows the amount of RAM used by all zones, including the global zone. For resource controls, either the system's amount of the resource is shown, e.g. "RAM Cap", or hyphens are displayed.

Note that there is a maximum amount of RAM that can be locked in a Solaris system. This prevents all of memory from being locked down, which would prevent the virtual memory system from running. In the output above, this system will only allow 4.1GB of RAM to be locked.

Also note that the amount of VM used is less than the amount of RAM used. This is because the memory pages which contain a program's instructions are not backed by swap disk, but by the file system itself. Those 'text' pages take up RAM, but do not take up swap space.

Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.1 100 50 1.0G  30M 536M  0.0 536M  0.0 1.0G  27M
==TOTAL= --- ----    2 ----  0.2  -- -- 4.3G 1.0G 4.3G 139K 4.1G 2.2M 5.3G 780M
A zone has booted. It has caps for RAM, shared memory, locked memory, and VM. The default pool now has a total of 200 shares: 100 for each zone. Therefore, each zone has 50% of the shares in that pool. This provides a good reason to change the global zone's FSS value from its default of one share to a larger value as soon as you add the first zone to a system.
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.3 100 50   1G  93M 536M  0.0 536M  0.0   1G  95M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 848M
  global  0D  66K    2       0.1 100 50      981M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.4 100 50   1G 122M 536M  0.0 536M  0.0   1G 135M
==TOTAL= --- ----    2 ----  0.5  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
The zone 'z3' is still booting, and is using 0.4 CPUs worth of CPU cycles.
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.3 100 50   1G 122M 536M      536M  0.0   1G 135M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.2 100 50   1G 122M 536M      536M  0.0   1G 135M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
  global  0D  66K    2       0.1 100 33      986M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.1 100 33   1G 122M 536M      536M  0.0   1G 135M
   web02  0D  66K    2  0.4  0.0 100 33 100M  11M  20M       20M  0.0 268M   8M
==TOTAL= --- ----    2 ----  0.2  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 897M
A third zone has booted. This zone has a CPU-cap of 0.4 CPUs. It also has memory caps, including a RAM cap that is less than the amount of memory that zone 'z3' is using. If the two zones need the same amount, web02 should begin paging before long. Let's see what happens...
  global  0D  66K    2       0.1   1 33      985M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.1   1 33   1G 122M 536M      536M  0.0   1G 135M
   web02  0D  66K    2  0.4  0.1   1 33 100M  29M  20M       20M  0.0 268M  36M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 925M
  global  0D  66K    2       0.1   1 33      984M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.1   1 33   1G 122M 536M      536M  0.0   1G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M  63M  20M       20M  0.0 268M 138M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
  global  0D  66K    2       0.1   1 33      985M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 122M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M  87M  20M       20M  0.0 268M 185M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.1G
  global  0D  66K    2       0.1   1 33      985M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 122M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M 100M  20M       20M  0.0 268M 112M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
  global  0D  66K    2       0.1   1 33      984M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 122M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.3   1 33 100M 112M  20M       20M  0.0 268M 117M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
As expected, web02 exceeds its RAM cap. Now rcapd should address the problem.
  global  0D  66K    2       0.1   1 33      981M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 119M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.3   1 33 100M 111M  20M       20M  0.0 268M 127M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
One of two things has happened: either a process in web02 freed up memory, or rcapd caused pageouts. rcapstat(1M) will tell us which it is. Also, the increase in VM usage indicates that more memory was allocated than freed, so it's more likely that rcapd was active during this period.
  global  0D  66K    2       0.1   1 33      981M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33 1.0G 119M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M 110M  20M       20M  0.0 268M 133M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
  global  0D  66K    2       0.1   1 33      978M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33 1.0G 116M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M  91M  20M       20M  0.0 268M 133M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
At this point 'web02' is safely under its RAM cap. If this zone began to do 'real' work, it would continually be under memory pressure, and the value in 'Memory:RAM:Use" would fluctuate around 100M. When setting a RAM cap, it is very important to choose a reasonable valuable to avoid causing unnecessary paging.

One final example, taken from a different configuration of zones:

Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap|Used| Cap|Used| Cap|Used| Cap|Used
  global  0D  66K    1  0.0  0.0 200 66      1.2G  18E 343K  18E 2.6M  18E 1.1G
      zB  0D  66K    1  0.2  0.0 100 33      124M  18E  0.0  18E  0.0  18E 138M
      zA  1P    1    1  0.0  0.1   1 HH       31M  18E  0.0  18E  0.0  18E  24M
==TOTAL= --- ----    2 ----  0.1 --- -- 4.3G 1.4G 4.3G 343K 4.1G 2.6M 5.3G 1.2G
The global zone and zone 'zB' share the default pool. Because the global zone has 200 FSS shares, compared to zB's 100 shares, global zone processes will get 2/3 of the processing power of the default pool if there is contention for that CPU. However, that is unlikely, because zB is capped at 0.2 CPUs worth of compute time.

Zone 'zA' is in its own private resource pool. It has exclusive access to the one dedicated CPU in that pool.



Zonestat's biggest problem is due to its brute-force nature. It runs a few commands for each zone that is running. This can consume many CPU cycles, and can take a few seconds to run with many zones. Performance improvements to zonestat are underway.

Wrong / Misleading CPU Usage Data

Two commonly used methods to 'measure' CPU usage by processes and zones are prstat and mpstat. Each can produce inaccurate 'data' in certain situations.

With mpstat, it is not difficult to create surprising results, e.g. on a CMT system, set a CPU cap on a zone in a pool, and run a few CPU-bound processes: the "Pset Used" column will not reach the CPU cap. This is due to the method used by mpstat to calculate its data.

Prstat only computes its data occasionally, ignoring anything that happened between samples. This leads to undercounting CPU usage for zones with many short-lived processes.

I wrote code to gather data from each, but prstat seemed more useful, so for now the output comes from prstat.

What's Next

I would like feedback on this tool, perhaps leading to minor modifications to improve its robustness and usability. What's missing? What's not clear?

The future of zonestat might include these:

  • I hope that this can be re-written in C or D. Either way, it might find its way into Solaris... If I can find the time, I would like to tackle this.
  • New features - best added to a 'real' version:
    1. -d: show disk usage
    2. -n: show network usage
    3. -i: also show installed zones that are not running
    4. -c: also show configured zones that are not installed
    5. -p: only show processor info, but add more fields, e.g. prstat's instantaneous CPU%, micro-state fields, and mpstat's CPU%
    6. -m: only show memory-related info, and add the paging columns of vmstat, output of "vmstat -p", free swap space
    7. -s : sort sample output by a field. Good example: sort by poolID
    8. add a one-character column showing the default scheduler for each pool
    9. report state transitions like mpstat does, e.g. changes in zone state, changes in pool configuration
    10. improve robustness

The Code

You can find the Perl script in the Files page of the OpenSolaris project "Zone Statistics."

Wednesday Apr 09, 2008

Effortless Upgrade: Solaris 8 System to Solaris 8 Container

[Update 23 Aug 2008: Solaris 9 Containers was released a few months back. Reply to John's comment: yes, the steps to use Solaris 9 Containers are the same as the steps for Solaris 8 Containers]

Since 2005, Solaris 10 has offered the Solaris Containers feature set, creating isolated virtual Solaris environments for Solaris 10 applications. Although almost all Solaris 8 applications run unmodified in Solaris 10 Containers, sometimes it would be better to just move an entire Solaris 8 system - all of its directories and files, configuration information, etc. - into a Solaris 10 Container. This has become very easy - just three commands.

Sun offers a Solaris Binary Compatibility Guarantee which demonstrates the significant effort that Sun invests in maintaining compatibility from one Solaris version to the next. Because of that effort, almost all applications written for Solaris 8 run unmodified on Solaris 10, either in a Solaris 10 Container or in the Solaris 10 global zone.

However, there are still some data centers with many Solaris 8 systems. In some situations it is not practical to re-test all of those applications on Solaris 10. It would be much easier to just move the entire contents of the Solaris 8 file systems into a Solaris Container and consolidate many Solaris 8 systems into a much smaller number of Solaris 10 systems.

For those types of situations, and some others, Sun now offers Solaris 8 Containers. These use the "Branded Zones" framework available in OpenSolaris and first released in Solaris 10 in August 2007. A Solaris 8 Container provides an isolated environment in which Solaris 8 binaries - applications and libraries - can run without modification. To a user logged in to the Container, or to an application running in the Container, there is very little evidence that this is not a Solaris 8 system.

The Solaris 8 Container technology rests on a very thin layer of software which performs system call translations - from Solaris 8 system calls to Solaris 10 system calls. This is not binary emulation, and the number of system calls with any difference is small, so the performance penalty is extremely small - typically less than 3%.

Not only is this technology efficient, it's very easy to use. There are five steps, but two of them can be combined into one:

  1. install Solaris 8 Containers packages on the Solaris 10 system
  2. patch systems if necessary
  3. archive the contents of the Solaris 8 system
  4. move the archive to the Solaris 10 system (if step 3 placed the archive on a file system accessible to the Solaris 10 system, e.g. via NFS, this step is unnecessary)
  5. configure and install the Solaris 8 Container, using the archive
The rest of this blog entry is a demonstration of Solaris 8 Containers, using a Sun Ultra 60 workstation for the Solaris 8 system and a Logical Domain on a Sun Fire T2000 for the Solaris 10 system. I chose those two only because they were available to me. Any two SPARC systems could be used as long as one can run Solaris 8 and one can run Solaris 10.

Almost any Solaris 8 revision or patch level will work, but Sun strongly recommends applying the most recent patches to that system. The Solaris 10 system must be running Solaris 10 8/07, and requires the following minimum patch levels:

  • Kernel patch: 127111-08 (-11 is now available) which needs
    • 125369-13
    • 125476-02
  • Solaris 8 Branded Zones patch: 128548-05
The first step is installation of the Solaris 8 Containers packages. You can download the Solaris 8 Containers software packages from Installing those packages on the Solaris 10 system is easy and takes 5-10 seconds:
s10-system# pkgadd -d . SUNWs8brandr  SUNWs8brandu  SUNWs8p2v
Now we can patch the Solaris 10 system, using the patches listed above.

After patches have been applied, it's time to archive the Solaris 8 system. In order to remove the "archive transfer" step I'll turn the Solaris 10 system into an NFS server and mount it on the Solaris 8 system. The archive can be created by the Solaris 8 system, but stored on the Solaris 10 system. There are several tools which can be used to create the archive: Solaris flash archive tools, cpio, pax, etc. In this example I used flarcreate, which first became available on Solaris 8 2/04.

s10-system# share /export/home/s8-archives
s8-system# mount s10-system:/export/home/s8-archives /mnt
s8-system# flarcreate -S -n atl-sewr-s8 /mnt/atl-sewr-s8.flar
Creation of the archive takes longer than any other step - 15 minutes to an hour, or even more, depending on the size of the Solaris 8 file systems.

With the archive in place, we can configure and install the Solaris 8 Container. In this demonstration the Container was "sys-unconfig'd" by using the -u option. The opposite of that is -p, which preserves the system configuration information of the Solaris 8 system.

s10-system# zonecfg -z test8
zonecfg:test8> create -t SUNWsolaris8
zonecfg:test8> set zonepath=/zones/roots/test8
zonecfg:test8> add net
zonecfg:test8:net> set address=
zonecfg:test8:net> set physical=vnet0
zonecfg:test8:net> end
zonecfg:test8> exit
s10-system# zoneadm -z test8 install -u -a /export/home/s8-archives/atl-sewr-s8.flar
              Log File: /var/tmp/test8.install.995.log
            Source: /export/home/s8-archives/atl-sewr-s8.flar
        Installing: This may take several minutes...
    Postprocessing: This may take several minutes...

            Result: Installation completed successfully.
          Log File: /zones/roots/test8/root/var/log/test8.install.995.log
This step should take 5-10 minutes. After the Container has been installed, it can be booted.
s10-system# zoneadm -z test8 boot
s10-system# zlogin -C test8
At this point I was connected to the Container's console. It asked the usual system configuration questions, and then rebooted:
[NOTICE: Zone rebooting]

SunOS Release 5.8 Version Generic_Virtual 64-bit
Copyright 1983-2000 Sun Microsystems, Inc.  All rights reserved

Hostname: test8
The system is coming up.  Please wait.
starting rpc services: rpcbind done.
syslog service starting.
Print services started.
Apr  1 18:07:23 test8 sendmail[3344]: My unqualified host name (test8) unknown; sleeping for retry
The system is ready.

test8 console login: root
Apr  1 18:08:04 test8 login: ROOT LOGIN /dev/console
Last login: Tue Apr  1 10:47:56 from vpn-129-150-80-
Sun Microsystems Inc.   SunOS 5.8       Generic Patch   February 2004
# bash
bash-2.03# psrinfo
0       on-line   since 04/01/2008 03:56:38
1       on-line   since 04/01/2008 03:56:38
2       on-line   since 04/01/2008 03:56:38
3       on-line   since 04/01/2008 03:56:38
bash-2.03# ifconfig -a
lo0:1: flags=1000849 mtu 8232 index 1
        inet netmask ff000000
vnet0:1: flags=1000843 mtu 1500 index 2
        inet netmask ffffff00 broadcast

At this point the Solaris 8 Container exists. It's accessible on the local network, existing applications can be run in it, or new software can be added to it, or existing software can be patched.

To extend the example, here is the output from the commands I used to limit this Solaris 8 Container to only use a subset of the 32 virtual CPUs on that Sun Fire T2000 system.

s10-system# zonecfg -z test8
zonecfg:test8> add dedicated-cpu
zonecfg:test8:dedicated-cpu> set ncpus=2
zonecfg:test8:dedicated-cpu> end
zonecfg:test8> exit
bash-3.00# zoneadm -z test8 reboot
bash-3.00# zlogin -C test8

[NOTICE: Zone rebooting]

SunOS Release 5.8 Version Generic_Virtual 64-bit
Copyright 1983-2000 Sun Microsystems, Inc.  All rights reserved

Hostname: test8
The system is coming up.  Please wait.
starting rpc services: rpcbind done.
syslog service starting.
Print services started.
Apr  1 18:14:53 test8 sendmail[3733]: My unqualified host name (test8) unknown; sleeping for retry
The system is ready.
test8 console login: root
Apr  1 18:15:24 test8 login: ROOT LOGIN /dev/console
Last login: Tue Apr  1 18:08:04 on console
Sun Microsystems Inc.   SunOS 5.8       Generic Patch   February 2004
# psrinfo
0       on-line   since 04/01/2008 03:56:38
1       on-line   since 04/01/2008 03:56:38

Finally, to learn more about Solaris 8 Containers: For those who were counting, the "three commands" were, at a minimum, flarcreate, zonecfg and zoneadm.

Tuesday Apr 08, 2008

ZoiT: Solaris Zones on iSCSI Targets (aka NAC: Network-Attached Containers)


Solaris Containers have a 'zonepath' ('home') which can be a directory on the root file system or on a non-root file system. Until Solaris 10 8/07 was released, a local file system was required for this directory. Containers that are on non-root file systems have used UFS, ZFS, or VxFS. All of those are local file systems - putting Containers on NAS has not been possible. With Solaris 10 8/07, that has changed: a Container can now be placed on remote storage via iSCSI.


Solaris Containers (aka Zones) are Sun's operating system level virtualization technology. They allow a Solaris system (more accurately, an 'instance' of Solaris) to have multiple, independent, isolated application environments. A program running in a Container cannot detect or interact with a process in another Container.

Each Container has its own root directory. Although viewed as the root directory from within that Container, that directory is also a non-root directory in the global zone. For example, a Container's root directory might be called /zones/roots/myzone/root in the global zone.

The configuration of a Container includes something called its "zonepath." This is the directory which contains a Container's root directory (e.g. /zones/roots/myzone/root) and other directories used by Solaris. Therefore, the zonepath of myzone in the example above would be /zones/roots/myzone.

The global zone administrator can choose any directory to be a Container's zonepath. That directory could just be a directory on the root partition of Solaris, though in that case some mechanism should be used to prevent that Container from filling up the root partition. Another alternative is to use a separate partition for that Container, or one shared among multiple Containers. In the latter case, a quota should be used for each Container.

Local file systems have been used for zonepaths. However, many people have strongly expressed a desire for the ability to put Containers on remote storage. One significant advantage to placing Containers on NAS is the simplification of Container migration - moving a Container from one system to another. When using a local file system, the contents of the Container must be transmitted from the original host to the new host. For small, sparse zones this can take as little as a few seconds. For large, whole-root zones, this can take several minutes - a whole-root zone is an entire copy of Solaris, taking up as much as 3-5 GB. If remote storage can be used to store a zone, the zone's downtime can be as little as a second or two, during which time a file system is unmounted on one system and mounted on another.

Here are some significant advantages to iSCSI over SANs:

  1. the ability to use commodity Ethernet switching gear, which tends to be less expensive than SAN switching equipment
  2. the ability to manage storage bandwidth via standard, mature, commonly used IP QoS features
  3. iSCSI networks can be combined with non-iSCSI IP networks to reduce the hardware investment and consolidate network management. If that is not appropriate, the two networks can be separate but use the same type of equipment, reducing costs and types of in-house infrastrucuture management expertise.

Unfortunately, a Container cannot 'live' on an NFS server, and it's not clear if or when that limitation will be removed.

iSCSI Basics

iSCSI is simply "SCSI communication over IP." In this case, SCSI commands and responses are sent between two iSCSI-capable devices, which can be general-purpose computers (Solaris, Windows, Linux, etc.) or specific-purpose storage devices (e.g. Sun StorageTek 5210 NAS, EMC Celerra NS40, etc.). There are two endpoints to iSCSI communications: the initiator (client) and the target (server). A target publicizes its existence. An initiator binds to a target.

The industry's design for iSCSI includes a large number of features, including security. Solaris implements many of those features. Details can be found:

In Solaris, the command iscsiadm(1M) configures an initiator, and the command iscsitadm(1M) configures a target.


This section demonstrates the installation of a Container onto a remote file system that uses iSCSI for its transport.

The target system is an LDom on a T2000, and looks like this:

System Configuration:  Sun Microsystems  sun4v
Memory size: 1024 Megabytes
SunOS ldg1 5.10 Generic_127111-07 sun4v sparc SUNW,Sun-Fire-T200
Solaris 10 8/07 s10s_u4wos_12b SPARC
The initiator system is another LDom on the same T2000 - although there is no requirement that LDoms are used, or that they be on the same computer if they are used.
System Configuration:  Sun Microsystems  sun4v
Memory size: 896 Megabytes
SunOS ldg4 5.11 snv_83 sun4v sparc SUNW,Sun-Fire-T200
Solaris Nevada snv_83a SPARC
The first configuration step is the creation of the storage underlying the iSCSI target. Although UFS could be used, let's improve the robustness of the Container's contents and put the target's storage under control of ZFS. I don't have extra disk devices to give to ZFS, so I'll make some and use them for a zpool - in real life you would use disk devices here:
Target# mkfile 150m /export/home/disk0
Target# mkfile 150m /export/home/disk1
Target# zpool create myscsi mirror /export/home/disk0 /export/home/disk1
Target# zpool status
  pool: myscsi
 state: ONLINE
 scrub: none requested

        NAME                  STATE     READ WRITE CKSUM
        myscsi                ONLINE       0     0     0
          /export/home/disk0  ONLINE       0     0     0
          /export/home/disk1  ONLINE       0     0     0
Now I can create a zvol - an emulation of a disk device:
Target# zfs list
myscsi   86K   258M  24.5K  /myscsi
Target# zfs create -V 200m myscsi/jvol0
Target# zfs list
myscsi         200M  57.9M  24.5K  /myscsi
myscsi/jvol0  22.5K   258M  22.5K  -
Creating an iSCSI target device from a zvol is easy:
Target# iscsitadm list target
Target# zfs set shareiscsi=on myscsi/jvol0
Target# iscsitadm list target
Target: myscsi/jvol0
    iSCSI Name:
    Connections: 0
Target# iscsitadm list target -v
Target: myscsi/jvol0
    iSCSI Name:
    Alias: myscsi/jvol0
    Connections: 0
    ACL list:
    TPGT list:
    LUN information:
        LUN: 0
            GUID: 0x0
            VID: SUN
            PID: SOLARIS
            Type: disk
            Size:  200M
            Backing store: /dev/zvol/rdsk/myscsi/jvol0
            Status: online

Configuring the iSCSI initiator takes a little more work. There are three methods to find targets. I will use a simple one. After telling Solaris to use that method, it only needs to know what the IP address of the target is.

Note that the example below uses "iscsiadm list ..." several times, without any output. The purpose is to show the difference in output before and after the command(s) between them.

First let's look at the disks available before configuring iSCSI on the initiator:

Initiator# ls /dev/dsk
c0d0s0  c0d0s2  c0d0s4  c0d0s6  c0d1s0  c0d1s2  c0d1s4  c0d1s6
c0d0s1  c0d0s3  c0d0s5  c0d0s7  c0d1s1  c0d1s3  c0d1s5  c0d1s7
We can view the currently enabled discovery methods, and enable the one we want to use:
Initiator# iscsiadm list discovery
        Static: disabled
        Send Targets: disabled
        iSNS: disabled
Initiator# iscsiadm list target
Initiator# iscsiadm modify discovery --sendtargets enable
Initiator# iscsiadm list discovery
        Static: disabled
        Send Targets: enabled
        iSNS: disabled
At this point we just need to tell Solaris which IP address we want to use as a target. It takes care of all the details, finding all disk targets on the target system. In this case, there is only one disk target.
Initiator# iscsiadm list target
Initiator# iscsiadm add discovery-address
Initiator# iscsiadm list target
        Alias: myscsi/jvol0
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
Initiator# iscsiadm list target -v
        Alias: myscsi/jvol0
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
                CID: 0
                  IP address (Local):
                  IP address (Peer):
                  Discovery Method: SendTargets
                  Login Parameters (Negotiated):
                        Data Sequence In Order: yes
                        Data PDU In Order: yes
                        Default Time To Retain: 20
                        Default Time To Wait: 2
                        Error Recovery Level: 0
                        First Burst Length: 65536
                        Immediate Data: yes
                        Initial Ready To Transfer (R2T): yes
                        Max Burst Length: 262144
                        Max Outstanding R2T: 1
                        Max Receive Data Segment Length: 8192
                        Max Connections: 1
                        Header Digest: NONE
                        Data Digest: NONE
The initiator automatically finds the iSCSI remote storage, but we need to turn this into a disk device. (Newer builds seem to not need this step, but it won't hurt. Looking in /devices/iscsi will help determine whether it's needed.)
Initiator# devfsadm -i iscsi
Initiator# ls /dev/dsk
c0d0s0    c0d0s3    c0d0s6    c0d1s1    c0d1s4    c0d1s7    c1t7d0s2  c1t7d0s5
c0d0s1    c0d0s4    c0d0s7    c0d1s2    c0d1s5    c1t7d0s0  c1t7d0s3  c1t7d0s6
c0d0s2    c0d0s5    c0d1s0    c0d1s3    c0d1s6    c1t7d0s1  c1t7d0s4  c1t7d0s7
Initiator# ls -l /dev/dsk/c1t7d0s0
lrwxrwxrwx   1 root     root         100 Mar 28 00:40 /dev/dsk/c1t7d0s0 ->

Now that the local device entry exists, we can do something useful with it. Installing a new file system requires the use of format(1M) to partition the "disk" but it is assumed that the reader knows how to do that. However, here is the first part of the format dialogue, to show that format lists the new disk device with its unique identifier - the same identifier listed in /devices/iscsi.
Initiator# format
Searching for disks...done

c1t7d0: configured with capacity of 199.98MB

       0. c0d0 
       1. c0d1 
       2. c1t7d0 
Specify disk (enter its number): 2
selecting c1t7d0
[disk formatted]
Disk not labeled.  Label it now? no

Let's jump to the end of the partitioning steps, after assigning all of the available disk space to partition 0:
partition> print
Current partition table (unnamed):
Total disk cylinders available: 16382 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders        Size            Blocks
  0       root    wm       0 - 16381      199.98MB    (16382/0/0) 409550
  1 unassigned    wu       0                0         (0/0/0)          0
  2     backup    wu       0 - 16381      199.98MB    (16382/0/0) 409550
  3 unassigned    wm       0                0         (0/0/0)          0
  4 unassigned    wm       0                0         (0/0/0)          0
  5 unassigned    wm       0                0         (0/0/0)          0
  6 unassigned    wm       0                0         (0/0/0)          0
  7 unassigned    wm       0                0         (0/0/0)          0

partition> label
Ready to label disk, continue? y

The new raw disk needs a file system.
Initiator# newfs /dev/rdsk/c1t7d0s0
newfs: construct a new file system /dev/rdsk/c1t7d0s0: (y/n)? y
/dev/rdsk/c1t7d0s0:     409550 sectors in 16382 cylinders of 5 tracks, 5 sectors
        200.0MB in 1024 cyl groups (16 c/g, 0.20MB/g, 128 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 448, 864, 1280, 1696, 2112, 2528, 2944, 3232, 3648,
Initializing cylinder groups:
super-block backups for last 10 cylinder groups at:
 405728, 406144, 406432, 406848, 407264, 407680, 408096, 408512, 408928, 409344

Back on the target:
Target# zfs list
myscsi         200M  57.9M  24.5K  /myscsi
myscsi/jvol0  32.7M   225M  32.7M  -
Finally, the initiator has a new file system, on which we can install a zone.
Initiator# mkdir /zones/newroots
Initiator# mount /dev/dsk/c1t7d0s0 /zones/newroots
Initiator# zonecfg -z iscuzone
iscuzone: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:iscuzone> create
zonecfg:iscuzone> set zonepath=/zones/newroots/iscuzone
zonecfg:iscuzone> add inherit-pkg-dir
zonecfg:iscuzone:inherit-pkg-dir> set dir=/opt
zonecfg:iscuzone:inherit-pkg-dir> end
zonecfg:iscuzone> exit
Initiator# zoneadm -z iscuzone install
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying <2762> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1162> packages on the zone.
Initialized <1162> packages on zone.
Zone  is initialized.
Installation of these packages generated warnings: 
The file  contains a log of the zone installation.

There it is: a Container on an iSCSI target on a ZFS zvol.

Zone Lifecycle, and Tech Support

There is more to management of Containers than creating them. When a Solaris instance is upgraded, all of its native Containers are upgraded as well. Some upgrade methods work better with certain system configurations than others. This is true for UFS, ZFS, other local file system types, and iSCSI targets that use any of them for underlying storage.

You can use Solaris Live Upgrade to patch or upgrade a system with Containers. If the Containers are on a traditional file system which uses UFS (e.g. /, /export/home) LU will automatically do the right thing. Further, if you create a UFS file system on an iSCSI target and install one or more Containers on it, the ABE will also need file space for its copy of those Containers. To mimic the layout of the original BE you could use another UFS file system on another iSCSI target. The lucreate command would look something like this:

# lucreate -m /:/dev/dsk/c0t0d0s0:ufs   -m /zones:/dev/dsk/c1t7d0s0:ufs -n newBE


If you want to put your Solaris Containers on NAS storage, Solaris 10 8/07 will help you get there, using iSCSI.

Friday Mar 21, 2008

High-availability Networking for Solaris Containers

Here's another example of Containers that can manage their own affairs.

Sometimes you want to closely manage the devices that a Solaris Container uses. This is easy to do from the global zone: by default a Container does not have direct access to devices. It does have indirect access to some devices, e.g. via a file system that is available to the Container.

By default, zones use NICs that they share with the global zone, and perhaps with other zones. In the past these were just called "zones." Starting with Solaris 10 8/07, these are now referred to as "shared-IP zones." The global zone administrator manages all networking aspects of shared-IP zones.

Sometimes it would be easier to give direct control of a Container's devices to its owner. An excellent example of this is the option of allowing a Container to manage its own network interfaces. This enables it to configure IP Multipathing for itself, as well as IP Filter and other network features. Using IPMP increases the availability of the Container by creating redundant network paths to the Container. When configured correctly, this can prevent the failure of a network switch, network cable or NIC from blocking network access to the Container.

As described at, to use IP Multipathing you must choose two network devices of the same type, e.g. two ethernet NICs. Those NICs are placed into an IPMP group through the use of the command ifconfig(1M). Usually this is done by placing the appropriate ifconfig parameters into files named /etc/hostname.<NIC-instance>, e.g. /etc/hostname.bge0.

An IPMP group is associated with an IP address. Packets leaving any NIC in the group have a source address of the IPMP group. Packets with a destination address of the IPMP group can enter through either NIC, depending on the state of the NICs in the group.

Delegating network configuration to a Container requires use of the new IP Instances feature. It's easy to create a zone that uses this feature, making this an "exclusive-IP zone." One new line in zonecfg(1M) will do it:

zonecfg:twilight> set ip-type=exclusive
Of course, you'll need at least two network devices in the IPMP group. Using IP Instances will dedicate these two NICs to this Container exclusively. Also, the Container will need direct access to the two network devices. Configuring all of that looks like this:
global# zonecfg -z twilight
zonecfg:twilight> create
zonecfg:twilight> set zonepath=/zones/roots/twilight
zonecfg:twilight> set ip-type=exclusive
zonecfg:twilight> add net
zonecfg:twilight:net> set physical=bge1
zonecfg:twilight:net> end
zonecfg:twilight> add net
zonecfg:twilight:net> set physical=bge2
zonecfg:twilight:net> end
zonecfg:twilight>add device
zonecfg:twilight:device> set match=/dev/net/bge1
zonecfg:twilight:net> end
zonecfg:twilight>add device
zonecfg:twilight:device> set match=/dev/net/bge2
zonecfg:twilight:net> end
zonecfg:twilight> exit
As usual, the Container must be installed and booted with zoneadm(1M):
global# zoneadm -z twilight install
global# zoneadm -z twilight boot
Now you can login to the Container's console and answer the usual configuration questions:
global# zlogin -C twilight
<answer questions>
<the zone automatically reboots>

After the Container reboots, you can configure IPMP. There are two methods. One uses link-based failure detection and one uses probe-based failure detection.

Link-based detection requires the use of a NIC which supports this feature. Some NICs that support this are hme, eri, ce, ge, bge, qfe and vnet (part of Sun's Logical Domains). They are able to detect failure of the link immediately and report that failure to Solaris. Solaris can then take appropriate steps to ensure that network traffic continues to flow on the remaining NIC(s).

Other NICs do not support this link-based failure detection, and must use probe-based detection. This method uses ICMP packets ("pings") from the NICs in the IPMP group to detect failure of a NIC. This requires one IP address per NIC, in addition to the IP address of the group.

Regardless of the method used, configuration can be accomplished manually or via files /etc/hostname.<NIC-instance>. First I'll describe the manual method.

Link-based Detection

Using link-based detection is easiest. The commands to configure IPMP look like these, whether they're run in an exclusive-IP zone for itself, or in the global zone, for its NICs and for NICs used by shared-IP Containers:
# ifconfig bge1 plumb
# ifconfig bge1 twilight group ipmp0 up
# ifconfig bge2 plumb
# ifconfig bge2 group ipmp0 up
Note that those commands only achieve the desired network configuration until the next time that Solaris boots. To configure Solaris to do the same thing when it next boots, you must put the same configuration information into configuration files. Inserting those parameters into configuration files is also easy:
twilight group ipmp0 up

/etc/hostname.bge2: group ipmp0 up
Those two files will be used to configure networking the next time that Solaris boots. Of course, an IP address entry for twilight is required in /etc/inet/hosts.

If you have entered the ifconfig commands directly, you are finished. You can test your IPMP group with the if_mpadm command, which can be run in the global zone, to test an IPMP group in the global zone, or can be run in an exclusive-IP zone, to test one of its groups:

# ifconfig -a
bge1: flags=201000843 mtu 1500 index 4
        inet netmask ffff0000 broadcast
        groupname ipmp0
        ether 0:14:4f:f8:9:1d
bge2: flags=201000843 mtu 1500 index 5
        inet netmask ff000000
        groupname ipmp0
        ether 0:14:4f:fb:ca:b

# if_mpadm -d bge1
# ifconfig -a
bge1: flags=289000842 mtu 0 index 4
        inet netmask 0
        groupname ipmp0
        ether 0:14:4f:f8:9:1d
bge2: flags=201000843 mtu 1500 index 5        inet netmask ff000000
        groupname ipmp0
        ether 0:14:4f:fb:ca:b
bge2:1: flags=201000843 mtu 1500 index 5
        inet netmask ffff0000 broadcast

# if_mpadm -r bge1
# ifconfig -a
bge1: flags=201000843 mtu 1500 index 4        inet netmask ffff0000 broadcast
        groupname ipmp0
        ether 0:14:4f:f8:9:1d
bge2: flags=201000843 mtu 1500 index 5        inet netmask ff000000
        groupname ipmp0
        ether 0:14:4f:fb:ca:b

If you are using link-based detection, that's all there is to it!

Probe-based detection

As mentioned above, using probe-based detection requires more IP addresses:

twilight netmask + broadcast + group ipmp0 up addif twilight-test-bge1 \\
deprecated -failover netmask + broadcast + up
twilight-test-bge2 deprecated -failover netmask + broadcast + group ipmp0 up
Three entries for hostname and IP address pairs will, of course, be needed in /etc/inet/hosts.

All that's left is a reboot of the Container. If a reboot is not practical at this time, you can accomplish the same effect by using ifconfig(1M) commands:

twilight# ifconfig bge1 plumb
twilight# ifconfig bge1 twilight netmask + broadcast + group ipmp0 up addif \\
twilight-test-bge1 deprecated -failover netmask + broadcast + up
twilight# ifconfig bge2 plumb
twilight# ifconfig bge2 twilight-test-bge2 deprecated -failover netmask + \\
broadcast + group ipmp0 up


Whether link-based failure detection or probe-based failure detection is used, we have a Container with these network properties:

  1. Two network interfaces
  2. Automatic failover between the two NICs
You can expand on this with more NICs to achieve even more resiliency or even greater bandwidth.

Wednesday Sep 05, 2007

New Zones Features

On September 4, 2007, Solaris 10 8/07 became available. You can download it - free of charge - at: Just click on "Get Software" and follow the instructions.

This update to Solaris 10 has many new features. Of those, many enhance Solaris Containers either directly or indirectly. This update brings the most important changes to Containers since they were introduced in March of 2005. A brief introduction to them seems appropriate, but first a review of the previous update.

Solaris 10 11/06 added four features to Containers. One of them is called "configurable privileges" and allows the platform administrator to tailor the abilities of a Container to the needs of its application. I blogged about configurable privileges before, so I won't say any more here.

At least as important as that feature was the new ability to move (also called 'migrate') a Container from one Solaris 10 computer to another. This uses the 'detach' and 'attach' sub-commands to zoneadm(1M).

Other, minor new features, included:

  • rename a zone (i.e. Container)
  • move a zone to a different place in the file system on the same computer

New Features in Solaris 10 8/07 that Enhance Containers

New Resource Management Features

Solaris 10 8/07 has improved the resource management features of Containers. Some of these are new resource management features and some are improvements to the user interface. First I will describe three new "RM" features.

Earlier releases of Solaris 10 included the Resource Capping Daemon. This tool enabled you to place a 'soft cap' on the amount of RAM (physical memory) that an application, user or group of users could use. Excess usage would be detected by rcapd. When it did, physical memory pages owned by that entity would be paged out until the memory usage decreased below the cap.

Although it was possible to apply this tool to a zone, it was cumbersome and required cooperation from the administrator of the Container. In other words, the root user of a capped Container could change the cap. This made it inappropriate for potentially hostile environments, including service providers.

Solaris 10 8/07 enables the platform administrator to set a physical memory cap on a Container using an enhanced version of rcapd. Cooperation of the Container's administrator is not necessary - only the platform administrator can enable or disable this service or modify the caps. Further, usage has been greatly simplified to the following syntax:

global# zonecfg -z myzone
zonecfg:myzone> add capped-memory
zonecfg:myzone:capped-memory> set physical=500m
zonecfg:myzone:capped-memory> end
zonecfg:myzone> exit
The next time the Container boots, this cap (500MB of RAM) will be applied to it. The cap can be also be modified while the Container is running, with:
global# rcapadm -z myzone -m 600m
Because this cap does not reserve RAM, you can over-subscribe RAM usage. The only drawback is the possibility of paging.

For more details, see the online documentation.

Virtual memory (i.e. swap space) can also be capped. This is a 'hard cap.' In a Container which has a swap cap, an attempt by a process to allocate more VM than is allowed will fail. (If you are familiar with system calls: malloc() will fail with ENOMEM.)

The syntax is very similar to the physical memory cap:

global# zonecfg -z myzone
zonecfg:myzone> add capped-memory
zonecfg:myzone:capped-memory> set swap=1g
zonecfg:myzone:capped-memory> end
zonecfg:myzone> exit
This limit can also be changed for a running Container:
global# prctl -n zone.max-swap -v 2g -t privileged   -r -e deny -i zone myzone
Just as with the physical memory cap, if you want to change the setting for a running Container and for the next time it boots, you must use zonecfg and prctl or rcapadm.

The third new memory cap is locked memory. This is the amount of physical memory that a Container can lock down, i.e. prevent from being paged out. By default a Container now has the proc_lock_memory privilege, so it is wise to set this cap for all Containers.

Here is an example:

global# zonecfg -z myzone
zonecfg:myzone> add capped-memory
zonecfg:myzone:capped-memory> set locked=100m
zonecfg:myzone:capped-memory> end
zonecfg:myzone> exit

Simplified Resource Management Features

Dedicated CPUs

Many existing resource management features have a new, simplified user interface. For example, "dedicated-cpus" re-use the existing Dynamic Resource Pools features. But instead of needing many commands to configure them, configuration can be as simple as:

global# zonecfg -z myzone
zonecfg:myzone> add dedicated-cpu
zonecfg:myzone:dedicated-cpu> set ncpus=1-3
zonecfg:myzone:dedicated-cpu> end
zonecfg:myzone> exit
After using that command, when that Container boots, Solaris:
  1. removes a CPU from the default pool
  2. assigns that CPU to a newly created temporary pool
  3. associates that Container with that pool, i.e. only schedules that Container's processes on that CPU
Further, if the load on that CPU exceeds a default threshold and another CPU can be moved from another pool, Solaris will do that, up to the maximum configured amount of three CPUs. Finally, when the Container is stopped, the temporary pool is destroyed and its CPU(s) are placed back in the default pool.

Also, three existing project resource controls were applied to Containers:

global# zonecfg -z myzone
zonecfg:myzone> set max-shm-memory=100m
zonecfg:myzone> set max-shm-ids=100
zonecfg:myzone> set max-msg-ids=100
zonecfg:myzone> set max-sem-ids=100
zonecfg:myzone> exit
Fair Share Scheduler

A commonly used method to prevent "CPU hogs" from impacting other workloads is to assign a number of CPU shares to each workload, or to each zone. The relative number of shares assigned per zone guarantees a relative minimum amount of CPU power. This is less wasteful than dedicating a CPU to a Container that will not completely utilize the dedicated CPU(s).

Several steps were needed to configure this in the past. Solaris 10 8/07 simplifies this greatly: now just two steps are needed. The system must use FSS as the default scheduler. This command tells the system to use FSS as the default scheduler the next time it boots.

global# dispadmin -d FSS
Also, the Container must be assigned some shares:
global# zonecfg -z myzone
zonecfg:myzone> set cpu-shares=100
zonecfg:myzone> exit
Shared Memory Accounting

One feature simplification is not a reduced number of commands, but reduced complexity in resource monitoring. Prior to Solaris 10 8/07, the accounting of shared memory pages had an unfortunate subtlety. If two processes in a Container shared some memory, per-Container summaries counted the shared memory usage once for every process that was sharing the memory. It would appear that a Container was using more memory than it really was.

This was changed in 8/07. Now, in the per-Container usage section of prstat and similar tools, shared memory pages are only counted once per Container.

Global Zone Resource Management

Solaris 10 8/07 adds the ability to persistently assign resource controls to the global zone and its processes. These controls can be applied:
  • pool
  • cpu-shares
  • capped-memory: physical, swap, locked
  • dedicated-cpu: ncpus, importance
global# zonecfg -z global
zonecfg:myzone> set cpu-shares=100
zonecfg:myzone> set scheduling-class=FSS
zonecfg:myzone> exit
Use those features with caution. For example, assigning a physical memory cap of 100MB to the global zone will surely cause problems...

New Boot Arguments

The following boot arguments can now be used:

Argument or OptionMeaning
-sBoot to the single-user milestone
-m <milestone>Boot to the specified milestone
-i </path/to/init>Boot the specified program as 'init'. This is only useful with branded zones.

Allowed syntaxes include:

global# zoneadm -z myzone boot -- -s
global# zoneadm -z yourzone reboot -- -i /sbin/myinit
ozone# reboot -- -m verbose
In addition, these boot arguments can be stored with zonecfg, for later boots.
global# zonecfg -z myzone
zonecfg:myzone> set bootargs="-m verbose"
zonecfg:myzone> exit

Configurable Privileges

Of the existing three DTrace privileges, dtrace_proc and dtrace_user can now be assigned to a Container. This allows the use of DTrace from within a Container. Of course, even the root user in a Container is still not allowed to view or modify kernel data, but DTrace can be used in a Container to look at system call information and profiling data for user processes.

Also, the privilege proc_priocntl can be added to a Container to enable the root user of that Container to change the scheduling class of its processes.

IP Instances

This is a new feature that allows a Container to have exclusive access to one or more network interfaces. No other Container, even the global zone, can send or receive packets on that NIC.

This also allows a Container to control its own network configuration, including routing, IP Filter, the ability to be a DHCP client, and others. The syntax is simple:

global# zonecfg -z myzone
zonecfg:myzone> set ip-type=exclusive
zonecfg:myzone> add net
zonecfg:myzone:net> set physical=bge1
zonecfg:myzone:net> end
zonecfg:myzone> exit

IP Filter Improvements

Some network architectures call for two systems to communicate via a firewall box or other piece of network equipment. It is often desirable to create two Containers that communicate via an external device, for similar reasons. Unfortunately, prior to Solaris 10 8/07 that was not possible. In 8/07 the global zone administrator can configure such a network architecture with the existing IP Filter commands.

Upgrading and Patching Containers with Live Upgrade

Solaris 10 8/07 adds the ability to use Live Upgrade tools on a system with Containers. This makes it possible to apply an update to a zoned system, e.g. updating from Solaris 10 11/06 to Solaris 10 8/07. It also drastically reduces the downtime necessary to apply some patches.

The latter ability requires more explanation. An existing challenge in the maintenance of zones is patching - each zone must be patched when a patch is applied. If the patch must be applied while the system is down, the downtime can be significant.

Fortunately, Live Upgrade can create an Alternate Boot Environment (ABE) and the ABE can be patched while the Original Boot Environment (OBE) is still running its Containers and their applications. After the patches have been applied, the system can be re-booted into the ABE. Downtime is limited to the time it takes to re-boot the system.

An additional benefit can be seen if there is a problem with the patch and that particular application environment. Instead of backing out the patch, the system can be re-booted into the OBE while the problem is investigated.

Branded Zones (Branded Containers)

Some times it would be useful to run an application in a Container, but the application is not yet available for Solaris, or is not available for the version of Solaris that is being run. To run an application like that, perhaps a special Solaris environment could be created that only runs applications for that version of Solaris, or for that operating system.

Solaris 10 8/07 contains a new framework called Branded Zones. This framework enables the creation and installation of Containers that are not the default 'native' type of Containers, but have been tailored to run 'non-native' applications.

Solaris Containers for Linux Applications

The first brand to be integrated into Solaris 10 is the brand called 'lx'. This brand is intended for x86 appplications which run well on CentOS 3 or Red Hat Linux 3. This brand is specific to x86 computers. The name of this feature is Solaris Containers for Linux Applications.


This was only a brief introduction to these many new and improved features. Details are available in the usual places, including,, and

Thursday Jul 12, 2007

Shrink-Wrap Security


These steps were developed using OpenSolaris build 56 and may require modification to run on other builds of OpenSolaris or its distros, including Solaris 10. As with any security recommendation, these settings may not be appropriate for all sites - assess your situation and risk before implementing these recommendations.


Server virtualization is commonly used to consolidate multiple workloads onto one computer. What else is it good for?

Solaris Containers (aka Zones) is a virtualization tool that has other powerful, but less well known uses. These rely on a unique combination of features:

  • per-zone Solaris Service Manager configuration, allowing a zone to turn off unneeded services like telnet, ftp, even ssh, yet still allow secure access from the platform administrator's environment, called the global zone
  • a strict security boundary, preventing direct inter-zone interaction
  • configurable privileges, enabling a platform administrator to further restrict the abilities of a zone, or to selectively enhance the abilities of a zone
  • resource management controls, allowing the platform admin to limit the amount of resources that the zone can consume
How can this combination provide unique functionality?

By default, Solaris Containers are more secure than general-purpose operating systems in many ways. For example, even the root user of a Container with a default configuration cannot modify the Container's operating system programs. That limitation prevents trojan horse attacks which replace those programs. Also, a process running in a Container cannot directly modify any kernel data, nor can it modify kernel modules like device drivers. Glenn Brunette created an excellent slide deck that describes the multiple layers of security in Solaris 10, of which Containers can be one layer.

Even considering that level of security, the ability to selectively remove Solaris privileges can be used to further tighten a zone's security boundary. In addition, the ability to disable network services prevents almost all network-based attacks. This is very difficult to accomplish in most operating systems without making the system unusable or unmanageable.

The combination of those abilities and the resource controls that are part of Containers' functionality enables you to configure an application environment that can do little more than fulfill the role you choose for it.

This blog entry describes a method that can be used to slightly expand a Container's abilities, and then tighten the security boundary snugly around the Container's intended application.


Imagine that you want to run an application on a Solaris system, but the workload(s) running on this system should not be directly attached to the Internet. Further, imagine that the application needs an accurate sense of time. Today this can be done by properly configuring a firewall to allow the use of an NTP client. But now there's another way... (If this concept sounds familiar, it is because this idea has been mentioned before here and here.)

To achieve the same goal without a firewall, you could use two Solaris "virtual environments" (zones): one that has "normal" behavior, for the application, and one that has the ability to change the system's clock, but has been made extremely secure by meeting the following requirements:

  1. the zone can make outbound network requests and accept responses to those requests
  2. the zone does not allow any inbound network requests (even secure ones like ssh)
  3. the zone can set the system's time clock
  4. the zone has minimal abilities beside the ones it needs to perform its task (the principle of "least privilege")
A Solaris zone can be configured to meet that list of needs.

Any zone can be configured to have access to one or more network ports (NICs). Further, OpenSolaris build 57 and newer builds, and the next update to Solaris 10, enable a zone to have exclusive access to a NIC, further isolating network activity of different zones. This feature is called IP Instances and will be mentioned again a bit later. A zone has its own SSM (Solaris Services Manager). Most of the services managed by SSM can be disabled if you are limiting the abilities of a zone. The zone that will manage the time clock can be configured so that it does not respond to any network connection requests by disabling all non-essential services. Also, Solaris Configurable Privileges enables us to remove unnecessary privileges from the zone, and add the one non-default privilege it needs: sys_time. That privilege is needed in order to use the stime(2) system call.

Basic Steps

Here is an outline of the steps I used to accomplish the goals described above. It can be generalized to harden any service, not just an NTP client.
  1. Configure a sparse-root zone with a zonepath and zonename, but without network access yet (this prevents network attacks while we're hardening the zone)
  2. Install the zone
  3. Add an appropriate /etc/sysidcfg file to the zone
  4. Boot the zone - this automatically configures SSM with default services
  5. Disable unnecessary services
    1. In general, there are two possibilities: we're hardening an existing service managed by SSM, or we're hardening something else. In the former case, we can disable all services with "svcadm disable ..." and then automatically enable the desired service and the ones it needs with:
      svcadm enable -r <service-we-want>
      In the latter case, some experimentation will be necessary to determine the minimum set of services that will be needed to boot the zone and run the application. In the NTP example discussed here it might be possible to disable more services, but there was little advantage to that.
    2. Reboot the zone, check to see if it complains unreasonably
    3. Shut down the zone
  6. Limit the zone's ability to do undesired things:
    1. Discover the set of necessary privileges with 'privdebug' and remove unnecessary privileges with the 'set limitpriv' command in zonecfg(1M). This step must be performed after turning off unnecessary services, because some of the unnecessary services require privileges we don't want the zone to have. The zone may not boot if the services are still enabled and the privileges have already been removed.
    2. Add any non-default privileges to the zone that are needed for the specific purpose of the zone.
    3. Add the network interface(s) that this zone should use to run its application. When possible, this is performed after services and privileges have been removed to prevent someone from attacking the system while it is being hardened.
  7. Boot the zone
  8. Configure the zone to run the application


I won't list the steps to configure and install a zone. They are described here, here, and about a bazillion other places.

Here is the configuration for the zone when I initially created it:

zonecfg -z timelord
zonecfg:timelord> create
zonecfg:timelord> set zonepath=/zones/roots/timelord
zonecfg:timelord> exit

After the zone has been booted and halted once, disabling services in Solaris is easy - the svcadm(1M) command does that. Through experimentation I found that this script disabled all of the network services - and some non-network services, too - but left enough services running that the zone would boot and NTP client software would run. Note that this is less important starting with Solaris 10 11/06: new installations of Solaris 10 will offer the choice to install "Secure By Default" with almost all network services turned off.

To use that script, I booted the zone and logged into it from the global zone - something you can do with zlogin(1) even if the zone does not have access to a NIC. Then I copied the script from the global zone into the non-global zone. A secure method to do this is: as the root user of the global zone, create a directory in <zonepath>/root/tmp, change its permissions to prevent access by any user other than root, and then copy the script into that directory. All of that allowed the script to be run by the root user of the non-global zone. Those steps can be accomplished with these commands:

global# mkdir /zones/roots/timelord/root/tmp/ntpscript
global# chmod 700 /zones/roots/timelord/root/tmp/ntpscript
global# cp ntp-disable-services /zones/roots/timelord/root/tmp/ntpscript
global# zlogin timelord
timelord# chmod 700 /tmp/ntpscript/disable-services
timelord# /tmp/ntpscript/disable-services

Now we have a zone that only starts the services needed to boot the zone and run NTP. Incidentally, many other commands will still work, but they don't need any additional privileges.

The next step is to gather the minimum list of Solaris privileges needed by the reduced set of services. Fortunately, a tool has been developed that helps you determine the minimum necessary set of privileges: privdebug.

Here is a sample use of privdebug, which was started just before booting the zone, and stopped after the zone finished booting:

global# ./ -z timelord
USED sys_mount
USED sys_mount
USED sys_mount
USED sys_mount
USED sys_mount
USED proc_exec
USED proc_fork
USED proc_exec
USED proc_exec
USED proc_fork
USED contract_event
USED contract_event
<many lines deleted>
Running that output through sort(1) and uniq(1) summarizes the list of privileges needed to boot the zone and our minimal Solaris services. Limiting a zone to a small set of privileges requires using the zonecfg command:
global# zonecfg -z timelord
zonecfg:timelord> set limitpriv=file_chown,file_dac_read,file_dac_write,file_owner,prov_exec,proc_fork,proc_info,proc_session,proc_setid,proc_taskid,sys_admin,sys_mount,sys_resource
zonecfg:timelord> exit
At this point the zone is configured without unnecessary privileges and without network services. Next we must discover the privileges needed to run our application. Our first attempt to run the application may succeed. If that happens, there is no need to change the list of privileges that the zone has. If the attempt fails, we can determine the missing privilege(s) with privdebug.

For this example I will use ntpdate(1M) to synchronize the system's time clock with time servers on the Internet. In order for ntpdate to run, it needs network access, which must be enabled with zonecfg. When adding a network port, I increased zone isolation with a new feature in OpenSolaris called IP Instances. Use of this feature is not required, but it does improve network isolation and network configuration flexibility. You can choose to ignore this feature if you are using a version of Solaris 10 which does not offer it, or if you do not want to dedicate a NIC to this purpose.

To use IP Instances, I added the following parameters via zonecfg:

global# zonecfg -z timelord
zonecfg:timelord> set ip-type=exclusive
zonecfg:timelord> add net
zonecfg:timelord:net> set physical=bge1
zonecfg:timelord:net> end
zonecfg:timelord> exit
Setting ip-type=exclusive quietly adds the net_rawaccess privilege and the new sys_ip_config privilege to the zone's limit set. This happens whenever the zone boots. These privileges are required in exclusive-IP zones.

We can assign a static address to the zone with the usual methods of configuring IP addresses on Solaris systems. For example, you could boot the zone, login to it, and enter the following command:

timelord# echo "" > /etc/hostname.bge1
However, because the root user of the global zone can access any of the zone's files, you can do the same thing without booting the zone by using this command instead:
global# echo "" > /zones/roots/timelord/root/etc/hostname.bge1

With network access in place, we can discover the list of privileges necessary to run the NTP client. First boot the zone:

global# zoneadm -z timelord boot
After the zone boots, in one window run the privdebug script, and then in another window run the NTP client in the NTP zone:

global# ./ -z timelord
USED proc_fork
USED proc_exec
USED proc_fork
USED proc_exec
NEED sys_time

    global# zlogin timelord
    timelord# ntpdate -u <list of NTP server IP addresses>
    16 May 13:12:27 ntpdate[24560]: Can't adjust the time of day: Not owner

That output shows us that the privilege 'sys_time' is the only additional one needed to enable the zone to set the system time clock using ntpdate(1M).

Again we use zonecfg to modify the zone's privileges:

global# zonecfg -z timelord
zonecfg:timelord> set limitpriv=file_chown,file_dac_read,file_dac_write,file_owner,prov_exec,proc_fork,proc_info,proc_session,proc_setid,proc_taskid,sys-admin,sys_mount,sys_resource,sys_time
zonecfg:timelord> exit

While isolating the zone, why not also limit the amount of resources that it can consume? If the zone is operating normally the use of resource management features is unnecessary, but they are easy to configure and their use in this situation could be valuable. These limits could reduce or eliminate the effects of a hypothetical bug in ntpdate which might cause a memory leak or other unnecessary use of resources.

Capping the amount of resources which can be consumed by the zone is also another layer of security in this environment. Resource constraints can reduce or eliminate risks associated with a denial of service attack. Note that the use of these features is not necessary. Their use is shown for completeness, to demonstrate what is possible.

A few quick tests with rcapstat(1) showed that the zone needed less than 50MB of memory to do its job. A cap on locked memory further minimized the zone's abilities without causing a problem for NTP. As with IP Instances, these features are available in OpenSolaris and will be in the next update to Solaris 10.

global# zonecfg -z timelord
zonecfg:timelord> add capped-memory
zonecfg:timelord:capped-memory> set physical=50m
zonecfg:timelord:capped-memory> set swap=50m
zonecfg:timelord:capped-memory> set locked=20m
zonecfg:timelord:capped-memory> end
zonecfg:timelord> set scheduling-class=FSS
zonecfg:timelord> set cpu-shares=1
zonecfg:timelord> set max-lwps=200

Assigning one share to the zone prevents the zone from using too much CPU power and impacting other workloads. It also guarantees that other workloads will not prevent this zone from getting access to the CPU. Capping the number of threads (lwps) limits the ability to use up a fixed resource: process table slots. That limit is probably not necessary given the strict memory caps, but it can't hurt.

Now that we have 'shrink-wrapped' the security boundary even more tightly than the default, we're ready to use this zone.

global# zoneadm -z timelord boot
global# zlogin timelord
timelord# ntpdate 
16 May 14:40:35 ntpdate[25070]: adjust time server  offset -0.394755 sec
The output of ntpdate shows that that it was able to contact an NTP server and adjust this system's time clock by almost 0.4 seconds.

Experience with Solaris privileges can allow you to further tighten the security boundary. For example, if you want to prevent the zone from changing its own host name, you could remove the sys_admin privilege from the zone's limit set. Doing so, and then rebooting the zone, would allow you to demonstrate this:

timelord# hostname drwho
hostname: error in setting name: Not owner
What privilege is needed to use the hostname(1M) command?
timelord# ppriv -e -D hostname drwho
hostname[4231]: missing privilege "sys_admin" (euid = 0, syscall = 139) needed at systeminfo+0x139
hostname: error in setting name: Not owner

Security Analysis

Many attacks on computers require the ability to take advantage of a security weakness in software that listens to the network. The ability to turn off all such services greatly decreases the security risk. Disabling network listeners was easy with svcadm(1M).

Before disabling services, I ran "netstat -a" on another zone which had just been created. It showed a list of 13 ports to which services were listening, including ssh and sunrpc services. After hardening the zone 'timelord' by disabling unneeded services, "netstat -a" doesn't show any open ports.

In order to further evaluate the security of the configuration described above, Nessus was used to evaluate the possible attack vectors. It did not find any security weaknesses.

Future Work

This method could be enhanced further by using tools like the Solaris Security Toolkit and by understanding the research done in implementing a secure NTP system. Network security could also be enhanced by using the Solaris IP Filter to prevent the NTP zone from communicating with any other computer except a specific list of IP addresses - the NTP servers which will be used.

What else can be secured using this method? Typical Unix services like sendmail and applications like databases are ideal candidates. What application do you want to secure?


The goal of this exercise was not merely to demonstrate a new method of securing an NTP client, but to describe a method of limiting the abilities of processes in a zone to only those abilities that are needed to provide a single function for the system. Solaris Zones (Containers) can be used to accomplish goals which other server virtualization technologies cannot.

Thanks to Glenn Brunette for assistance with security techniques and to Bob Bownes for providing a test platform and assistance with Nessus.

Wednesday Apr 11, 2007

Migratory FrankenZones

Sometimes I wonder too much. This is one of those times.

Solaris 10 11/03 introduced the ability to migrate a Solaris non-global zone from one computer to another. Use of this feature is supported for two computers which have the same CPU type and substantially similar package sets and patch levels. (Note that non-global zones are also called simply 'zones' or, more officially, 'Solaris Containers.')

But I wondered... what would happen if you migrated a zone from a SPARC system to an x86/x64 system (or vice versa)? Would it work?

Theoretically, it depends on how hardware-dependent Solaris is. With a few minor exceptions, Solaris has only one source-code base, compiled for each hardware architecture on which it runs. The exceptions are things like device drivers and other components which operate hardware directly. But none of those components are part of a zone... (eery foreshadowing music plays in the background)

Of course, programs compiled for one architecture won't run on another one. If a zone contains a binary program and you move the zone to a computer with a different CPU type, that program will not run. I wondered: do zones include binary programs?

The answer to that question is "it depends." A sparse-root zone, which is the default type, does not include binary programs except for a few in /etc/fs, /etc/lp/alerts and /etc/security/lib which are no longer used and didn't belong there in the first place. In fact, when a zone is not running, it is just a bunch of configuration files of these types:

  • directory
  • symbolic link
  • empty file
  • FIFO
  • executable shell script
All of those file types are portable from one Solaris system to another, regardless of CPU type. So, theoretically, it might be possible to move a sparse-root zone from any Solaris 10 system to any other, without regard to CPU type.

In addition, when a zone is booted, a few loopback mounts (see lofs(7FS)) are created from the global zone into the non-global zone. They include directories like /usr and /sbin - the directories that actually contain the Solaris programs. Those loopback mounts make all of the operating system's programs available to a zone's processes when the zone is running.

Although the information about the mount points moves with a migrating (sparse-root) zone, the contents of those mount points don't move with the zone... (there's that music again)

On the other hand, a whole-root zone contains its own copy of almost all files in a Solaris instance, including all of the binary programs. Because of that, a whole-root zone cannot be moved to a system with a different CPU type.

To test the migration of a sparse-root zone across CPU types, I created a zone on a SPARC system and used the steps shown in my "How to Move a Container" guide to move it to an x86 system. Note that in step 2 of the section "Move the Container", pax(1) is used to ensure that there are no endian issues.

The original zone had this configuration on an Ultra 30 workstation:

sparc-global# zonecfg -z bennu
zonecfg:bennu> create
zonecfg:bennu> set zonepath=/zones/roots/bennu
zonecfg:bennu> add net
zonecfg:bennu:net> set physical=hme0
zonecfg:bennu:net> set address=
zonecfg:bennu:net> end
zonecfg:bennu> exit

When configuring the new zone, you must specify any hardware differences. In my case, the NIC on the original system was hme0. On the destination system (a Toshiba Tecra M2) it was e1000g0. I chose to keep the same IP address for simlicity. After moving the archive file to the Tecra and unpacking it into /zones/roots/phoenix, it was time to configure the new zone. The zonecfg session for the new zone looked like this:

x86-global# zonecfg -z phoenix
zonecfg:phoenix> create -a /zones/roots/phoenix
zonecfg:phoenix> select net physical=hme0
zonecfg:phoenix:net> set physical=e1000g0
zonecfg:phoenix:net> end
zonecfg:phoenix> exit

By specifying the change in hardware, the appropriate actions are implemented by zoneadm when the zone boots.

The zoneadm(1M) command is used to attach a zone's detached files to their new computer as a new zone. When used to attach a zone, the zoneadm(1M) command compares the zone's package and patch information - generated when detaching the zone - to the package and patch information of the new host for the zone. Unfortunately (for this situation) patch numbers are different for SPARC and x86 systems. As you might guess, attaching a zone which was first created on a SPARC system, to an x86 system, caused zoneadm to emit numerous complaints, including:

These packages installed on the source system are inconsistent with this system:
(SPARC-specific packages)
These pacakges installed on this system were not installed on the source system:
(x86-specific packages)
These patches installed on the source system are inconsistent with this system:
        118367-04: not installed
(other SPARC-specific patches)
These patches installed on this system were not installed on the source system:
(other x86-specific patches)

If zoneadm detects sufficient differences in packages and patches, it does not attach the zone. Fortunately, for situations like this, when you know what you are doing (or pretend that you do...) and are willing to create a possibly unsupported configuration, the 'attach' sub-command to zoneadm has its own -F flag. The use of that flag tells zoneadm to attach the zone even if there are package and/or patch inconsistencies.

After forcing the attachment, the zone boots correctly. It uses programs in the loopback-mounted file systems /usr, /lib and /sbin. Other applications could be loopback-mounted into /opt as long as that loopback mount is modified, if necessary, when the zone is attached to the new system.


This little experiment has shown that it is possible to move a Solaris zone from a computer with one CPU type to a Solaris computer that has a different CPU type. I have not shown that it is wise to do so, nor that an arbitrary application will run correctly.

My goals were:

  • depict the light weight of zones, especially showing that an unbooted sparse-root zone is merely a collection of hardware- independent configuration files
  • show how easy it is to create and migrate a zone
  • expand the boundaries of knowledge about zones
  • explore the possibility of designing an infrastructure that is not based on the assumption that a workload must be limited to one CPU architecture. Without that limitation, an application could "float" from system to system as needed, regardless of CPU type.

Tuesday Apr 03, 2007

YACUZ: Package-independent zones

Here is Yet Another Creative Use of Zones:

Overcoming some obstacles that developers face when using Solaris Containers (aka Zones), Doug Scott documented a method of building a zone which will never be patched from the global zone. In other words, when a patch is applied to the global zone, it will not be applied to a zone built using this method, even if the patch is for a package which is marked ALLZONES=true.

Normally, a package with that parameter setting will require that the package be installed in all zones, and patched consistently in all zones. Branded zones, also called 'non-native zones,' are exempt from that rule. Branded zones allow you to create a zone which will run applications meant for another operating system or operating system version. The first official brand is 'lx'. An lx-branded zone can run most Linux applications.

Note that this method would not be supported by Sun for the following reasons:

  1. It uses the BrandZ framework, which is available via OpenSolaris, but not yet supported by Sun.
  2. It requires you to edit system files which you shouldn't edit; the syntax of those files can change.
  3. Eventually, a patch will modify the kernel and libc (or other kernel-dependent libs) in such a way that they will be incompatible with the cbe-branded zone. Some patches must be applied manually to keep this cbe-branded zone synchronized with the global zone.

Also, note that a zone built like that will no longer benefit from one of the key advantages of zones: management simplicity. You must figure out which patches must be applied to a cbe-branded zone.

However, if those don't bother you, or if you want to learn more about how zones really work, take a look:


Jeff Victor writes this blog to help you understand Oracle's Solaris and virtualization technologies.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


« July 2016