Tuesday Nov 18, 2008

Zonestat: How Big is a Zone?

Introduction

Recently an organization showed great interest in Solaris Containers, especially the resource management features, and then asked "what command do you use to compare the resource usage of Containers to the resource caps you have set?"

I began to list them: prstat(1M), poolstat(1M), ipcs(1), kstat(1M), rcapstat(1), prctl(1), ... Obviously it would be easier to monitor Containers if there was one 'dashboard' to view. Such a dashboard would enable zone administrators to easily review zones' usage of system resources and decide if further investigation is necessary. Also, if there is a system-wide shortage of a resource, this tool would be the first out of the toolbox, simplifying the task of finding the 'resource hog.'

Zonestat

In order to investigate the possibilities, I created a Perl script I call 'zonestat' which summarizes resource usage of Containers. I consider this script a prototype, not intended for production use. On the other hand, for a small number of zones, it seems to be pretty handy, and moderately robust.

Its output looks like this:

        |----Pool-----|------CPU-------|----------------Memory----------------|
        |---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
  global  0D  66K    2       0.1 100 25      986M      139K  18E 2.2M  18E 754M
    db01  0D  66K    2       0.1 200 50   1G 122M 536M  0.0 536M    0   1G 135M
   web02  0D  66K    2  0.4  0.0 100 25 100M  11M  20M  0.0  20M    0 268M   8M 
==TOTAL= --- ----    2 ----  0.2  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 897M
The 'Pool' columns provide information about the Dynamic Resource Pool in which the zone's processes are running. The two-character 'I' column displays the pool ID (number) and the 'T' column indicates the type of pool - 'D' for 'default', 'P' for 'private' (using the dedicated-cpu feature of zonecfg) or 'S' for 'shared.' The two 'Size' columns show the quantity of CPUs assigned to the pool in which the zone is running.

The 'CPU Pset' columns show each zone's CPU usage and any caps that have been set. The first two columns show CPU quantities - CPU cores for x86, SPARC64 and all UltraSPARC systems except CMT (T1, T2, T2+). On CMT systems, Solaris considers every hardware thread ('strand') to be a CPU, and calls them 'vCPUs.'

The last two CPU columns - 'Shr' and 'S%' - show the number of FSS shares assigned to the zone, and what percentage of the total number of shares in that zone's pool. In the example above, all the zones share the default pset, and the zone 'db01' has two shares, so it should receive 50% of the CPU power of the pool at a minimum.

The 'Memory' columns show the caps and usage for RAM, locked memory and virtual memory. Note that virtual memory is RAM plus swap space.

The syntax of zonestat is very similar to the other \*stat tools:

   zonestat [-l] [interval [count]]
The output shown above is generated with the -l flag, which means "show the limits (caps) that have been set." Without -l, only usage columns are displayed.

Example of Usage

Here is more output, showing some of the conclusions that can be drawn from the data. I have added parenthetical numbers in the right-hand in order to refer to specific lines of output.

        |----Pool-----|------CPU-------|----------------Memory----------------|
        |---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
  global  0D  66K    2       0.1 100 HH      983M      139K  18E 2.2M  18E 752M
==TOTAL= --- ----    2 ----  0.1  -- -- 4.3G 983M 4.3G 139K 4.1G 2.2M 5.3G 752M
--------
  global  0D  66K    2       0.1 100 HH      983M      139K  18E   2M  18E 752M
==TOTAL= --- ----    2 ----  0.1  -- -- 4.3G 983M 4.3G 139K 4.1G 2.2M 5.3G 752M
Note that the none of the non-global zones are running. Because the global zone is the only zone running in its pool, its 100 FSS shares represent 100% of the shares in its pool. To save a column of output, I indicate that with 'HH' instead of '100'.

The "==TOTAL=" line provides two types of information, depending on the column type. For usage information, the sum of the resource used is shown. For example, "RAM Use" shows the amount of RAM used by all zones, including the global zone. For resource controls, either the system's amount of the resource is shown, e.g. "RAM Cap", or hyphens are displayed.

Note that there is a maximum amount of RAM that can be locked in a Solaris system. This prevents all of memory from being locked down, which would prevent the virtual memory system from running. In the output above, this system will only allow 4.1GB of RAM to be locked.

Also note that the amount of VM used is less than the amount of RAM used. This is because the memory pages which contain a program's instructions are not backed by swap disk, but by the file system itself. Those 'text' pages take up RAM, but do not take up swap space.

        |----Pool-----|------CPU-------|----------------Memory----------------|
        |---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.1 100 50 1.0G  30M 536M  0.0 536M  0.0 1.0G  27M
==TOTAL= --- ----    2 ----  0.2  -- -- 4.3G 1.0G 4.3G 139K 4.1G 2.2M 5.3G 780M
A zone has booted. It has caps for RAM, shared memory, locked memory, and VM. The default pool now has a total of 200 shares: 100 for each zone. Therefore, each zone has 50% of the shares in that pool. This provides a good reason to change the global zone's FSS value from its default of one share to a larger value as soon as you add the first zone to a system.
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.3 100 50   1G  93M 536M  0.0 536M  0.0   1G  95M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 848M
--------
  global  0D  66K    2       0.1 100 50      981M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.4 100 50   1G 122M 536M  0.0 536M  0.0   1G 135M
==TOTAL= --- ----    2 ----  0.5  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
The zone 'z3' is still booting, and is using 0.4 CPUs worth of CPU cycles.
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.3 100 50   1G 122M 536M      536M  0.0   1G 135M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
--------
  global  0D  66K    2       0.1 100 50      984M      139K  18E 2.2M  18E 753M
      z3  0D  66K    2       0.2 100 50   1G 122M 536M      536M  0.0   1G 135M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 888M
--------
  global  0D  66K    2       0.1 100 33      986M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.1 100 33   1G 122M 536M      536M  0.0   1G 135M
   web02  0D  66K    2  0.4  0.0 100 33 100M  11M  20M       20M  0.0 268M   8M
==TOTAL= --- ----    2 ----  0.2  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 897M
A third zone has booted. This zone has a CPU-cap of 0.4 CPUs. It also has memory caps, including a RAM cap that is less than the amount of memory that zone 'z3' is using. If the two zones need the same amount, web02 should begin paging before long. Let's see what happens...
--------
  global  0D  66K    2       0.1   1 33      985M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.1   1 33   1G 122M 536M      536M  0.0   1G 135M
   web02  0D  66K    2  0.4  0.1   1 33 100M  29M  20M       20M  0.0 268M  36M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.1G 4.3G 139K 4.1G 2.2M 5.3G 925M
--------
  global  0D  66K    2       0.1   1 33      984M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.1   1 33   1G 122M 536M      536M  0.0   1G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M  63M  20M       20M  0.0 268M 138M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
        |----Pool-----|------CPU-------|----------------Memory----------------|
        |---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
  global  0D  66K    2       0.1   1 33      985M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 122M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M  87M  20M       20M  0.0 268M 185M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.1G
--------
  global  0D  66K    2       0.1   1 33      985M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 122M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M 100M  20M       20M  0.0 268M 112M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
--------
  global  0D  66K    2       0.1   1 33      984M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 122M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.3   1 33 100M 112M  20M       20M  0.0 268M 117M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
As expected, web02 exceeds its RAM cap. Now rcapd should address the problem.
--------
  global  0D  66K    2       0.1   1 33      981M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33   1G 119M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.3   1 33 100M 111M  20M       20M  0.0 268M 127M
==TOTAL= --- ----    2 ----  0.4  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
One of two things has happened: either a process in web02 freed up memory, or rcapd caused pageouts. rcapstat(1M) will tell us which it is. Also, the increase in VM usage indicates that more memory was allocated than freed, so it's more likely that rcapd was active during this period.
--------
  global  0D  66K    2       0.1   1 33      981M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33 1.0G 119M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M 110M  20M       20M  0.0 268M 133M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
--------
  global  0D  66K    2       0.1   1 33      978M      139K  18E 2.2M  18E 754M
      z3  0D  66K    2       0.0   1 33 1.0G 116M 536M      536M  0.0 1.0G 135M
   web02  0D  66K    2  0.4  0.2   1 33 100M  91M  20M       20M  0.0 268M 133M
==TOTAL= --- ----    2 ----  0.3  -- -- 4.3G 1.2G 4.3G 139K 4.1G 2.2M 5.3G 1.0G
At this point 'web02' is safely under its RAM cap. If this zone began to do 'real' work, it would continually be under memory pressure, and the value in 'Memory:RAM:Use" would fluctuate around 100M. When setting a RAM cap, it is very important to choose a reasonable valuable to avoid causing unnecessary paging.

One final example, taken from a different configuration of zones:

        |----Pool-----|------CPU-------|----------------Memory----------------|
        |---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap|Used| Cap|Used| Cap|Used| Cap|Used
-------------------------------------------------------------------------------
  global  0D  66K    1  0.0  0.0 200 66      1.2G  18E 343K  18E 2.6M  18E 1.1G
      zB  0D  66K    1  0.2  0.0 100 33      124M  18E  0.0  18E  0.0  18E 138M
      zA  1P    1    1  0.0  0.1   1 HH       31M  18E  0.0  18E  0.0  18E  24M
==TOTAL= --- ----    2 ----  0.1 --- -- 4.3G 1.4G 4.3G 343K 4.1G 2.6M 5.3G 1.2G
The global zone and zone 'zB' share the default pool. Because the global zone has 200 FSS shares, compared to zB's 100 shares, global zone processes will get 2/3 of the processing power of the default pool if there is contention for that CPU. However, that is unlikely, because zB is capped at 0.2 CPUs worth of compute time.

Zone 'zA' is in its own private resource pool. It has exclusive access to the one dedicated CPU in that pool.

Problems

CPU Hog

Zonestat's biggest problem is due to its brute-force nature. It runs a few commands for each zone that is running. This can consume many CPU cycles, and can take a few seconds to run with many zones. Performance improvements to zonestat are underway.

Wrong / Misleading CPU Usage Data

Two commonly used methods to 'measure' CPU usage by processes and zones are prstat and mpstat. Each can produce inaccurate 'data' in certain situations.

With mpstat, it is not difficult to create surprising results, e.g. on a CMT system, set a CPU cap on a zone in a pool, and run a few CPU-bound processes: the "Pset Used" column will not reach the CPU cap. This is due to the method used by mpstat to calculate its data.

Prstat only computes its data occasionally, ignoring anything that happened between samples. This leads to undercounting CPU usage for zones with many short-lived processes.

I wrote code to gather data from each, but prstat seemed more useful, so for now the output comes from prstat.

What's Next

I would like feedback on this tool, perhaps leading to minor modifications to improve its robustness and usability. What's missing? What's not clear?

The future of zonestat might include these:

  • I hope that this can be re-written in C or D. Either way, it might find its way into Solaris... If I can find the time, I would like to tackle this.
  • New features - best added to a 'real' version:
    1. -d: show disk usage
    2. -n: show network usage
    3. -i: also show installed zones that are not running
    4. -c: also show configured zones that are not installed
    5. -p: only show processor info, but add more fields, e.g. prstat's instantaneous CPU%, micro-state fields, and mpstat's CPU%
    6. -m: only show memory-related info, and add the paging columns of vmstat, output of "vmstat -p", free swap space
    7. -s : sort sample output by a field. Good example: sort by poolID
    8. add a one-character column showing the default scheduler for each pool
    9. report state transitions like mpstat does, e.g. changes in zone state, changes in pool configuration
    10. improve robustness

The Code

You can find the Perl script in the Files page of the OpenSolaris project "Zone Statistics."

About

Jeff Victor writes this blog to help you understand Oracle's Solaris and virtualization technologies.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today