Dienstag Apr 26, 2016

Socket, Core, Strand - Where are my Zones?

Consolidation using Solaris Zones is widely adopted.  In many cases, people run all the zones on all available CPUs, which is great for overall utilization.  In such a case, Solaris does all the scheduling, taking care that the best CPU is chosen for each process and that all resources are distributed fairly amongst all applications.  However, there are cases where you would want to dedicate a certain set of CPUs to one or more zones.  For example to deal with license restrictions or to create a more strict separation between different workloads.  This separation is achieved either by using the "dedicated-cpu" setting in the zone's configuration, or by binding the zone to an existing resource pool, which in turn contains a processor set.  The technology in both cases is the same, since in the case of "dedicated-cpu", Solaris automatically creates a temporary resource pool when the zone is started.  The effect of using a processor set is that the CPUs assigned to it are available exclusively to the zones associated with this set.  This means that these zones can use exactly those CPUs - not more, not less.  Anything else running on the system (the global zone and any other zones) can no longer be executed on these CPUs.

In this article, I'll discuss (and hopefully answer) the question, which CPUs to include in such a processor set, and how to figure out which zones currently run on which CPUs.

To avoid unnecessary confusion, let me define a few terms first, since there are multiple names in use for the various concepts:

  • A CPU is a processor, consisting of one or more cores, cache and optionally some IO controllers and/or memory controllers.
  • A Core is one computation or execution unit on a CPU.  (Not to be confused with the pipelines that it contains.)
  • A Strand is an entry point into a core, which makes the core's services available to the operating system.

For example, a SPARC M7 CPU consists of 32 cores.  Each core provides 8 strands, so a M7 CPU provides 32*8=256 strands to the OS.  The OS treats each of these strands as a fully-fledged execution unit and therefore shows 256 "CPUs".

All modern multi-core CPUs include multiple levels of caches.  The L3 cache is usually shared by all cores.  L2 and L1 caches are closer to the cores.  They are smaller but faster and often dedicated to one or a small number of cores.  (The M7 CPU applies different strategies, but each core owns it's own, exclusive L1 cache.)  Now, if multiple strands of the same core are used by the same process (or application), this can lead to relatively high hit rates in these caches.  If, on the other hand, different processes use the same core, there will be competition for the little cache space, overwriting each other's entries.  We call this behavior "cache thrashing".  Solaris does a good job trying to prevent this.  However, when using many zones, it is common to assign different zones to different sets of cores.  Use whole cores (complete sets of 8 strands) to avoid sharing of cores between zones or applications.  This also makes the most sense with regards to license capping, since you usually license your application by the number of cores.

So how can you make sure that your zones are bound correctly to whole, exclusive cores?

Solaris knows about the relation between strands, cores and cpus (as well as the memory hierarchy, which I'll not cover here).  You can query this relation using kstat.  For historical reasons (from the times where there were no multi-core or multi-strand cpus), Solaris uses the term "CPU" for what we now call a strand:

root@mars:~# kstat -m cpu_info -s core_id -i 150
module: cpu_info                        instance: 150   
name:   cpu_info150                     class:    misc
        core_id                         18

root@mars:~# kstat -m cpu_info -s chip_id -i 150
module: cpu_info                        instance: 150   
name:   cpu_info150                     class:    misc
        chip_id                         1

In the above example, the "cpu" with id 150 is a strand of core 18, which belongs to CPU 1.  You can discover all available strands and CPUs like this.

Usually, when you configure a processor set for a resource pool, you just tell it the minimum and maximum number of strands it should contain (where min=max is quite common). Optionally, you can also specify specific CPU-IDs (strands) or, since Solaris 11.2, core IDs.  The commands to do this are "pooladm" and "poolcfg".  (There is also the command "psrset", but it only creates a processor set, not a resource pool, and is not permanent, so needs to be run after every reboot.)  I already described the use of these commands a while ago.  Now, to figure out which strands, cores or CPUs are assigned to a specific zone, you'd need to use kstat to find the association between strand IDs in your processor set and the corresponding cores and CPUs.  Done manually, that's a little painful, which is why I wrote a little script to do this for you:

root@mars:~# ./zonecores -h
usage: zonecores [-Sscl] 
       -S report whole Socket use
       -s report shared use
       -c report whole core use
       -l list cpu overview

 With the "-l" commandline option, it will give you an overview of the available CPUs and which zones are running on them.  Here's an example from a SPARC system with 2 16-core CPUs:

root@mars:~# ./zonecores -l
#
# Socket, Core, Strand and Zone Overview
#
Socket Core Strands Zones
0      0    0,1,2,3,4,5,6,7 db2,
0      1    8,9,10,11,12,13,14,15 db2,
0      2    16,17,18,19,20,21,22,23 none
0      3    24,25,26,27,28,29,30,31 db2,
0      4    32,33,34,35,36,37,38,39 db2,
0      5    40,41,42,43,44,45,46,47 db2,
0      6    48,49,50,51,52,53,54,55 db2,
0      7    56,57,58,59,60,61,62,63 coreshare,db1,
0      8    64,65,66,67,68,69,70,71 db2,
0      9    72,73,74,75,76,77,78,79 none
0     10    80,81,82,83,84,85,86,87 none
0     11    88,89,90,91,92,93,94,95 none
0     12    96,97,98,99,100,101,102,103 none
0     13    104,105,106,107,108,109,110,111 none
0     14    112,113,114,115,116,117,118,119 none
0     15    120,121,122,123,124,125,126,127 none
1     16    128,129,130,131,132,133,134,135 none
1     17    136,137,138,139,140,141,142,143 none
1     18    144,145,146,147,148,149,150,151 none
1     19    152,153,154,155,156,157,158,159 none
1     20    160,161,162,163,164,165,166,167 none
1     21    168,169,170,171,172,173,174,175 none
1     22    176,177,178,179,180,181,182,183 none
1     23    184,185,186,187,188,189,190,191 none
1     24    192,193,194,195,196,197,198,199 none
1     25    200,201,202,203,204,205,206,207 none
1     26    208,209,210,211,212,213,214,215 none
1     27    216,217,218,219,220,221,222,223 none
1     28    224,225,226,227,228,229,230,231 none
1     29    232,233,234,235,236,237,238,239 none
1     30    240,241,242,243,244,245,246,247 db2,
1     31    248,249,250,251,252,253,254,255 none

Using the options -S and -c, you can check whether your zones use whole sockets (-S) or whole cores (-c).   With -s you can check whether or not several zones share one or more cores, which can be intentional or not, depending on the use case.  Here's an example with various pools and zones on the same system as above:

root@mars:~# ./zonecores -Ssc
#
# Checking Socket Affinity (16 cores per socket)
#
INFO - Zone db2 using 2 sockets for 8 cores.
OK - Zone db1 using 1 sockets for 1 cores.
OK - Zone capped7 using default pool.
OK - Zone coreshare using 1 sockets for 1 cores.
#
# Checking Core Resource Sharing
#
OK - Core 0 used by only one zone.
OK - Core 1 used by only one zone.
OK - Core 3 used by only one zone.
OK - Core 30 used by only one zone.
OK - Core 4 used by only one zone.
OK - Core 5 used by only one zone.
OK - Core 6 used by only one zone.
INFO - Core 7 used by 2 zones!
-> coreshare
-> db1
OK - Core 8 used by only one zone.
#
# Checking Whole Core Assignments
#
OK - Zone db2 using all 8 strands of core 0.
OK - Zone db2 using all 8 strands of core 1.
OK - Zone db2 using all 8 strands of core 3.
OK - Zone db2 using all 8 strands of core 30.
OK - Zone db2 using all 8 strands of core 4.
OK - Zone db2 using all 8 strands of core 5.
FAIL - only 7 strands of core 6 in use for zone db2.
FAIL - only 1 strands of core 8 in use for zone db2.
OK - Zone db1 using all 8 strands of core 7.
OK - Zone coreshare using all 8 strands of core 7.

Info: 1 instances of core sharing found.
Info: 1 instances of socket spanning found.
Warning: 2 issues found with whole core assignments.

While this mostly speaks for itself, here are some comments:

  • Zone db01 uses a resource pool with 8 strands from one core.
  • Zone coreshare also uses that same pool.
  • Zone db2 uses a resource pool with 64 strands coming from cores from two different CPUs.  It only uses 7 of the 8 strands from core 6, while the 8th strand comes from core 8.  This is probably not intentional.  It would make more sense to use all 8 strands from the same core to avoid cache sharing and reduce the number of cores to license by one.   It might also be benefitial to use all 8 cores from the same CPU.  In this case, Solaris would attempt to allocate memory local to that CPU to avoid remote memory access.
  • Zone capped7 is configured with the option "capped-cpu: ncpus=7".  This is implemented using the Fair Share Scheduler (FSS) which uses all available CPUs in the default pool.

The script is available for download here: zonecores

I also wrote a more detailed discussion of all of this, with examples how to reconfigure your pool configuration in MOS DocID 2116794.1

Some links to further reading:

Dienstag Okt 01, 2013

CPU-DR for Zones

In my last entry, I described how to change the memory configuration of a running zone.  The natural next question is of course, if that also works with CPUs that have been assigned to a zone.  The answer, of course, is "yes".

You might wonder why that would be necessary in the first place.  After all, there's the Fair Share Scheduler, that's extremely capable of managing zones' CPU usage.  However, there are reasons to assign dedicated CPU resources to zones, licensing is one, SLAs with specified CPU requirements another.  In such cases, you configure a fixed amount of CPUs (more precisely, strands) for a zone.  Being able to change this configuration on the fly then becomes desirable.  I'll show how to do that in this blog entry.

In general, there are two ways to assign exclusive CPUs to a zone.  The classic approach is by using a resource pool with an associated processor set.  One or more zones can then be bound to that pool.  The easier solution is to use the parameter "dedicated-cpu" directly when configuring the zone.  In this second case, Solaris will create a temporary pool to manage these resources.  So effectively, the implementation is the same in both cases.  Which makes it clear how to change the CPU configuration in both cases: By changing the pool.  If you do this in the classical approach, the change to the pool will be persistent.  If working with the temporary pool created for the zone, you will also need to change the zone's configuration if you want the change to survive a zone restart.

If you configured you zone with "dedicated-cpu", the temporary pool (and also the temporary processor set that goes along with it) will usually be called "SUNWtmp_<zonename>".   If not, you'll know the name of the pool...  In both cases, everything else is the same:

Let's assume a zone called orazone, currently configured with 1 CPU.  It's to be assigned a second CPU.  The current pool configuration is like this:
root@benjaminchen:~# pooladm                

system default
	string	system.comment 
	int	system.version 1
	boolean	system.bind-default true
	string	system.poold.objectives wt-load

	pool pool_default
		int	pool.sys_id 0
		boolean	pool.active true
		boolean	pool.default true
		int	pool.importance 1
		string	pool.comment 
		pset	pset_default

	pool SUNWtmp_orazone
		int	pool.sys_id 5
		boolean	pool.active true
		boolean	pool.default false
		int	pool.importance 1
		string	pool.comment 
		boolean	pool.temporary true
		pset	SUNWtmp_orazone

	pset pset_default
		int	pset.sys_id -1
		boolean	pset.default true
		uint	pset.min 1
		uint	pset.max 65536
		string	pset.units population
		uint	pset.load 687
		uint	pset.size 3
		string	pset.comment 

		cpu
			int	cpu.sys_id 1
			string	cpu.comment 
			string	cpu.status on-line

		cpu
			int	cpu.sys_id 3
			string	cpu.comment 
			string	cpu.status on-line

		cpu
			int	cpu.sys_id 2
			string	cpu.comment 
			string	cpu.status on-line

	pset SUNWtmp_orazone
		int	pset.sys_id 2
		boolean	pset.default false
		uint	pset.min 1
		uint	pset.max 1
		string	pset.units population
		uint	pset.load 478
		uint	pset.size 1
		string	pset.comment 
		boolean	pset.temporary true

		cpu
			int	cpu.sys_id 0
			string	cpu.comment 
			string	cpu.status on-line
As we can see in the definition of pset SUNWtmp_orazone, it has been assigned CPU #0.  To add another CPU to this pool, you'll need these two commands:
root@benjaminchen:~# poolcfg -dc 'modify pset SUNWtmp_orapset \
                     (uint pset.max=2)' 
root@benjaminchen:~# poolcfg -dc 'transfer to pset \
                     orapset (cpu 1)'

To remove that CPU from the pool again, use these:

root@benjaminchen:~# poolcfg -dc 'transfer to pset pset_default \
                     (cpu 1)'
root@benjaminchen:~# poolcfg -dc 'modify pset SUNWtmp_orapset \
                     (uint pset.max=1)' 

That's it.   If you've used "dedicated-cpu" for your zone's configuration, you'll need to change that before the next reboot.  If not, you'd have to use the pool name you assigned to the zone.

Further details:

Montag Aug 19, 2013

Memory-DR for Zones

Zones allow you to limit their memory consumption.  The usual way to configure this is with the zone parameter "capped-memory" and it's three sub-values "physical", "swap" and "locked".  "Physical" corresponds to the resource control "zone.max-rss", which is actual main memory.  "Swap" corresponds to "zone.max-swap", which is swapspace and "locked" corresponds to "zone.max-locked-memory", which is non-pageable memory, typically shared memory segments.  Swap and locked memory are rather hard limits that can't be exceeded.  RSS - physical memory, is not quite as hard, being enforced by rcapd.  This daemon will try to page out all those memory pages that are beyond the allowed amount of memory and are least active.  Depending on the activity of the processes in question, this is more or less successful, but will always result in paging activity.  This will slow down the memory-hungry processes in that zone.

If you change any of these values using zonecfg, these changes will only be in effect after a reboot of the zone.  This is not as dynamic as one might be used to from the LDoms world.  But it can be, as I'd like to show you in a small example:

Let's assume a little zone with a memory configuration like this:

root@benjaminchen:~# zonecfg -z orazone info capped-memory
capped-memory:
    physical: 512M
    [swap: 256M]
    [locked: 512M]

To change these values while the zone is in operation, you need to interact with two different sub-systems.   For physical memory, we'll need to talk to rcapd.  For swap and locked memory, we need prctl for the normal resource controls.  So, if I wanted to double all three limits for my zone, I'd need these commands:

root@benjaminchen:~# prctl -n zone.max-swap -v 512m -r -i zone orazone
root@benjaminchen:~# prctl -n zone.max-locked-memory -v 1g -r -i zone orazone
root@benjaminchen:~# rcapadm -z orazone -m 1g

These new values will be effective immediately - for rcapd after the next reconfigure-interval.  You can also change this interval with rcapadm.  Note that these changes are not persistent - if you reboot your zone, it will fall back to whatever was configured with zonecfg.  So to have both - persistent changes and immediate effect, you'll need to touch both tools.

Links:

  • Solaris Admin Guide:
    http://docs.oracle.com/cd/E19683-01/817-1592/rm.rcapd-1/index.html

Dienstag Apr 17, 2012

Solaris Zones: Virtualization that Speeds up Benchmarks

One of the first questions that typically comes up when I talk to customers about virtualization is the overhead involved.  Now we all know that virtualization with hypervisors comes with an overhead of some sort.  We should also all know that exactly how big that overhead is depends on the type of workload as much as it depends on the hypervisor used.  While there have been attempts to create standard benchmarks for this, quantifying hypervisor overhead is still mostly hidden in the mists of marketing and benchmark uncertainty.  However, what always raises eyebrows is when I come to Solaris Zones (called Containers in Solaris 10) as an alternative to hypervisor virtualization.  Since Zones are, greatly simplyfied, nothing more than a group of Unix processes contained by a set of rules which are enforced by the Solaris kernel, it is quite evident that there can't be much overhead involved.  Nevertheless, since many people think in hypervisor terms, there is almost always some doubt about this claim of zero overhead.  And as much as I find the explanation with technical details compelling, I also understand that seeing is so much better than believing.  So - look and see:

The Oracle benchmark teams are so convinced of the advantages of Solaris Zones that they actually use them in the configurations for public benchmarking.  Solaris resource management will also work in a non Zones environment, but Zones make it just so much easier to handle, especially with some of the more complex benchmark configurations.  There are numerous benchmark publications available using Solaris Containers, dating back to the days of the T5440.  Some recent examples, all of them world records, are:

The use of Solaris Zones is documented in all of these benchmark publications.

The benchmarking team also published a blog entry detailing how they make use of resource management with Solaris Zones to actually increase application performance.  That almost asks for calling this "negative overhead", if the term weren't somewhat misleading.

So, if you ever need to substantiate why Solaris Zones have no virtualization overhead, point to these (and probably some more) published benchmarks.

About

Neuigkeiten, Tipps und Wissenswertes rund um SPARC, CMT, Performance und ihre Analyse sowie Erfahrungen mit Solaris auf dem Server und dem Laptop.

This is a bilingual blog (most of the time). Please select your prefered language:
.
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Categories
Archives
« May 2016
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today