Critical Threads Optimization

Background

One of the more common issues we've been seeing in the field is the growing difficulty in optimizing performance of multi-threaded applications. A good portion of this difficulty is due to the increasing complexity of modern processors that present various degrees of sharing relationships between hardware components. Take any current CMT processor and you'll find any number of CPUs sharing execution pipelines, floating point units, caches, etc. Consequently, applying the traditional recipe of one software thread for each CPU will have varying degrees of success, according to the layout of the underlying hardware.

On top of this increasing complexity we've also seen processors with features that aim at dynamically resourcing software threads according to their utilization. Intel's Turbo Boost allows processors to increase their operating frequency if there is enough thermal headroom available and the processor isn't fully utilized. More recently, the SPARC T4 processor introduced dynamic threading, allowing each core to dynamically allocate more resources to its active CPUs. Both cases are in essence recognizing that current processors will be running a wide mix of workloads, some will be designed for throughput, others for low latency. The hardware is providing mechanisms to dynamically resource threads according to their runtime behavior.

We're very aware of these challenges in Solaris, and have been working to provide the best out of box performance while providing mechanisms to further optimize applications when necessary. The Critical Threads Optimzation was introduced in Solaris 10 8/11 and Solaris 11 as one such mechanism that allows customers to both address issues caused by contention over shared hardware resources and explicitly take advantage of features such as T4's dynamic threading.

What it is

The basic idea is to allow performance critical threads to execute with more exclusive access to hardware resources. For example, when deploying an application that implements a producer/consumer model, it'll likely be advantageous to give the producer more exclusive access to the hardware instead of having it competing for resources with all the consumers. In the case of a T4 based system, we may want to have a producer running by itself on a single core and create one consumer for each of the remaining CPUs.

With the Critical Threads Optimization we're extending the semantics of scheduling priorities (which thread should run first) to include priority over shared resources (which thread should have more "space"). Now the scheduler will not only run higher priority threads first: it will also provide them with more exclusive access to hardware resources if they are available.

How does it work ?

Using the previous example in Solaris 11, all you'd have to do would be to place the producer in the Fixed Priority (FX) scheduling class at priority 60, or in the Real Time (RT) class at any priority and Solaris will try to give it more "hardware space". On both Solaris 10 8/11 and Solaris 11 this can be achieved through the existing priocntl(1,2) and priocntlset(2) interfaces. If your application already assigns these priorities to performance critical threads, there's no additional step you need to take.

One important aspect of this optimization is that it requires some level of idleness in the system, either as a result of sizing the application before hand or through periods of transient idleness during runtime. If the system is fully committed, the scheduler will put all the available CPUs to work.

Best practices

If you're an application developer, we encourage you to look into assigning the right priorities for the different threads in your application. Solaris provides different scheduling classes (Time Share, Interactive, Fair Share, Fixed Priority and Real Time) that offer different policies and behaviors. It is not always simple to figure out which set of threads are critical to the performance of a workload, and it may not always be feasible to take advantage of this optimization, but we believe that this can be correctly (and safely) done during development.

Overall, the out of box performance in Solaris should meet your workload's requirements. If you are looking into that extra bit of performance, then the Critical Threads Optimization may be what you're looking for.

Comments:

I read that the new LDOM release for Solaris 11 has two options to utilize critical threading, either the LDOM can be set as max-throughput mode or max-ipc mode. this is a very good feature. But does a similar thing exist for Solaris zones.
For instance. When I create a processor set , assign it to a resource pool and assign it to a zone (whole core containers ofcourse) , is there a way to define that the core is in max-ipc mode. It becomes relevant if I want to run databases inside the zone and make sure any single threaded processes doesnt become a bottle neck.
I understand I can put the process in FX priority and assign priority 60 via priocntl, but would like to have a more zone level control for max-ipc mode.

thanks
Martin Francis K

Posted by guest on March 12, 2012 at 09:07 PM GMT #

I read that the new LDOM release for Solaris 11 has two options to utilize critical threading, either the LDOM can be set as max-throughput mode or max-ipc mode. this is a very good feature. But does a similar thing exist for Solaris zones.
For instance. When I create a processor set , assign it to a resource pool and assign it to a zone (whole core containers ofcourse) , is there a way to define that the core is in max-ipc mode. It becomes relevant if I want to run databases inside the zone and make sure any single threaded processes doesnt become a bottle neck.
I understand I can put the process in FX priority and assign priority 60 via priocntl, but would like to have a more zone level control for max-ipc mode.

thanks
Martin Francis K

Posted by martin francis k on March 12, 2012 at 09:09 PM GMT #

To configure a zone to use an entire core, you must use resource pools, and use the "poolcfg -dc" command to transfer the specific cpuids that make up the core into the pool. This is done in the global zone. This must be scripted as this sort of pool configuration is not persistent across reboot (of global zone).

Then set the zone's pool property to this pool so when the zone boots, it will be bound. You can also bind a running zone using poolbind(1m).

If you want to vary the throughput versus ipc for this pool, simply offline some number of the cpus in the pool using psradm(1m) in the global zone. This must be scripted as well, as cpu online/offline settings are not persistent across reboot.

A "cpu" in Solaris maps to a hardware thread. In general, an offline cpu in a pool will be in the halted state. This means that the hardware thread will not be scheduled on the core, giving more cycles to the remaining hardware threads.

Posted by Stephen Lawrence on April 06, 2012 at 01:20 PM GMT #

When a LDOM is set to max-ipc mode, is it essentially same as off-lining the strands(threads) on the core using psradm or p_online programatically ?.
If so, why is the max-ipc mode of LDOMs published as a main feature in the latest LDOM software release, if customer could do the same thing by writing a simple script to offline threads using psradm from inside an LDOM even in previous releases of LDOM (VM for Sparc) even with older UltraSparc T processors (T1-T3)?

Based on the Sparc T4 Datasheet, the critical thread optimization (CTO) is not the same as turning threads off using "psradm" or programatically using "p_online".

CTO is essentially the scheduler making decision to run a thread which has a high ipc value (producer thread) all by itself on a core, by not scheduling any other thread on the same core. The strands are not disabled, but rather halted by means of a V9 instruction, in which case the number of CPUs listed in a psrinfo output will also count parked strands as online (I haven't tested it so its just a hypothesis) (on the other hand when strands are turned off using psradm they show as offline i.e disabled ).

Halted threads are targets for scheduling LWPs if the number of runnable threads is greater than available number of online (non parked) threads on the OS.

Below is snip from Ultra Sparc supplement instruction manual:
--- The operation of the Halt pseudo-instruction is as follows. The virtual processor can be parked, disabled, running, or halted.
Once halted, the virtual processor remains halted until an interrupt arrives. When the interrupt arrives, the processor transitions back to running. It resumes execution at the NPC of the Halt instruction.
When halted, the virtual processor consumes no execution resource. It is similar to the parked state except that it awakens upon an interrupt.-----

Posted by martin francis k on April 07, 2012 at 09:11 AM GMT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

The Observatory is a blog for users of Oracle Solaris. Tune in here for tips, tricks and more as we explore the Solaris operating system from Oracle.

Connect with Oracle Solaris:


Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
20
21
22
23
24
25
26
27
28
29
30
   
       
Today