UltraSPARC T1 utilization explained


UltraSPARC T1 utilization explained

With the introduction of Sun Fire T2000/T1000 servers using UltraSPARC T1 processor, Sun has taken a radically different approach to building scalable servers. UltraSPARC T1 processor is best perceived as a system on the chip. In order to understand the performance of any system we need to start with understanding the CPU utilization of that system. Let us see how software and hardware thread scheduling is done on UltraSPARC T1, why conventional tools like mpstat don't show the complete picture and what it really means by CPU utilization for this T1 processor. While thinking about this issue, I wrote "corestat" a new tool to monitor the core utilization of  T1 processor and I will discuss the use of this tool too.

Let us start with the overview of basic concepts which will help understand the rationale for addressing the CPU utilization aspect separately for UltraSPARC T1's CMT architecture.

CMT and UltraSPARC T1 at a glance :

UltraSPARC T1 processor presents Chip Multiprocessing combined with Chip Multi threading. Processor architecture consists of eight cores with four hardware threads per core. Each core has one integer pipeline and four threads within a core share the same pipeline. There are two types of shared resources on the processor. Each core shares Level 1 (L1) Instruction and Data cache as well as the Translation Lookaside Buffer (TLB) and all the cores share the on chip Level 2 (L2) cache. L2 cache is a 12 way set associative unified (instruction and data combined) cache.

Thread scheduling on UltraSPARC T1 :

The Solaris Operating System kernel treats each hardware thread of a core as a separate CPU which makes T1 processor look like a 32 CPU system. In reality its a single physical processor with 32 virtual processors. Conventional tools like mpstat and prtdiag report 32 CPUs on T1. The Solaris Operating system schedules software threads onto these virtual processors (hardware threads) very similar to a conventional SMP system. There is a one to one mapping of software threads onto these hardware threads and a software thread is always scheduled on one hardware thread till its time quantum expires or is pre-empted by another higher priority software thread.

Hardware scheduler decides the use of the pipeline by the hardware threads sharing the same core. Every cycle the hardware thread scheduler switches threads within a core, allowing the same hardware thread to run at least every 4th cycle. There are two specific situations under which a hardware thread can get to run for more than one cycle in four consecutive cycles. These situations arise when a hardware thread becomes idle or gets stalled.

Let us look into these cases more closely :

What does it mean by an “idle” hardware thread on UltraSPARC T1 :

Conventionally a processor is considered to be idle by the kernel when there is no runnable thread in the system which can be scheduled on that processor. On previous generation SPARC processors, an idle state related to the pipeline of the processor remaining unused. For a CMT processor like T1 if there are not enough runnable threads in the system then one or more hardware threads in a core remain idle.

Main differences in behavior of an idle virtual processor (hardware thread) of T1 compared to the idle CPU in conventional SMP are :
  • A hardware thread becoming idle doesn't mean an entire core becomes idle. Processor core will still continue to execute instructions on behalf of other threads in the core.
  • Solaris kernel has been optimized for T1 processor so that when a hardware thread becomes idle, it is parked. A parked thread is taken out of the mix of threads available in a core for scheduling. Its time slice is allocated to the next runnable thread from the same core.
  • An idle (parked) thread doesn't consume any cycles on UltraSPARC T1. On a non CMT SPARC processor based system an idle processor executes an idle loop in the kernel.
  • A hardware thread becoming idle doesn't necessarily reduce core utilization. It also doesn't slow down other threads sharing the same core. Core utilization depends on how efficiently a thread can execute its instructions.
  • Mpstat on Sun Fire T2000 reports an idle thread in the same way as it reports an idle CPU in conventional SMP system.
  • On a conventional system an idleness of a processor is inherently linked to the idleness of the system. On UltraSPARC T1 one or more hardware threads can be idle but the processor could still be executing instructions at reasonable capacity. These two aspects are not directly related.
  • Only when all four threads from the same core become idle, that core becomes idle and utilization drops to zero.
What does it mean by a “stalled” thread on UltraSPARC T1 :
On a T1 processor when a thread stalls due to a long latency instruction (such as a load missing in the cache), it is taken out of the mix of schedulable threads with allowing the next ready to run thread from the same core to use its time slice. Similar to conventional processors, a stalled thread on T1 is reported as busy by mpstat. On conventional processors a stalled (e.g. on cache miss) thread occupies the pipeline and hence results in low system utilization. In case of T1 the core can still get utilized by other nonstalled runnable threads.

Understanding processor utilization :

For a T1 processor a thread being idle and a core becoming idle are two different things and hence need to be understood separately. Here are some commonly asked questions in this regard :
  • There is already vmstat and mpstat so why do we need to think about anything else ?

On UltraSPARC T1 Solaris tools like mpstat only report the state of a hardware thread and don't show the core utilization. Conventionally if a processor is not idle it is considered as busy. A stalled processor is also conventionally considered busy because for non CMT processors the pipeline of a stalled processor is not available for other runnable threads in the system. However on a T1 processor a stalled thread doesn't mean stalled pipeline. On T1 processor vmstat and mpstat output should really be interpreted as the report of pipeline occupancy by software threads. For non CMT processors idle time reported by mpstat or vmstat can be used to decide on adding more load on the system. On a CMT processor like T1, we also need to look at the core utilization before making the same decision.
  • How can we understand core utilization on UltraSPARC T1 if mpstat doesn't show it ?
Core utilization of a T1 corresponds to the number of instructions executed by that core. Cpustat is a tool available on Solaris to monitor system behavior using hardware performance counters. T1 processor has two hardware performance counters per thread (there are no core specific counters). One of the performance counters always reports instruction count and the other can be programmed to measure other events such as cache misses and TLB misses etc. A typical cpustat command looks like :

cpustat -c pic0=L2_dmiss_ld,pic1=Instr_cnt 1

which will report Data cache misses in L2 cache and the instructions, executed in user mode at 1 second interval by all the enabled threads.

I wrote a new tool “Corestat”  for online monitoring of core utilization. Core utilization is reported for all the available cores by aggregating the instructions executed by all the threads in that core. Its  a perl script which forks cpustat command at run time and then aggregates the instruction count to derive the core utilization. A T1 core can best execute 1 instruction/cycle and hence the maximum core utilization is directly proportional to the frequency of the processor.

Corestat can be used in two modes :
  • For online monitoring purpose, it requires root privileges. This is the default mode of operation. Default reporting interval is 10 sec and it assumes the frequency of 1200 MHz.
  • It can be used to report core utilization by post processing already sampled cpustat data.

    Usage :
       $ corestat
         corestat : Permission denied. Needs root privilege...

         Usage : corestat [-f <infile>] [-i <interval>] [-r <freq>]

                           -f infile   : Filename containing sampled cpustat data
                           -i interval : Reporting interval in sec (default = 10
                           -r freq     : CPU frequency in MHz (default = 1200
       # corestat
                 Core Utilization
         CoreId     %Usr     %Sys     %Total
         ------            -----         -----        ------
         0               16.23     18.56     34.80
         1               26.09     13.42     39.52
         2               28.97     11.47     40.44
         3               28.63     11.74     40.38
         4               29.18     12.95     42.13
         5               29.25     11.31     40.56
         6               29.10     15.96     45.06
         7               23.97     12.55     36.51
         ------            -----        -----        ------
         Avg          26.43     13.50     39.92

    mpstat data for the same period from the same system looks like :

    CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
     0    2   0 4191  7150 6955 1392   93  374  573   14  1433   78  22   0   0
      1    2   0  179 11081 10956 1180  132  302 1092   13  1043   79  21   0   0
      2    1   0  159  9524 9388 1085  141  261 1249   14   897   79  21   0   0
      3    0   0 3710 10540 10466  621  231  116 1753    2   215   70  29   0   0
      4    5   0   28   355    1 2485  284  456  447   30  2263   77  23   0   0
      5    5   0   25   350    1 2541  280  534  445   26  2315   78  22   0   0
      6    3   0   26   331    0 2501  267  545  450   28  2319   78  22   0   0
      7    2   0   30   292    1 2390  232  534  475   23  2244   77  22   0   0
      8    4   0   22   265    1 2188  220  499  429   26  2118   75  25   0   0
      9    2   0   28   319    1 2348  258  513  440   26  2161   76  24   0   0
     10    4   0   23   308    0 2384  259  514  430   22  2220   76  24   0   0
     11    4   0   27   292    0 2366  237  518  438   30  2209   77  23   0   0
     12   11   0   31   314    0 2446  253  530  458   27  2290   78  22   0   0
     13    4   0   31   273    1 2334  223  523  428   25  2261   79  21   0   0
     14   12   0   29   298    1 2405  247  521  435   25  2286   78  22   0   0
     15    4   0   32   330    1 2445  272  526  450   24  2248   77  22   0   0
     16    5   0   28   271    0 2311  219  528  406   29  2188   76  23   0   0
     17    4   0   23   309    1 2387  253  537  442   25  2234   78  22   0   0
     18    3   0   25   312    1 2412  257  534  449   26  2216   78  22   0   0
     19    3   0   29   321    1 2479  262  545  462   31  2287   78  22   0   0
     20   14   0   29   347    0 2474  289  541  457   24  2253   78  22   0   0
     21    4   0   29   315    1 2406  259  534  469   24  2240   77  22   0   0
     22    4   0   27   290    1 2406  243  531  480   25  2258   77  22   0   0
     23    4   0   27   286    1 2344  235  531  445   26  2240   77  22   0   0
     24    3   0   30   279    0 2292  228  518  442   22  2160   77  23   0   0
     25    3   0   26   275    1 2340  227  538  448   25  2224   76  23   0   0
     26    4   0   22   294    1 2349  247  529  479   26  2197   77  23   0   0
     27    4   0   27   324    1 2459  270  544  476   25  2256   77  23   0   0
     28    4   0   25   300    1 2426  249  549  461   27  2253   77  23   0   0
     29    5   0   27   323    1 2463  269  541  447   23  2277   77  22   0   0
     30    2   0   27   289    1 2386  239  535  463   26  2222   77  23   0   0
     31    3   0   29   363    1 2528  304  525  446   26  2251   76  23   0   0

Here we can see each core is executing 39% of its max capacity. Interestingly mpstat output for the same period shows that all the virtual CPUs are all 100% busy. Together it shows that in this particular case even 100% busy threads can not utilize any of the core to its max capacity due to the stalls.

From corestat data we can get an idea about the absolute capacity of the core available for more work or performance. Higher the percentage of core usage means the core is getting saturated and has less head room available for processing more load. It also means that the pipeline is being used more efficiently. However, lower core utilization doesn't simply mean more room for applying more load. All the virtual CPUs can be 100% busy and still the core utilization could be low.
  • How to use core utilization data along with conventional stat tools ?
Core utilization (as seen above from corestat) and mpstat or vmstat need to be used together to make decisions about system utilization.

Here is some explanation of a few commonly observed scenarios :

Vmstat reports 75% idle and core utilization is only  20% :

Since vmstat reports huge idle time as well as the core usage is also low, there is head room for applying more load. Any performance gain by increasing load will depend on the characteristic of the application.

Vmstat reports 100% busy and core utilization is 50% :

Since vmstat reports all threads being 100% busy, there is really no more head room to schedule any more software threads. Hence the system is at its peak load. Low (i.e. 50%) core utilization indicates that the application is only utilizing each core to its 50% capacity and the cores are not saturated.

Vmstat reports 75% idle but core utilization is 50%  :

Since core utilization is higher than that reported by vmstat, this is an indication that the processor can get saturated by having fewer software threads than the available hardware threads. It is also an indication of a low CPI application. In this case, scalability will be limited by core saturation and adding more load after a certain point will not help achieve any more performance.

As with any other system on Sun Fire T2000 as the load increases, more threads become busy and core utilization also goes up. Since thread saturation (i.e. virtual CPU saturation) and core saturation are two different aspects of system utilization, we need to monitor both simultaneously in order to determine whether an application is likely to saturate a core by using fewer threads. In that case, applying additional load on the system will not deliver any more throughput. On the other hand if all the threads get saturated but core utilization shows more head room then that means the application has stalls and it is a high CPI application. Application level tuning, partitioning of resources using processor sets (psrset(1M)) or binding of LWPs (pbind(1M)) could be some techniques to improve the performance in such cases.

Holy crap...wow!! This is a lot of really great detailed information. I'm gonna have to print this out and read it a couple of times to let it all sink in. Thanks Ravi! -Moazam

Posted by Moazam on December 07, 2005 at 04:59 AM IST #

It seems to me that requiring the user to specify the CPU frequency on the corestat command line instead of having it auto-detected is a bad design. It's already clear that Sun will be selling machines with different clock frequencies, and the user should not be allowed to get this wrong by mistake.

Posted by Glenn on December 13, 2005 at 07:05 PM IST #

(1) How long is a hardware thread typically stalled in a typical instruction mix?
(2) How does the kernel schedule software threads across cores and hardware threads? Does it just use some random allocation? Does it pile the SW threads onto the HW threads of the first core, then the second core, and so forth? Does it distribute the SW threads evenly across cores, insofar as possible? In a situation where all the cores have at least one HW thread active, does the kernel try to cluster threads on cores according to the Solaris process in which they are executing (keeping threads from the same process on the same core, as much as possible), to avoid or limit the amount of L1 cache and TLB thrashing, especially since the cache is unified? For the sake of argument, consider a system with 32 active SW threads, including 8 active threads in my single-process application. Will my own 8 threads be allocated 1 to a core, 2 to a core, 4 to a core, or some mixture of these? Then consider the same question when my 8 threads are the only threads on the machine.
(3) How large is each cache line? How much cache thrashing can we expect? Don't all the HW threads in a core tend to overlay each other's data in the L1 cache, reducing performance?
(4) What does all of this say about the best way to write an application for best performance on a T1? Sun sometimes says no restructuring is needed, but this is not believable. It seems that one would need to dramatically increase the number of threads that can be used in parallel by your application. Two threads would probably get you back to where you started (given the lower operating frequency of each core), and then you can climb from there if you can find appropriate parallelism in your application.
(5) How does 12 (-way set associativity of the cache) relate to 8 (cores) or 32 (HW threads)? 12 seems like a very strange number (not a power of two).
(6) This may be a dumb question, but ... are the L1 and L2 caches physical-memory or virtual-memory caches?

Posted by Glenn on December 13, 2005 at 07:48 PM IST #

This is a great tool, Ravi! What do you think about releasing this utility on the OpenSolaris Performance community web page? I would be happy to help you with setting up a page, etc.

Posted by Andrei Dorofeev on December 18, 2005 at 04:06 AM IST #

About Glenn's first comment on having -r option to corestat. Yes, I can look into making it auto detect at run time. That will eliminate the need for specifying the frequency when it is used for online monitoring. However, corestat can also be used for post processing sampled cpustat data and for that purpose it would still need a mechanism to specify the frequency and hence -r option is needed.

Posted by Ravindra Talashikar on December 19, 2005 at 04:59 AM IST #

Andrei, I definitely plan on releasing corestat to external community. I'll contact you.

Posted by Ravindra Talashikar on December 19, 2005 at 05:08 AM IST #

Hi there Ravindra, Thanks for posting this information. I am currently testing a T2000 for various applications at my work. I'm interested to know though; How did you get the instruction count for both user and system processes at the same time?

Playing around with cpustat, I can get one or the other but not both, and as far as I'm aware you can't run 2 instances of cpustat at the same time? cpustat refers to the T1 manual, but that manual is not available at the given URL.

Would you mind posting the cpustat command you're running?

thanks, Richard

Posted by Richard Gray on January 22, 2006 at 11:53 AM IST #

Richard, I didn't get instruction counts for user and system at the same time. Performance counters can be programmed to measure the events in only one of the three modes at a time. Output is an average over time. I used following command : cpustat -n -c pic0=L2_dmiss_ld,pic1=Instr_cnt -c pic0=L2_dmiss_ld,pic1=Instr_cnt,nouser,sys 1

Posted by Ravindra Talashikar on February 02, 2006 at 09:39 AM IST #

Ravindra, have you made the corestat utility available at this time? Thanks.

Posted by Robert Halloran on February 03, 2006 at 01:03 PM IST #

Hi, Is there any command in Solaris to find the number of cores in the processor. In case of T2000, it has 8 cores and each core has 4 threads. psrinfo -pv is showing 32 virtual processors and one physical processor. Is there a way to find the number of cores and threads per core.

Posted by Balaji on April 12, 2006 at 07:14 AM IST #

Hi, Is there anywhere where we can download the source or a binary of corestat ? Thanks.

Posted by Mark Round on April 24, 2006 at 08:40 AM IST #

Hi, I am trying to figure out, how to get number of cores in the processor also number of hardware threads programatically ? Can you help ? Also are number of H/W threds per core fixed to 4 ?

Posted by Niranjan B on May 02, 2006 at 05:52 PM IST #

Well, there was no easy way to figure out the number of cores programatically, one can do so by traversing chip_t structure in kernel. That would require using mdb and that too is an undocumented way of finding this information. Based on growing demand for this information, in the upcoming release of Solaris (actually in Nevada build 32) a new member to the cpu_info kstat is now included to report the core_id. Use kstat -m cpu_info | grep core_id and you can get the Virtual Processor i.e. hardware thread and its core_id relationship. Number of available hardware threads are fixed at 4 per core. One can use psradm(1M) command to online/offline a hardware thread. Also, its possible to control the available resources using System Controller i.e. "SC" as an administrator.

Posted by Ravindra Talashikar on June 04, 2006 at 11:40 AM IST #

Hi, I used the kstat command, but all 32 cpus are reported to be of chip_id 0. There is no distinguishing factor in the core_id. How do I figure out the thread to core mapping? Also, is the corestat publicly available? thanks, Kamal

Posted by Kamal Srinivasan on June 06, 2006 at 12:14 AM IST #

Ravindra: I found the information that you posted extremely helpful. Although you expressed the ideas clearly, I am interested in these particular topics: 1. Can we have access to the way that Solaris schedules software threads onto hardware threads as developers? I other words, can I develop an application that can schedules the threads in a T1 platform? 2. Does Solaris has a dynamic load balancing tool? I know that the answer may be long. If you do not have enough time I will be more that happy if you give me the links to the corresponding documentation. Thank you, Javier.

Posted by Javier Iparraguirre on June 12, 2006 at 06:41 PM IST #

Answering to Kamal's observation. I mentioned that in build 32 of Nevada (i.e. next major Solaris release) a new kstat is introduced for core_id. This build is not available externally yet. Currently the simplest way to map hardware threads to cores is that cpu id 0 to 3 are in core 0, cpu ids 4 to 7 in core 1 and so on... I have released corestat through my latest blog and am eager to see how you find it useful in performance analysis.. Do share your feedback.

Posted by Ravindra Talashikar on June 13, 2006 at 09:39 AM IST #

Commenting on some of the questions raised by Javier. Basically Solaris Operating System kernel treats hardware threads as CPUs in a conventional system. A software thread has a priority and time quantum associated with it. Scheduler takes these aspects into account as it does for non-CMT processors anyway. One can influence thread scheduling in many ways like binding an LWP (pbind(1M) to a virtual processor, running LWPs in a processor set (psrset(1M)), changing scheduling class (priocntl(1M)) etc. About load balancing, it is the fundamental job of any Operating System kernel hence Solaris does it always. No need for any special tool there...

Posted by Ravindra Talashikar on June 13, 2006 at 09:55 AM IST #

Hi Ravindra, I have downloaded the corestack utility and try to ran it on my solaris machine but it gives following error. # ./corestat Argument "pic0" isn't numeric in array element at ./corestat line 152, <fd_in> line 1. Argument "[-c" isn't numeric in array element at ./corestat line 152, <fd_in> line 3. Argument "measure" isn't numeric in numeric eq (==) at ./corestat line 152, <fd_in> line 3. Argument "events" isn't numeric in array element at ./corestat line 152, <fd_in> line 5. Argument "[-p" isn't numeric in numeric eq (==) at ./corestat line 152, <fd_in> line 5. Argument "period" isn't numeric in array element at ./corestat line 152, <fd_in> line 7. Argument "processor" isn't numeric in numeric eq (==) at ./corestat line 152, <fd_in> line 7. Argument "run" isn't numeric in array element at ./corestat line 152, <fd_in> line 8. Argument "through" isn't numeric in addition (+) at ./corestat line 205, <fd_in> line 13. The uname o/p of my machine is : SunOS romeo 5.10 Generic_118833-03 sun4u sparc SUNW,Ultra-80 Could you please let me know why it is giving this error ? Thanks and regards, Shailesh

Posted by Shailesh on September 19, 2006 at 09:05 AM IST #

Hi, Thanks for your explanations, this is very good. But as I know Sun Fire 890 server has CMT processors too and they do not scale as it is said here. I have a program with 20 threads. I executed it on a sun 440 with 4 cpus it took 60 seconds with 0 idle sar output. When I executed it on 890, it took 95 seconds with 30 percent idle time. OS is solaris 9 on both systems. Is there a way to use all the cpu power on 890. More interestingly I executed 2 instances of the same program simultaneously and both of them took 75 seconds. This is faster than before and idle was 0. Is this normal behavior.

Posted by goker canitezer on November 18, 2006 at 10:23 PM IST #

Ravi, This looks like a real useful utility and it has helped explain why we are getting invalid CPU utilization alerts from the monitoring agent on Quest Foglight. We are starting to make use of resource pools and I see from the README that corestat is processor set aware. How would you recommend running corestat to collect the info for each of the pools? I looked for a command line option to set which pool/processor set is to be monitored but there isn't one there. Thanks.

Posted by Phil Freund on January 17, 2007 at 06:13 PM IST #

Hi Ravi, corestat is great, but I'm having the same problem as @Ravindra on a T2000:
Argument "Can't" isn't numeric in array element at /usr/local/bin/corestat line 152, <fd_in> line 1.
But corestat works perfect on a T1000; both run with 1000Mhz. What could be the reason? I'm using v1.0 -- Kind regards, Nick

Posted by Niki Kraus on July 05, 2007 at 06:53 AM IST #

Are there any plans to make this sort of data available via an API such as kstat?? While we are interested in getting access to get this data, running a perl script as root and parsing the output is not something I am prepared to do. What I am looking for is a more programmatic solution where I do not need to fork off a process nor do I need to know the frequency. Thanks. John

Posted by John Tavares on August 02, 2007 at 01:37 PM IST #

Hi Ravi, corestat is a great tool ,i am able to run it successfully.But i am not able to redirect the corestat output to a file.
Commnads like pipe or Redirection are not working with corestat.
Could you please provide solution to this?
I run corestat using "sudo corestat" command.
Thank YOu

Posted by Prashant on September 26, 2007 at 04:39 AM IST #

I also want to redirect the output to a file and further analyse. But I also found that redirection doesn't work. Anyone can give me some hints?

Posted by Karen Law on February 19, 2008 at 12:18 AM IST #


Is it possible to get this tool for T2+ processors?

Posted by PB on October 24, 2008 at 07:55 AM IST #

Is there any tool to load different cores in a CMT processor? For example, I want to load say 4 (out of 8 cores available) cores to 100% utilization.

Posted by Pramod on October 02, 2009 at 06:17 PM IST #


I've downloaded corestat. Can you tell me exactly how to install it on my T5220?

Posted by Marvin Hecht on December 11, 2009 at 06:39 PM IST #

Hi Ravi,
The information has not grown old. Found it really useful for my understanding.

Posted by Achintya on March 02, 2011 at 03:39 PM IST #

Hi Ravi

Great script. It works fine on a server without pool facility actived.
I would appreciate very much your help. I am working on server with 32 cores, and 256 threads, distributed on some pools that are assigned to same numbers of containers. Running corestat on global zone give me statistics for the global zone only (pool default). Trying to execute cpustat on a processor set, I got message "psrset: cannot exec in processor set 1: Operation not supported" (because pool facility is active). Trying to run cpustat on a local zone (container) I got "cpustat: cpu100 - No such file or directory" for each thread on that local zone. Could you help me on this problem? Is it possible to run cpustat on containers??

Posted by Wagner Pozzani on March 17, 2011 at 06:34 PM IST #

Post a Comment:
  • HTML Syntax: NOT allowed



« June 2016