Monday Nov 19, 2007

Corestat for UltraSPARC T2/T2+

Corestat for UltraSPARC T2/T2+ :

With the launch of UltraSPARC T2+ processor based servers, corestat needs an upgradation. Updated version of corestat is now available off the link from this blog. Also note that the same version (V1.2.3) should work on T5220, T5240 and T5240 servers.

Understanding processor utilization is important for performance analysis and capacity planning. With the launch of UltraSPARC T2 based servers I would like to revisit the topic of core utilization.

As we have seen earlier, for a Chip Multi Threaded (CMT) processor, like UltraSPARC T1, CPU utilization reported by conventional tools like mpstat/vmstat and core utilization reported using hardware performance counters in the processor are different metrics and both are equally important in performance analysis and tuning.

Before discussing the details about core utilization of UltraSPARC T2 and the details about corestat let us take a quick look at what does a core on UltraSPARC T2 look like. UltraSPARC T2 extends the CMT architecture of T1. It consists of eight cores where each core has eight hardware threads. Hardware threads within a core are grouped into two sets of four threads each. There are two integer pipelines within a core and each set of four threads share one integer pipeline. In this sense, the resources available for computation within a core are doubled from that in UltraSPARC T1. It is worth understanding that threads within a core do not switch pipelines and the assignment of threads to a pipeline is fixed and hardwired.

One more important addition to the compute resources within a core is a Floating Point Unit (FPU). Each core of T2, includes a FPU shared by all eight threads from that core. Other shared resources within a core include Level-1 Instruction (I) and Data (D) cache and Translation Look aside Buffers (TLBs) like I-TLB and D-TLB. All cores share a 4 MB Level-2 (L2) cache. Including these there are key features why both single thread and multi thread performance of UltraSPARC T2 is better than T1.

A quick look at the UltraSPARC T2 architecture features shows following enhancements which benefit single thread performance :
  • Increased frequency - 1400 MHz
  • Lower instruction latencies
  • Better Floating Point performance
  • Hardware TLB miss handling for I-TLB and D-TLB
  • Larger D-TLB size (128 entries v/s 64 entries)
  • Larger L2 cache (4 MB v/s 3 MB)
  • Full support of VIS 2.0 instruction set. No kernel emulation
Similarly following are some of the features of UltraSPARC T2 that benefit multi thread performance :
  • Two integer pipelines per core
  • Twice the number of hardware threads (64 v/s 32)
  • Higher L2 cache set associativity. 16 way compared to 12 way
  • Instruction cache being 8 way associative compared to 4 way
  • Dedicated Floating point unit per core shared by all 8 strands, improved FP throughput
  • Memory interface supports FBDIMMs for higher capacity and bandwidth
  • Support for shared context feature where multiple contexts share the same entry in the TLB for mappings to the same address segment
  • Streaming Processing Unit (SPU) per core for on chip encryption/decryption support
Now, let us look at the topic of core utilization. All the important concepts like thread scheduling, idle hardware thread, stalled thread etc. have been introduced in my earlier blog on T1. All those concepts generally hold good for T2 however there are subtle differences such as on T2 an integer pipeline remaining idle doesn't mean a full core remains idle. Both the pipelines within a core can concurrently execute one instruction per cycle hence at 1417 MHz frequency, a core can execute maximum of 2x1417x1000x1000 instructions/second.

Considering these differences, corestat for UltraSPARC T2/T2+ has been enhanced and can be downloaded from here . The main enhancements are :
  1. It now reports the utilization of each pipeline separately. By default only the integer pipe utilization is reported.
  2. There is a new command line option "-g" added to report the FPU utilization along with integer utilization.
  3. Corestat detects frequency of the target system at run time.

While the usage remains same, corestat for UltraSPARC T2 can be used in two modes :
  1. For online monitoring purpose, it requires root privileges. This is the default mode of operation. Default reporting interval is 10 sec and it assumes the frequency of 1417 MHz.
  2. It can be used to report core utilization by post processing already sampled cpustat data using following command line :
cpustat -n -c pic0=Instr_cnt,pic1=Instr_FGU_arithmetic
                   -c pic0=Instr_cnt,pic1=Instr_FGU_arithmetic,
                                 nouser,sys 1

$ corestat
   Frequency = 1050 MHz
             corestat : Permission denied. Needs root privilege...

Usage : corestat [-g] [-v] [[-f <infile>] [-i <interval>] [-r <freq>]]

                  Default mode : Report Integer Pipeline Utilization
                  -g                     : Report FPU usage
                  -v                     : Report version number
                  -f infile            : Filename containing sampled cpustat data
                  -i interval       : Reporting interval in sec (default = 10 sec)
                  -r freq             : Processor frequency in MHz (default = 1417 MHz)

        # corestat -g

           Core Utilization for Integer pipeline
     Core,Int-pipe     %Usr     %Sys     %Usr+Sys
     -------------             -----         -----        --------
         0,0                   0.00          0.19      0.20
         0,1                   0.00          0.01      0.01
         1,0                   0.00          0.03      0.03
         1,1                   0.00          0.01      0.01
         2,0                   1.15          0.02      1.16
         2,1                   0.00          0.01      0.01
         3,0                   0.02          0.02      0.04
         3,1                   0.00          0.01      0.01
         4,0                   0.00          0.02      0.03
         4,1                   0.00          0.01      0.01
         5,0                   0.02          0.01      0.03
         5,1                   0.00          0.01      0.01
         6,0                   0.05          0.03      0.08
         6,1                   0.00          0.01      0.01
         7,0                   0.00          0.03      0.03
         7,1                   0.00          0.01      0.01
     -------------             -----         -----    ------
         Avg                   0.08          0.03      0.10

                      FPU Utilization
              Core         %Usr     %Sys     %Usr+Sys
         -------------         -----         -----     --------
              0                0.02          0.01      0.03
              1                0.02          0.01      0.03
              2                0.01          0.01      0.03
              3                0.01          0.01      0.03
              4                0.02         0.01      0.04
              5                0.02          0.02      0.04
              6                0.02          0.02      0.04
              7                0.02          0.02      0.04
         -------------         -----         -----    ------
             Avg           0.02          0.02      0.04

As far as interpretation of corestat data is concerned, all the points mentioned in an earlier blog with respect to T1, hold good. Since core saturation (measured using corestat) and virtual CPU saturation (measured using vmstat/mpstat) are two different aspects, we need to monitor both simultaneously in order to determine whether an application is likely to saturate the core by using fewer application threads. In such cases, increasing workload (e.g. by increasing the number of threads) may not yield any more performance. On the other hand, most often we will see applications having high Cycles Per Instructions (CPI) and thereby not being able to saturate the cores fully before achieving 100% CPU utilization.

While I make this new version of corestat available here.. we are already looking at a number of RFEs received as comments on my earlier blog and via e-mails to me. Some of the points being considered. Stay tuned !!


Friday Jun 09, 2006

Corestat

cmt.html

Corestat : Core Utilization reporting tool for UltraSPARC T1

Its been a while since I last talked to you. Last time we saw what is meant by core utilization of UltraSPARC T1 and why it is important to understand it separately from processor utilization reported by conventional tools like vmstat and mpstat. I had promised to make Corestat, the tool I developed for monitoring core utilization, available to you. Corestat is now released and can be downloaded from here .

    Usage :
       $ corestat
         corestat : Permission denied. Needs root privilege...

         Usage : corestat [-v] [-f <infile>] [-i <interval>] [-r <freq>]

                           -v               : Report version number
                          -f infile        : Filename containing sampled cpustat data
                          -i interval   : Reporting interval in sec (default = 10 sec)
                          -r freq         : CPU frequency in MHz (default = 1200 MHz)

  # corestat
                 Core Utilization
         CoreId     %Usr     %Sys     %Total
         ------        -----       -----       ------
         0               16.23     18.56     34.80
         1               26.09     13.42     39.52
         2               28.97     11.47     40.44
         3               28.63     11.74     40.38
         4               29.18     12.95     42.13
         5               29.25     11.31     40.56
         6               29.10     15.96     45.06
         7               23.97     12.55     36.51
         ------         -----       -----       ------
         Avg          26.43     13.50     39.92

Wednesday Dec 07, 2005

Database scaling on Sun Fire T2000

dbperf.html

Database scalability on Sun Fire T2000

With the launch of Sun Fire T2000 server I would like to share with you three aspects of database performance on these CoolThreads Servers. First let us see what reaserachers have found about CPU utilization of DBMSs, then we will see what we observed from our tests and finally I will share my analysis.

1. Where does time go ?

Researchers from University of Wisconsin in their analysis of DBMS performance of modern processors have looked at the CPU utilization of DBMS by analyzing where does time go ? They have analyzed how does processor cycles get utilized. The conclusion clearly states that for OLTP workloads 60% to 80% of the time is spent in memory related stalls. Also memory stalls breakdown shows dominance of data and instruction stalls at L2 cache level. OLTP workloads due to the higher amount of memory stalls exhibit high CPI (Cycles per Instruction). As the stalls increase, processor core utilization reduces and it lowers the overall efficiency.

UltraSPARC T1 processor with CoolThreads technology is fundamentally designed to take advantage of the stall component in the workload. UltraSPARC T1 hides memory stalls in one thread by allowing other threads from the same core, to use the pipeline. Where a thread on a conventional processor would stall and still occupy the pipeline, UltraSPARC T1 has hardware threads which can continue to execute even if one or more threads are stalled. This results in greatly improving the core efficiency.

2. What did we observe ?

Soon after the arrival of early prototypes of Sun Fire servers based on UltraSPARC T1 processor we were curious to know how CMT works for database, how does shared L2 cache behave for OLTP and how does commercial databases benefit from all the large page performance projects in Solaris.

So, we configured about 1.5 TB of database using a commercial DBMS on Sun Fire T2000 with 32 GB memory and did a number of performance tests. Let us see what we found :

Scaling characteristics :


Initially by sizing the database scale we controlled the amount of i/o activity to simply understand the scaling with increasing hardware threads. We noticed excellent scaling :

#of hardware threads/core
# of Virtual processors
Relative Performance (Throughput)
1
8
1.0
2
16
1.95
3
24
2.98
4
32
3.9


We did two sets of experiments to further understand scaling. By appropriately sizing the database cache and the database scale, we kept the disk io/transaction more or less same throughout these tests. For these tests threads within a core were disabled as needed using psradm(1M) command of Solaris.

  • Scaling across cores (with always using all 4 threads/core)

# of Cores # of Virtual processors Relative Performance
(Throughput)
2
8
1.0
4
16
1.91
6
24
2.91
8
32
3.68\*


  • Scaling the number of hardware threads per core 

# of hardware Threads/core # of Virtual Processors Relative Performance
(Throughput)
1
8
1.0
2
16
1.84
3
24
2.47
4
32
3.10 \*

     \*  We observed ~10% idle time for this config

As shown above, this commercial DBMS could scale quite well in both dimensions. Due to the high amount of i/o we saw idle time at 32 threads.

There are two ways in which we can select hardware threads from cores of UltraSPARC T1. e.g. if we want to use 8 hardware threads, we can choose 4 threads in 2 cores or can use 1 thread in each core.

Comparison of the throughput results show that for the same number of hardware threads its beneficial to use more number of cores. The performance gap gets closed as we increase the number of threads per core.


Configuration Performance Difference
1 thread/core v/s 4 threads in 2 cores 33 %
2 threads/core v/s 4 threads in 4 cores 16 %
3 threads/core v/s 4 threads in 6 cores 2 %

This shows that for the same number of hardware threads DBMS performance benefits from using more cores. Certain resources like Level 1 caches and TLBs are available per core and using more number of cores allows the software to use these resources. However, around 24 threads, the difference between choosing all 8 cores over selecting only 6 cores almost vanishes.

DBMS and large page support in Solaris :

We also did characterization of large pagesize selection features in Solaris 10, specifically developed for UltraSPARC T1.

UltraSPARC T1 processor has a 64 entry Instruction and Data TLB per core which supports 8k, 64k, 4M and 256 M pagesizes. Solaris 10 kernel on Sun Fire T2000 has been optimized to make use of large pages for various segments in the address space of a process. Solaris provides optimum pagesize selection  algorithm out of the box and requires no special tuning.

Individual feature tests showed following results :

Large Page (LP) feature
Performance gain (%)
LP for text and libraries
6.4 %
LP for ISM (database shared memory)
9.4 %
LP for Kernel heap
6.8 %
LP for heap, stack and anon
1.8 %

We have seen OLTP performance improvements upto 30% due to combined effects of all large page projects in Solaris. While running commercial DBMS on Sun Fire T2000, we see most of the database cache being allocated using 256 MB pages, text getting allocated on 4 MB pages with heap, stack and anonymous memory segments getting allocated using 64 KB pages.

All of this works just out of the box !

Other observations :

  • We also tried different scheduling classes. FX as well as RT but noticed that at even at high throughput the default TS scheduling class performs the best.

  • Understanding processor utilization on CMT systems can be tricky. Low CPI applications tend to saturate the core using fewer threads whereas high CPI applications tend to run out of available threads before they can saturate the core. Since the CPI of the OLTP workload on Sun Fire T2000 is quite high, it doesn't saturate the core capacity of 1.2 billion instructions/sec/core (@1200 MHz frequncy). Which means we can use vmstat and mpstat to get true idea about the head room available. Also scaling tests with varying load on the system have shown performance variation pattern following the variation in CPU utilization closely.

  • As the throughput scales, we did notice slight increase in response time. It was reasonably low and within acceptable limits.
3. So why does database OLTP performance scale really well on Sun Fire T2000 ?

A number of factors contribute to overall good performance and scaling. Basically CoolThreads technology is really working well. [We have validated this by analyzing hardware performance counter data collected using cpustat]
  • We have used cpustat to analyze the cache misses per transaction and to analyze the code path. Cache misses increase marginally as we scale up which validates that 12 way associativity of L2 cache is working well. We also noticed that the code path i.e. instructions executed per transaction almost remain constant even as we increase the number of threads. This also shows good software scaling.
  • Floating point usage is extremely low in case of DBMS. During the tests cpustat data showed that floating point unit (FPU) usage is less than 0.01% per instruction.
  • All the large page features in Solaris help reduce the number of TLB misses. It requires no special /etc/system tuning.
  • Along with all these, low memory latency, enough i/o connectivity with 3 PCI-E and 2 PCI-X slots and 4 GigE ports on board providing good network connectivity make Sun Fire T2000 a balanced server architecture for database.

UltraSPARC T1 utilization explained

cmt.html

UltraSPARC T1 utilization explained

With the introduction of Sun Fire T2000/T1000 servers using UltraSPARC T1 processor, Sun has taken a radically different approach to building scalable servers. UltraSPARC T1 processor is best perceived as a system on the chip. In order to understand the performance of any system we need to start with understanding the CPU utilization of that system. Let us see how software and hardware thread scheduling is done on UltraSPARC T1, why conventional tools like mpstat don't show the complete picture and what it really means by CPU utilization for this T1 processor. While thinking about this issue, I wrote "corestat" a new tool to monitor the core utilization of  T1 processor and I will discuss the use of this tool too.

Let us start with the overview of basic concepts which will help understand the rationale for addressing the CPU utilization aspect separately for UltraSPARC T1's CMT architecture.

CMT and UltraSPARC T1 at a glance :

UltraSPARC T1 processor presents Chip Multiprocessing combined with Chip Multi threading. Processor architecture consists of eight cores with four hardware threads per core. Each core has one integer pipeline and four threads within a core share the same pipeline. There are two types of shared resources on the processor. Each core shares Level 1 (L1) Instruction and Data cache as well as the Translation Lookaside Buffer (TLB) and all the cores share the on chip Level 2 (L2) cache. L2 cache is a 12 way set associative unified (instruction and data combined) cache.

Thread scheduling on UltraSPARC T1 :

The Solaris Operating System kernel treats each hardware thread of a core as a separate CPU which makes T1 processor look like a 32 CPU system. In reality its a single physical processor with 32 virtual processors. Conventional tools like mpstat and prtdiag report 32 CPUs on T1. The Solaris Operating system schedules software threads onto these virtual processors (hardware threads) very similar to a conventional SMP system. There is a one to one mapping of software threads onto these hardware threads and a software thread is always scheduled on one hardware thread till its time quantum expires or is pre-empted by another higher priority software thread.

Hardware scheduler decides the use of the pipeline by the hardware threads sharing the same core. Every cycle the hardware thread scheduler switches threads within a core, allowing the same hardware thread to run at least every 4th cycle. There are two specific situations under which a hardware thread can get to run for more than one cycle in four consecutive cycles. These situations arise when a hardware thread becomes idle or gets stalled.

Let us look into these cases more closely :

What does it mean by an “idle” hardware thread on UltraSPARC T1 :

Conventionally a processor is considered to be idle by the kernel when there is no runnable thread in the system which can be scheduled on that processor. On previous generation SPARC processors, an idle state related to the pipeline of the processor remaining unused. For a CMT processor like T1 if there are not enough runnable threads in the system then one or more hardware threads in a core remain idle.

Main differences in behavior of an idle virtual processor (hardware thread) of T1 compared to the idle CPU in conventional SMP are :
  • A hardware thread becoming idle doesn't mean an entire core becomes idle. Processor core will still continue to execute instructions on behalf of other threads in the core.
  • Solaris kernel has been optimized for T1 processor so that when a hardware thread becomes idle, it is parked. A parked thread is taken out of the mix of threads available in a core for scheduling. Its time slice is allocated to the next runnable thread from the same core.
  • An idle (parked) thread doesn't consume any cycles on UltraSPARC T1. On a non CMT SPARC processor based system an idle processor executes an idle loop in the kernel.
  • A hardware thread becoming idle doesn't necessarily reduce core utilization. It also doesn't slow down other threads sharing the same core. Core utilization depends on how efficiently a thread can execute its instructions.
  • Mpstat on Sun Fire T2000 reports an idle thread in the same way as it reports an idle CPU in conventional SMP system.
  • On a conventional system an idleness of a processor is inherently linked to the idleness of the system. On UltraSPARC T1 one or more hardware threads can be idle but the processor could still be executing instructions at reasonable capacity. These two aspects are not directly related.
  • Only when all four threads from the same core become idle, that core becomes idle and utilization drops to zero.
What does it mean by a “stalled” thread on UltraSPARC T1 :
On a T1 processor when a thread stalls due to a long latency instruction (such as a load missing in the cache), it is taken out of the mix of schedulable threads with allowing the next ready to run thread from the same core to use its time slice. Similar to conventional processors, a stalled thread on T1 is reported as busy by mpstat. On conventional processors a stalled (e.g. on cache miss) thread occupies the pipeline and hence results in low system utilization. In case of T1 the core can still get utilized by other nonstalled runnable threads.

Understanding processor utilization :

For a T1 processor a thread being idle and a core becoming idle are two different things and hence need to be understood separately. Here are some commonly asked questions in this regard :
  • There is already vmstat and mpstat so why do we need to think about anything else ?

On UltraSPARC T1 Solaris tools like mpstat only report the state of a hardware thread and don't show the core utilization. Conventionally if a processor is not idle it is considered as busy. A stalled processor is also conventionally considered busy because for non CMT processors the pipeline of a stalled processor is not available for other runnable threads in the system. However on a T1 processor a stalled thread doesn't mean stalled pipeline. On T1 processor vmstat and mpstat output should really be interpreted as the report of pipeline occupancy by software threads. For non CMT processors idle time reported by mpstat or vmstat can be used to decide on adding more load on the system. On a CMT processor like T1, we also need to look at the core utilization before making the same decision.
  • How can we understand core utilization on UltraSPARC T1 if mpstat doesn't show it ?
Core utilization of a T1 corresponds to the number of instructions executed by that core. Cpustat is a tool available on Solaris to monitor system behavior using hardware performance counters. T1 processor has two hardware performance counters per thread (there are no core specific counters). One of the performance counters always reports instruction count and the other can be programmed to measure other events such as cache misses and TLB misses etc. A typical cpustat command looks like :

cpustat -c pic0=L2_dmiss_ld,pic1=Instr_cnt 1

which will report Data cache misses in L2 cache and the instructions, executed in user mode at 1 second interval by all the enabled threads.

I wrote a new tool “Corestat”  for online monitoring of core utilization. Core utilization is reported for all the available cores by aggregating the instructions executed by all the threads in that core. Its  a perl script which forks cpustat command at run time and then aggregates the instruction count to derive the core utilization. A T1 core can best execute 1 instruction/cycle and hence the maximum core utilization is directly proportional to the frequency of the processor.

Corestat can be used in two modes :
  • For online monitoring purpose, it requires root privileges. This is the default mode of operation. Default reporting interval is 10 sec and it assumes the frequency of 1200 MHz.
  • It can be used to report core utilization by post processing already sampled cpustat data.

    Usage :
       $ corestat
         corestat : Permission denied. Needs root privilege...

         Usage : corestat [-f <infile>] [-i <interval>] [-r <freq>]

                           -f infile   : Filename containing sampled cpustat data
                           -i interval : Reporting interval in sec (default = 10
                                         sec)
                           -r freq     : CPU frequency in MHz (default = 1200
                                         MHz)
       # corestat
                 Core Utilization
         CoreId     %Usr     %Sys     %Total
         ------            -----         -----        ------
         0               16.23     18.56     34.80
         1               26.09     13.42     39.52
         2               28.97     11.47     40.44
         3               28.63     11.74     40.38
         4               29.18     12.95     42.13
         5               29.25     11.31     40.56
         6               29.10     15.96     45.06
         7               23.97     12.55     36.51
         ------            -----        -----        ------
         Avg          26.43     13.50     39.92

    mpstat data for the same period from the same system looks like :

    CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
     0    2   0 4191  7150 6955 1392   93  374  573   14  1433   78  22   0   0
      1    2   0  179 11081 10956 1180  132  302 1092   13  1043   79  21   0   0
      2    1   0  159  9524 9388 1085  141  261 1249   14   897   79  21   0   0
      3    0   0 3710 10540 10466  621  231  116 1753    2   215   70  29   0   0
      4    5   0   28   355    1 2485  284  456  447   30  2263   77  23   0   0
      5    5   0   25   350    1 2541  280  534  445   26  2315   78  22   0   0
      6    3   0   26   331    0 2501  267  545  450   28  2319   78  22   0   0
      7    2   0   30   292    1 2390  232  534  475   23  2244   77  22   0   0
      8    4   0   22   265    1 2188  220  499  429   26  2118   75  25   0   0
      9    2   0   28   319    1 2348  258  513  440   26  2161   76  24   0   0
     10    4   0   23   308    0 2384  259  514  430   22  2220   76  24   0   0
     11    4   0   27   292    0 2366  237  518  438   30  2209   77  23   0   0
     12   11   0   31   314    0 2446  253  530  458   27  2290   78  22   0   0
     13    4   0   31   273    1 2334  223  523  428   25  2261   79  21   0   0
     14   12   0   29   298    1 2405  247  521  435   25  2286   78  22   0   0
     15    4   0   32   330    1 2445  272  526  450   24  2248   77  22   0   0
     16    5   0   28   271    0 2311  219  528  406   29  2188   76  23   0   0
     17    4   0   23   309    1 2387  253  537  442   25  2234   78  22   0   0
     18    3   0   25   312    1 2412  257  534  449   26  2216   78  22   0   0
     19    3   0   29   321    1 2479  262  545  462   31  2287   78  22   0   0
     20   14   0   29   347    0 2474  289  541  457   24  2253   78  22   0   0
     21    4   0   29   315    1 2406  259  534  469   24  2240   77  22   0   0
     22    4   0   27   290    1 2406  243  531  480   25  2258   77  22   0   0
     23    4   0   27   286    1 2344  235  531  445   26  2240   77  22   0   0
     24    3   0   30   279    0 2292  228  518  442   22  2160   77  23   0   0
     25    3   0   26   275    1 2340  227  538  448   25  2224   76  23   0   0
     26    4   0   22   294    1 2349  247  529  479   26  2197   77  23   0   0
     27    4   0   27   324    1 2459  270  544  476   25  2256   77  23   0   0
     28    4   0   25   300    1 2426  249  549  461   27  2253   77  23   0   0
     29    5   0   27   323    1 2463  269  541  447   23  2277   77  22   0   0
     30    2   0   27   289    1 2386  239  535  463   26  2222   77  23   0   0
     31    3   0   29   363    1 2528  304  525  446   26  2251   76  23   0   0


Here we can see each core is executing 39% of its max capacity. Interestingly mpstat output for the same period shows that all the virtual CPUs are all 100% busy. Together it shows that in this particular case even 100% busy threads can not utilize any of the core to its max capacity due to the stalls.

From corestat data we can get an idea about the absolute capacity of the core available for more work or performance. Higher the percentage of core usage means the core is getting saturated and has less head room available for processing more load. It also means that the pipeline is being used more efficiently. However, lower core utilization doesn't simply mean more room for applying more load. All the virtual CPUs can be 100% busy and still the core utilization could be low.
  • How to use core utilization data along with conventional stat tools ?
Core utilization (as seen above from corestat) and mpstat or vmstat need to be used together to make decisions about system utilization.

Here is some explanation of a few commonly observed scenarios :

Vmstat reports 75% idle and core utilization is only  20% :

Since vmstat reports huge idle time as well as the core usage is also low, there is head room for applying more load. Any performance gain by increasing load will depend on the characteristic of the application.

Vmstat reports 100% busy and core utilization is 50% :

Since vmstat reports all threads being 100% busy, there is really no more head room to schedule any more software threads. Hence the system is at its peak load. Low (i.e. 50%) core utilization indicates that the application is only utilizing each core to its 50% capacity and the cores are not saturated.

Vmstat reports 75% idle but core utilization is 50%  :

Since core utilization is higher than that reported by vmstat, this is an indication that the processor can get saturated by having fewer software threads than the available hardware threads. It is also an indication of a low CPI application. In this case, scalability will be limited by core saturation and adding more load after a certain point will not help achieve any more performance.

As with any other system on Sun Fire T2000 as the load increases, more threads become busy and core utilization also goes up. Since thread saturation (i.e. virtual CPU saturation) and core saturation are two different aspects of system utilization, we need to monitor both simultaneously in order to determine whether an application is likely to saturate a core by using fewer threads. In that case, applying additional load on the system will not deliver any more throughput. On the other hand if all the threads get saturated but core utilization shows more head room then that means the application has stalls and it is a high CPI application. Application level tuning, partitioning of resources using processor sets (psrset(1M)) or binding of LWPs (pbind(1M)) could be some techniques to improve the performance in such cases.

Introduction

This is my first blog and I would like to introduce myself. I joined Sun Microsystems a little over 9 years ago. Since then I have been working with the performance group. Most of my time I have spent working on Database performance analysis and improvement. I love to analyze software performance from system architecture point of view. In the past I have worked on Memory Placement Optimization related projects. My current passion is CMT. With the launch of T1 processor I think we are going to see a big shift in conventional thinking about performance. It should open new avenues for learning and let's hope that's going to be a lot of fun!! I will be discussing CMT performance and hope to share with you some interesting performance stuff.
About

travi

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks