Monday Oct 13, 2008

Solaris for the T5440

Solaris for the T5440

Sun announced the Sun SPARC Enterprise T5440 server today, the largest and most capable in a line of servers based on CoolThreads technology. It has four sockets with up to four UltraSPARC T2 plus processors and 256 hardware execution strands, where each strand is represented as a CPU by the Solaris Operating System. How much work did it take to scale Solaris to handle 256 CPUs? Not much -- it just works. We did the heavy lifting for previous servers in this line, when we added support for the performance features of the T2 and T2 plus processors, and when we optimized scalability for the 2-socket predecessor of the T5440. These features were designed to automatically scale with server size, and we are reaping the benefits of those designs with the T5440. To refresh your memory, here are some of the unique combinations of hardware and software that continue to deliver performance on the T5440:
  • Core Pipeline and Thread Scheduling.
    The Solaris scheduler spreads software threads across instruction pipelines, cores, and sockets for load balancing and optimal usage of processor resources.
  • Cache Associativity.
    The kernel automatically uses both hardware and software page coloring to improve effective L2 cache associativity and reduce cache conflicts.
  • Virtual Memory.
    The kernel automatically uses large memory pages and enables hardware tablewalk to reduce the cost of virtual to physical memory translation. Virtual to physical mappings are shared in hardware for processes that share memory.
  • Block Copy
    Solaris provides optimized versions of library functions to initialize and copy large blocks of memory.
  • Crypto Acceleration
    The T2 plus processor has a hardware crypto acceleration engine per core, thus the T5440 has 32 crypto engines.
  • Locking Primitives
    The kernel mutex is efficient for all system sizes, and it handles excessive contention well, using an exponential backoff algorithm that is parametrized based on system size. The algorithm automatically scales to newer, large systems. The kernel reader-writer lock and atomic operations such as atomic_add now have exponential backoff as well (new since my last posting).
  • Multi-threaded Resource Accounting
    Solaris performs resource accounting and limit enforcement in the clock function at regular intervals. The function is multi-threaded, and the number of threads is automatically scaled based on system size.
  • Memory Placement Optimization (MPO)
    Solaris allocates memory physically near a thread to minimize memory access latency. We added MPO support for the CMT line starting with the T5240 server, which has two sockets and two latency groups (collections of resources near each other). The T5440 has four sockets and four latency groups. The CPU and memory topology is provided to Solaris by the firmware in the form of an abstract graph, thus no changes were required in the Solaris kernel to support MPO on the T5440.
  • Performance Counters
    The cpustat command provides access to a variety of hardware performance counters that give visibility into the lowest level execution characteristics of a thread.
If you want more detail on the above, see my previous postings:

What do you see when logged into a T5440?

Here is the list of 256 CPUs. I have deleted some output for brevity:

% mpstat
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0  420  1232    4  885   60  255  100    0  2862    2  88   0  11
  1    0   0  589   456   33  366   30   96   90    0   668    0  95   0   4
  2    0   0   34  2566    6 2199  128  594  221    1  7061    4  65   0  31
  ... 250 lines deleted ...
253    0   0  666  2111 1862  518   23   79   75    0  1011    1  54   0  45
254    0   0  788  1631 1521  200    6   17   48    0   139    0  87   0  13
255    0   0  589   138   33  173    4   14   48    0   130    0  87   0  13

Here is the grouping of those CPUs into sockets and cores. A virtual processor is a CPU. Again, I deleted some output.

% psrinfo -pv
The physical processor has 8 cores and 64 virtual processors (0-63)
  The core has 8 virtual processors (0-7)
  The core has 8 virtual processors (8-15)
  The core has 8 virtual processors (16-23)
  The core has 8 virtual processors (24-31)
  The core has 8 virtual processors (32-39)
  The core has 8 virtual processors (40-47)
  The core has 8 virtual processors (48-55)
  The core has 8 virtual processors (56-63)
    UltraSPARC-T2+ (clock 1414 MHz)

The physical processor has 8 cores and 64 virtual processors (64-127)
  The core has 8 virtual processors (64-71)
  ...
  The core has 8 virtual processors (120-127)
    UltraSPARC-T2+ (clock 1414 MHz)

The physical processor has 8 cores and 64 virtual processors (128-191)
  The core has 8 virtual processors (128-135)
  ...
  The core has 8 virtual processors (184-191)
    UltraSPARC-T2+ (clock 1414 MHz)

The physical processor has 8 cores and 64 virtual processors (192-255)
  The core has 8 virtual processors (192-199)
  ...
  The core has 8 virtual processors (248-255)
    UltraSPARC-T2+ (clock 1414 MHz)

Here is the lgroup configuration (again edited):

% lgrpinfo
lgroup 0 (root):
        Children: 1-4
        CPUs: 0-255
        Memory: installed 128G, allocated 3.9G, free 124G
lgroup 1 (leaf):
        Children: none, Parent: 0
        CPUs: 0-63
        Memory: installed 32G, allocated 774M, free 31G
lgroup 2 (leaf):
        Children: none, Parent: 0
        CPUs: 64-127
        Memory: installed 33G, allocated 1.1G, free 31G
lgroup 3 (leaf):
        Children: none, Parent: 0
        CPUs: 128-191
        Memory: installed 33G, allocated 1.2G, free 31G
lgroup 4 (leaf):
        Children: none, Parent: 0
        CPUs: 192-255
        Memory: installed 33G, allocated 918M, free 32G

256 is a Big Number.

Here is a visual way to grasp the capacity offered by 256 CPUs. Look at the output of mpstat, which shows execution statistics for all CPUs on a system. Compare the capacity of the T5440 to the original T2000 server, which was revolutionary for the amount of thruput it delivered. I use teeny-tiny font so we can see both servers on the same page.

Here is the T2000 mpstat output;



% mpstat
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0  434  2588    9 2156  127  577  113    0  5845    3  70   0  27
  1    0   0  510   152   34   80    7   20   88    0   130    0  99   0   1
  2    0   0   31  2283    9 2071  111  523  107    0  5588    3  66   0  31
  3    0   0  344  1342   25 1122   62  273   86    0  2479    1  76   0  23
  4    0   0 1073     4    3    0    0    0    1    0     0    0 100   0   0
  5    0   0    0     1    0    0    0    0    0    0     0    0 100   0   0
  6    0   0  248  1371   23 1215   62  297  132    0  2913    2  71   0  27
  7    0   0  505    97   36   22    1    4   49    0    26    0  99   0   1
  8    0   0  462   471   29  446   30  112   80    0   761    0  92   0   8
  9    0   0  285   930   25  963   47  237   71    0  2031    1  77   0  22
 10    0   0  450   388   30  359   18   78   65    0   561    0  90   0   9
 11    0   0  504   129   40   54    3   11   64    0    71    0  98   0   2
 12    0   0  406   516   28  511   33  124   71    0   939    1  89   0  11
 13    0   0  263   734   21  777   38  180   75    0  1648    1  78   0  21
 14    0   0  206  1322   26 1331   59  280   80    0  2906    2  53   0  45
 15    0   0  506   220   33  194   11   30   74    0   226    0  93   0   7
 16    0   0  518   138   37   90    6   19   59    0   122    0  98   0   2
 17    0   0  174   296   14  358   18   80   48    0   750    2  88   0  10
 18    0   0  442   276   30  305   15   61   72    0   438    0  90   0  10
 19    0   0  396   335   28  392   17   74   47    0   603    0  86   0  14
 20    0   0  235   401   20  464   20   96   46    0   914    0  83   0  16
 21    0   0  454   118   27  106    3   15   57    0    94    0  96   0   4
 22    0   0  505   128   35   86    4   15   61    0   107    0  97   0   3
 23    0   0   17  1115    8 1279   41  244   64    0  3442    2  36   0  62
 24    0   0  505   140   35  119    5   23   52    0   186    0  96   0   4
 25    0   0  415   171   27  193    8   40   83    0   304    0  93   0   7
 26    0   0  232   485   26  669   26  132   47    0  1241    1  69   0  31
 27    0   0  155   275   15  367   14   67   35    0   638    0  78   0  22
 28    0   0  126   189   12  263    8   45   25    0   439    0  85   0  15
 29    0   0  464   123   29  114    5   15   50    0    89    0  95   0   5
 30    0   0  448   230   26  264    9   46   88    0   386    0  83   0  17
 31    0   0  447   158   26  178    6   24   57    0   182    0  90   0  10
 32    0   0  212   387   23  615   24  132   50    0  1143    1  79   0  21


Here is the T5440 mpstat output (I cut the 256 lines into 4 columns):



% mpstat
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0  420  1232    4  885   60  255  100    0  2862    2  88   0  11             64    0   0  292   927   18 2055  102  478  354    0  6580    3  54   0  43            128    0   0  131   924    5 2074  102  480  264    0  6616    4  46   0  51            192    0   0  494   270   23  504   44  147   89    0  1009    1  93   0   6
  1    0   0  589   456   33  366   30   96   90    0   668    0  95   0   4             65    0   0   32   570    5 1345   42  252  174    0  4740    2  27   0  71            129    0   0  473   515   12 1133   44  190  234    0  2970    2  45   0  54            193    0   0  101   474   12  990   62  283   69    0  2812    2  84   0  14
  2    0   0   34  2566    6 2199  128  594  221    1  7061    4  65   0  31             66    0   0  130   308    7  699   21  110  130    0  2135    1  21   0  78            130    0   0  273   289    8  646   23   88  142    0  1742    1  28   0  71            194    0   0  143   866   14 1863  131  503  125    2  4848    3  68   0  29
  3    0   0  196  1865   21 1525   91  385  151    1  4178    2  69   0  29             67    0   0   24    91    7  191    3   33   61    0   780    0   6   0  94            131    0   0  342   104    8  203    7   26   87    0   548    0  23   0  76            195    0   0  294   613   19 1299   81  312  128    1  3085    2  68   0  30
  4    0   0 1508   757   31  671   41  165   84    0  1531    1  85   0  14             68    0   0   32   802    8 1800   66  354  173    0  5662    3  33   0  64            132    0   0  205   726   15 1571   69  316  185    0  4529    2  36   0  61            196    0   0  449   655   25 1353   96  338  140    2  2997    2  73   0  25
  5    0   0  247  1461   26 1246   71  317  126    0  3222    2  70   0  28             69    0   0    1    34    0   74    1   11    9    0   242    0   1   0  98            133    0   0  307   280    8  611   21   77  117    0  1357    1  26   0  73            197    0   0 1334  2044 1930  185   17   51   87    0   377    0  95   0   5
  6    0   0  144   757   14  630   38  158  177    0  1754    1  85   0  14             70    0   0    0     2    0    2    0    0    0    0     5    0   0   0 100            134    0   0  202    50    6   88    3    7   72    0    95    0  15   0  84            198    0   0  653  1467 1290  431   28  101   66    0   763    1  86   0  13
  7    0   0  550   294   29  213   12   43  311    0   347    0  94   0   6             71    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100            135    0   0   61    25    2   48    2    3   26    0    14    0   5   0  95            199    0   0  900  2202 1989  439   28   97   82    0   860    1  81   0  18
  8    0   0  452   512   25  520   36  121  542    0   935    1  88   0  11             72    0   0   13   869    2 1955   74  393  166    0  5862    3  35   0  62            136    0   0  194   849   12 1847   90  362  231    0  4604    2  41   0  57            200    0   0  632  1440 1142  632   46  157   94    0  1275    1  86   0  14
  9    0   0  386  1037   24 1040   55  258  773    0  2317    1  73   0  25             73    0   0    9   347    2  795   17  127   77    0  2526    1  14   0  84            137    0   0   92   422    6  937   28  158  108    0  2744    1  23   0  76            201    0   0 1060  1506 1433   74    5   14   83    0    90    0  98   0   2
 10    0   0  564   362   27  335   20   69  790    0   452    0  92   0   7             74    0   0   85    74   12  112    2   14   74    0   224    0   6   0  94            138    0   0  115   194    5  415   13   50   78    0   861    0  14   0  86            202    0   0  597  1802 1447  766   49  188   76    0  1624    1  77   0  22
 11    0   0  500   196   25  120    8   24  810    0   150    0  96   0   4             75    0   0    3     7    1    9    0    1    1    0    20    0   0   0 100            139    0   0  187    48    5   85    3    7   59    0    74    0  15   0  85            203    0   0  723  1788 1599  383   25   84   92    0   711    0  87   0  13
 12    0   0  460   429   25  418   23   92  828    0   734    1  87   0  12             76    0   0   27   558    6 1263   31  193  105    0  3742    2  23   0  75            140    0   0  242   512    7 1134   42  184  136    0  3121    2  38   0  61            204    0   0  876  1508 1347  297   19   67   91    0   530    0  90   0   9
 13    0   0  222  1292   23 1301   64  303  865    1  3286    2  57   0  41             77    0   0    0    13    0   30    0    4    3    0    90    0   1   0  99            141    0   0  310   237   10  493   19   44  169    0   586    0  24   0  76            205    0   0  260   162   16  303   16   59   46    0   451    0  88   0  12
 14    0   0  292   961   22  990   48  205  838    1  2220    1  62   0  37             78    0   0    0     1    0    1    0    0    0    0     0    0   0   0 100            142    0   0    0     2    0    2    0    0    0    0     5    0   0   0 100            206    0   0  214   497   23 1026   51  201   68    0  2176    1  53   0  46
 15    0   0  569   470   30  497   26   77  855    0   693    0  81   0  19             79    0   0    0     2    0    1    0    0    0    0     0    0   0   0 100            143    0   0   16     9    2   13    0    1    9    0    20    0   1   0  99            207    0   0    4   187    0  426   14   73   23    0  1221    1  69   0  30
 16    0   0  555   300   30  353   18   64  721    0   461    0  91   0   9             80    0   0   18   800    5 1803   57  350  153    0  5148    2  33   0  65            144    0   0  306   667    8 1480   59  261  197    0  3744    2  43   0  55            208    0   0  377   639   21 1329   84  309   94    1  3117    2  65   0  33
 17    0   0  189  1019   27 1266   56  281  574    2  2930    2  53   0  46             81    0   0    6   247    1  577   11   78   49    0  1714    1  10   0  89            145    0   0   90   403    4  888   30  123  143    0  2038    1  19   0  80            209    0   0  177   454   19  942   51  197   69    0  2165    1  63   0  36
 18    0   0  549   246   27  288   13   46  368    0   341    0  89   0  11             82    0   0    0     6    0   11    0    2    0    0    24    0   0   0 100            146    0   0    0    39    0   89    1   12    7    0   266    0   2   0  98            210    0   0  454   294   25  550   29   88   72    0   703    0  73   0  26
 19    0   0  453   375   27  440   18   69  133    1   596    0  77   0  23             83    0   0    0     2    0    1    0    0    0    0     0    0   0   0 100            147    0   0  168    54    5   99    5    7   49    0   115    0  11   0  89            211    0   0  475   226   25  407   18   60   91    0   512    0  78   0  22
 20    0   0  212   236   15  300   14   59   39    0   518    0  87   0  12             84    0   0    5   441    0 1014   26  151   86    0  2916    1  18   0  80            148    0   0  159   346    6  770   25  119  112    0  1872    1  24   0  75            212    0   0    9   450    0 1029   42  203   70    0  3063    2  48   0  50
 21    0   0  493   255   27  353   16   55   63    0   443    0  88   0  12             85    0   0    0    87    0  198    3   28   11    0   510    0   3   0  96            149    0   0  391   195   11  383   17   30  174    0   300    0  29   0  71            213    0   0  237   289   19  575   21   84   50    0   963    1  55   0  44
 22    0   0  599   123   31  101    4   14   55    0   110    0  96   0   4             86    0   0    0     2    0    2    0    0    0    0     2    0   0   0 100            150    0   0   97    31    3   55    3    4   43    0    23    0   7   0  93            214    0   0  577   124   30  152    5   13   65    0    89    0  92   0   7
 23    0   0    9   766    4  899   29  179   98    0  2915    2  25   0  74             87    0   0    0     2    0    1    0    0    0    0     0    0   0   0 100            151    0   0  305    61    8  108    5    8   77    0    46    0  21   0  79            215    0   0 1554  2411 2332  114    6   11  287    0    72    0  94   0   6
 24    0   0  514   203   31  302   12   58   65    0   483    0  88   0  11             88    0   0   14   819    3 1846   55  339  171    1  5556    2  34   0  64            152    0   0   18   772    5 1721   62  326  120    0  4856    2  32   0  66            216    0   0 1314  1971 1757  422   27   91  210    0   827    1  85   0  15
 25    0   0  531   239   27  338   15   66   70    2   580    0  85   0  15             89    0   0    2   205    0  476    8   72   50    0  1545    1   9   0  90            153    0   0  275   340    8  744   23   92  151    0  1668    1  27   0  73            217    0   0 1347  1847 1710  268   16   49  201    0   422    0  88   0  12
 26    0   0  152   284   15  427   20   90   40    0   864    1  81   0  18             90    0   0    0     5    0   10    0    2    1    0    30    0   0   0 100            154    0   0  106    43    2   87    2   10   28    0   242    0   8   0  92            218    0   0 1499  2019 1945   98    7   16  209    0   118    0  96   0   4
 27    0   0  223   466   23  698   26  126   61    1  1358    1  48   0  51             91    0   0    0     5    0   10    0    1    1    0    27    0   0   0 100            155    0   0   10    18    4   19    0    1    2    0    22    0   1   0  99            219    0   0 1397  1953 1792  327   18   55  220    1   521    0  81   0  19
 28    0   0  191   312   18  501   16   77   47    2   839    1  65   0  34             92    0   0    6   502    0 1149   29  171   94    0  3293    1  21   0  78            156    0   0    6   343    1  781   20  128   55    0  2324    1  14   0  85            220    0   0 1610  1891 1836   33    3    4  214    0    32    0  99   0   1
 29    0   0  517   168   26  225    9   27   53    0   213    0  87   0  13             93    0   0    1     8    0   15    0    2    0    0    35    0   0   0 100            157    0   0  425   465   31  953   30   62  174    1   586    0  27   0  73            221    0   0 2014  2318 2220  152    8   21  329    0   151    0  94   0   6
 30    0   0  523   145   25  183    5   24   80    0   232    0  86   0  13             94    0   0    0     1    0    1    0    0    0    0     0    0   0   0 100            158    0   0  331   569   23 1223   42   78  164    2   620    0  23   0  76            222    0   0 1324  2011 1824  390   17   53  195    0   498    0  67   0  33
 31    0   0  435   254   24  414   16   46   99    0   376    0  74   0  26             95    0   0    7     5    1    3    0    0    9    0     0    0   0   0 100            159    0   0  196   303   20  614   18   42   89    2   517    0  15   0  85            223    0   0  638    99   36   66    2    5   78    0    27    0  96   0   4
 32    0   0  250   392   24  706   27  145   69    1  1323    1  74   0  25             96    0   0   14   796    2 1808   56  341  140    0  5092    2  33   0  65            160    0   0  394   861   30 1831   94  290  210    2  2836    2  40   0  59            224    0   0  530   234   26  408   28   76  141    0   535    0  90   0   9
 33    0   0  340   353   27  617   25  111   66    0   919    1  74   0  26             97    0   0    7   181    1  420    6   57   47    0  1342    1   8   0  91            161    0   0  366   492   27 1017   39  115  206    1  1182    1  31   0  69            225    0   0  554   214   29  358   20   75   65    0   572    0  89   0  11
 34    0   0  448   271   25  472   20   70   64    0   549    0  81   0  19             98    0   0   19    36    6   53    1    5   26    0   110    0   2   0  98            162    0   0  383   426   27  874   27   62  114    2   662    0  23   0  76            226    0   0  258   444   21  913   39  178   58    0  1762    1  56   0  43
 35    0   0   11   407    6  712   18  125   63    1  2087    1  29   0  70             99    0   0    1     2    0    1    0    0    0    0     0    0   0   0 100            163    0   0    5    11    1   21    0    3    5    0    51    0   1   0  99            227    0   0  427   259   25  476   19   66   95    0   570    0  70   0  29
 36    0   0  565   162   31  225    8   30   72    0   203    0  90   0  10            100    0   0   25   503    7 1116   27  163   89    0  3018    1  20   0  78            164    0   0  249   590   29 1222   48  149  130    2  1832    1  25   0  74            228    0   0  429   249   24  461   22   76   64    0   632    0  80   0  19
 37    0   0  507   226   30  363   12   44   75    0   338    0  82   0  18            101    0   0    1     5    0    8    0    1    1    0    17    0   0   0 100            165    0   0  705   657   34 1421   56   94  415    2   569    0  44   0  56            229    0   0  457   161   23  266   13   40   67    0   318    0  89   0  11
 38    0   0  501   164   25  254    8   29   73    0   193    0  86   0  14            102    0   0   13     5    1    3    0    0   26    0     1    0   1   0  99            166    0   0    0     4    0    7    0    1    0    0    16    0   0   0 100            230    0   0    5   281    0  647   20  104   39    0  1882    1  45   0  54
 39    0   0  249   292   22  485   13   52   48    0   613    0  43   0  56            103    0   0    0     1    0    1    0    0    0    0     0    0   0   0 100            167    0   0    0     1    0    1    0    0    0    0     0    0   0   0 100            231    0   0  380   257   27  465   15   42   76    0   405    0  66   0  34
 40    0   0  249   404   26  792   36  176   69    1  1506    1  77   0  22            104    0   0   34   785    3 1761   57  332  189    0  4973    2  33   0  65            168    0   0  411   722   28 1528   73  201  204    4  1850    1  35   0  64            232    0   0  624   219   29  350   27   60  161    0   376    0  93   0   7
 41    0   0   28   607   15 1279   39  272   98    3  3387    2  49   0  50            105    0   0    2   191    0  436    8   62   53    0  1487    1   9   0  91            169    0   0  627  2105 1787  704   27   66  146    1   650    0  40   0  60            233    0   0  239   542   23 1124   53  250   80    0  2559    2  60   0  39
 42    0   0    7   197    6  419   10   73   27    0  1110    1  77   0  23            106    0   0    1     7    0   14    0    1    1    0    36    0   0   0 100            170    0   0  423  1924 1709  467   15   35   93    0   414    0  34   0  66            234    0   0  408   320   28  604   26   97   72    0   856    1  72   0  28
 43    0   0  212   356   22  691   21  104   63    0  1321    1  43   0  56            107    0   0    0     2    0    1    0    0    0    0     3    0   0   0 100            171    0   0 1174  1733 1486  536   18   29  346    2   256    0  73   0  26            235    0   0  475   209   27  358   14   50   81    0   427    0  81   0  19
 44    0   0   19   242    8  504   20  108   38    2  1194    1  80   0  19            108    0   0  541   467    0 1075   23  153   81    0  2948    2  21   0  76            172    0   0  554  1747 1326  918   34   97  152    1  1122    1  39   0  61            236    0   0  172   423   19  871   38  157   60    0  1888    1  58   0  40
 45    0   0   17   333    9  696   27  126   46    3  1731    1  69   0  30            109    0   0    0    10    0   21    0    2    3    0    77    0   1   0  99            173    0   0  494  1421 1133  607   18   43  114    1   495    0  29   0  71            237    0   0    4   379    0  856   27  125   51    0  2516    1  41   0  58
 46    0   0   18   320   13  670   17  115   43    2  1639    1  45   0  54            110    0   0    0     1    0    1    0    0    0    0     0    0   0   0 100            174    0   0  633  2471 2275  426   13   25  104    1   312    0  41   0  58            238    0   0  361   241   24  455   15   46   69    0   510    0  57   0  42
 47    0   0   13   168   14  298    6   44   24    0   745    0   6   0  94            111    0   0    1     2    0    2    0    0    0    0     1    0   0   0 100            175    0   0  310  1630 1614   22    0    1   13    0    33    0  19   0  81            239    0   0  488   158   26  252    6   20   68    0   203    0  81   0  19
 48    0   0   11   178    6  379   13   83   23    0   917    0  89   0  11            112    0   0   46   687    9 1513   49  273  134    0  4309    2  28   0  70            176    0   0  132  1806 1273 1194   45  247  123    0  3560    2  38   0  60            240    0   0  261   230   19  434   25   93   55    0   817    1  86   0  13
 49    0   0    8   164    6  345   11   67   27    1   861    0  86   0  13            113    0   0    2   131    0  309    7   45   29    0   981    0   6   0  94            177    0   0   94   270   21  523   15   64   63    1  1011    1  10   0  89            241    0   0  160   280   13  573   27  100   47    0  1116    1  76   0  24
 50    0   0    7   170    5  371   10   64   28    0   985    1  81   0  19            114    0   0   10     4    1    4    0    1    5    0     8    0   1   0  99            178    0   0  373   483   26 1018   32   66  163    3   618    0  25   0  75            242    0   0  350   364   23  709   34  118  101    0  1279    1  66   0  33
 51    0   0   18   295   10  634   15  142   52    1  1718    1  44   0  55            115    0   0   98    51    3   94    4    7   24    0    37    0   6   0  94            179    0   0  525   478   27  993   32   65  185    1   598    0  31   0  69            243    0   0  186   350   18  706   25  104   81    0  1434    1  37   0  62
 52    0   0   62   316    6  691   32  138   53    1  1786    1  78   0  21            116    0   0  361   385   10  830   28  106  212    0  1694    1  33   0  66            180    1   0  295   621   30 1269   52  141  149    3  1579    2  27   0  71            244    0   0    1    42    0   90    4   17    4    0   230    0  96   0   4
 53    0   0   78   372   15  785   25  138   53    0  2021    1  59   0  40            117    0   0    0     2    0    2    0    0    0    0     2    0   0   0 100            181    0   0  333   360   35  665   19   43  127    2   510    0  20   0  79            245    0   0   22    69    3  141    7   21   12    0   253    0  93   0   7
 54    0   0   15   307   12  648   18  102   49    1  1703    1  36   0  63            118    0   0  226    84    6  160    8   12   66    0    44    0  15   0  85            182    0   0  367   526   25 1119   39   72  171    2   583    0  25   0  75            246    0   0  300   238   28  439   11   55   69    0   633    0  43   0  56
 55    0   0   10   104    9  200    4   28   19    0   474    0   3   0  97            119    0   0   13     9    2    9    0    1   14    0     9    0   1   0  99            183    0   0  237   297   22  605   16   39   95    1   429    0  14   0  86            247    0   0 1054  1619 1562   53    2    3  112    1    23    0  98   0   2
 56    0   0   68   298   13  611   25  146   54    1  1534    1  83   0  16            120    0   0  159   512   12 1125   34  205  137    0  3040    1  27   0  72            184    0   0  114   730   25 1513   64  262  117    1  3579    2  32   0  66            248    0   0  756  1902 1667  475   29  104   82    0   961    1  84   0  16
 57    0   0   21   507   14 1084   44  226   74    0  2811    1  65   0  34            121    0   0  293   161   11  318    9   38  110    0   571    0  23   0  77            185    0   0  223   551   22 1153   44  110  174    3  1201    1  24   0  76            249    0   0  817  1758 1559  380   22   69   91    1   528    0  87   0  13
 58    0   0   17   410   14  883   26  166   73    0  2430    1  51   0  48            122    0   0  207    33    5   54    2    5   54    0    76    0  14   0  86            186    0   0  384   384   23  808   23   58  129    1   688    0  24   0  76            250    0   0  916  1939 1738  396   21   72   88    0   639    0  83   0  16
 59    0   0   65   284   13  594   11  100   48    2  1663    1  32   0  68            123    0   0  301    70    7  135    6    9   70    0    36    0  18   0  82            187    0   0   93   260   12  529   15   32   78    1   336    0   9   0  91            251    0   0  922  1518 1292  464   20   66   95    1   635    0  76   0  24
 60    0   0  131   348   15  724   27  141   59    3  1707    1  70   0  29            124    0   0   81   268    3  609   15   90   89    0  1665    1  18   0  81            188    0   0  226   534   26 1101   42  127  107    0  1462    1  22   0  77            252    0   0  689  1684 1557  248   13   42   48    0   365    0  87   0  13
 61    0   0   86   376   17  777   26  140   64    2  1869    1  52   0  48            125    0   0  112    22    2   39    1    2   23    0    43    0   7   0  93            189    0   0   73   138   20  226    6   17   29    1   291    0   5   0  94            253    0   0  666  2111 1862  518   23   79   75    0  1011    1  54   0  45
 62    0   0  169   278   19  541   18   71   78    2  1000    1  41   0  58            126    0   0    0     2    0    2    0    0    0    0     1    0   0   0 100            190    0   0  281   455   22  955   33   59  142    4   580    0  19   0  81            254    0   0  788  1631 1521  200    6   17   48    0   139    0  87   0  13
 63    0   0  197   119   18  186    6   20   30    0   192    0  32   0  68            127    0   0  197    66    6  127    6    9   83    0    35    0  14   0  86            191    0   0  594   472   24  983   34   59  299    1   442    0  44   0  56            255    0   0  589   138   33  173    4   14   48    0   130    0  87   0  13


Wow! Another leap forward in capacity.

For more information on the T5440, see Allan Packer's blog index.

Wednesday Apr 09, 2008

Scaling Solaris on Large CMT Systems

The Solaris Operating System is very effective at managing systems with large numbers of CPUs. Traditionally, these have been SMPs such as the Sun Fire(TM) E25K server, but these days it is CMT systems that are pushing the limits of Solaris scalability. The Sun SPARC(R) Enterprise T5140/T5240 Server, with 128 hardware strands that each behave as an independent CPU, is a good example. We continue to optimize Solaris to handle ever larger CPU counts, and in this posting I discuss a number of recent optimizations that enable Solaris to scale well on the T5140 and other large systems.

The Clock Thread

Clock is a kernel function that by default runs 100 times per second on the lowest numbered CPU in a domain and performs various housekeeping activities. This includes time adjustment, processing pending timeouts, traversing the CPU list to find currently running threads, and performing resource accounting and limit enforcement for the running threads. On a system with more CPUs, the CPU list traversal takes longer, and can exceed 10 ms, in which case clock falls behind, timeout processing is delayed, and the system becomes less responsive. When this happens, the mpstat command will show sys time approaching 100% on CPU 0. This is more likely for memory-intensive workloads on CMT systems with a shared L2$, as the increased L2$ miss rate further slows the clock thread.

We fixed this by multi-threading the clock function. Clock still runs at 100 Hz, but it divides the CPU list into sets, and cross calls a helper CPU to perform resource accounting for each set. The helpers are rotated so that over time the load is finely and evenly distributed over all CPUS; thus, what had been for example a 70% load on CPU 0 becomes a less than 1% load on each of 128 CPUs in a T5140 system. CPU 0 will still have a somewhat higher %sys load than the other CPUs, because it is solely responsible for some functions such as timeout processing.

Memory Placement Optimization (MPO)

The T5140 server in its default configuration has a NUMA characteristic, which is a common architectural strategy for building larger systems. Each server has two physical UltraSPARC(R) T2 Plus processors, and each processor has 64 hardware strands (CPUs). The 64 CPUs on a processor access memory controlled by that processor at a lower latency than memory controlled by the other processor. The physical address space is interleaved across the two processors at a 1 GB granularity. Thus, an operating system that is aware of CPU and memory locality can arrange that software threads allocate memory near the CPU on which they run, minimizing latency.

Solaris does exactly that, and has done so on various platforms since Solaris 9, using the Memory Placement Optimization framework, aka MPO. However, enabling the framework on the T5140 was non-trivial due to the virtualization of CPUs and memory in the sun4v architecture. We extended the hypervisor layer by adding locality arcs in the physical resource graph, and ensured that these arcs were preserved when a subset of the graph was extracted, virtualized, and passed to the Solaris guest at Solaris boot time.

Here are a few details on the MPO framework itself. Each set of CPUs and "near" memory is called a locality group, or lgroup; this corresponds to a single T2 Plus processor on the T5140. When a thread is created, it is assigned to a home lgroup, and the Solaris scheduler tries to run the thread on a CPU in its home lgroup whenever possible. Thread private memory (eg stack, heap, anon) is allocated from the home lgroup whenever possible. Shared memory (eg SysV shm) is striped across lgroups on a page granularity. For more details on Solaris MPO, including commands to control and observe lgroups and local memory, such as lgrpinfo, pmap -L, liblgrp, and memadvise, see the man pages and this presentation.

If an application is dominated by stall time due to memory references that miss in cache, then MPO can theoretically improve performance by as much as the ratio of remote to local memory latency, which is about 1.5 : 1 on the T5140. The STREAM benchmark is a good example; our early experiments with MPO yielded a 50% improvement in STREAM performance. See Brian's blog for the latest optimized results. Similarly, if an application is limited by global coherency bandwidth, then MPO can improve performance by reducing global coherency traffic, though this is unlikely on the T5140 because the local memory bandwidth and the global coherency bandwidth are well balanced.

Thread Scheduling

In my posting on the UltraSPARC T2 processor, I described how Solaris threads are spread across cores and pipelines to balance the load and maximize hardware resource usage. Since the T2 Plus is identical to the T2 in this area, these scheduling heuristics continue to be used for the T5140, but are augmented by scheduling at the lgroup level. Thus, independent software threads are first spread across processors, then across cores within a processor, then across pipelines within a core.

The Kernel Adaptive Mutex

The mutex is the basic locking primitive in Solaris. We have optimized the mutex for large CMT systems in several ways.

The implementation of the mutex in the kernel is adaptive, in that a waiter will busy-wait if the software thread that owns the mutex is running, on the supposition that the owner will release it soon. The waiter will yield the CPU and sleep if the owner is not running. To determine if a mutex owner is running, the code previously traversed all CPUs looking for the owner thread, as opposed to simply examining the owner thread's state, to avoid a race vs threads being freed. This O(NCPU) algorithm was costly on large systems, and we replaced it with a constant time algorithm that is safe wrt threads being freed.

Waiters attempt to acquire a mutex using a compare-and-swap (cas) operation. If many waiters continuously attempt cas on the same mutex, then a queue of requests builds in the memory subsystem, and the latency of each cas becomes proportional the number of requests. This dramatically reduces the rate at which the mutex can be acquired and released, and causes negative scaling for the higher level code which is using the mutex. The fix is to space out the cas requests over time, such that a queue never builds up, by forcing the waiters to busy-wait for a fixed period after a cas failure. The period increases exponentially after repeated failures, up to a maximum which is proportional to the number of CPUs, which is the upper bound on the number of actively waiting threads. Further, in the busy-wait loop, we use long-latency, low-impact operations, so the busy CPU consumes very little of the execution pipeline, leaving more cycles available to other strands sharing the pipeline.

To be clear, any application which causes many waiters to desire the same mutex has an inherent scalability bottleneck, and ultimately needs to be restructured for optimal scaling on large servers. However, the mutex optimizations above allow such apps to scale to perhaps 2X or 3X as many CPUs as they otherwise would, and to degrade gracefully under load rather than tip over into negative scaling.

Availability

All of the enhancements described herein are available in OpenSolaris, and will be available soon in updates to Solaris 10. The MPO and scheduling enhancements for the T5140 will be available in Solaris 10 4/08, and the clock and mutex enhancements will be released soon after in a KU patch.

Tuesday Oct 09, 2007

The UltraSPARC T2 Processor and the Solaris Operating System

The UltraSPARC T2 processor used in the Sun SPARC Enterprise T5x20 server family implements novel features for achieving high performance, which require equally novel support in the Solaris Operating System. A few areas I highlight here are:

  • Core Pipeline and Thread Scheduling
  • Cache Associativity
  • Virtual Memory
  • Block Copy
  • Performance Counters

Unless otherwise noted, the Solaris enhancements I describe are available in the OpenSolaris repository, and in the Solaris 10 8/07 release that is pre-installed on the T5120 and T5220 servers. No special tuning or programming is required for applications to benefit from these enhancements; they are applied automatically by the operating system.

To simplify the technical explanations, I do not distinguish between Physical Memory and Real Memory, but use Physical Memory everywhere. If you want to learn about the distinction, study the OpenSPARC hyper-privileged architecture.

Core Pipeline and Thread Scheduling

The Solaris thread scheduler spreads running threads across as many hardware resources as possible, rather than packing them onto as few resources as possible (consider cores as the resource, for example). By utilizing more resources, this heuristic yields maximum performance at lower thread counts. Do not confuse the kernel-level software thread scheduling I describe here with hardware strand scheduling, which is done on a cycle-by-cycle granularity.

The hierarchy of shared resources is deeper on the T2 processor than it is on the T1 processor, and the scheduler heuristic was extended accordingly. On the T2 processor, 4 strands share an integer pipeline, 2 such pipelines plus other goodies comprise a core, and 8 cores fit on the die. The scheduler spreads threads first across cores, 1 thread per core until every core has one, then 2 threads per core until every core has two, and so on. Within each core, the kernel scheduler balances the threads across the core's 2 integer pipelines.

Hardware resources and their relationships are described in the Processor Group data structure, which is initialized in platform specific code, but accessed in common dispatcher code that implements the hierarchical load balancing heuristic.

Cache Associativity

The T2 L2 cache is 4MB, 16-way associative, 64B line size, and shared by 64 hardware strands. If more than 16 threads frequently access data that maps to the same index in the cache, then they will evict each other's data, and increase the cache miss rate. This is known as a conflict miss, and is more likely in multi-process workloads with the same executable image (the same binary).

The Solaris implementation minimizes conflict misses by applying a technique known as page coloring. In general, a page is mapped into the cache by using higher order physical address bits as an index into the cache way. The T2 L2 way size is 256KB (4MB divided by 16 ways), so an 8KB page will fit into the 256KB way at 32 different locations, known as colors. The Solaris VM system organizes free physical memory based on color. When mapping virtual addresses to physical pages, it chooses the pages so that all colors in the cache are covered, and to minimize the probability that the same virtual address in different processes maps to the same color (and hence cache index).

However, page coloring is not applicable for a large page whose size exceeds the way size, such as 4MB and 256MB pages, because such a page has only one color. To remedy this problem, the T2 processor implements hashed cache indexing, in which higher order physical address bits are XOR'd together to yield the index into the cache. Specifically:

index = PA[32:28] \^ PA[17:13] . PA[19:18] \^ PA[12:11] . PA[10:6]

The effect is that the mapping of a page's contents onto the cache is non-linear, and many permutations are possible, governed by address bits that are larger than the page size. This hardware feature alone would reduce cache conflicts between multiple processes using large pages. However, we also modified Solaris VM to be aware of the hardware hash calculation, and allocate physical pages to maintain an even distribution across permutations, which further minimizes conflicts.

The L1 data cache is 8KB, 4-way associative, 16B line size, and shared by 8 hardware strands. The way size is thus 2KB, which is smaller than the base page size of 8KB, so page coloring techniques cannot be applied to reduce potential conflicts. Instead, Solaris biases the start of each process stack by adding a small pseudo-random amount to the stack pointer at process creation time. Thus, the same stack variable in different processes will have different addresses, which will map to different indices in the L1 data cache, avoiding conflict evictions if that variable is heavily accessed by many processes.

The L1 instruction cache is 16KB, 8-way associative, 32B line size, and shared by 8 hardware strands. This level of associativity is adequate to avoid excessive conflicts and evictions between strands.

Virtual Memory

The T2 processor has a number of interesting features that accelerate the translation of virtual to physical addresses. It supports the the same page sizes that are available on the T1 processor: 8KB, 64KB, 4MB, and 256MB. However, because the hashed cache indexing lowers the L2 conflict miss rate for large pages, the Solaris kernel on servers with the T2 processor automatically allocates 4MB pages for process heap, stack, and anonymous memory. The default size for these segments on T1 processors is 64KB. Larger pages are better because they reduce the TLB miss rate.

Further, the T2 processor also supports Hardware Tablewalk (HWTW) on TLB misses, and is the first SPARC processor to do so. The TLB is a hardware cache of VA to PA translations. Each T2 core has an 128-entry Data TLB, and a 64-entry Instruction TLB, both fully associative. When a translation is not found in the TLB, the TSB is probed. The TSB is a much larger cache that is created by the kernel, lives in main memory, and is directly indexed by virtual address. In other SPARC implementations, a TLB miss causes a trap to the kernel, and the trap handler probes the TSB. With HWTW, the processor issues a load and probes the TSB directly, avoiding the overhead of a software trap handler, which saves valuable cycles per TLB miss. Each T2 hardware strand can be programmed to probe up to four separate TSB regions. The Solaris kernel programs the regions at kernel thread context switch time. (Again, do not confuse this with hardware strand scheduling).

Lastly, processes can share memory more efficiently on the T2 processor using a new feature called the shared context. In previous SPARC implementations, even when processes share physical memory, they still have private translations from process virtual addresses to shared physical addresses, so the processes compete for space in the TLB. Using the shared context feature, processes can use each other's translations that are cached in the TLB, as long as the shared memory is mapped at the same virtual address in each process. This is done safely - the Solaris VM system manages private and shared context identifiers, assigns them to processes and process sharing groups, and programs hardware context registers at thread context switch time. The hardware allows sharing only amongst processes that have the same shared context identifier. In addition, the Solaris VM system arranges that shared translations are backed by a shared TSB, which is accessed via HWTW, further boosting efficiency. Processes that map the same ISM/DISM segments and have the same executable image share translations in this manner, for both the shared memory and for the main text segment.

In total, these VM enhancements provide a nice boost in performance to workloads with a high TLB miss rate, such as OLTP.

Block Copy

Routines that initialize or copy large blocks of memory are heavily used by applications and by the Solaris kernel, so these routines are heavily optimized, often using processor-specific assembly code. On most SPARC implementations, large block move is done using 64-byte loads and stores as supported by the VIS instruction set. These instructions store intermediate results in banks of floating point registers, and so are not the best choice on the T1 processor, in which all cores share a single floating point unit. However, the T2 processor has one FP unit per core, so VIS is once again the best choice.

The Solaris OS selects the optimal block copy routines at boot time based on the processor type. This applies to the memcpy family of functions in libc, to the kernel bcopy routine, and to routines which copy data in and out of the kernel during system calls.

The new block copy implementation for the T2 is also tuned to handle special cases for various combinations of source alignment, destination alignment, and length. The result is that the new routines are approximately 2X faster than the previous routines. The changes are available now in the OpenSolaris repository, and will be available later in a patch following Solaris 10 8/07.

If you enjoy reading SPARC assembly code (and who does not !), see the file niagara_copy.s

Performance Counters

The T2 processor offers a rich set of performance counters for counting hardware events such as cache misses, TLB misses, crypto operations, FP operations, loads, stores, etc. These are accessed using the cpustat command. cpustat -h shows the list of available counters, which are documented in the "Performance Instrumentation" chapter of the OpenSPARC T2 Supplement to the UltraSPARC Architecture 2007.

In addition, the T2 processor implements a "trap on counter overflow" mechanism which is compatible with the mechanism on other UltraSPARC processors (unlike the T1 processor), which means that you can use Sun Studio Performance Analyzer to profile hardware events and map them to your application.

The corestat utility in the Cool Tools suite has been updated to use the new counters in its per-core utilization calculations. It is bundled with the Cool Tools software and is pre-installed on T2-based servers. For more on corestat, see Ravi Talashikar's blog.

Other

The list above is not exhaustive. There are other cool performance features in the T2 processor, such as per-core crypto acceleration, and on-chip 10 GbE networking. For more information on the T5120/T5220 servers and their new processor, see Allan Packer's index.

About

Steve Sistare

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today