The UltraSPARC T2 processor used in the Sun SPARC Enterprise T5x20
server family implements novel features for achieving high performance,
which require equally novel support in the Solaris Operating System.
A few areas I highlight here are:
- Core Pipeline and Thread Scheduling
- Cache Associativity
- Virtual Memory
- Block Copy
- Performance Counters
Unless otherwise noted, the Solaris enhancements I describe
are available in the OpenSolaris repository, and in the Solaris
10 8/07 release that is pre-installed on the T5120 and T5220 servers. No special tuning or programming is required for applications to benefit from these enhancements; they are applied automatically by the operating system.
To simplify the technical explanations, I do not distinguish between
Physical Memory and Real Memory, but use Physical Memory everywhere.
If you want to learn about the distinction, study the
OpenSPARC hyper-privileged architecture.
Core Pipeline and Thread Scheduling
The Solaris thread scheduler spreads running threads across as many
hardware resources as possible, rather than packing them onto as
few resources as possible (consider cores as the resource, for example).
By utilizing more resources, this heuristic yields maximum performance
at lower thread counts. Do not confuse the kernel-level software
thread scheduling I describe here with hardware strand scheduling,
which is done on a cycle-by-cycle granularity.
The hierarchy of shared resources is deeper on the T2 processor than
it is on the T1 processor, and the scheduler heuristic was extended
accordingly. On the T2 processor, 4 strands share an integer
pipeline, 2 such pipelines plus other goodies comprise a core, and 8
cores fit on the die. The scheduler spreads threads first across
cores, 1 thread per core until every core has one, then 2 threads per
core until every core has two, and so on. Within each core, the
kernel scheduler balances the threads across the core's 2 integer
Hardware resources and their relationships are described in the
Processor Group data structure, which is initialized in platform
specific code, but accessed in common dispatcher code that implements
the hierarchical load balancing heuristic.
The T2 L2 cache is 4MB, 16-way associative, 64B line size, and shared
by 64 hardware strands. If more than 16 threads frequently access
data that maps to the same index in the cache, then they will evict
each other's data, and increase the cache miss rate. This is known
as a conflict miss, and is more likely in multi-process workloads
with the same executable image (the same binary).
The Solaris implementation minimizes conflict misses by applying a
technique known as page coloring. In general, a page is mapped into
the cache by using higher order physical address bits as an index
into the cache way. The T2 L2 way size is 256KB (4MB divided by 16
ways), so an 8KB page will fit into the 256KB way at 32 different
locations, known as colors. The Solaris VM system organizes free
physical memory based on color. When mapping virtual addresses to
physical pages, it chooses the pages so that all colors in the cache
are covered, and to minimize the probability that the same virtual
address in different processes maps to the same color (and hence
However, page coloring is not applicable for a large page whose size
exceeds the way size, such as 4MB and 256MB pages, because such a page
has only one color. To remedy this problem, the T2 processor implements
hashed cache indexing, in which higher order physical address bits are
XOR'd together to yield the index into the cache. Specifically:
index = PA[32:28] \^ PA[17:13] . PA[19:18] \^ PA[12:11] . PA[10:6]
The effect is that the mapping of a page's contents onto the cache is
non-linear, and many permutations are possible, governed by address
bits that are larger than the page size. This hardware feature alone
would reduce cache conflicts between multiple processes using large
pages. However, we also modified Solaris VM to be aware of the
hardware hash calculation, and allocate physical pages to maintain an
even distribution across permutations, which further minimizes
The L1 data cache is 8KB, 4-way associative, 16B line size, and
shared by 8 hardware strands. The way size is thus 2KB, which is
smaller than the base page size of 8KB, so page coloring techniques
cannot be applied to reduce potential conflicts. Instead, Solaris
biases the start of each process stack by adding a small
pseudo-random amount to the stack pointer at process creation time.
Thus, the same stack variable in different processes will have
different addresses, which will map to different indices in the L1
data cache, avoiding conflict evictions if that variable is heavily
accessed by many processes.
The L1 instruction cache is 16KB, 8-way associative, 32B line size,
and shared by 8 hardware strands. This level of associativity is
adequate to avoid excessive conflicts and evictions between strands.
The T2 processor has a number of interesting features that accelerate
the translation of virtual to physical addresses. It supports the
the same page sizes that are available on the T1 processor: 8KB,
64KB, 4MB, and 256MB. However, because the hashed cache indexing
lowers the L2 conflict miss rate for large pages, the Solaris kernel
on servers with the T2 processor automatically allocates 4MB pages
for process heap, stack, and anonymous memory. The default size for
these segments on T1 processors is 64KB. Larger pages are better
because they reduce the TLB miss rate.
Further, the T2 processor also supports Hardware Tablewalk (HWTW) on
TLB misses, and is the first SPARC processor to do so. The TLB is a
hardware cache of VA to PA translations. Each T2 core has an
128-entry Data TLB, and a 64-entry Instruction TLB, both fully
associative. When a translation is not found in the TLB, the TSB is
probed. The TSB is a much larger cache that is created by the
kernel, lives in main memory, and is directly indexed by virtual
address. In other SPARC implementations, a TLB miss causes a trap to
the kernel, and the trap handler probes the TSB. With HWTW, the
processor issues a load and probes the TSB directly, avoiding the
overhead of a software trap handler, which saves valuable cycles per
TLB miss. Each T2 hardware strand can be programmed to probe up to
four separate TSB regions. The Solaris kernel programs the regions at
kernel thread context switch time. (Again, do not confuse this with
hardware strand scheduling).
Lastly, processes can share memory more efficiently on the T2
processor using a new feature called the shared context. In previous
SPARC implementations, even when processes share physical memory,
they still have private translations from process virtual addresses
to shared physical addresses, so the processes compete for space in
the TLB. Using the shared context feature, processes can use each
other's translations that are cached in the TLB, as long as the
shared memory is mapped at the same virtual address in each process.
This is done safely - the Solaris VM system manages private and shared
context identifiers, assigns them to processes and process sharing
groups, and programs hardware context registers at thread context
switch time. The hardware allows sharing only amongst processes that
have the same shared context identifier. In addition, the Solaris VM
system arranges that shared translations are backed by a shared TSB, which
is accessed via HWTW, further boosting efficiency. Processes that map the same ISM/DISM segments and have the same executable image share translations in this manner, for both the shared memory and for the main text segment.
In total, these VM enhancements provide a nice boost in performance to
workloads with a high TLB miss rate, such as OLTP.
Routines that initialize or copy large blocks of memory are heavily
used by applications and by the Solaris kernel, so these routines
are heavily optimized, often using processor-specific assembly code.
On most SPARC implementations, large block move is done using
64-byte loads and stores as supported by the VIS instruction set.
These instructions store intermediate results in banks of floating point
registers, and so are not the best choice on the T1 processor, in which
all cores share a single floating point unit. However, the T2 processor
has one FP unit per core, so VIS is once again the best choice.
The Solaris OS selects the optimal block copy routines at boot
time based on the processor type. This applies to the memcpy family
of functions in libc, to the kernel bcopy routine, and to routines
which copy data in and out of the kernel during system calls.
The new block copy implementation for the T2 is also tuned to handle
special cases for various combinations of source alignment, destination
alignment, and length. The result is that the new routines are
approximately 2X faster than the previous routines. The changes are
available now in the OpenSolaris repository, and will be available later in a patch
following Solaris 10 8/07.
If you enjoy reading SPARC assembly code (and who does not !), see the file
The T2 processor offers a rich set of performance counters for
counting hardware events such as cache misses, TLB misses, crypto
operations, FP operations, loads, stores, etc. These are accessed
using the cpustat command. cpustat -h shows the list of available
counters, which are documented in the "Performance Instrumentation"
chapter of the
OpenSPARC T2 Supplement to the UltraSPARC Architecture 2007.
In addition, the T2 processor implements a "trap on counter overflow"
mechanism which is compatible with the mechanism on other UltraSPARC
processors (unlike the T1 processor), which means that you can use
Sun Studio Performance Analyzer to profile hardware events and
map them to your application.
The corestat utility in the Cool Tools suite has been updated to use
the new counters in its per-core utilization calculations. It is
bundled with the Cool Tools software and is pre-installed on
T2-based servers. For more on corestat, see
Ravi Talashikar's blog.
The list above is not exhaustive. There are other cool performance
features in the T2 processor, such as per-core crypto acceleration,
and on-chip 10 GbE networking. For more information on the
T5120/T5220 servers and their new processor, see
Allan Packer's index.