By Steve Sistare on Oct 09, 2007
The UltraSPARC T2 processor used in the Sun SPARC Enterprise T5x20 server family implements novel features for achieving high performance, which require equally novel support in the Solaris Operating System. A few areas I highlight here are:
- Core Pipeline and Thread Scheduling
- Cache Associativity
- Virtual Memory
- Block Copy
- Performance Counters
Unless otherwise noted, the Solaris enhancements I describe are available in the OpenSolaris repository, and in the Solaris 10 8/07 release that is pre-installed on the T5120 and T5220 servers. No special tuning or programming is required for applications to benefit from these enhancements; they are applied automatically by the operating system.
To simplify the technical explanations, I do not distinguish between Physical Memory and Real Memory, but use Physical Memory everywhere. If you want to learn about the distinction, study the OpenSPARC hyper-privileged architecture.
Core Pipeline and Thread Scheduling
The Solaris thread scheduler spreads running threads across as many hardware resources as possible, rather than packing them onto as few resources as possible (consider cores as the resource, for example). By utilizing more resources, this heuristic yields maximum performance at lower thread counts. Do not confuse the kernel-level software thread scheduling I describe here with hardware strand scheduling, which is done on a cycle-by-cycle granularity.
The hierarchy of shared resources is deeper on the T2 processor than it is on the T1 processor, and the scheduler heuristic was extended accordingly. On the T2 processor, 4 strands share an integer pipeline, 2 such pipelines plus other goodies comprise a core, and 8 cores fit on the die. The scheduler spreads threads first across cores, 1 thread per core until every core has one, then 2 threads per core until every core has two, and so on. Within each core, the kernel scheduler balances the threads across the core's 2 integer pipelines.
Hardware resources and their relationships are described in the Processor Group data structure, which is initialized in platform specific code, but accessed in common dispatcher code that implements the hierarchical load balancing heuristic.
The T2 L2 cache is 4MB, 16-way associative, 64B line size, and shared by 64 hardware strands. If more than 16 threads frequently access data that maps to the same index in the cache, then they will evict each other's data, and increase the cache miss rate. This is known as a conflict miss, and is more likely in multi-process workloads with the same executable image (the same binary).
The Solaris implementation minimizes conflict misses by applying a technique known as page coloring. In general, a page is mapped into the cache by using higher order physical address bits as an index into the cache way. The T2 L2 way size is 256KB (4MB divided by 16 ways), so an 8KB page will fit into the 256KB way at 32 different locations, known as colors. The Solaris VM system organizes free physical memory based on color. When mapping virtual addresses to physical pages, it chooses the pages so that all colors in the cache are covered, and to minimize the probability that the same virtual address in different processes maps to the same color (and hence cache index).
However, page coloring is not applicable for a large page whose size exceeds the way size, such as 4MB and 256MB pages, because such a page has only one color. To remedy this problem, the T2 processor implements hashed cache indexing, in which higher order physical address bits are XOR'd together to yield the index into the cache. Specifically:
index = PA[32:28] \^ PA[17:13] . PA[19:18] \^ PA[12:11] . PA[10:6]
The effect is that the mapping of a page's contents onto the cache is non-linear, and many permutations are possible, governed by address bits that are larger than the page size. This hardware feature alone would reduce cache conflicts between multiple processes using large pages. However, we also modified Solaris VM to be aware of the hardware hash calculation, and allocate physical pages to maintain an even distribution across permutations, which further minimizes conflicts.
The L1 data cache is 8KB, 4-way associative, 16B line size, and shared by 8 hardware strands. The way size is thus 2KB, which is smaller than the base page size of 8KB, so page coloring techniques cannot be applied to reduce potential conflicts. Instead, Solaris biases the start of each process stack by adding a small pseudo-random amount to the stack pointer at process creation time. Thus, the same stack variable in different processes will have different addresses, which will map to different indices in the L1 data cache, avoiding conflict evictions if that variable is heavily accessed by many processes.
The L1 instruction cache is 16KB, 8-way associative, 32B line size, and shared by 8 hardware strands. This level of associativity is adequate to avoid excessive conflicts and evictions between strands.
The T2 processor has a number of interesting features that accelerate the translation of virtual to physical addresses. It supports the the same page sizes that are available on the T1 processor: 8KB, 64KB, 4MB, and 256MB. However, because the hashed cache indexing lowers the L2 conflict miss rate for large pages, the Solaris kernel on servers with the T2 processor automatically allocates 4MB pages for process heap, stack, and anonymous memory. The default size for these segments on T1 processors is 64KB. Larger pages are better because they reduce the TLB miss rate.
Further, the T2 processor also supports Hardware Tablewalk (HWTW) on TLB misses, and is the first SPARC processor to do so. The TLB is a hardware cache of VA to PA translations. Each T2 core has an 128-entry Data TLB, and a 64-entry Instruction TLB, both fully associative. When a translation is not found in the TLB, the TSB is probed. The TSB is a much larger cache that is created by the kernel, lives in main memory, and is directly indexed by virtual address. In other SPARC implementations, a TLB miss causes a trap to the kernel, and the trap handler probes the TSB. With HWTW, the processor issues a load and probes the TSB directly, avoiding the overhead of a software trap handler, which saves valuable cycles per TLB miss. Each T2 hardware strand can be programmed to probe up to four separate TSB regions. The Solaris kernel programs the regions at kernel thread context switch time. (Again, do not confuse this with hardware strand scheduling).
Lastly, processes can share memory more efficiently on the T2
processor using a new feature called the shared context. In previous
SPARC implementations, even when processes share physical memory,
they still have private translations from process virtual addresses
to shared physical addresses, so the processes compete for space in
the TLB. Using the shared context feature, processes can use each
other's translations that are cached in the TLB, as long as the
shared memory is mapped at the same virtual address in each process.
This is done safely - the Solaris VM system manages private and shared
context identifiers, assigns them to processes and process sharing
groups, and programs hardware context registers at thread context
switch time. The hardware allows sharing only amongst processes that
have the same shared context identifier. In addition, the Solaris VM
system arranges that shared translations are backed by a shared TSB, which
is accessed via HWTW, further boosting efficiency. Processes that map the same ISM/DISM segments and have the same executable image share translations in this manner, for both the shared memory and for the main text segment.
In total, these VM enhancements provide a nice boost in performance to workloads with a high TLB miss rate, such as OLTP.
Routines that initialize or copy large blocks of memory are heavily used by applications and by the Solaris kernel, so these routines are heavily optimized, often using processor-specific assembly code. On most SPARC implementations, large block move is done using 64-byte loads and stores as supported by the VIS instruction set. These instructions store intermediate results in banks of floating point registers, and so are not the best choice on the T1 processor, in which all cores share a single floating point unit. However, the T2 processor has one FP unit per core, so VIS is once again the best choice.
The Solaris OS selects the optimal block copy routines at boot time based on the processor type. This applies to the memcpy family of functions in libc, to the kernel bcopy routine, and to routines which copy data in and out of the kernel during system calls.
The new block copy implementation for the T2 is also tuned to handle special cases for various combinations of source alignment, destination alignment, and length. The result is that the new routines are approximately 2X faster than the previous routines. The changes are available now in the OpenSolaris repository, and will be available later in a patch following Solaris 10 8/07.
If you enjoy reading SPARC assembly code (and who does not !), see the file niagara_copy.s
The T2 processor offers a rich set of performance counters for counting hardware events such as cache misses, TLB misses, crypto operations, FP operations, loads, stores, etc. These are accessed using the cpustat command. cpustat -h shows the list of available counters, which are documented in the "Performance Instrumentation" chapter of the OpenSPARC T2 Supplement to the UltraSPARC Architecture 2007.
In addition, the T2 processor implements a "trap on counter overflow" mechanism which is compatible with the mechanism on other UltraSPARC processors (unlike the T1 processor), which means that you can use Sun Studio Performance Analyzer to profile hardware events and map them to your application.
The corestat utility in the Cool Tools suite has been updated to use the new counters in its per-core utilization calculations. It is bundled with the Cool Tools software and is pre-installed on T2-based servers. For more on corestat, see Ravi Talashikar's blog.
The list above is not exhaustive. There are other cool performance features in the T2 processor, such as per-core crypto acceleration, and on-chip 10 GbE networking. For more information on the T5120/T5220 servers and their new processor, see Allan Packer's index.