Wednesday Apr 10, 2013

Massive Solaris Scalability for the T5-8 and M5-32, Part 3

Today I conclude this series on M5-32 scalability [ Part1 , Part2 ] with enhancements we made in the Scheduler, Devices, Tools, and Reboot areas of Solaris.

Scheduler

The Solaris thread scheduler is little changed, as the architecture of balancing runnable threads across levels in the processor resource hierarchy , which I described when the T2 processor was introduced, has scaled well. However, we have continued to optimize the clock function of the scheduler. Clock is responsible for quanta expiration, timeout processing, resource accounting for every CPU, and for misc housekeeping functions. Previously, we parallelized quanta expiration and timeout expiration(aka callouts). In Solaris 11, we eliminated the need to acquire the process and thread locks in most cases during quanta expiration and accounting, and we eliminated or reduced the impact of several smallish O(N) calculations that had become significant at 1536 CPUs. The net result is that all functionality associated with clock scales nicely, and CPU 0 does not accumulate noticeable %sys CPU time due to clock processing.

Devices

SPARC systems use an IOMMU to map PCI-E virtual addresses to physical memory. The PCI VA space is a limited resource with high demand. The VA span is only 2GB to maintain binary compatibility with traditional DDI functions, and many drivers pre-map large DMA buffer pools so that mapping is not on the critical path for transport operations. Every CPU can post concurrent DMA requests, thus demand increases with scale. Managing these conflicting demands is a challenge. We reimplemented DVMA allocation using the Solaris kmem_cache and vmem facilities, with object size and quanta chosen to match common DMA transfer sizes. This provides a good balance between contention-free per-CPU caching, and redistribution of free space in the back end magazine and slab layers. We also modified drivers to use DMA pools more efficiently, and we modified the IOMMU code so that 2GB of VA is available per PCI function, rather than per PCI root port.

The net result for the end user is higher device throughput and/or lower CPU utilization per unit of throughput on larger systems.

Tools

The very tools we use to analyze scalability may exhibit problems themselves, because they must collect data for all the entities on a system. We noticed that mpstat was consuming so much CPU time on large systems that it could not sample at 1 second intervals and was falling behind. mpstat collects data for all CPUs in every interval, but 1536 CPUs is not a large number to handle in 1 second, so something was amiss. Profiling showed the time was spent searching for per-cpu kstats (see kstat(3KSTAT)), and every lookup searched the entire kc_chain linked list of all kstats. Since the number of kstats grows with NCPU, the overall algorithm takes time O(NCPU^2), which explodes on the larger systems. We modified the kstat library to build a hash table when kstats are opened, and re-implemented kstat_lookup() on that. This reduced cpu consumption by 8X on our "small" 512-CPU test system, and improves the performance of all tools that are based on libkstat, including mpstat, vmstat, iostat, and sar.

Even dtrace is not immune. When a script starts, dtrace allocates multi-megabyte trace buffers for every CPU in the domain, using a single thread, and frees the buffers on script termination using a single thread. On a T3-4 with 512 CPUs, it took 30 seconds to run a null D script. Even worse, the allocation is done while holding the global cpu_lock, which serializes the startup of other D scripts, and causes long pauses in the output of some stat commands that briefly take cpu_lock while sampling. We fixed this in Solaris 11.1 by allocating and freeing the trace buffers in parallel using vmtasks, and by hoisting allocation out of the cpu_lock critical path.

Large scale can impact the usability of a tool. Some stat tools produce a row of output per CPU in every sampling interval, making it hard to spot important clues in the torrent of data. In Solaris 11.1, we provide new aggregation and sorting options for the mpstat, cpustat, and trapstat commands that allow the user to make sense of the data. For example, the command

  mpstat -k intr -A 4 -m 10 5
sorts CPUs by the interrupts metric, partitions them into quartiles, and aggregates each quartile into a single row by computing the mean column values within each. See the man pages for details.

Reboot

Large servers take longer to reboot than small servers. Why? They must initialize more CPUs, memory, and devices, but much of the shutdown and startup code in firmware and the kernel is single threaded. We are addressing that. On shutdown, Solaris now scans memory in parallel to look for dirty pages that must be flushed to disk. The sun4v hypervisor zero's a domain's memory in parallel, using CPUs that are physically closest to memory for maximum bandwidth. On startup, Solaris VM initializes per-page metadata using SPARC cache initializing block stores, which speeds metadata initialization by more than 2X. We also fixed an O(NCPU^2) algorithm in bringing CPUs online, and an O(NCPU) algorithm in reclaiming memory from firmware. In total, we have reduced the reboot time for M5-32 systems by many minutes, and we continue to work on optimizations in this area.

In these few short posts, I have summarized the work of many people over a period of years that has pushed Solaris to new heights of scalability, and I look forward to seeing what our customers will do with their massive T5-8 and M5-32 systems. However, if you have seen the SPARC processor roadmap, you know that our work is not done. Onward and upward!

Friday Apr 05, 2013

Massive Solaris Scalability for the T5-8 and M5-32, Part 2

Last time, I outlined the general issues that must be addressed to achieve operating system scalability. Next I will provide more detail on what we modified in Solaris to reach the M5-32 scalability level. We worked in most of the major areas of Solaris, including Virtual Memory, Resources, Scheduler, Devices, Tools, and Reboot. Today I cover VM and resources.

Virtual Memory

When a page of virtual memory is freed, the virtual to physical address translation must be deleted from the MMU of all CPUs which may have accessed the page. On Solaris, this is implemented by posting a software interrupt known as an xcall to each target CPU. This "TLB shootdown" operation poses one of the thorniest scalability challenges in the VM area, as a single-threaded process may have migrated and run on all the CPUs in a domain, and a multi-threaded process may run threads on all CPUs concurrently. This is a frequent cause of sub-optimal scaling when porting an application from a small to a large server, for a wide variety of systems and vendors.

The T5 and M5 processors provide hardware acceleration for this operation. A single PIO write (an ASI write in SPARC parlance) can demap a VA in all cores of a single socket. Solaris need only send an xcall to one CPU per socket, rather than sending an xcall to every CPU. This achieves a 48X reduction in xcalls on M5-32, and a 128X reduction in xcalls on T5-8, for mappings such as kernel pages that are used on every CPU. For user page mappings, one xcall is sent to each socket on which the process runs. The net result is that the cost of demap operations in dynamic memory workloads is not measurably higher on large T5 and M5 systems than on small.

The VM2 project re-implemented the physical page management layer in Solaris 11.1, and offers several scalability benefits. It manages a large page as a single unit, rather than as a collection of contained small pages, which reduces the cost of allocating and freeing large pages. It predicts the demand for large pages and proactively defragments physical memory to build more, reducing delays when an application page faults and needs a large page. These enhancements make it practical for Solaris to use a range of large page sizes, in every segment type, which maximizes run-time efficiency of large memory applications. VM2 also allows kernel memory to be allocated near any socket. Previously, kernel memory was confined to a single "kernel cage" that was confined to a single physically contiguous region, which often fit on the memory connected to a single socket, which could become a memory hot spot for kernel intensive workloads. Spreading reduces hot spots, and also allows kernel data such as DMA buffers to be allocated near threads or devices for lower latency and higher bandwidth.

The VM system manages certain resources on a per-domain basis, in units of pages. These include swap space, locked memory, and reserved memory, among others. These quantities are adjusted when a page is allocated, freed, locked, and unlocked. Each is represented by a global counter protected by a global lock. The lock hold times are small, but at some CPU count they become bottlenecks. How does one scale a global counter? Using a new data structure I call the Credit Tree, which provides O(K * log(NCPU)) allocation performance with a very small constant K. I will describe it in a future posting. We replaced the VM system's global counters with credit trees in S11.1, and achieved a 45X speedup on an mmap() microbenchmark on T4-4 with 256 CPUs. This is good for the Oracle database, because it uses mmap() and munmap() to dynamically allocate space for its per-process PGA memory.

The virtual address space is a finite resource that must be partitioned carefully to support large memory systems. 64 bits of VA is sufficient, but we had to adjust the kernel's VA's to support a larger heap and more physical memory pages, and adjust process VA's to support larger shared memory segments (eg, for the Oracle SGA).

Lastly, we reduced contention on various locks by increasing lock array sizes and improving the object-to-lock hash functions.

Resource Limits

Solaris limits the number of processes that can be created to prevent metadata such as the process table and the proc_t structures from consuming too much kernel memory. This is enforced by the tunables maxusers, max_nprocs, and pidmax. The default for the latter was 30000, which is too small for M5-32 with 1536 CPUs, allowing only 20 processes per CPU. As of Solaris 11.1, the default for these tunables automatically scales up with CPU count and memory size, to a maximum of 999999 processes. You should rarely if ever need to change these tunables in /etc/system, though that is still allowed.

Similarly, Solaris limits the number of threads that can be created, by limiting the space reserved for kernel thread stacks with the segkpsize tunable, whose default allowed approximately 64K threads. In Solaris 11.1, the default scales with CPU and memory to a maximum of 1.6M threads.

Next time: Scheduler, Devices, Tools, and Reboot.

Tuesday Apr 02, 2013

Massive Solaris Scalability for the T5-8 and M5-32, Part 1

How do you scale a general purpose operating system to handle a single system image with 1000's of CPUs and 10's of terabytes of memory? You start with the scalable Solaris foundation. You use superior tools such as Dtrace to expose issues, quantify them, and extrapolate to the future. You pay careful attention to computer science, data structures, and algorithms, when designing fixes. You implement fixes that automatically scale with system size, so that once exposed, an issue never recurs in future systems, and the set of issues you must fix in each larger generation steadily shrinks.

The T5-8 has 8 sockets, each containing 16 cores of 8 hardware strands each, which Solaris sees as 1024 CPUs to manage. The M5-32 has 1536 CPUs and 32 TB of memory. Both are many times larger than the previous generation of Oracle T-class and M-class servers. Solaris scales well on that generation, but every leap in size exposes previously benign O(N) and O(N^2) algorithms that explode into prominence on the larger system, consuming excessive CPU time, memory, and other resources, and limiting scalability. To find these, knowing what to look for helps. Most OS scaling issues can be categorized as CPU issues, memory issues, device issues, or resource shortage issues.

CPU scaling issues include:

  • increased lock contention at higher thread counts
  • O(NCPU) and worse algorithms
Lock contention is addressed using fine grained locking based on domain decomposition or hashed lock arrays, and the number of locks is automatically scaled with NCPU for a future-proof solution. O(NCPU^2) algorithms are often the result of naive data structures, or interactions between sub-systems each of which does O(N) work, and once recognized can be recoded easily enough with an adequate supply of caffeine. O(NCPU) algorithms are often the result of a single thread managing resources that grow with machine size, and the solution is to apply parallelism. A good example is the use of vmtasks for shared memory allocation.

Memory scaling issues include:

  • working sets that exceed VA translation caches
  • unmapping translations in all CPUs that access a memory page
  • O(memory) algorithms
  • memory hotspots
Virtual to physical address translations are cached at multiple levels in hardware and software, from TLB through TSB and HME on SPARC. A miss in the smaller lower level caches requires a more costly lookup at the higher level(s). Solaris maximizes the span of each cache and minimizes misses by supporting shared MMU contexts, a range of hardware page sizes up to 2 GB, and the ability to use large pages in every type of memory segment: user, kernel, text, data, private, shared. Solaris uses a novel hardware feature of the T5 and M5 processors to unmap memory on a large number of CPUs efficiently. O(memory) algorithms are fixed using parallelism. Memory hotspots are fixed by avoiding false sharing and spreading data structures across caches and memory controllers.

Device scaling issues include:

  • O(Ndevice) and worse algorithms
  • system bandwidth limitations
  • lock contention in interrupt threads and service threads
The O(N) algorithms tend to be hit during administrative actions such as system boot and hot plug, are are fixed with parallelism and improved data structures. System bandwidth is maximized by spreading devices across PCI roots and system boards, by spreading DMA buffers across memory controllers, and by co-locating DMA buffers with either the producer or consumer of the data. Lock contention is a CPU scaling issue.

Resource shortages occur when too many CPUs compete for a finite set of resources. Sometimes the resource limit is artificial and defined by software, such as for the maximum process and thread count, in which case the fix is to scale the limit automatically with NCPU. Sometimes the limit is imposed by hardware, such as for the number of MMU contexts, and the fix requires more clever resource management in software.

Next time I will provide more details on new Solaris improvements in all of these areas that enable superior performance and scaling on T5 and M5 systems. Stay tuned.

Wednesday Jul 08, 2009

Lies, Damned Lies, and Stack Traces

The kernel stack trace is a critical piece of information for diagnosing kernel bugs, but it can be tricky to interpret due to quirks in the processor architecture and in optimized code. Some of these are well known: tail calls and leaf functions obscure frames, function arguments may live in registers that have been modified since entry, and so on. These quirks can cause you to waste time chasing the wrong problem if you are not careful.

Here is a less well known example to be wary of that is specific to SPARC kernel stacks. Use mdb to examine the panic thread in a kernel crash dump:

    > \*panic_thread::findstack
    stack pointer for thread 30014adaf60: 2a10c548671
      000002a10c548721 die+0x98()
      000002a10c548801 trap+0x768()
      000002a10c548961 ktl0+0x64()
      000002a10c548ab1 hat_unload_callback+0x358()
      000002a10c548f21 segvn_unmap+0x2a8()
      000002a10c549021 as_free+0xf4()
      000002a10c5490d1 relvm+0x234()
      000002a10c549181 proc_exit+0x490()
      000002a10c549231 exit+8()
      000002a10c5492e1 syscall_trap+0xac()
    

This says that the thread did something bad at hat_unload_callback+0x358, which caused a trap and panic. But what does panicinfo show?

    > ::panicinfo
                 cpu              195
              thread      30014adaf60
             message BAD TRAP: type=30 rp=2a10c549210 addr=0 mmu_fsr=9
                  pc          1031360
    

The pc symbolizes to this:

    > 1031360::dis -n 0
    hat_unload_callback+0x3f8:      ldx       [%l4 + 0x10], %o3
    

Hmm, that is not the same offset that was shown in the call stack: 3f8 versus 358. Which one should you believe?

panicinfo is correct, and the call stack lies -- it is an artifact of the conventional interpretation of the o7 register in the SPARC architecture, plus a discontinuity caused by the trap. In the standard calling sequence, the pc is saved in the o7 register, the destination address is written to the pc, and the destination executes a save instruction that slides the register window and renames the o registers to i registers. A stack walker interprets the value of i7 in each window as the pc.

However, a SPARC trap uses a different mechanism for saving the pc, and does not modify o7. When the trap handler executes a save instruction, the o7 register contains the pc of the most recent call instruction. This is marginally interesting, but totally unrelated to the pc at which the trap was taken. The stack walker later extracts this value of o7 from the window and shows it as the frame's pc, which is wrong.

This particular stack lie only occurs after a trap, so you can recognize it by the presence of the Solaris trap function ktl0() on the stack. You can find the correct pc in a "struct regs" that the trap handler pushes on the stack at address sp+7ff-a0, where sp is the stack pointer for the frame prior to the ktl0(). From the example above, use the sp value to the left of hat_unload_callback:

    > 000002a10c548ab1+7ff-a0::print struct regs r_pc
    r_pc = 0x1031360
    

This works for any thread. If you are examining the panic thread, then ::panicinfo command performs the calculation for you and shows the correct pc.

Thursday Jan 08, 2009

CPU to core mapping

A frequently asked question among users of CMT platforms is "How do I know which CPUs share a core?". For most users, the best answer is, "don't worry about it", because Solaris does a good job of assigning software threads to CPUs and spreading them across cores such that the utilization of hardware resources is maximized. However, knowledge of the mapping is helpful to users who want to explicitly manage the assignment of threads to CPUs and cores, to squeeze out more performance, using techniques such as processor set binding and interrupt fencing.

For some processors and configurations, the core can be computed as a static function of the CPU ID, but this is not a general or easy-to-use solution. Instead, Solaris exposes this in a portable way via the "psrinfo -pv" command, as shown in this example on an M5000 server:

    % psrinfo -pv
    
    The physical processor has 2 cores and 4 virtual processors (0-3)
      The core has 2 virtual processors (0 1)
      The core has 2 virtual processors (2 3)
        SPARC64-VI (portid 1024 impl 0x6 ver 0x90 clock 2150 MHz)
    The physical processor has 2 cores and 4 virtual processors (40-43)
      The core has 2 virtual processors (40 41)
      The core has 2 virtual processors (42 43)
        SPARC64-VI (portid 1064 impl 0x6 ver 0x90 clock 2150 MHz)
    

The numbers in parenthesis are the CPU IDs, as known to Solaris and used in commands such as mpstat, psradm, etc. At this time, there are no supported programmatic interfaces to get this information.

Now for the confusing part. Unfortunately, "psrinfo -pv" only prints the core information on systems running OpenSolaris or Solaris Express, because psrinfo was enhanced by this CR:

    6316187 Need interface to determine core sharing by CPUs
which was never backported to a Solaris 10 update. I cannot predict when or whether this will be done. However, on Solaris 10, you can see core groupings using the unstable and less friendly kstat interface. Try this script, which I have named showcores:
    #!/bin/ksh
    kstat cpu_info | \\
        egrep "cpu_info |core_id" | \\
        awk \\
            'BEGIN { printf "%4s %4s", "CPU", "core" } \\
             /module/ { printf "\\n%4s", $4 } \\
             /core_id/ { printf "%4s", $2} \\
             END { printf "\\n" }'
    

    % showcores
     CPU core
       0   0
       1   0
       2   2
       3   2
      40  40
      41  40
      42  42
      43  42
    

The core_id extracted from the kstats is arbitrary, but CPUs with the same core_id share a physical core. Beware that the name and semantics of kstats such as core_id are unstable interfaces, which means they are not documented, not supported, and are subject to change.

Wednesday Apr 09, 2008

Scaling Solaris on Large CMT Systems

The Solaris Operating System is very effective at managing systems with large numbers of CPUs. Traditionally, these have been SMPs such as the Sun Fire(TM) E25K server, but these days it is CMT systems that are pushing the limits of Solaris scalability. The Sun SPARC(R) Enterprise T5140/T5240 Server, with 128 hardware strands that each behave as an independent CPU, is a good example. We continue to optimize Solaris to handle ever larger CPU counts, and in this posting I discuss a number of recent optimizations that enable Solaris to scale well on the T5140 and other large systems.

The Clock Thread

Clock is a kernel function that by default runs 100 times per second on the lowest numbered CPU in a domain and performs various housekeeping activities. This includes time adjustment, processing pending timeouts, traversing the CPU list to find currently running threads, and performing resource accounting and limit enforcement for the running threads. On a system with more CPUs, the CPU list traversal takes longer, and can exceed 10 ms, in which case clock falls behind, timeout processing is delayed, and the system becomes less responsive. When this happens, the mpstat command will show sys time approaching 100% on CPU 0. This is more likely for memory-intensive workloads on CMT systems with a shared L2$, as the increased L2$ miss rate further slows the clock thread.

We fixed this by multi-threading the clock function. Clock still runs at 100 Hz, but it divides the CPU list into sets, and cross calls a helper CPU to perform resource accounting for each set. The helpers are rotated so that over time the load is finely and evenly distributed over all CPUS; thus, what had been for example a 70% load on CPU 0 becomes a less than 1% load on each of 128 CPUs in a T5140 system. CPU 0 will still have a somewhat higher %sys load than the other CPUs, because it is solely responsible for some functions such as timeout processing.

Memory Placement Optimization (MPO)

The T5140 server in its default configuration has a NUMA characteristic, which is a common architectural strategy for building larger systems. Each server has two physical UltraSPARC(R) T2 Plus processors, and each processor has 64 hardware strands (CPUs). The 64 CPUs on a processor access memory controlled by that processor at a lower latency than memory controlled by the other processor. The physical address space is interleaved across the two processors at a 1 GB granularity. Thus, an operating system that is aware of CPU and memory locality can arrange that software threads allocate memory near the CPU on which they run, minimizing latency.

Solaris does exactly that, and has done so on various platforms since Solaris 9, using the Memory Placement Optimization framework, aka MPO. However, enabling the framework on the T5140 was non-trivial due to the virtualization of CPUs and memory in the sun4v architecture. We extended the hypervisor layer by adding locality arcs in the physical resource graph, and ensured that these arcs were preserved when a subset of the graph was extracted, virtualized, and passed to the Solaris guest at Solaris boot time.

Here are a few details on the MPO framework itself. Each set of CPUs and "near" memory is called a locality group, or lgroup; this corresponds to a single T2 Plus processor on the T5140. When a thread is created, it is assigned to a home lgroup, and the Solaris scheduler tries to run the thread on a CPU in its home lgroup whenever possible. Thread private memory (eg stack, heap, anon) is allocated from the home lgroup whenever possible. Shared memory (eg SysV shm) is striped across lgroups on a page granularity. For more details on Solaris MPO, including commands to control and observe lgroups and local memory, such as lgrpinfo, pmap -L, liblgrp, and memadvise, see the man pages and this presentation.

If an application is dominated by stall time due to memory references that miss in cache, then MPO can theoretically improve performance by as much as the ratio of remote to local memory latency, which is about 1.5 : 1 on the T5140. The STREAM benchmark is a good example; our early experiments with MPO yielded a 50% improvement in STREAM performance. See Brian's blog for the latest optimized results. Similarly, if an application is limited by global coherency bandwidth, then MPO can improve performance by reducing global coherency traffic, though this is unlikely on the T5140 because the local memory bandwidth and the global coherency bandwidth are well balanced.

Thread Scheduling

In my posting on the UltraSPARC T2 processor, I described how Solaris threads are spread across cores and pipelines to balance the load and maximize hardware resource usage. Since the T2 Plus is identical to the T2 in this area, these scheduling heuristics continue to be used for the T5140, but are augmented by scheduling at the lgroup level. Thus, independent software threads are first spread across processors, then across cores within a processor, then across pipelines within a core.

The Kernel Adaptive Mutex

The mutex is the basic locking primitive in Solaris. We have optimized the mutex for large CMT systems in several ways.

The implementation of the mutex in the kernel is adaptive, in that a waiter will busy-wait if the software thread that owns the mutex is running, on the supposition that the owner will release it soon. The waiter will yield the CPU and sleep if the owner is not running. To determine if a mutex owner is running, the code previously traversed all CPUs looking for the owner thread, as opposed to simply examining the owner thread's state, to avoid a race vs threads being freed. This O(NCPU) algorithm was costly on large systems, and we replaced it with a constant time algorithm that is safe wrt threads being freed.

Waiters attempt to acquire a mutex using a compare-and-swap (cas) operation. If many waiters continuously attempt cas on the same mutex, then a queue of requests builds in the memory subsystem, and the latency of each cas becomes proportional the number of requests. This dramatically reduces the rate at which the mutex can be acquired and released, and causes negative scaling for the higher level code which is using the mutex. The fix is to space out the cas requests over time, such that a queue never builds up, by forcing the waiters to busy-wait for a fixed period after a cas failure. The period increases exponentially after repeated failures, up to a maximum which is proportional to the number of CPUs, which is the upper bound on the number of actively waiting threads. Further, in the busy-wait loop, we use long-latency, low-impact operations, so the busy CPU consumes very little of the execution pipeline, leaving more cycles available to other strands sharing the pipeline.

To be clear, any application which causes many waiters to desire the same mutex has an inherent scalability bottleneck, and ultimately needs to be restructured for optimal scaling on large servers. However, the mutex optimizations above allow such apps to scale to perhaps 2X or 3X as many CPUs as they otherwise would, and to degrade gracefully under load rather than tip over into negative scaling.

Availability

All of the enhancements described herein are available in OpenSolaris, and will be available soon in updates to Solaris 10. The MPO and scheduling enhancements for the T5140 will be available in Solaris 10 4/08, and the clock and mutex enhancements will be released soon after in a KU patch.

Tuesday Oct 09, 2007

The UltraSPARC T2 Processor and the Solaris Operating System

The UltraSPARC T2 processor used in the Sun SPARC Enterprise T5x20 server family implements novel features for achieving high performance, which require equally novel support in the Solaris Operating System. A few areas I highlight here are:

  • Core Pipeline and Thread Scheduling
  • Cache Associativity
  • Virtual Memory
  • Block Copy
  • Performance Counters

Unless otherwise noted, the Solaris enhancements I describe are available in the OpenSolaris repository, and in the Solaris 10 8/07 release that is pre-installed on the T5120 and T5220 servers. No special tuning or programming is required for applications to benefit from these enhancements; they are applied automatically by the operating system.

To simplify the technical explanations, I do not distinguish between Physical Memory and Real Memory, but use Physical Memory everywhere. If you want to learn about the distinction, study the OpenSPARC hyper-privileged architecture.

Core Pipeline and Thread Scheduling

The Solaris thread scheduler spreads running threads across as many hardware resources as possible, rather than packing them onto as few resources as possible (consider cores as the resource, for example). By utilizing more resources, this heuristic yields maximum performance at lower thread counts. Do not confuse the kernel-level software thread scheduling I describe here with hardware strand scheduling, which is done on a cycle-by-cycle granularity.

The hierarchy of shared resources is deeper on the T2 processor than it is on the T1 processor, and the scheduler heuristic was extended accordingly. On the T2 processor, 4 strands share an integer pipeline, 2 such pipelines plus other goodies comprise a core, and 8 cores fit on the die. The scheduler spreads threads first across cores, 1 thread per core until every core has one, then 2 threads per core until every core has two, and so on. Within each core, the kernel scheduler balances the threads across the core's 2 integer pipelines.

Hardware resources and their relationships are described in the Processor Group data structure, which is initialized in platform specific code, but accessed in common dispatcher code that implements the hierarchical load balancing heuristic.

Cache Associativity

The T2 L2 cache is 4MB, 16-way associative, 64B line size, and shared by 64 hardware strands. If more than 16 threads frequently access data that maps to the same index in the cache, then they will evict each other's data, and increase the cache miss rate. This is known as a conflict miss, and is more likely in multi-process workloads with the same executable image (the same binary).

The Solaris implementation minimizes conflict misses by applying a technique known as page coloring. In general, a page is mapped into the cache by using higher order physical address bits as an index into the cache way. The T2 L2 way size is 256KB (4MB divided by 16 ways), so an 8KB page will fit into the 256KB way at 32 different locations, known as colors. The Solaris VM system organizes free physical memory based on color. When mapping virtual addresses to physical pages, it chooses the pages so that all colors in the cache are covered, and to minimize the probability that the same virtual address in different processes maps to the same color (and hence cache index).

However, page coloring is not applicable for a large page whose size exceeds the way size, such as 4MB and 256MB pages, because such a page has only one color. To remedy this problem, the T2 processor implements hashed cache indexing, in which higher order physical address bits are XOR'd together to yield the index into the cache. Specifically:

index = PA[32:28] \^ PA[17:13] . PA[19:18] \^ PA[12:11] . PA[10:6]

The effect is that the mapping of a page's contents onto the cache is non-linear, and many permutations are possible, governed by address bits that are larger than the page size. This hardware feature alone would reduce cache conflicts between multiple processes using large pages. However, we also modified Solaris VM to be aware of the hardware hash calculation, and allocate physical pages to maintain an even distribution across permutations, which further minimizes conflicts.

The L1 data cache is 8KB, 4-way associative, 16B line size, and shared by 8 hardware strands. The way size is thus 2KB, which is smaller than the base page size of 8KB, so page coloring techniques cannot be applied to reduce potential conflicts. Instead, Solaris biases the start of each process stack by adding a small pseudo-random amount to the stack pointer at process creation time. Thus, the same stack variable in different processes will have different addresses, which will map to different indices in the L1 data cache, avoiding conflict evictions if that variable is heavily accessed by many processes.

The L1 instruction cache is 16KB, 8-way associative, 32B line size, and shared by 8 hardware strands. This level of associativity is adequate to avoid excessive conflicts and evictions between strands.

Virtual Memory

The T2 processor has a number of interesting features that accelerate the translation of virtual to physical addresses. It supports the the same page sizes that are available on the T1 processor: 8KB, 64KB, 4MB, and 256MB. However, because the hashed cache indexing lowers the L2 conflict miss rate for large pages, the Solaris kernel on servers with the T2 processor automatically allocates 4MB pages for process heap, stack, and anonymous memory. The default size for these segments on T1 processors is 64KB. Larger pages are better because they reduce the TLB miss rate.

Further, the T2 processor also supports Hardware Tablewalk (HWTW) on TLB misses, and is the first SPARC processor to do so. The TLB is a hardware cache of VA to PA translations. Each T2 core has an 128-entry Data TLB, and a 64-entry Instruction TLB, both fully associative. When a translation is not found in the TLB, the TSB is probed. The TSB is a much larger cache that is created by the kernel, lives in main memory, and is directly indexed by virtual address. In other SPARC implementations, a TLB miss causes a trap to the kernel, and the trap handler probes the TSB. With HWTW, the processor issues a load and probes the TSB directly, avoiding the overhead of a software trap handler, which saves valuable cycles per TLB miss. Each T2 hardware strand can be programmed to probe up to four separate TSB regions. The Solaris kernel programs the regions at kernel thread context switch time. (Again, do not confuse this with hardware strand scheduling).

Lastly, processes can share memory more efficiently on the T2 processor using a new feature called the shared context. In previous SPARC implementations, even when processes share physical memory, they still have private translations from process virtual addresses to shared physical addresses, so the processes compete for space in the TLB. Using the shared context feature, processes can use each other's translations that are cached in the TLB, as long as the shared memory is mapped at the same virtual address in each process. This is done safely - the Solaris VM system manages private and shared context identifiers, assigns them to processes and process sharing groups, and programs hardware context registers at thread context switch time. The hardware allows sharing only amongst processes that have the same shared context identifier. In addition, the Solaris VM system arranges that shared translations are backed by a shared TSB, which is accessed via HWTW, further boosting efficiency. Processes that map the same ISM/DISM segments and have the same executable image share translations in this manner, for both the shared memory and for the main text segment.

In total, these VM enhancements provide a nice boost in performance to workloads with a high TLB miss rate, such as OLTP.

Block Copy

Routines that initialize or copy large blocks of memory are heavily used by applications and by the Solaris kernel, so these routines are heavily optimized, often using processor-specific assembly code. On most SPARC implementations, large block move is done using 64-byte loads and stores as supported by the VIS instruction set. These instructions store intermediate results in banks of floating point registers, and so are not the best choice on the T1 processor, in which all cores share a single floating point unit. However, the T2 processor has one FP unit per core, so VIS is once again the best choice.

The Solaris OS selects the optimal block copy routines at boot time based on the processor type. This applies to the memcpy family of functions in libc, to the kernel bcopy routine, and to routines which copy data in and out of the kernel during system calls.

The new block copy implementation for the T2 is also tuned to handle special cases for various combinations of source alignment, destination alignment, and length. The result is that the new routines are approximately 2X faster than the previous routines. The changes are available now in the OpenSolaris repository, and will be available later in a patch following Solaris 10 8/07.

If you enjoy reading SPARC assembly code (and who does not !), see the file niagara_copy.s

Performance Counters

The T2 processor offers a rich set of performance counters for counting hardware events such as cache misses, TLB misses, crypto operations, FP operations, loads, stores, etc. These are accessed using the cpustat command. cpustat -h shows the list of available counters, which are documented in the "Performance Instrumentation" chapter of the OpenSPARC T2 Supplement to the UltraSPARC Architecture 2007.

In addition, the T2 processor implements a "trap on counter overflow" mechanism which is compatible with the mechanism on other UltraSPARC processors (unlike the T1 processor), which means that you can use Sun Studio Performance Analyzer to profile hardware events and map them to your application.

The corestat utility in the Cool Tools suite has been updated to use the new counters in its per-core utilization calculations. It is bundled with the Cool Tools software and is pre-installed on T2-based servers. For more on corestat, see Ravi Talashikar's blog.

Other

The list above is not exhaustive. There are other cool performance features in the T2 processor, such as per-core crypto acceleration, and on-chip 10 GbE networking. For more information on the T5120/T5220 servers and their new processor, see Allan Packer's index.

Monday Oct 08, 2007

Overview

In future postings in this space, I will describe features, issues, and tools relating to performance and scalability of the Solaris Operating System. This is my bread and butter - I work on kernel projects that improve performance. Some of these projects enable new features in Sun's UltraSPARC processors and servers, and some address cross-platform issues.
About

Steve Sistare

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today