Wednesday Apr 10, 2013

Massive Solaris Scalability for the T5-8 and M5-32, Part 3

Today I conclude this series on M5-32 scalability [ Part1 , Part2 ] with enhancements we made in the Scheduler, Devices, Tools, and Reboot areas of Solaris.

Scheduler

The Solaris thread scheduler is little changed, as the architecture of balancing runnable threads across levels in the processor resource hierarchy , which I described when the T2 processor was introduced, has scaled well. However, we have continued to optimize the clock function of the scheduler. Clock is responsible for quanta expiration, timeout processing, resource accounting for every CPU, and for misc housekeeping functions. Previously, we parallelized quanta expiration and timeout expiration(aka callouts). In Solaris 11, we eliminated the need to acquire the process and thread locks in most cases during quanta expiration and accounting, and we eliminated or reduced the impact of several smallish O(N) calculations that had become significant at 1536 CPUs. The net result is that all functionality associated with clock scales nicely, and CPU 0 does not accumulate noticeable %sys CPU time due to clock processing.

Devices

SPARC systems use an IOMMU to map PCI-E virtual addresses to physical memory. The PCI VA space is a limited resource with high demand. The VA span is only 2GB to maintain binary compatibility with traditional DDI functions, and many drivers pre-map large DMA buffer pools so that mapping is not on the critical path for transport operations. Every CPU can post concurrent DMA requests, thus demand increases with scale. Managing these conflicting demands is a challenge. We reimplemented DVMA allocation using the Solaris kmem_cache and vmem facilities, with object size and quanta chosen to match common DMA transfer sizes. This provides a good balance between contention-free per-CPU caching, and redistribution of free space in the back end magazine and slab layers. We also modified drivers to use DMA pools more efficiently, and we modified the IOMMU code so that 2GB of VA is available per PCI function, rather than per PCI root port.

The net result for the end user is higher device throughput and/or lower CPU utilization per unit of throughput on larger systems.

Tools

The very tools we use to analyze scalability may exhibit problems themselves, because they must collect data for all the entities on a system. We noticed that mpstat was consuming so much CPU time on large systems that it could not sample at 1 second intervals and was falling behind. mpstat collects data for all CPUs in every interval, but 1536 CPUs is not a large number to handle in 1 second, so something was amiss. Profiling showed the time was spent searching for per-cpu kstats (see kstat(3KSTAT)), and every lookup searched the entire kc_chain linked list of all kstats. Since the number of kstats grows with NCPU, the overall algorithm takes time O(NCPU^2), which explodes on the larger systems. We modified the kstat library to build a hash table when kstats are opened, and re-implemented kstat_lookup() on that. This reduced cpu consumption by 8X on our "small" 512-CPU test system, and improves the performance of all tools that are based on libkstat, including mpstat, vmstat, iostat, and sar.

Even dtrace is not immune. When a script starts, dtrace allocates multi-megabyte trace buffers for every CPU in the domain, using a single thread, and frees the buffers on script termination using a single thread. On a T3-4 with 512 CPUs, it took 30 seconds to run a null D script. Even worse, the allocation is done while holding the global cpu_lock, which serializes the startup of other D scripts, and causes long pauses in the output of some stat commands that briefly take cpu_lock while sampling. We fixed this in Solaris 11.1 by allocating and freeing the trace buffers in parallel using vmtasks, and by hoisting allocation out of the cpu_lock critical path.

Large scale can impact the usability of a tool. Some stat tools produce a row of output per CPU in every sampling interval, making it hard to spot important clues in the torrent of data. In Solaris 11.1, we provide new aggregation and sorting options for the mpstat, cpustat, and trapstat commands that allow the user to make sense of the data. For example, the command

  mpstat -k intr -A 4 -m 10 5
sorts CPUs by the interrupts metric, partitions them into quartiles, and aggregates each quartile into a single row by computing the mean column values within each. See the man pages for details.

Reboot

Large servers take longer to reboot than small servers. Why? They must initialize more CPUs, memory, and devices, but much of the shutdown and startup code in firmware and the kernel is single threaded. We are addressing that. On shutdown, Solaris now scans memory in parallel to look for dirty pages that must be flushed to disk. The sun4v hypervisor zero's a domain's memory in parallel, using CPUs that are physically closest to memory for maximum bandwidth. On startup, Solaris VM initializes per-page metadata using SPARC cache initializing block stores, which speeds metadata initialization by more than 2X. We also fixed an O(NCPU^2) algorithm in bringing CPUs online, and an O(NCPU) algorithm in reclaiming memory from firmware. In total, we have reduced the reboot time for M5-32 systems by many minutes, and we continue to work on optimizations in this area.

In these few short posts, I have summarized the work of many people over a period of years that has pushed Solaris to new heights of scalability, and I look forward to seeing what our customers will do with their massive T5-8 and M5-32 systems. However, if you have seen the SPARC processor roadmap, you know that our work is not done. Onward and upward!

Friday Apr 05, 2013

Massive Solaris Scalability for the T5-8 and M5-32, Part 2

Last time, I outlined the general issues that must be addressed to achieve operating system scalability. Next I will provide more detail on what we modified in Solaris to reach the M5-32 scalability level. We worked in most of the major areas of Solaris, including Virtual Memory, Resources, Scheduler, Devices, Tools, and Reboot. Today I cover VM and resources.

Virtual Memory

When a page of virtual memory is freed, the virtual to physical address translation must be deleted from the MMU of all CPUs which may have accessed the page. On Solaris, this is implemented by posting a software interrupt known as an xcall to each target CPU. This "TLB shootdown" operation poses one of the thorniest scalability challenges in the VM area, as a single-threaded process may have migrated and run on all the CPUs in a domain, and a multi-threaded process may run threads on all CPUs concurrently. This is a frequent cause of sub-optimal scaling when porting an application from a small to a large server, for a wide variety of systems and vendors.

The T5 and M5 processors provide hardware acceleration for this operation. A single PIO write (an ASI write in SPARC parlance) can demap a VA in all cores of a single socket. Solaris need only send an xcall to one CPU per socket, rather than sending an xcall to every CPU. This achieves a 48X reduction in xcalls on M5-32, and a 128X reduction in xcalls on T5-8, for mappings such as kernel pages that are used on every CPU. For user page mappings, one xcall is sent to each socket on which the process runs. The net result is that the cost of demap operations in dynamic memory workloads is not measurably higher on large T5 and M5 systems than on small.

The VM2 project re-implemented the physical page management layer in Solaris 11.1, and offers several scalability benefits. It manages a large page as a single unit, rather than as a collection of contained small pages, which reduces the cost of allocating and freeing large pages. It predicts the demand for large pages and proactively defragments physical memory to build more, reducing delays when an application page faults and needs a large page. These enhancements make it practical for Solaris to use a range of large page sizes, in every segment type, which maximizes run-time efficiency of large memory applications. VM2 also allows kernel memory to be allocated near any socket. Previously, kernel memory was confined to a single "kernel cage" that was confined to a single physically contiguous region, which often fit on the memory connected to a single socket, which could become a memory hot spot for kernel intensive workloads. Spreading reduces hot spots, and also allows kernel data such as DMA buffers to be allocated near threads or devices for lower latency and higher bandwidth.

The VM system manages certain resources on a per-domain basis, in units of pages. These include swap space, locked memory, and reserved memory, among others. These quantities are adjusted when a page is allocated, freed, locked, and unlocked. Each is represented by a global counter protected by a global lock. The lock hold times are small, but at some CPU count they become bottlenecks. How does one scale a global counter? Using a new data structure I call the Credit Tree, which provides O(K * log(NCPU)) allocation performance with a very small constant K. I will describe it in a future posting. We replaced the VM system's global counters with credit trees in S11.1, and achieved a 45X speedup on an mmap() microbenchmark on T4-4 with 256 CPUs. This is good for the Oracle database, because it uses mmap() and munmap() to dynamically allocate space for its per-process PGA memory.

The virtual address space is a finite resource that must be partitioned carefully to support large memory systems. 64 bits of VA is sufficient, but we had to adjust the kernel's VA's to support a larger heap and more physical memory pages, and adjust process VA's to support larger shared memory segments (eg, for the Oracle SGA).

Lastly, we reduced contention on various locks by increasing lock array sizes and improving the object-to-lock hash functions.

Resource Limits

Solaris limits the number of processes that can be created to prevent metadata such as the process table and the proc_t structures from consuming too much kernel memory. This is enforced by the tunables maxusers, max_nprocs, and pidmax. The default for the latter was 30000, which is too small for M5-32 with 1536 CPUs, allowing only 20 processes per CPU. As of Solaris 11.1, the default for these tunables automatically scales up with CPU count and memory size, to a maximum of 999999 processes. You should rarely if ever need to change these tunables in /etc/system, though that is still allowed.

Similarly, Solaris limits the number of threads that can be created, by limiting the space reserved for kernel thread stacks with the segkpsize tunable, whose default allowed approximately 64K threads. In Solaris 11.1, the default scales with CPU and memory to a maximum of 1.6M threads.

Next time: Scheduler, Devices, Tools, and Reboot.

Tuesday Apr 02, 2013

Massive Solaris Scalability for the T5-8 and M5-32, Part 1

How do you scale a general purpose operating system to handle a single system image with 1000's of CPUs and 10's of terabytes of memory? You start with the scalable Solaris foundation. You use superior tools such as Dtrace to expose issues, quantify them, and extrapolate to the future. You pay careful attention to computer science, data structures, and algorithms, when designing fixes. You implement fixes that automatically scale with system size, so that once exposed, an issue never recurs in future systems, and the set of issues you must fix in each larger generation steadily shrinks.

The T5-8 has 8 sockets, each containing 16 cores of 8 hardware strands each, which Solaris sees as 1024 CPUs to manage. The M5-32 has 1536 CPUs and 32 TB of memory. Both are many times larger than the previous generation of Oracle T-class and M-class servers. Solaris scales well on that generation, but every leap in size exposes previously benign O(N) and O(N^2) algorithms that explode into prominence on the larger system, consuming excessive CPU time, memory, and other resources, and limiting scalability. To find these, knowing what to look for helps. Most OS scaling issues can be categorized as CPU issues, memory issues, device issues, or resource shortage issues.

CPU scaling issues include:

  • increased lock contention at higher thread counts
  • O(NCPU) and worse algorithms
Lock contention is addressed using fine grained locking based on domain decomposition or hashed lock arrays, and the number of locks is automatically scaled with NCPU for a future-proof solution. O(NCPU^2) algorithms are often the result of naive data structures, or interactions between sub-systems each of which does O(N) work, and once recognized can be recoded easily enough with an adequate supply of caffeine. O(NCPU) algorithms are often the result of a single thread managing resources that grow with machine size, and the solution is to apply parallelism. A good example is the use of vmtasks for shared memory allocation.

Memory scaling issues include:

  • working sets that exceed VA translation caches
  • unmapping translations in all CPUs that access a memory page
  • O(memory) algorithms
  • memory hotspots
Virtual to physical address translations are cached at multiple levels in hardware and software, from TLB through TSB and HME on SPARC. A miss in the smaller lower level caches requires a more costly lookup at the higher level(s). Solaris maximizes the span of each cache and minimizes misses by supporting shared MMU contexts, a range of hardware page sizes up to 2 GB, and the ability to use large pages in every type of memory segment: user, kernel, text, data, private, shared. Solaris uses a novel hardware feature of the T5 and M5 processors to unmap memory on a large number of CPUs efficiently. O(memory) algorithms are fixed using parallelism. Memory hotspots are fixed by avoiding false sharing and spreading data structures across caches and memory controllers.

Device scaling issues include:

  • O(Ndevice) and worse algorithms
  • system bandwidth limitations
  • lock contention in interrupt threads and service threads
The O(N) algorithms tend to be hit during administrative actions such as system boot and hot plug, are are fixed with parallelism and improved data structures. System bandwidth is maximized by spreading devices across PCI roots and system boards, by spreading DMA buffers across memory controllers, and by co-locating DMA buffers with either the producer or consumer of the data. Lock contention is a CPU scaling issue.

Resource shortages occur when too many CPUs compete for a finite set of resources. Sometimes the resource limit is artificial and defined by software, such as for the maximum process and thread count, in which case the fix is to scale the limit automatically with NCPU. Sometimes the limit is imposed by hardware, such as for the number of MMU contexts, and the fix requires more clever resource management in software.

Next time I will provide more details on new Solaris improvements in all of these areas that enable superior performance and scaling on T5 and M5 systems. Stay tuned.

Thursday Nov 08, 2012

Faster Memory Allocation Using vmtasks

You may have noticed a new system process called "vmtasks" on Solaris 11 systems:

    % pgrep vmtasks
    8
    % prstat -p 8
       PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
         8 root        0K    0K sleep   99  -20   9:10:59 0.0% vmtasks/32
    

What is vmtasks, and why should you care? In a nutshell, vmtasks accelerates creation, locking, and destruction of pages in shared memory segments. This is particularly helpful for locked memory, as creating a page of physical memory is much more expensive than creating a page of virtual memory. For example, an ISM segment (shmflag & SHM_SHARE_MMU) is locked in memory on the first shmat() call, and a DISM segment (shmflg & SHM_PAGEABLE) is locked using mlock() or memcntl(). Segment operations such as creation and locking are typically single threaded, performed by the thread making the system call. In many applications, the size of a shared memory segment is a large fraction of total physical memory, and the single-threaded initialization is a scalability bottleneck which increases application startup time.

To break the bottleneck, we apply parallel processing, harnessing the power of the additional CPUs that are always present on modern platforms. For sufficiently large segments, as many of 16 threads of vmtasks are employed to assist an application thread during creation, locking, and destruction operations. The segment is implicitly divided at page boundaries, and each thread is given a chunk of pages to process. The per-page processing time can vary, so for dynamic load balancing, the number of chunks is greater than the number of threads, and threads grab chunks dynamically as they finish their work. Because the threads modify a single application address space in compressed time interval, contention on locks protecting VM data structures locks was a problem, and we had to re-scale a number of VM locks to get good parallel efficiency. The vmtasks process has 1 thread per CPU and may accelerate multiple segment operations simultaneously, but each operation gets at most 16 helper threads to avoid monopolizing CPU resources. We may reconsider this limit in the future.

Acceleration using vmtasks is enabled out of the box, with no tuning required, and works for all Solaris platform architectures (SPARC sun4u, SPARC sun4v, x86).

The following tables show the time to create + lock + destroy a large segment, normalized as milliseconds per gigabyte, before and after the introduction of vmtasks:

    ISM
        system     ncpu    before      after   speedup 
        ------     ----    ------      -----   -------
        x4600      32      1386        245     6X
        X7560      64      1016        153     7X
        M9000      512     1196        206     6X
        T5240      128     2506        234     11X
        T4-2       128     1197        107     11x
    
    DISM
        system     ncpu    before      after   speedup 
        ------     ----    ------      -----   -------
        x4600      32      1582        265     6X
        X7560      64      1116        158     7X
        M9000      512     1165        152     8X
        T5240      128     2796        198     14X
    

(I am missing the data for T4 DISM, for no good reason; it works fine).

The following table separates the creation and destruction times:

    ISM, T4-2
                  before    after  
                  ------    -----
        create    702       64
        destroy   495       43
    

To put this in perspective, consider creating a 512 GB ISM segment on T4-2. Creating the segment would take 6 minutes with the old code, and only 33 seconds with the new. If this is your Oracle SGA, you save over 5 minutes when starting the database, and you also save when shutting it down prior to a restart. Those minutes go directly to your bottom line for service availability.

Wednesday Oct 31, 2012

High Resolution Timeouts

The default resolution of application timers and timeouts is now 1 msec in Solaris 11.1, down from 10 msec in previous releases. This improves out-of-the-box performance of polling and event based applications, such as ticker applications, and even the Oracle rdbms log writer. More on that in a moment.

As a simple example, the poll() system call takes a timeout argument in units of msec:


System Calls                                              poll(2)
NAME
     poll - input/output multiplexing
SYNOPSIS
     int poll(struct pollfd fds[], nfds_t nfds, int timeout);

In Solaris 11, a call to poll(NULL,0,1) returns in 10 msec, because even though a 1 msec interval is requested, the implementation rounds to the system clock resolution of 10 msec. In Solaris 11.1, this call returns in 1 msec.

In specification lawyer terms, the resolution of CLOCK_REALTIME, introduced by POSIX.1b real time extensions, is now 1 msec. The function clock_getres(CLOCK_REALTIME,&res) returns 1 msec, and any library calls whose man page explicitly mention CLOCK_REALTIME, such as nanosleep(), are subject to the new resolution. Additionally, many legacy functions that pre-date POSIX.1b and do not explicitly mention a clock domain, such as poll(), are subject to the new resolution. Here is a fairly comprehensive list:

      nanosleep
      pthread_mutex_timedlock pthread_mutex_reltimedlock_np
      pthread_rwlock_timedrdlock pthread_rwlock_reltimedrdlock_np
      pthread_rwlock_timedwrlock pthread_rwlock_reltimedwrlock_np
      mq_timedreceive mq_reltimedreceive_np
      mq_timedsend mq_reltimedsend_np
      sem_timedwait sem_reltimedwait_np
      poll select pselect
      _lwp_cond_timedwait _lwp_cond_reltimedwait
      semtimedop sigtimedwait
      aiowait aio_waitn aio_suspend
      port_get port_getn
      cond_timedwait cond_reltimedwait
      setitimer (ITIMER_REAL)
      misc rpc calls, misc ldap calls

This change in resolution was made feasible because we made the implementation of timeouts more efficient a few years back when we re-architected the callout subsystem of Solaris. Previously, timeouts were tested and expired by the kernel's clock thread which ran 100 times per second, yielding a resolution of 10 msec. This did not scale, as timeouts could be posted by every CPU, but were expired by only a single thread. The resolution could be changed by setting hires_tick=1 in /etc/system, but this caused the clock thread to run at 1000 Hz, which made the potential scalability problem worse. Given enough CPUs posting enough timeouts, the clock thread could be a performance bottleneck. We fixed that by re-implementing the timeout as a per-CPU timer interrupt (using the cyclic subsystem, for those familiar with Solaris internals). This decoupled the clock thread frequency from timeout resolution, and allowed us to improve default timeout resolution without adding CPU overhead in the clock thread.

Here are some exceptions for which the default resolution is still 10 msec.

  • The thread scheduler's time quantum is 10 msec by default, because preemption is driven by the clock thread (plus helper threads for scalability). See for example dispadmin, priocntl, fx_dptbl, rt_dptbl, and ts_dptbl. This may be changed using hires_tick.
  • The resolution of the clock_t data type, primarily used in DDI functions, is 10 msec. It may be changed using hires_tick. These functions are only used by developers writing kernel modules.
  • A few functions that pre-date POSIX CLOCK_REALTIME mention _SC_CLK_TCK, CLK_TCK, "system clock", or no clock domain. These functions are still driven by the clock thread, and their resolution is 10 msec. They include alarm, pcsample, times, clock, and setitimer for ITIMER_VIRTUAL and ITIMER_PROF. Their resolution may be changed using hires_tick.

Now back to the database. How does this help the Oracle log writer? Foreground processes post a redo record to the log writer, which releases them after the redo has committed. When a large number of foregrounds are waiting, the release step can slow down the log writer, so under heavy load, the foregrounds switch to a mode where they poll for completion. This scales better because every foreground can poll independently, but at the cost of waiting the minimum polling interval. That was 10 msec, but is now 1 msec in Solaris 11.1, so the foregrounds process transactions faster under load. Pretty cool.

Friday Oct 26, 2012

Solaris 11.1 Performance

Solaris 11.1 has many improvements in performance and scalability, some of which I worked on and can finally talk about.  Check this space over the next few weeks for new postings.

Tuesday Jul 19, 2011

Fast Crash Dump

You may have noticed the new system crash dump file vmdump.N that was introduced in Solaris 10 9/10. However, you perhaps did not notice that the crash dump is generated much more quickly than before, reducing your down time by many minutes on large memory systems, by harnessing parallelism and high compression at dump time. In this entry, I describe the Fast Crash optimizations that Dave Plauger and I added to Solaris.

In the previous implementation, if a system panics, the panic thread freezes all other CPUs and proceeds to copy system memory to the dump device. By default, only kernel pages are saved, which is usually sufficient to diagnose a kernel bug, but that can be changed with the dumpadm(1M) command. I/O is the bottleneck, so the panic thread compresses pages on the fly to reduce the data to be written. It uses lzjb compression, which provides decent compression at a reasonable CPU utilization. When the system reboots, the single-threaded savecore(1M) process reads the dump device, uncompresses the data, and creates the crash dump files vmcore.N and unix.N, for a small integer N.

Even with lzjb compression, writing a crash dump on systems with gigabytes to terabytes of memory takes a long time. What if we use stronger compression to further reduce the amount of data to write? The following chart compares the compression ratio of lzjb vs bzip2 for 42 crash dumps picked at random from our internal support site.




bzip2 compresses 2x more than lzjb for most cases, and in the extreme case, bzip2 achieves a 39X compression vs 9X for lzjb. (We also tested gzip levels 1 through 9, and they fall in between the two.) Thus we could reduce the disk I/O time for crash dump by using bzip2. The catch is that bzip2 requires significantly more CPU time than lzjb per byte compressed, some 20X to 40X more on the SPARC and x86 CPUs we tested, so introducing bzip2 in a single threaded dump would be a net loss. However, we hijack the frozen CPUs to compress different ranges of physical memory in parallel. The panic thread traverses physical memory in 4 MB chunks, mapping each chunk and passing its address to a helper CPU. The helper compresses the chunk to an output buffer, and passes the result back to the panic thread, which writes it to disk. This is implemented in a pipelined, dataflow fashion such that the helper CPUs are kept busy compressing the next batch of data while the panic thread writes the previous batch of data to disk.

We dealt with several practical problems to make this work. Each helper CPU needs several MB of buffer space to run the bzip2 algorithm, which really adds up for 100's of CPUs, and we did not want to statically reserve that much memory per domain. Instead, we scavenge memory that is not included in the dump, such as userland pages in a kernel-only dump. Also, during a crash dump, only the panic thread is allowed to use kernel services, because the state of kernel data structures is suspect and concurrency is not safe. Thus the panic thread and helper CPUs must communicate using shared memory and spin locks only.

The speedup of parallel crash dump versus the serial dump depends on compression factor, CPU speed, and disk speed, but here are a few examples. These are "kernel only" dumps, and the dumpsize column below is the uncompressed kernel size. The disk is either a raw disk or a simple ZFS zvol, with no striping. Before is the time for a serial dump, and after is the time for a parallel dump, measured from the "halt -d" command to the last write to the dump device.

    system  NCPU  disk  dumpsize  compression  before  after  speedup
                          (GB)                 mm:ss   mm:ss
    ------  ----  ----  --------  -----------  -----   -----  -------
    M9000   512   zvol    90.6      42.0       28:30    2:03    13.8X
    T5440   256   zvol    19.4       7.2       27:21    4:29     6.1X
    x4450    16   raw      4.9       6.7        0:22    0:07     3.2X
    x4600    32   raw     14.6       3.1        3:47    1:46     2.1X
    

The higher compression, performed concurrently with I/O, gives a significant speedup, but we are still I/O limited, and future speedup will depend on improvements in I/O. For example, a striped zvol is not supported as a dump device, but that would help. You can use hardware raid to configure a faster dump device.

We also optimized live dumps, which are crash dumps that are generated with the "savecore -L" command, without stopping the system. This is useful for diagnosing systems that are misbehaving in some way but are still performing adequately, without interrupting service. We fixed a bug (CR 6878030) in which live dump writes were broken into 8K physical transfers, giving terrible I/O throughput. The live dumps could take hours on large systems, making them practically unusable. We obtained these speedups for various dump sizes:

    system disk  before   after  speedup
                 h:mm:ss  mm:ss
    ------ ----  -------  -----  -------
    M9000  zvol  3:28:10  15:06  13.8X
    T5440  zvol    23:29   1:19  18.6X
    T5440  raw      9:17   0:55  10.5X
    

Lastly, we optimized savecore, which runs when the system boots after a panic, and copies data from the dump device to a persistent file, such as /var/crash/hostname/vmdump.N. savecore is 3X to 5X faster faster because the dump device contents are more compressed, so there are fewer bytes to copy, and because it no longer uncompresses the dump by default. That makes more sense if you need to send the vmdump across the internet. To uncompress by default, use "dumpadm -z off". To uncompress a vmdump.N file and produce vmcore.N and unix.N, which are required for using tools such as mdb, run the "savecore -f" command. We multi-threaded savecore to uncompress in parallel.

Having said all that, I sincerely hope you never see a crash dump on Solaris! The best way to reduce downtime is for the system to stay up. The Sun Ray server I am working on now has been up for 160 days since its previous planned downtime.

Monday Mar 14, 2011

Migrate

This blog is still active and I am adding this entry so my blog will be migrated.

Tuesday Aug 11, 2009

Wicked fast memstat

The memstat command is commonly used to diagnose memory issues on Solaris systems. It characterizes every physical page on a system and provides a a high level summary of memory usage, such as the following:

    # mdb -k
    > ::memstat
    Page Summary                Pages                MB  %Tot
    ------------     ----------------  ----------------  ----
    Kernel                     428086              3344    3%
    ZFS File Data               25006               195    0%
    Anon                     13992767            109318   85%
    Exec and libs                 652                 5    0%
    Page cache                  24979               195    0%
    Free (cachelist)             1809                14    0%
    Free (freelist)           1979424             15464   12%
    Total                    16452723            128536
    

However, memstat is horribly slow on large systems. Its running time grows as O(physmem \* NCPU), and can take an hour or more on the largest systems, which makes it practically unusable. I have recently worked with Pavel Tatashin to optimize memstat, and if you use memstat, you will like the results.

memstat is an mdb command; see its soucre code in the file usr/src/cmd/mdb/common/modules/genunix/memory.c. For every page that memstat examines, it reads the page_t structure describing the page, and reads the vnode_t structure describing the page's identity. Each read of a kernel data structure is expensive - it is a system call; specifically, a pread() from the special file /dev/kmem. Max Bruning in his blog suggested the first optimization: rather than finding non-free pages through the page_hash[] and reading them one at a time, memstat should read dense arrays of page_t's from the memsegs. These include free pages which must be ignored, but it reduces the number of system calls and is a net win. Max reports more than a 2X speedup. This is a good start, but is just the tip of the iceberg.

The next big cost is reading the vnode_t per page. The key observation is that many pages point to the same vnode_t; thus, if we save the vnode_t in mdb when we first read it, we can avoid subsequent reads of the same vnode_t. In practice, there are too many vnode_t's on a production system to save every one, as this would greatly increase the memory consumption of mdb, so we implement a cache of up to 10000 vnode_t's, with LRU replacement, organized in a hash table for rapid lookup by vnode_t address. Also, we only save the vn_flag field of the vnode_t object to save space, since only the flag is needed to characterize a page's identity. The cache eliminates most vnode_t related reads, gaining another 2X in performance.

The next cost is a redundant traversal of the pages. memstat traverses and reads the pages twice, performing a slightly different accounting on the second traversal. We eliminated the second traversal and did all accounting on the first pass, gaining another 2X in performance.

The last big cost relates to virtual memory management, and is the reason that the running time grows as O(NCPU). The pread system call jumps to the kernel module for /dev/kmem, whose source code is in usr/src/uts/common/io/mem.c. For each read request, the code determines the physical address (PA), creates a temporary virtual address (VA) mapping to this address, copies the data from kernel to user space, and unmaps the VA. The unmap operation must be broadcast all CPUs to make sure no CPU has the stale VA to PA translation in its TLB. To avoid this cost, we extended and leveraged a Solaris capability called Kernel Physical Mapping (KPM), in which all of physical memory is pre-assigned to a dedicated range of kernel virtual memory that is never mapped for any other purpose. Thus a KPM mapping never needs to be purged from the CPU TLB's, and the memstat running time is no longer a function of NCPU. This optimization yields an additional 10X or more speedup on large CPU count systems.

Finally, the punchline: the combined speedup from all optimizations is almost 500X in the best case, and memstat completes in seconds to minutes. Here are the memstat run times before versus after on various systems:

    platform  memory  NCPU   before   after   speedup
               (GB)           (sec)   (sec)
               ----   ---     -----   ----    -----
    X4600       32     32        19    1.5     13 X
    T5240       32    128       490    4.5    109 X
    T5440      128    256      3929    9.5    414 X
    M9000     4096    512     34143   70.5    484 X
    E25K       864    144      2612  181.5     14 X
    

(The E25K speedup is "only" 14X because it does not support our KPM optimization; KPM is more complicated on UltraSPARC IV+ and older processors due to possible VA conflicts in their L1 cache).

As a bonus, all mdb -k commands are somewhat faster on large CPU count systems due to the KPM optimization. For example, on a T5440 running 10000 threads, an mdb pipeline to walk all threads and print their stacks took 64 seconds before, and 27 seconds after.

But wait, there's more! Thanks to a suggestion from Jonathan Adams, we exposed the fast method of traversing pages via memsegs with a new mdb walker which you can use:

    > ::walk allpages
    

These optimizations are coming soon to a Solaris near you, tracked by the following CR:
6708183 poor scalability of mdb memstat with increasing CPU count
They are available now in Open Solaris developer build 118, and will be in OpenSolaris 2010.02. They will also be in Solaris 10 Update 8, which is patch 141444-08 for SPARC and 141445-08 for x86.

Wednesday Jul 08, 2009

Lies, Damned Lies, and Stack Traces

The kernel stack trace is a critical piece of information for diagnosing kernel bugs, but it can be tricky to interpret due to quirks in the processor architecture and in optimized code. Some of these are well known: tail calls and leaf functions obscure frames, function arguments may live in registers that have been modified since entry, and so on. These quirks can cause you to waste time chasing the wrong problem if you are not careful.

Here is a less well known example to be wary of that is specific to SPARC kernel stacks. Use mdb to examine the panic thread in a kernel crash dump:

    > \*panic_thread::findstack
    stack pointer for thread 30014adaf60: 2a10c548671
      000002a10c548721 die+0x98()
      000002a10c548801 trap+0x768()
      000002a10c548961 ktl0+0x64()
      000002a10c548ab1 hat_unload_callback+0x358()
      000002a10c548f21 segvn_unmap+0x2a8()
      000002a10c549021 as_free+0xf4()
      000002a10c5490d1 relvm+0x234()
      000002a10c549181 proc_exit+0x490()
      000002a10c549231 exit+8()
      000002a10c5492e1 syscall_trap+0xac()
    

This says that the thread did something bad at hat_unload_callback+0x358, which caused a trap and panic. But what does panicinfo show?

    > ::panicinfo
                 cpu              195
              thread      30014adaf60
             message BAD TRAP: type=30 rp=2a10c549210 addr=0 mmu_fsr=9
                  pc          1031360
    

The pc symbolizes to this:

    > 1031360::dis -n 0
    hat_unload_callback+0x3f8:      ldx       [%l4 + 0x10], %o3
    

Hmm, that is not the same offset that was shown in the call stack: 3f8 versus 358. Which one should you believe?

panicinfo is correct, and the call stack lies -- it is an artifact of the conventional interpretation of the o7 register in the SPARC architecture, plus a discontinuity caused by the trap. In the standard calling sequence, the pc is saved in the o7 register, the destination address is written to the pc, and the destination executes a save instruction that slides the register window and renames the o registers to i registers. A stack walker interprets the value of i7 in each window as the pc.

However, a SPARC trap uses a different mechanism for saving the pc, and does not modify o7. When the trap handler executes a save instruction, the o7 register contains the pc of the most recent call instruction. This is marginally interesting, but totally unrelated to the pc at which the trap was taken. The stack walker later extracts this value of o7 from the window and shows it as the frame's pc, which is wrong.

This particular stack lie only occurs after a trap, so you can recognize it by the presence of the Solaris trap function ktl0() on the stack. You can find the correct pc in a "struct regs" that the trap handler pushes on the stack at address sp+7ff-a0, where sp is the stack pointer for the frame prior to the ktl0(). From the example above, use the sp value to the left of hat_unload_callback:

    > 000002a10c548ab1+7ff-a0::print struct regs r_pc
    r_pc = 0x1031360
    

This works for any thread. If you are examining the panic thread, then ::panicinfo command performs the calculation for you and shows the correct pc.

Thursday Jan 08, 2009

CPU to core mapping

A frequently asked question among users of CMT platforms is "How do I know which CPUs share a core?". For most users, the best answer is, "don't worry about it", because Solaris does a good job of assigning software threads to CPUs and spreading them across cores such that the utilization of hardware resources is maximized. However, knowledge of the mapping is helpful to users who want to explicitly manage the assignment of threads to CPUs and cores, to squeeze out more performance, using techniques such as processor set binding and interrupt fencing.

For some processors and configurations, the core can be computed as a static function of the CPU ID, but this is not a general or easy-to-use solution. Instead, Solaris exposes this in a portable way via the "psrinfo -pv" command, as shown in this example on an M5000 server:

    % psrinfo -pv
    
    The physical processor has 2 cores and 4 virtual processors (0-3)
      The core has 2 virtual processors (0 1)
      The core has 2 virtual processors (2 3)
        SPARC64-VI (portid 1024 impl 0x6 ver 0x90 clock 2150 MHz)
    The physical processor has 2 cores and 4 virtual processors (40-43)
      The core has 2 virtual processors (40 41)
      The core has 2 virtual processors (42 43)
        SPARC64-VI (portid 1064 impl 0x6 ver 0x90 clock 2150 MHz)
    

The numbers in parenthesis are the CPU IDs, as known to Solaris and used in commands such as mpstat, psradm, etc. At this time, there are no supported programmatic interfaces to get this information.

Now for the confusing part. Unfortunately, "psrinfo -pv" only prints the core information on systems running OpenSolaris or Solaris Express, because psrinfo was enhanced by this CR:

    6316187 Need interface to determine core sharing by CPUs
which was never backported to a Solaris 10 update. I cannot predict when or whether this will be done. However, on Solaris 10, you can see core groupings using the unstable and less friendly kstat interface. Try this script, which I have named showcores:
    #!/bin/ksh
    kstat cpu_info | \\
        egrep "cpu_info |core_id" | \\
        awk \\
            'BEGIN { printf "%4s %4s", "CPU", "core" } \\
             /module/ { printf "\\n%4s", $4 } \\
             /core_id/ { printf "%4s", $2} \\
             END { printf "\\n" }'
    

    % showcores
     CPU core
       0   0
       1   0
       2   2
       3   2
      40  40
      41  40
      42  42
      43  42
    

The core_id extracted from the kstats is arbitrary, but CPUs with the same core_id share a physical core. Beware that the name and semantics of kstats such as core_id are unstable interfaces, which means they are not documented, not supported, and are subject to change.

Sunday Nov 16, 2008

Measuring lock spin utilization

Lockstat(1M) is the Solaris tool of choice for identifying contended kernel locks that need to be optimized to improve system efficiency and scalability. However, until recently, you could not easily use lockstat to quantify the impact of hot locks on system utilization, or to state with certainty that a hot lock is limiting your performance. This could lead to wasted effort spent fixing a lock, with no performance benefit. For example, here is the output of lockstat showing a hot lock:

    # lockstat sleep 10
    Adaptive mutex spin: 9227825 events in 10.302 seconds (895755 events/sec)
    Count indv cuml rcnt     spin Lock                   Caller
    -------------------------------------------------------------------------
    569235   6%   6% 0.00       17 0x6009a065000          timeout_common+0x4
    ...
    

This says that some thread(s) could not immediately acquire the mutex 569235 times, so they had to busy-wait, and took an average of 17 retries (spins) to acquire the mutex. In absolute terms, is this bad? You cannot tell. Higher spin is worse than lower spin, but you cannot easily map this to time, and further, not all spins are created equal because of the exponential backoff algorithm in the mutex implementation.

The good news is that we have enhanced lockstat to show the spin time in Solaris 10 10/08 and OpenSolaris 2008.11. Now the output of lockstat looks like this:

    # lockstat sleep 10
    Adaptive mutex spin: 9181733 events in 10.135 seconds (905943 events/sec)
    Count indv cuml rcnt     nsec Lock                   Caller
    -------------------------------------------------------------------------
    557331   6%   6% 0.00    88454 0x60033047000          timeout_common+0x4
    ...
    

The new nsec column shows the average time spent busy-waiting and retrying to acquire a mutex. From this data, you can compute the CPU utilization spent busy-waiting using the formula:

    CPU_util = (Count \* nsec) / (elapsed \* ncpu \* 1e9)

The elapsed time is shown in the header above, and you need to know the number of CPUs on your platform. In this example, ncpu=256, so we have:

    CPU_util = (557331 \* 88454) / (10.135 \* 256 \* 1e9) = .019

Thus, 1.9% of all CPU time on the system was spent busy-waiting to acquire this lock, and if we eliminate this lock, we can recover 1.9% of the CPU cycles and perhaps increase our throughput by that much.

I have written a handy script called lockutil to do this calculation for you. lockutil reads a file produced by lockstat, and computes the CPU utilization spent busy-waiting for every lock in the file, including adaptive mutexes, spin locks, and thread locks. It can parse output produced by all of lockstat's myriad command line options, except for the -I profiling mode, which is not relevant. You must supply the CPU count. Example:

    # ... Start running a test ...
    # lockstat -kWP sleep 30 > lockstat_kWP.out
    # mpstat | wc -l
    513                       # includes one header line
    # cat lockstat_kWP.out
    Adaptive mutex spin: 1713116 events in 30.214 seconds (56700 events/sec)
    
    Count indv cuml rcnt     nsec Hottest Lock           Caller
    -------------------------------------------------------------------------------
    1170899  68%  68% 0.00 11444141 pcf                  page_create_wait
    202679  12%  80% 0.00   332996 pcf+0x380             page_free
    86959   5%  85% 0.00  8840242 pcf                    page_reclaim
    64565   4%  89% 0.00    25301 0x604ec13c650          ufs_lockfs_begin_getpage
    55914   3%  92% 0.00    24010 0x604ec13c650          ufs_lockfs_end
    29505   2%  94% 0.00     1284 0x70467650040          page_trylock
    24687   1%  95% 0.00     1845 0x70467650040          page_unlock
    ...
    #
    # lockutil -n 512 lockstat_kWP.out
     CPU-util  Lock                 Caller
        0.866  pcf                  page_create_wait
        0.004  pcf+0x380            page_free
        0.050  pcf                  page_reclaim
        0.000  0x604ec13c650        ufs_lockfs_begin_getpage
        0.000  0x604ec13c650        ufs_lockfs_end
        0.000  0x70467650040        page_trylock
        0.000  0x70467650040        page_unlock
        0.920  TOTAL                TOTAL
    

Ouch, busy-waiting for the pcf lock consumes 92% of the system! There's a lock that needs fixing if ever I saw one. We fixed it.

This begs the question: what can you do if you find a hot lock? If you are a kernel developer and the lock is in your code, then change your locking granularity or algorithm. If you are an application developer, then study the results carefully, because hot kernel locks sometimes indicate a non-scalable application architecture, such as too many processes using an IPC primitive such as a message queue or a fifo. Use the lockstat -s option to see the callstack for each hot lock and trace back to your system calls, or use the dtrace lockstat provider to aggregate lock spin time by system call made by your application. The lockstat provider has also been modified to report spin time rather than spin count in its probe arguments.

If you are an end-user, then search for the lock in the support databases, to see if a fix or tuning advice is available. A few resources are
http://sunsolve.sun.com
http://bugs.opensolaris.org
and public search engines. For example a google search for "lockstat timeout_common" finds this:
http://bugs.opensolaris.org/view_bug.do?bug_id=6311743
6311743 callout table lock contention in timeout and untimeout
which my group recently fixed, and will appear in coming Solaris releases. Advice to folks who file CRs: to make these searches more successful, include lockstat snippets in the publicly readable sections of a CR, such as "Description" and "Public Comments".

Monday Nov 03, 2008

Faster Firmware

What a difference firmware can make! We take it for granted, and as administrators we probably do not update our system's firmware as often as we should, but I was recently involved in a performance investigation where it made a huge difference.

On a 128 CPU T5240 server, the throughput of an application peaked around 90 processes, but declined as more processes were added, until at 128 processes the throughput was just 25% of its peak value. Classic and severe anti-scaling. The puzzling part was that the usual suspects were innocent. mpstat showed that 99% of the time was usr mode, so no kernel issues; plockstat did not show any contended userland mutexes; cpustat did not show increases in cache misses, TLB misses, or any other counter per process; and a collector/analyzer profile did not show hot atomic functions or CAS operations. It did show a marked increase in the cost of the SPARC save instruction at function entry as process count was raised. Curious.

We eventually upgraded the firmware, and the application scaled nicely up to 128 processes. If you want some advice and do not care about gory details, skip the next two paragraphs :)

It turns out that the hypervisor had a global lock that was limiting scalability, and the lock was eliminated by a firmware upgrade. Normally very little time is spent executing code in hyper-privileged mode on the Sun CMT servers. However, the hypervisor is responsible for maintaining "permanent" VA->PA mappings in the TLB. These mappings are used for the Solaris kernel nucleus, one 4MB mapping for text, and one 4MB mapping for data. Solaris cannot handle an MMU miss for these mappings, so when the processor traps to hypervisor for the miss, the hypervisor finds the mapping, stuffs it into the hardware TLB, and returns from the trap, so Solaris never sees the miss.

The above hypervisor action was protected in the old firmware by a single global lock. The application had a high TLB miss rate exceeding 200K/CPU/sec, so the permanent mappings were being continuously evicted - not an issue if we don't use the kernel much. But, the app also had a deep function stack, so it generated lots of spill/fill traps. These trap to kernel text, which is backed by a permanent mapping, which has been evicted, which causes a hypervisor trap, which hits the global lock. A perfect storm limiting scalability! Spill/fill is a lightweight trap that does not change accounting mode to sys, hence I did not see high sys time; instead, I saw high save time. In hindsight, I could have directly observed the instruction and stall cycles spent in hypervisor mode using:

# cpustat -s -c pic0=Instr_cnt,pic1=Idle_strands,hpriv,nouser 10 10

Should you care? This issue is specific to firmware in the CMT server line, and it depends on your model:

  • T5140,T5240 (2-socket 128-CPU): Definitely verify and upgrade your firmware if needed; get the latest version of patch 136936.
  • T5120,T5220 (1-socket, 64-CPU): I have not observed the scalability bottleneck on this smaller system, but you may get a small performance boost by upgrading; get the latest version of patch 136932.
  • T1000, T2000 (1-socket, 32-CPU) - probably not an issue, the system is too small.
  • T5440 (4-socket, 256-CPU): not an issue, as the first units shipped already contained a later version of the firmware containing the fix.

The CR is: 6669222 lock in mmu_miss can be eliminated to reduce contention
It was fixed in Sun System Firmware version 7.1.3.d.
To show the version of firmware installed on your system, log in to the service processor and verify you have version 7.1.3.d or later:

sc> showhost
Sun System Firmware 7.1.3.d 2008/07/11 08:55
...

To upgrade your firmware:

  1. Go to http://sunsolve.sun.com
  2. lick on Patches and Updates link
  3. Type the patch number in the PatchFinder Form (eg 136936 for the T5140 or T5240)
  4. Push Find Patch button
  5. Click on the "Download Patch" link near the top.
  6. Unzip the download and refer to the Install.info file for instructions

If you have never upgraded firmware before, read the documentation and be careful!

Monday Oct 13, 2008

Solaris for the T5440

Solaris for the T5440

Sun announced the Sun SPARC Enterprise T5440 server today, the largest and most capable in a line of servers based on CoolThreads technology. It has four sockets with up to four UltraSPARC T2 plus processors and 256 hardware execution strands, where each strand is represented as a CPU by the Solaris Operating System. How much work did it take to scale Solaris to handle 256 CPUs? Not much -- it just works. We did the heavy lifting for previous servers in this line, when we added support for the performance features of the T2 and T2 plus processors, and when we optimized scalability for the 2-socket predecessor of the T5440. These features were designed to automatically scale with server size, and we are reaping the benefits of those designs with the T5440. To refresh your memory, here are some of the unique combinations of hardware and software that continue to deliver performance on the T5440:
  • Core Pipeline and Thread Scheduling.
    The Solaris scheduler spreads software threads across instruction pipelines, cores, and sockets for load balancing and optimal usage of processor resources.
  • Cache Associativity.
    The kernel automatically uses both hardware and software page coloring to improve effective L2 cache associativity and reduce cache conflicts.
  • Virtual Memory.
    The kernel automatically uses large memory pages and enables hardware tablewalk to reduce the cost of virtual to physical memory translation. Virtual to physical mappings are shared in hardware for processes that share memory.
  • Block Copy
    Solaris provides optimized versions of library functions to initialize and copy large blocks of memory.
  • Crypto Acceleration
    The T2 plus processor has a hardware crypto acceleration engine per core, thus the T5440 has 32 crypto engines.
  • Locking Primitives
    The kernel mutex is efficient for all system sizes, and it handles excessive contention well, using an exponential backoff algorithm that is parametrized based on system size. The algorithm automatically scales to newer, large systems. The kernel reader-writer lock and atomic operations such as atomic_add now have exponential backoff as well (new since my last posting).
  • Multi-threaded Resource Accounting
    Solaris performs resource accounting and limit enforcement in the clock function at regular intervals. The function is multi-threaded, and the number of threads is automatically scaled based on system size.
  • Memory Placement Optimization (MPO)
    Solaris allocates memory physically near a thread to minimize memory access latency. We added MPO support for the CMT line starting with the T5240 server, which has two sockets and two latency groups (collections of resources near each other). The T5440 has four sockets and four latency groups. The CPU and memory topology is provided to Solaris by the firmware in the form of an abstract graph, thus no changes were required in the Solaris kernel to support MPO on the T5440.
  • Performance Counters
    The cpustat command provides access to a variety of hardware performance counters that give visibility into the lowest level execution characteristics of a thread.
If you want more detail on the above, see my previous postings:

What do you see when logged into a T5440?

Here is the list of 256 CPUs. I have deleted some output for brevity:

% mpstat
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0  420  1232    4  885   60  255  100    0  2862    2  88   0  11
  1    0   0  589   456   33  366   30   96   90    0   668    0  95   0   4
  2    0   0   34  2566    6 2199  128  594  221    1  7061    4  65   0  31
  ... 250 lines deleted ...
253    0   0  666  2111 1862  518   23   79   75    0  1011    1  54   0  45
254    0   0  788  1631 1521  200    6   17   48    0   139    0  87   0  13
255    0   0  589   138   33  173    4   14   48    0   130    0  87   0  13

Here is the grouping of those CPUs into sockets and cores. A virtual processor is a CPU. Again, I deleted some output.

% psrinfo -pv
The physical processor has 8 cores and 64 virtual processors (0-63)
  The core has 8 virtual processors (0-7)
  The core has 8 virtual processors (8-15)
  The core has 8 virtual processors (16-23)
  The core has 8 virtual processors (24-31)
  The core has 8 virtual processors (32-39)
  The core has 8 virtual processors (40-47)
  The core has 8 virtual processors (48-55)
  The core has 8 virtual processors (56-63)
    UltraSPARC-T2+ (clock 1414 MHz)

The physical processor has 8 cores and 64 virtual processors (64-127)
  The core has 8 virtual processors (64-71)
  ...
  The core has 8 virtual processors (120-127)
    UltraSPARC-T2+ (clock 1414 MHz)

The physical processor has 8 cores and 64 virtual processors (128-191)
  The core has 8 virtual processors (128-135)
  ...
  The core has 8 virtual processors (184-191)
    UltraSPARC-T2+ (clock 1414 MHz)

The physical processor has 8 cores and 64 virtual processors (192-255)
  The core has 8 virtual processors (192-199)
  ...
  The core has 8 virtual processors (248-255)
    UltraSPARC-T2+ (clock 1414 MHz)

Here is the lgroup configuration (again edited):

% lgrpinfo
lgroup 0 (root):
        Children: 1-4
        CPUs: 0-255
        Memory: installed 128G, allocated 3.9G, free 124G
lgroup 1 (leaf):
        Children: none, Parent: 0
        CPUs: 0-63
        Memory: installed 32G, allocated 774M, free 31G
lgroup 2 (leaf):
        Children: none, Parent: 0
        CPUs: 64-127
        Memory: installed 33G, allocated 1.1G, free 31G
lgroup 3 (leaf):
        Children: none, Parent: 0
        CPUs: 128-191
        Memory: installed 33G, allocated 1.2G, free 31G
lgroup 4 (leaf):
        Children: none, Parent: 0
        CPUs: 192-255
        Memory: installed 33G, allocated 918M, free 32G

256 is a Big Number.

Here is a visual way to grasp the capacity offered by 256 CPUs. Look at the output of mpstat, which shows execution statistics for all CPUs on a system. Compare the capacity of the T5440 to the original T2000 server, which was revolutionary for the amount of thruput it delivered. I use teeny-tiny font so we can see both servers on the same page.

Here is the T2000 mpstat output;



% mpstat
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0  434  2588    9 2156  127  577  113    0  5845    3  70   0  27
  1    0   0  510   152   34   80    7   20   88    0   130    0  99   0   1
  2    0   0   31  2283    9 2071  111  523  107    0  5588    3  66   0  31
  3    0   0  344  1342   25 1122   62  273   86    0  2479    1  76   0  23
  4    0   0 1073     4    3    0    0    0    1    0     0    0 100   0   0
  5    0   0    0     1    0    0    0    0    0    0     0    0 100   0   0
  6    0   0  248  1371   23 1215   62  297  132    0  2913    2  71   0  27
  7    0   0  505    97   36   22    1    4   49    0    26    0  99   0   1
  8    0   0  462   471   29  446   30  112   80    0   761    0  92   0   8
  9    0   0  285   930   25  963   47  237   71    0  2031    1  77   0  22
 10    0   0  450   388   30  359   18   78   65    0   561    0  90   0   9
 11    0   0  504   129   40   54    3   11   64    0    71    0  98   0   2
 12    0   0  406   516   28  511   33  124   71    0   939    1  89   0  11
 13    0   0  263   734   21  777   38  180   75    0  1648    1  78   0  21
 14    0   0  206  1322   26 1331   59  280   80    0  2906    2  53   0  45
 15    0   0  506   220   33  194   11   30   74    0   226    0  93   0   7
 16    0   0  518   138   37   90    6   19   59    0   122    0  98   0   2
 17    0   0  174   296   14  358   18   80   48    0   750    2  88   0  10
 18    0   0  442   276   30  305   15   61   72    0   438    0  90   0  10
 19    0   0  396   335   28  392   17   74   47    0   603    0  86   0  14
 20    0   0  235   401   20  464   20   96   46    0   914    0  83   0  16
 21    0   0  454   118   27  106    3   15   57    0    94    0  96   0   4
 22    0   0  505   128   35   86    4   15   61    0   107    0  97   0   3
 23    0   0   17  1115    8 1279   41  244   64    0  3442    2  36   0  62
 24    0   0  505   140   35  119    5   23   52    0   186    0  96   0   4
 25    0   0  415   171   27  193    8   40   83    0   304    0  93   0   7
 26    0   0  232   485   26  669   26  132   47    0  1241    1  69   0  31
 27    0   0  155   275   15  367   14   67   35    0   638    0  78   0  22
 28    0   0  126   189   12  263    8   45   25    0   439    0  85   0  15
 29    0   0  464   123   29  114    5   15   50    0    89    0  95   0   5
 30    0   0  448   230   26  264    9   46   88    0   386    0  83   0  17
 31    0   0  447   158   26  178    6   24   57    0   182    0  90   0  10
 32    0   0  212   387   23  615   24  132   50    0  1143    1  79   0  21


Here is the T5440 mpstat output (I cut the 256 lines into 4 columns):



% mpstat
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0  420  1232    4  885   60  255  100    0  2862    2  88   0  11             64    0   0  292   927   18 2055  102  478  354    0  6580    3  54   0  43            128    0   0  131   924    5 2074  102  480  264    0  6616    4  46   0  51            192    0   0  494   270   23  504   44  147   89    0  1009    1  93   0   6
  1    0   0  589   456   33  366   30   96   90    0   668    0  95   0   4             65    0   0   32   570    5 1345   42  252  174    0  4740    2  27   0  71            129    0   0  473   515   12 1133   44  190  234    0  2970    2  45   0  54            193    0   0  101   474   12  990   62  283   69    0  2812    2  84   0  14
  2    0   0   34  2566    6 2199  128  594  221    1  7061    4  65   0  31             66    0   0  130   308    7  699   21  110  130    0  2135    1  21   0  78            130    0   0  273   289    8  646   23   88  142    0  1742    1  28   0  71            194    0   0  143   866   14 1863  131  503  125    2  4848    3  68   0  29
  3    0   0  196  1865   21 1525   91  385  151    1  4178    2  69   0  29             67    0   0   24    91    7  191    3   33   61    0   780    0   6   0  94            131    0   0  342   104    8  203    7   26   87    0   548    0  23   0  76            195    0   0  294   613   19 1299   81  312  128    1  3085    2  68   0  30
  4    0   0 1508   757   31  671   41  165   84    0  1531    1  85   0  14             68    0   0   32   802    8 1800   66  354  173    0  5662    3  33   0  64            132    0   0  205   726   15 1571   69  316  185    0  4529    2  36   0  61            196    0   0  449   655   25 1353   96  338  140    2  2997    2  73   0  25
  5    0   0  247  1461   26 1246   71  317  126    0  3222    2  70   0  28             69    0   0    1    34    0   74    1   11    9    0   242    0   1   0  98            133    0   0  307   280    8  611   21   77  117    0  1357    1  26   0  73            197    0   0 1334  2044 1930  185   17   51   87    0   377    0  95   0   5
  6    0   0  144   757   14  630   38  158  177    0  1754    1  85   0  14             70    0   0    0     2    0    2    0    0    0    0     5    0   0   0 100            134    0   0  202    50    6   88    3    7   72    0    95    0  15   0  84            198    0   0  653  1467 1290  431   28  101   66    0   763    1  86   0  13
  7    0   0  550   294   29  213   12   43  311    0   347    0  94   0   6             71    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100            135    0   0   61    25    2   48    2    3   26    0    14    0   5   0  95            199    0   0  900  2202 1989  439   28   97   82    0   860    1  81   0  18
  8    0   0  452   512   25  520   36  121  542    0   935    1  88   0  11             72    0   0   13   869    2 1955   74  393  166    0  5862    3  35   0  62            136    0   0  194   849   12 1847   90  362  231    0  4604    2  41   0  57            200    0   0  632  1440 1142  632   46  157   94    0  1275    1  86   0  14
  9    0   0  386  1037   24 1040   55  258  773    0  2317    1  73   0  25             73    0   0    9   347    2  795   17  127   77    0  2526    1  14   0  84            137    0   0   92   422    6  937   28  158  108    0  2744    1  23   0  76            201    0   0 1060  1506 1433   74    5   14   83    0    90    0  98   0   2
 10    0   0  564   362   27  335   20   69  790    0   452    0  92   0   7             74    0   0   85    74   12  112    2   14   74    0   224    0   6   0  94            138    0   0  115   194    5  415   13   50   78    0   861    0  14   0  86            202    0   0  597  1802 1447  766   49  188   76    0  1624    1  77   0  22
 11    0   0  500   196   25  120    8   24  810    0   150    0  96   0   4             75    0   0    3     7    1    9    0    1    1    0    20    0   0   0 100            139    0   0  187    48    5   85    3    7   59    0    74    0  15   0  85            203    0   0  723  1788 1599  383   25   84   92    0   711    0  87   0  13
 12    0   0  460   429   25  418   23   92  828    0   734    1  87   0  12             76    0   0   27   558    6 1263   31  193  105    0  3742    2  23   0  75            140    0   0  242   512    7 1134   42  184  136    0  3121    2  38   0  61            204    0   0  876  1508 1347  297   19   67   91    0   530    0  90   0   9
 13    0   0  222  1292   23 1301   64  303  865    1  3286    2  57   0  41             77    0   0    0    13    0   30    0    4    3    0    90    0   1   0  99            141    0   0  310   237   10  493   19   44  169    0   586    0  24   0  76            205    0   0  260   162   16  303   16   59   46    0   451    0  88   0  12
 14    0   0  292   961   22  990   48  205  838    1  2220    1  62   0  37             78    0   0    0     1    0    1    0    0    0    0     0    0   0   0 100            142    0   0    0     2    0    2    0    0    0    0     5    0   0   0 100            206    0   0  214   497   23 1026   51  201   68    0  2176    1  53   0  46
 15    0   0  569   470   30  497   26   77  855    0   693    0  81   0  19             79    0   0    0     2    0    1    0    0    0    0     0    0   0   0 100            143    0   0   16     9    2   13    0    1    9    0    20    0   1   0  99            207    0   0    4   187    0  426   14   73   23    0  1221    1  69   0  30
 16    0   0  555   300   30  353   18   64  721    0   461    0  91   0   9             80    0   0   18   800    5 1803   57  350  153    0  5148    2  33   0  65            144    0   0  306   667    8 1480   59  261  197    0  3744    2  43   0  55            208    0   0  377   639   21 1329   84  309   94    1  3117    2  65   0  33
 17    0   0  189  1019   27 1266   56  281  574    2  2930    2  53   0  46             81    0   0    6   247    1  577   11   78   49    0  1714    1  10   0  89            145    0   0   90   403    4  888   30  123  143    0  2038    1  19   0  80            209    0   0  177   454   19  942   51  197   69    0  2165    1  63   0  36
 18    0   0  549   246   27  288   13   46  368    0   341    0  89   0  11             82    0   0    0     6    0   11    0    2    0    0    24    0   0   0 100            146    0   0    0    39    0   89    1   12    7    0   266    0   2   0  98            210    0   0  454   294   25  550   29   88   72    0   703    0  73   0  26
 19    0   0  453   375   27  440   18   69  133    1   596    0  77   0  23             83    0   0    0     2    0    1    0    0    0    0     0    0   0   0 100            147    0   0  168    54    5   99    5    7   49    0   115    0  11   0  89            211    0   0  475   226   25  407   18   60   91    0   512    0  78   0  22
 20    0   0  212   236   15  300   14   59   39    0   518    0  87   0  12             84    0   0    5   441    0 1014   26  151   86    0  2916    1  18   0  80            148    0   0  159   346    6  770   25  119  112    0  1872    1  24   0  75            212    0   0    9   450    0 1029   42  203   70    0  3063    2  48   0  50
 21    0   0  493   255   27  353   16   55   63    0   443    0  88   0  12             85    0   0    0    87    0  198    3   28   11    0   510    0   3   0  96            149    0   0  391   195   11  383   17   30  174    0   300    0  29   0  71            213    0   0  237   289   19  575   21   84   50    0   963    1  55   0  44
 22    0   0  599   123   31  101    4   14   55    0   110    0  96   0   4             86    0   0    0     2    0    2    0    0    0    0     2    0   0   0 100            150    0   0   97    31    3   55    3    4   43    0    23    0   7   0  93            214    0   0  577   124   30  152    5   13   65    0    89    0  92   0   7
 23    0   0    9   766    4  899   29  179   98    0  2915    2  25   0  74             87    0   0    0     2    0    1    0    0    0    0     0    0   0   0 100            151    0   0  305    61    8  108    5    8   77    0    46    0  21   0  79            215    0   0 1554  2411 2332  114    6   11  287    0    72    0  94   0   6
 24    0   0  514   203   31  302   12   58   65    0   483    0  88   0  11             88    0   0   14   819    3 1846   55  339  171    1  5556    2  34   0  64            152    0   0   18   772    5 1721   62  326  120    0  4856    2  32   0  66            216    0   0 1314  1971 1757  422   27   91  210    0   827    1  85   0  15
 25    0   0  531   239   27  338   15   66   70    2   580    0  85   0  15             89    0   0    2   205    0  476    8   72   50    0  1545    1   9   0  90            153    0   0  275   340    8  744   23   92  151    0  1668    1  27   0  73            217    0   0 1347  1847 1710  268   16   49  201    0   422    0  88   0  12
 26    0   0  152   284   15  427   20   90   40    0   864    1  81   0  18             90    0   0    0     5    0   10    0    2    1    0    30    0   0   0 100            154    0   0  106    43    2   87    2   10   28    0   242    0   8   0  92            218    0   0 1499  2019 1945   98    7   16  209    0   118    0  96   0   4
 27    0   0  223   466   23  698   26  126   61    1  1358    1  48   0  51             91    0   0    0     5    0   10    0    1    1    0    27    0   0   0 100            155    0   0   10    18    4   19    0    1    2    0    22    0   1   0  99            219    0   0 1397  1953 1792  327   18   55  220    1   521    0  81   0  19
 28    0   0  191   312   18  501   16   77   47    2   839    1  65   0  34             92    0   0    6   502    0 1149   29  171   94    0  3293    1  21   0  78            156    0   0    6   343    1  781   20  128   55    0  2324    1  14   0  85            220    0   0 1610  1891 1836   33    3    4  214    0    32    0  99   0   1
 29    0   0  517   168   26  225    9   27   53    0   213    0  87   0  13             93    0   0    1     8    0   15    0    2    0    0    35    0   0   0 100            157    0   0  425   465   31  953   30   62  174    1   586    0  27   0  73            221    0   0 2014  2318 2220  152    8   21  329    0   151    0  94   0   6
 30    0   0  523   145   25  183    5   24   80    0   232    0  86   0  13             94    0   0    0     1    0    1    0    0    0    0     0    0   0   0 100            158    0   0  331   569   23 1223   42   78  164    2   620    0  23   0  76            222    0   0 1324  2011 1824  390   17   53  195    0   498    0  67   0  33
 31    0   0  435   254   24  414   16   46   99    0   376    0  74   0  26             95    0   0    7     5    1    3    0    0    9    0     0    0   0   0 100            159    0   0  196   303   20  614   18   42   89    2   517    0  15   0  85            223    0   0  638    99   36   66    2    5   78    0    27    0  96   0   4
 32    0   0  250   392   24  706   27  145   69    1  1323    1  74   0  25             96    0   0   14   796    2 1808   56  341  140    0  5092    2  33   0  65            160    0   0  394   861   30 1831   94  290  210    2  2836    2  40   0  59            224    0   0  530   234   26  408   28   76  141    0   535    0  90   0   9
 33    0   0  340   353   27  617   25  111   66    0   919    1  74   0  26             97    0   0    7   181    1  420    6   57   47    0  1342    1   8   0  91            161    0   0  366   492   27 1017   39  115  206    1  1182    1  31   0  69            225    0   0  554   214   29  358   20   75   65    0   572    0  89   0  11
 34    0   0  448   271   25  472   20   70   64    0   549    0  81   0  19             98    0   0   19    36    6   53    1    5   26    0   110    0   2   0  98            162    0   0  383   426   27  874   27   62  114    2   662    0  23   0  76            226    0   0  258   444   21  913   39  178   58    0  1762    1  56   0  43
 35    0   0   11   407    6  712   18  125   63    1  2087    1  29   0  70             99    0   0    1     2    0    1    0    0    0    0     0    0   0   0 100            163    0   0    5    11    1   21    0    3    5    0    51    0   1   0  99            227    0   0  427   259   25  476   19   66   95    0   570    0  70   0  29
 36    0   0  565   162   31  225    8   30   72    0   203    0  90   0  10            100    0   0   25   503    7 1116   27  163   89    0  3018    1  20   0  78            164    0   0  249   590   29 1222   48  149  130    2  1832    1  25   0  74            228    0   0  429   249   24  461   22   76   64    0   632    0  80   0  19
 37    0   0  507   226   30  363   12   44   75    0   338    0  82   0  18            101    0   0    1     5    0    8    0    1    1    0    17    0   0   0 100            165    0   0  705   657   34 1421   56   94  415    2   569    0  44   0  56            229    0   0  457   161   23  266   13   40   67    0   318    0  89   0  11
 38    0   0  501   164   25  254    8   29   73    0   193    0  86   0  14            102    0   0   13     5    1    3    0    0   26    0     1    0   1   0  99            166    0   0    0     4    0    7    0    1    0    0    16    0   0   0 100            230    0   0    5   281    0  647   20  104   39    0  1882    1  45   0  54
 39    0   0  249   292   22  485   13   52   48    0   613    0  43   0  56            103    0   0    0     1    0    1    0    0    0    0     0    0   0   0 100            167    0   0    0     1    0    1    0    0    0    0     0    0   0   0 100            231    0   0  380   257   27  465   15   42   76    0   405    0  66   0  34
 40    0   0  249   404   26  792   36  176   69    1  1506    1  77   0  22            104    0   0   34   785    3 1761   57  332  189    0  4973    2  33   0  65            168    0   0  411   722   28 1528   73  201  204    4  1850    1  35   0  64            232    0   0  624   219   29  350   27   60  161    0   376    0  93   0   7
 41    0   0   28   607   15 1279   39  272   98    3  3387    2  49   0  50            105    0   0    2   191    0  436    8   62   53    0  1487    1   9   0  91            169    0   0  627  2105 1787  704   27   66  146    1   650    0  40   0  60            233    0   0  239   542   23 1124   53  250   80    0  2559    2  60   0  39
 42    0   0    7   197    6  419   10   73   27    0  1110    1  77   0  23            106    0   0    1     7    0   14    0    1    1    0    36    0   0   0 100            170    0   0  423  1924 1709  467   15   35   93    0   414    0  34   0  66            234    0   0  408   320   28  604   26   97   72    0   856    1  72   0  28
 43    0   0  212   356   22  691   21  104   63    0  1321    1  43   0  56            107    0   0    0     2    0    1    0    0    0    0     3    0   0   0 100            171    0   0 1174  1733 1486  536   18   29  346    2   256    0  73   0  26            235    0   0  475   209   27  358   14   50   81    0   427    0  81   0  19
 44    0   0   19   242    8  504   20  108   38    2  1194    1  80   0  19            108    0   0  541   467    0 1075   23  153   81    0  2948    2  21   0  76            172    0   0  554  1747 1326  918   34   97  152    1  1122    1  39   0  61            236    0   0  172   423   19  871   38  157   60    0  1888    1  58   0  40
 45    0   0   17   333    9  696   27  126   46    3  1731    1  69   0  30            109    0   0    0    10    0   21    0    2    3    0    77    0   1   0  99            173    0   0  494  1421 1133  607   18   43  114    1   495    0  29   0  71            237    0   0    4   379    0  856   27  125   51    0  2516    1  41   0  58
 46    0   0   18   320   13  670   17  115   43    2  1639    1  45   0  54            110    0   0    0     1    0    1    0    0    0    0     0    0   0   0 100            174    0   0  633  2471 2275  426   13   25  104    1   312    0  41   0  58            238    0   0  361   241   24  455   15   46   69    0   510    0  57   0  42
 47    0   0   13   168   14  298    6   44   24    0   745    0   6   0  94            111    0   0    1     2    0    2    0    0    0    0     1    0   0   0 100            175    0   0  310  1630 1614   22    0    1   13    0    33    0  19   0  81            239    0   0  488   158   26  252    6   20   68    0   203    0  81   0  19
 48    0   0   11   178    6  379   13   83   23    0   917    0  89   0  11            112    0   0   46   687    9 1513   49  273  134    0  4309    2  28   0  70            176    0   0  132  1806 1273 1194   45  247  123    0  3560    2  38   0  60            240    0   0  261   230   19  434   25   93   55    0   817    1  86   0  13
 49    0   0    8   164    6  345   11   67   27    1   861    0  86   0  13            113    0   0    2   131    0  309    7   45   29    0   981    0   6   0  94            177    0   0   94   270   21  523   15   64   63    1  1011    1  10   0  89            241    0   0  160   280   13  573   27  100   47    0  1116    1  76   0  24
 50    0   0    7   170    5  371   10   64   28    0   985    1  81   0  19            114    0   0   10     4    1    4    0    1    5    0     8    0   1   0  99            178    0   0  373   483   26 1018   32   66  163    3   618    0  25   0  75            242    0   0  350   364   23  709   34  118  101    0  1279    1  66   0  33
 51    0   0   18   295   10  634   15  142   52    1  1718    1  44   0  55            115    0   0   98    51    3   94    4    7   24    0    37    0   6   0  94            179    0   0  525   478   27  993   32   65  185    1   598    0  31   0  69            243    0   0  186   350   18  706   25  104   81    0  1434    1  37   0  62
 52    0   0   62   316    6  691   32  138   53    1  1786    1  78   0  21            116    0   0  361   385   10  830   28  106  212    0  1694    1  33   0  66            180    1   0  295   621   30 1269   52  141  149    3  1579    2  27   0  71            244    0   0    1    42    0   90    4   17    4    0   230    0  96   0   4
 53    0   0   78   372   15  785   25  138   53    0  2021    1  59   0  40            117    0   0    0     2    0    2    0    0    0    0     2    0   0   0 100            181    0   0  333   360   35  665   19   43  127    2   510    0  20   0  79            245    0   0   22    69    3  141    7   21   12    0   253    0  93   0   7
 54    0   0   15   307   12  648   18  102   49    1  1703    1  36   0  63            118    0   0  226    84    6  160    8   12   66    0    44    0  15   0  85            182    0   0  367   526   25 1119   39   72  171    2   583    0  25   0  75            246    0   0  300   238   28  439   11   55   69    0   633    0  43   0  56
 55    0   0   10   104    9  200    4   28   19    0   474    0   3   0  97            119    0   0   13     9    2    9    0    1   14    0     9    0   1   0  99            183    0   0  237   297   22  605   16   39   95    1   429    0  14   0  86            247    0   0 1054  1619 1562   53    2    3  112    1    23    0  98   0   2
 56    0   0   68   298   13  611   25  146   54    1  1534    1  83   0  16            120    0   0  159   512   12 1125   34  205  137    0  3040    1  27   0  72            184    0   0  114   730   25 1513   64  262  117    1  3579    2  32   0  66            248    0   0  756  1902 1667  475   29  104   82    0   961    1  84   0  16
 57    0   0   21   507   14 1084   44  226   74    0  2811    1  65   0  34            121    0   0  293   161   11  318    9   38  110    0   571    0  23   0  77            185    0   0  223   551   22 1153   44  110  174    3  1201    1  24   0  76            249    0   0  817  1758 1559  380   22   69   91    1   528    0  87   0  13
 58    0   0   17   410   14  883   26  166   73    0  2430    1  51   0  48            122    0   0  207    33    5   54    2    5   54    0    76    0  14   0  86            186    0   0  384   384   23  808   23   58  129    1   688    0  24   0  76            250    0   0  916  1939 1738  396   21   72   88    0   639    0  83   0  16
 59    0   0   65   284   13  594   11  100   48    2  1663    1  32   0  68            123    0   0  301    70    7  135    6    9   70    0    36    0  18   0  82            187    0   0   93   260   12  529   15   32   78    1   336    0   9   0  91            251    0   0  922  1518 1292  464   20   66   95    1   635    0  76   0  24
 60    0   0  131   348   15  724   27  141   59    3  1707    1  70   0  29            124    0   0   81   268    3  609   15   90   89    0  1665    1  18   0  81            188    0   0  226   534   26 1101   42  127  107    0  1462    1  22   0  77            252    0   0  689  1684 1557  248   13   42   48    0   365    0  87   0  13
 61    0   0   86   376   17  777   26  140   64    2  1869    1  52   0  48            125    0   0  112    22    2   39    1    2   23    0    43    0   7   0  93            189    0   0   73   138   20  226    6   17   29    1   291    0   5   0  94            253    0   0  666  2111 1862  518   23   79   75    0  1011    1  54   0  45
 62    0   0  169   278   19  541   18   71   78    2  1000    1  41   0  58            126    0   0    0     2    0    2    0    0    0    0     1    0   0   0 100            190    0   0  281   455   22  955   33   59  142    4   580    0  19   0  81            254    0   0  788  1631 1521  200    6   17   48    0   139    0  87   0  13
 63    0   0  197   119   18  186    6   20   30    0   192    0  32   0  68            127    0   0  197    66    6  127    6    9   83    0    35    0  14   0  86            191    0   0  594   472   24  983   34   59  299    1   442    0  44   0  56            255    0   0  589   138   33  173    4   14   48    0   130    0  87   0  13


Wow! Another leap forward in capacity.

For more information on the T5440, see Allan Packer's blog index.

Wednesday Apr 09, 2008

Scaling Solaris on Large CMT Systems

The Solaris Operating System is very effective at managing systems with large numbers of CPUs. Traditionally, these have been SMPs such as the Sun Fire(TM) E25K server, but these days it is CMT systems that are pushing the limits of Solaris scalability. The Sun SPARC(R) Enterprise T5140/T5240 Server, with 128 hardware strands that each behave as an independent CPU, is a good example. We continue to optimize Solaris to handle ever larger CPU counts, and in this posting I discuss a number of recent optimizations that enable Solaris to scale well on the T5140 and other large systems.

The Clock Thread

Clock is a kernel function that by default runs 100 times per second on the lowest numbered CPU in a domain and performs various housekeeping activities. This includes time adjustment, processing pending timeouts, traversing the CPU list to find currently running threads, and performing resource accounting and limit enforcement for the running threads. On a system with more CPUs, the CPU list traversal takes longer, and can exceed 10 ms, in which case clock falls behind, timeout processing is delayed, and the system becomes less responsive. When this happens, the mpstat command will show sys time approaching 100% on CPU 0. This is more likely for memory-intensive workloads on CMT systems with a shared L2$, as the increased L2$ miss rate further slows the clock thread.

We fixed this by multi-threading the clock function. Clock still runs at 100 Hz, but it divides the CPU list into sets, and cross calls a helper CPU to perform resource accounting for each set. The helpers are rotated so that over time the load is finely and evenly distributed over all CPUS; thus, what had been for example a 70% load on CPU 0 becomes a less than 1% load on each of 128 CPUs in a T5140 system. CPU 0 will still have a somewhat higher %sys load than the other CPUs, because it is solely responsible for some functions such as timeout processing.

Memory Placement Optimization (MPO)

The T5140 server in its default configuration has a NUMA characteristic, which is a common architectural strategy for building larger systems. Each server has two physical UltraSPARC(R) T2 Plus processors, and each processor has 64 hardware strands (CPUs). The 64 CPUs on a processor access memory controlled by that processor at a lower latency than memory controlled by the other processor. The physical address space is interleaved across the two processors at a 1 GB granularity. Thus, an operating system that is aware of CPU and memory locality can arrange that software threads allocate memory near the CPU on which they run, minimizing latency.

Solaris does exactly that, and has done so on various platforms since Solaris 9, using the Memory Placement Optimization framework, aka MPO. However, enabling the framework on the T5140 was non-trivial due to the virtualization of CPUs and memory in the sun4v architecture. We extended the hypervisor layer by adding locality arcs in the physical resource graph, and ensured that these arcs were preserved when a subset of the graph was extracted, virtualized, and passed to the Solaris guest at Solaris boot time.

Here are a few details on the MPO framework itself. Each set of CPUs and "near" memory is called a locality group, or lgroup; this corresponds to a single T2 Plus processor on the T5140. When a thread is created, it is assigned to a home lgroup, and the Solaris scheduler tries to run the thread on a CPU in its home lgroup whenever possible. Thread private memory (eg stack, heap, anon) is allocated from the home lgroup whenever possible. Shared memory (eg SysV shm) is striped across lgroups on a page granularity. For more details on Solaris MPO, including commands to control and observe lgroups and local memory, such as lgrpinfo, pmap -L, liblgrp, and memadvise, see the man pages and this presentation.

If an application is dominated by stall time due to memory references that miss in cache, then MPO can theoretically improve performance by as much as the ratio of remote to local memory latency, which is about 1.5 : 1 on the T5140. The STREAM benchmark is a good example; our early experiments with MPO yielded a 50% improvement in STREAM performance. See Brian's blog for the latest optimized results. Similarly, if an application is limited by global coherency bandwidth, then MPO can improve performance by reducing global coherency traffic, though this is unlikely on the T5140 because the local memory bandwidth and the global coherency bandwidth are well balanced.

Thread Scheduling

In my posting on the UltraSPARC T2 processor, I described how Solaris threads are spread across cores and pipelines to balance the load and maximize hardware resource usage. Since the T2 Plus is identical to the T2 in this area, these scheduling heuristics continue to be used for the T5140, but are augmented by scheduling at the lgroup level. Thus, independent software threads are first spread across processors, then across cores within a processor, then across pipelines within a core.

The Kernel Adaptive Mutex

The mutex is the basic locking primitive in Solaris. We have optimized the mutex for large CMT systems in several ways.

The implementation of the mutex in the kernel is adaptive, in that a waiter will busy-wait if the software thread that owns the mutex is running, on the supposition that the owner will release it soon. The waiter will yield the CPU and sleep if the owner is not running. To determine if a mutex owner is running, the code previously traversed all CPUs looking for the owner thread, as opposed to simply examining the owner thread's state, to avoid a race vs threads being freed. This O(NCPU) algorithm was costly on large systems, and we replaced it with a constant time algorithm that is safe wrt threads being freed.

Waiters attempt to acquire a mutex using a compare-and-swap (cas) operation. If many waiters continuously attempt cas on the same mutex, then a queue of requests builds in the memory subsystem, and the latency of each cas becomes proportional the number of requests. This dramatically reduces the rate at which the mutex can be acquired and released, and causes negative scaling for the higher level code which is using the mutex. The fix is to space out the cas requests over time, such that a queue never builds up, by forcing the waiters to busy-wait for a fixed period after a cas failure. The period increases exponentially after repeated failures, up to a maximum which is proportional to the number of CPUs, which is the upper bound on the number of actively waiting threads. Further, in the busy-wait loop, we use long-latency, low-impact operations, so the busy CPU consumes very little of the execution pipeline, leaving more cycles available to other strands sharing the pipeline.

To be clear, any application which causes many waiters to desire the same mutex has an inherent scalability bottleneck, and ultimately needs to be restructured for optimal scaling on large servers. However, the mutex optimizations above allow such apps to scale to perhaps 2X or 3X as many CPUs as they otherwise would, and to degrade gracefully under load rather than tip over into negative scaling.

Availability

All of the enhancements described herein are available in OpenSolaris, and will be available soon in updates to Solaris 10. The MPO and scheduling enhancements for the T5140 will be available in Solaris 10 4/08, and the clock and mutex enhancements will be released soon after in a KU patch.

About

Steve Sistare

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today