Tuesday Apr 02, 2013

Massive Solaris Scalability for the T5-8 and M5-32, Part 1

How do you scale a general purpose operating system to handle a single system image with 1000's of CPUs and 10's of terabytes of memory? You start with the scalable Solaris foundation. You use superior tools such as Dtrace to expose issues, quantify them, and extrapolate to the future. You pay careful attention to computer science, data structures, and algorithms, when designing fixes. You implement fixes that automatically scale with system size, so that once exposed, an issue never recurs in future systems, and the set of issues you must fix in each larger generation steadily shrinks.

The T5-8 has 8 sockets, each containing 16 cores of 8 hardware strands each, which Solaris sees as 1024 CPUs to manage. The M5-32 has 1536 CPUs and 32 TB of memory. Both are many times larger than the previous generation of Oracle T-class and M-class servers. Solaris scales well on that generation, but every leap in size exposes previously benign O(N) and O(N^2) algorithms that explode into prominence on the larger system, consuming excessive CPU time, memory, and other resources, and limiting scalability. To find these, knowing what to look for helps. Most OS scaling issues can be categorized as CPU issues, memory issues, device issues, or resource shortage issues.

CPU scaling issues include:

  • increased lock contention at higher thread counts
  • O(NCPU) and worse algorithms
Lock contention is addressed using fine grained locking based on domain decomposition or hashed lock arrays, and the number of locks is automatically scaled with NCPU for a future-proof solution. O(NCPU^2) algorithms are often the result of naive data structures, or interactions between sub-systems each of which does O(N) work, and once recognized can be recoded easily enough with an adequate supply of caffeine. O(NCPU) algorithms are often the result of a single thread managing resources that grow with machine size, and the solution is to apply parallelism. A good example is the use of vmtasks for shared memory allocation.

Memory scaling issues include:

  • working sets that exceed VA translation caches
  • unmapping translations in all CPUs that access a memory page
  • O(memory) algorithms
  • memory hotspots
Virtual to physical address translations are cached at multiple levels in hardware and software, from TLB through TSB and HME on SPARC. A miss in the smaller lower level caches requires a more costly lookup at the higher level(s). Solaris maximizes the span of each cache and minimizes misses by supporting shared MMU contexts, a range of hardware page sizes up to 2 GB, and the ability to use large pages in every type of memory segment: user, kernel, text, data, private, shared. Solaris uses a novel hardware feature of the T5 and M5 processors to unmap memory on a large number of CPUs efficiently. O(memory) algorithms are fixed using parallelism. Memory hotspots are fixed by avoiding false sharing and spreading data structures across caches and memory controllers.

Device scaling issues include:

  • O(Ndevice) and worse algorithms
  • system bandwidth limitations
  • lock contention in interrupt threads and service threads
The O(N) algorithms tend to be hit during administrative actions such as system boot and hot plug, are are fixed with parallelism and improved data structures. System bandwidth is maximized by spreading devices across PCI roots and system boards, by spreading DMA buffers across memory controllers, and by co-locating DMA buffers with either the producer or consumer of the data. Lock contention is a CPU scaling issue.

Resource shortages occur when too many CPUs compete for a finite set of resources. Sometimes the resource limit is artificial and defined by software, such as for the maximum process and thread count, in which case the fix is to scale the limit automatically with NCPU. Sometimes the limit is imposed by hardware, such as for the number of MMU contexts, and the fix requires more clever resource management in software.

Next time I will provide more details on new Solaris improvements in all of these areas that enable superior performance and scaling on T5 and M5 systems. Stay tuned.

About

Steve Sistare

Search

Categories
Archives
« April 2013
SunMonTueWedThuFriSat
 
1
3
4
6
7
8
9
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today