Massive Solaris Scalability for the T5-8 and M5-32, Part 2
By Steve Sistare on Apr 05, 2013
Last time, I outlined the general issues that must be addressed to achieve operating system scalability. Next I will provide more detail on what we modified in Solaris to reach the M5-32 scalability level. We worked in most of the major areas of Solaris, including Virtual Memory, Resources, Scheduler, Devices, Tools, and Reboot. Today I cover VM and resources.
When a page of virtual memory is freed, the virtual to physical address translation must be deleted from the MMU of all CPUs which may have accessed the page. On Solaris, this is implemented by posting a software interrupt known as an xcall to each target CPU. This "TLB shootdown" operation poses one of the thorniest scalability challenges in the VM area, as a single-threaded process may have migrated and run on all the CPUs in a domain, and a multi-threaded process may run threads on all CPUs concurrently. This is a frequent cause of sub-optimal scaling when porting an application from a small to a large server, for a wide variety of systems and vendors.
The T5 and M5 processors provide hardware acceleration for this operation. A single PIO write (an ASI write in SPARC parlance) can demap a VA in all cores of a single socket. Solaris need only send an xcall to one CPU per socket, rather than sending an xcall to every CPU. This achieves a 48X reduction in xcalls on M5-32, and a 128X reduction in xcalls on T5-8, for mappings such as kernel pages that are used on every CPU. For user page mappings, one xcall is sent to each socket on which the process runs. The net result is that the cost of demap operations in dynamic memory workloads is not measurably higher on large T5 and M5 systems than on small.
The VM2 project re-implemented the physical page management layer in Solaris 11.1, and offers several scalability benefits. It manages a large page as a single unit, rather than as a collection of contained small pages, which reduces the cost of allocating and freeing large pages. It predicts the demand for large pages and proactively defragments physical memory to build more, reducing delays when an application page faults and needs a large page. These enhancements make it practical for Solaris to use a range of large page sizes, in every segment type, which maximizes run-time efficiency of large memory applications. VM2 also allows kernel memory to be allocated near any socket. Previously, kernel memory was confined to a single "kernel cage" that was confined to a single physically contiguous region, which often fit on the memory connected to a single socket, which could become a memory hot spot for kernel intensive workloads. Spreading reduces hot spots, and also allows kernel data such as DMA buffers to be allocated near threads or devices for lower latency and higher bandwidth.
The VM system manages certain resources on a per-domain basis, in units of pages. These include swap space, locked memory, and reserved memory, among others. These quantities are adjusted when a page is allocated, freed, locked, and unlocked. Each is represented by a global counter protected by a global lock. The lock hold times are small, but at some CPU count they become bottlenecks. How does one scale a global counter? Using a new data structure I call the Credit Tree, which provides O(K * log(NCPU)) allocation performance with a very small constant K. I will describe it in a future posting. We replaced the VM system's global counters with credit trees in S11.1, and achieved a 45X speedup on an mmap() microbenchmark on T4-4 with 256 CPUs. This is good for the Oracle database, because it uses mmap() and munmap() to dynamically allocate space for its per-process PGA memory.
The virtual address space is a finite resource that must be partitioned carefully to support large memory systems. 64 bits of VA is sufficient, but we had to adjust the kernel's VA's to support a larger heap and more physical memory pages, and adjust process VA's to support larger shared memory segments (eg, for the Oracle SGA).
Lastly, we reduced contention on various locks by increasing lock array sizes and improving the object-to-lock hash functions.
Solaris limits the number of processes that can be created to prevent metadata such as the process table and the proc_t structures from consuming too much kernel memory. This is enforced by the tunables maxusers, max_nprocs, and pidmax. The default for the latter was 30000, which is too small for M5-32 with 1536 CPUs, allowing only 20 processes per CPU. As of Solaris 11.1, the default for these tunables automatically scales up with CPU count and memory size, to a maximum of 999999 processes. You should rarely if ever need to change these tunables in /etc/system, though that is still allowed.
Similarly, Solaris limits the number of threads that can be created, by limiting the space reserved for kernel thread stacks with the segkpsize tunable, whose default allowed approximately 64K threads. In Solaris 11.1, the default scales with CPU and memory to a maximum of 1.6M threads.
Next time: Scheduler, Devices, Tools, and Reboot.