Faster Memory Allocation Using vmtasks
By Steve Sistare on Nov 08, 2012
You may have noticed a new system process called "vmtasks" on Solaris 11 systems:
% pgrep vmtasks 8 % prstat -p 8 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 8 root 0K 0K sleep 99 -20 9:10:59 0.0% vmtasks/32
What is vmtasks, and why should you care? In a nutshell, vmtasks accelerates creation, locking, and destruction of pages in shared memory segments. This is particularly helpful for locked memory, as creating a page of physical memory is much more expensive than creating a page of virtual memory. For example, an ISM segment (shmflag & SHM_SHARE_MMU) is locked in memory on the first shmat() call, and a DISM segment (shmflg & SHM_PAGEABLE) is locked using mlock() or memcntl(). Segment operations such as creation and locking are typically single threaded, performed by the thread making the system call. In many applications, the size of a shared memory segment is a large fraction of total physical memory, and the single-threaded initialization is a scalability bottleneck which increases application startup time.
To break the bottleneck, we apply parallel processing, harnessing the power of the additional CPUs that are always present on modern platforms. For sufficiently large segments, as many of 16 threads of vmtasks are employed to assist an application thread during creation, locking, and destruction operations. The segment is implicitly divided at page boundaries, and each thread is given a chunk of pages to process. The per-page processing time can vary, so for dynamic load balancing, the number of chunks is greater than the number of threads, and threads grab chunks dynamically as they finish their work. Because the threads modify a single application address space in compressed time interval, contention on locks protecting VM data structures locks was a problem, and we had to re-scale a number of VM locks to get good parallel efficiency. The vmtasks process has 1 thread per CPU and may accelerate multiple segment operations simultaneously, but each operation gets at most 16 helper threads to avoid monopolizing CPU resources. We may reconsider this limit in the future.
Acceleration using vmtasks is enabled out of the box, with no tuning required, and works for all Solaris platform architectures (SPARC sun4u, SPARC sun4v, x86).
The following tables show the time to create + lock + destroy a large segment, normalized as milliseconds per gigabyte, before and after the introduction of vmtasks:
ISM system ncpu before after speedup ------ ---- ------ ----- ------- x4600 32 1386 245 6X X7560 64 1016 153 7X M9000 512 1196 206 6X T5240 128 2506 234 11X T4-2 128 1197 107 11x DISM system ncpu before after speedup ------ ---- ------ ----- ------- x4600 32 1582 265 6X X7560 64 1116 158 7X M9000 512 1165 152 8X T5240 128 2796 198 14X
(I am missing the data for T4 DISM, for no good reason; it works fine).
The following table separates the creation and destruction times:
ISM, T4-2 before after ------ ----- create 702 64 destroy 495 43
To put this in perspective, consider creating a 512 GB ISM segment on T4-2. Creating the segment would take 6 minutes with the old code, and only 33 seconds with the new. If this is your Oracle SGA, you save over 5 minutes when starting the database, and you also save when shutting it down prior to a restart. Those minutes go directly to your bottom line for service availability.