Thursday Nov 08, 2012

Faster Memory Allocation Using vmtasks

You may have noticed a new system process called "vmtasks" on Solaris 11 systems:

    % pgrep vmtasks
    8
    % prstat -p 8
       PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
         8 root        0K    0K sleep   99  -20   9:10:59 0.0% vmtasks/32
    

What is vmtasks, and why should you care? In a nutshell, vmtasks accelerates creation, locking, and destruction of pages in shared memory segments. This is particularly helpful for locked memory, as creating a page of physical memory is much more expensive than creating a page of virtual memory. For example, an ISM segment (shmflag & SHM_SHARE_MMU) is locked in memory on the first shmat() call, and a DISM segment (shmflg & SHM_PAGEABLE) is locked using mlock() or memcntl(). Segment operations such as creation and locking are typically single threaded, performed by the thread making the system call. In many applications, the size of a shared memory segment is a large fraction of total physical memory, and the single-threaded initialization is a scalability bottleneck which increases application startup time.

To break the bottleneck, we apply parallel processing, harnessing the power of the additional CPUs that are always present on modern platforms. For sufficiently large segments, as many of 16 threads of vmtasks are employed to assist an application thread during creation, locking, and destruction operations. The segment is implicitly divided at page boundaries, and each thread is given a chunk of pages to process. The per-page processing time can vary, so for dynamic load balancing, the number of chunks is greater than the number of threads, and threads grab chunks dynamically as they finish their work. Because the threads modify a single application address space in compressed time interval, contention on locks protecting VM data structures locks was a problem, and we had to re-scale a number of VM locks to get good parallel efficiency. The vmtasks process has 1 thread per CPU and may accelerate multiple segment operations simultaneously, but each operation gets at most 16 helper threads to avoid monopolizing CPU resources. We may reconsider this limit in the future.

Acceleration using vmtasks is enabled out of the box, with no tuning required, and works for all Solaris platform architectures (SPARC sun4u, SPARC sun4v, x86).

The following tables show the time to create + lock + destroy a large segment, normalized as milliseconds per gigabyte, before and after the introduction of vmtasks:

    ISM
        system     ncpu    before      after   speedup 
        ------     ----    ------      -----   -------
        x4600      32      1386        245     6X
        X7560      64      1016        153     7X
        M9000      512     1196        206     6X
        T5240      128     2506        234     11X
        T4-2       128     1197        107     11x
    
    DISM
        system     ncpu    before      after   speedup 
        ------     ----    ------      -----   -------
        x4600      32      1582        265     6X
        X7560      64      1116        158     7X
        M9000      512     1165        152     8X
        T5240      128     2796        198     14X
    

(I am missing the data for T4 DISM, for no good reason; it works fine).

The following table separates the creation and destruction times:

    ISM, T4-2
                  before    after  
                  ------    -----
        create    702       64
        destroy   495       43
    

To put this in perspective, consider creating a 512 GB ISM segment on T4-2. Creating the segment would take 6 minutes with the old code, and only 33 seconds with the new. If this is your Oracle SGA, you save over 5 minutes when starting the database, and you also save when shutting it down prior to a restart. Those minutes go directly to your bottom line for service availability.

Wednesday Oct 31, 2012

High Resolution Timeouts

The default resolution of application timers and timeouts is now 1 msec in Solaris 11.1, down from 10 msec in previous releases. This improves out-of-the-box performance of polling and event based applications, such as ticker applications, and even the Oracle rdbms log writer. More on that in a moment.

As a simple example, the poll() system call takes a timeout argument in units of msec:


System Calls                                              poll(2)
NAME
     poll - input/output multiplexing
SYNOPSIS
     int poll(struct pollfd fds[], nfds_t nfds, int timeout);

In Solaris 11, a call to poll(NULL,0,1) returns in 10 msec, because even though a 1 msec interval is requested, the implementation rounds to the system clock resolution of 10 msec. In Solaris 11.1, this call returns in 1 msec.

In specification lawyer terms, the resolution of CLOCK_REALTIME, introduced by POSIX.1b real time extensions, is now 1 msec. The function clock_getres(CLOCK_REALTIME,&res) returns 1 msec, and any library calls whose man page explicitly mention CLOCK_REALTIME, such as nanosleep(), are subject to the new resolution. Additionally, many legacy functions that pre-date POSIX.1b and do not explicitly mention a clock domain, such as poll(), are subject to the new resolution. Here is a fairly comprehensive list:

      nanosleep
      pthread_mutex_timedlock pthread_mutex_reltimedlock_np
      pthread_rwlock_timedrdlock pthread_rwlock_reltimedrdlock_np
      pthread_rwlock_timedwrlock pthread_rwlock_reltimedwrlock_np
      mq_timedreceive mq_reltimedreceive_np
      mq_timedsend mq_reltimedsend_np
      sem_timedwait sem_reltimedwait_np
      poll select pselect
      _lwp_cond_timedwait _lwp_cond_reltimedwait
      semtimedop sigtimedwait
      aiowait aio_waitn aio_suspend
      port_get port_getn
      cond_timedwait cond_reltimedwait
      setitimer (ITIMER_REAL)
      misc rpc calls, misc ldap calls

This change in resolution was made feasible because we made the implementation of timeouts more efficient a few years back when we re-architected the callout subsystem of Solaris. Previously, timeouts were tested and expired by the kernel's clock thread which ran 100 times per second, yielding a resolution of 10 msec. This did not scale, as timeouts could be posted by every CPU, but were expired by only a single thread. The resolution could be changed by setting hires_tick=1 in /etc/system, but this caused the clock thread to run at 1000 Hz, which made the potential scalability problem worse. Given enough CPUs posting enough timeouts, the clock thread could be a performance bottleneck. We fixed that by re-implementing the timeout as a per-CPU timer interrupt (using the cyclic subsystem, for those familiar with Solaris internals). This decoupled the clock thread frequency from timeout resolution, and allowed us to improve default timeout resolution without adding CPU overhead in the clock thread.

Here are some exceptions for which the default resolution is still 10 msec.

  • The thread scheduler's time quantum is 10 msec by default, because preemption is driven by the clock thread (plus helper threads for scalability). See for example dispadmin, priocntl, fx_dptbl, rt_dptbl, and ts_dptbl. This may be changed using hires_tick.
  • The resolution of the clock_t data type, primarily used in DDI functions, is 10 msec. It may be changed using hires_tick. These functions are only used by developers writing kernel modules.
  • A few functions that pre-date POSIX CLOCK_REALTIME mention _SC_CLK_TCK, CLK_TCK, "system clock", or no clock domain. These functions are still driven by the clock thread, and their resolution is 10 msec. They include alarm, pcsample, times, clock, and setitimer for ITIMER_VIRTUAL and ITIMER_PROF. Their resolution may be changed using hires_tick.

Now back to the database. How does this help the Oracle log writer? Foreground processes post a redo record to the log writer, which releases them after the redo has committed. When a large number of foregrounds are waiting, the release step can slow down the log writer, so under heavy load, the foregrounds switch to a mode where they poll for completion. This scales better because every foreground can poll independently, but at the cost of waiting the minimum polling interval. That was 10 msec, but is now 1 msec in Solaris 11.1, so the foregrounds process transactions faster under load. Pretty cool.

About

Steve Sistare

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today