High Resolution Timeouts
By Steve Sistare-Oracle on Oct 31, 2012
The default resolution of application timers and timeouts is now 1 msec in Solaris 11.1, down from 10 msec in previous releases. This improves out-of-the-box performance of polling and event based applications, such as ticker applications, and even the Oracle rdbms log writer. More on that in a moment.
As a simple example, the poll() system call takes a timeout argument in units of msec:
System Calls poll(2) NAME poll - input/output multiplexing SYNOPSIS int poll(struct pollfd fds, nfds_t nfds, int timeout);
In Solaris 11, a call to poll(NULL,0,1) returns in 10 msec, because even though a 1 msec interval is requested, the implementation rounds to the system clock resolution of 10 msec. In Solaris 11.1, this call returns in 1 msec.
In specification lawyer terms, the resolution of CLOCK_REALTIME, introduced by POSIX.1b real time extensions, is now 1 msec. The function clock_getres(CLOCK_REALTIME,&res) returns 1 msec, and any library calls whose man page explicitly mention CLOCK_REALTIME, such as nanosleep(), are subject to the new resolution. Additionally, many legacy functions that pre-date POSIX.1b and do not explicitly mention a clock domain, such as poll(), are subject to the new resolution. Here is a fairly comprehensive list:
nanosleep pthread_mutex_timedlock pthread_mutex_reltimedlock_np pthread_rwlock_timedrdlock pthread_rwlock_reltimedrdlock_np pthread_rwlock_timedwrlock pthread_rwlock_reltimedwrlock_np mq_timedreceive mq_reltimedreceive_np mq_timedsend mq_reltimedsend_np sem_timedwait sem_reltimedwait_np poll select pselect _lwp_cond_timedwait _lwp_cond_reltimedwait semtimedop sigtimedwait aiowait aio_waitn aio_suspend port_get port_getn cond_timedwait cond_reltimedwait setitimer (ITIMER_REAL) misc rpc calls, misc ldap calls
This change in resolution was made feasible because we made the implementation of timeouts more efficient a few years back when we re-architected the callout subsystem of Solaris. Previously, timeouts were tested and expired by the kernel's clock thread which ran 100 times per second, yielding a resolution of 10 msec. This did not scale, as timeouts could be posted by every CPU, but were expired by only a single thread. The resolution could be changed by setting hires_tick=1 in /etc/system, but this caused the clock thread to run at 1000 Hz, which made the potential scalability problem worse. Given enough CPUs posting enough timeouts, the clock thread could be a performance bottleneck. We fixed that by re-implementing the timeout as a per-CPU timer interrupt (using the cyclic subsystem, for those familiar with Solaris internals). This decoupled the clock thread frequency from timeout resolution, and allowed us to improve default timeout resolution without adding CPU overhead in the clock thread.
Here are some exceptions for which the default resolution is still 10 msec.
- The thread scheduler's time quantum is 10 msec by default, because preemption is driven by the clock thread (plus helper threads for scalability). See for example dispadmin, priocntl, fx_dptbl, rt_dptbl, and ts_dptbl. This may be changed using hires_tick.
- The resolution of the clock_t data type, primarily used in DDI functions, is 10 msec. It may be changed using hires_tick. These functions are only used by developers writing kernel modules.
- A few functions that pre-date POSIX CLOCK_REALTIME mention _SC_CLK_TCK, CLK_TCK, "system clock", or no clock domain. These functions are still driven by the clock thread, and their resolution is 10 msec. They include alarm, pcsample, times, clock, and setitimer for ITIMER_VIRTUAL and ITIMER_PROF. Their resolution may be changed using hires_tick.
Now back to the database. How does this help the Oracle log writer? Foreground processes post a redo record to the log writer, which releases them after the redo has committed. When a large number of foregrounds are waiting, the release step can slow down the log writer, so under heavy load, the foregrounds switch to a mode where they poll for completion. This scales better because every foreground can poll independently, but at the cost of waiting the minimum polling interval. That was 10 msec, but is now 1 msec in Solaris 11.1, so the foregrounds process transactions faster under load. Pretty cool.