Sunday Nov 16, 2008

Measuring lock spin utilization

Lockstat(1M) is the Solaris tool of choice for identifying contended kernel locks that need to be optimized to improve system efficiency and scalability. However, until recently, you could not easily use lockstat to quantify the impact of hot locks on system utilization, or to state with certainty that a hot lock is limiting your performance. This could lead to wasted effort spent fixing a lock, with no performance benefit. For example, here is the output of lockstat showing a hot lock:

    # lockstat sleep 10
    Adaptive mutex spin: 9227825 events in 10.302 seconds (895755 events/sec)
    Count indv cuml rcnt     spin Lock                   Caller
    569235   6%   6% 0.00       17 0x6009a065000          timeout_common+0x4

This says that some thread(s) could not immediately acquire the mutex 569235 times, so they had to busy-wait, and took an average of 17 retries (spins) to acquire the mutex. In absolute terms, is this bad? You cannot tell. Higher spin is worse than lower spin, but you cannot easily map this to time, and further, not all spins are created equal because of the exponential backoff algorithm in the mutex implementation.

The good news is that we have enhanced lockstat to show the spin time in Solaris 10 10/08 and OpenSolaris 2008.11. Now the output of lockstat looks like this:

    # lockstat sleep 10
    Adaptive mutex spin: 9181733 events in 10.135 seconds (905943 events/sec)
    Count indv cuml rcnt     nsec Lock                   Caller
    557331   6%   6% 0.00    88454 0x60033047000          timeout_common+0x4

The new nsec column shows the average time spent busy-waiting and retrying to acquire a mutex. From this data, you can compute the CPU utilization spent busy-waiting using the formula:

    CPU_util = (Count \* nsec) / (elapsed \* ncpu \* 1e9)

The elapsed time is shown in the header above, and you need to know the number of CPUs on your platform. In this example, ncpu=256, so we have:

    CPU_util = (557331 \* 88454) / (10.135 \* 256 \* 1e9) = .019

Thus, 1.9% of all CPU time on the system was spent busy-waiting to acquire this lock, and if we eliminate this lock, we can recover 1.9% of the CPU cycles and perhaps increase our throughput by that much.

I have written a handy script called lockutil to do this calculation for you. lockutil reads a file produced by lockstat, and computes the CPU utilization spent busy-waiting for every lock in the file, including adaptive mutexes, spin locks, and thread locks. It can parse output produced by all of lockstat's myriad command line options, except for the -I profiling mode, which is not relevant. You must supply the CPU count. Example:

    # ... Start running a test ...
    # lockstat -kWP sleep 30 > lockstat_kWP.out
    # mpstat | wc -l
    513                       # includes one header line
    # cat lockstat_kWP.out
    Adaptive mutex spin: 1713116 events in 30.214 seconds (56700 events/sec)
    Count indv cuml rcnt     nsec Hottest Lock           Caller
    1170899  68%  68% 0.00 11444141 pcf                  page_create_wait
    202679  12%  80% 0.00   332996 pcf+0x380             page_free
    86959   5%  85% 0.00  8840242 pcf                    page_reclaim
    64565   4%  89% 0.00    25301 0x604ec13c650          ufs_lockfs_begin_getpage
    55914   3%  92% 0.00    24010 0x604ec13c650          ufs_lockfs_end
    29505   2%  94% 0.00     1284 0x70467650040          page_trylock
    24687   1%  95% 0.00     1845 0x70467650040          page_unlock
    # lockutil -n 512 lockstat_kWP.out
     CPU-util  Lock                 Caller
        0.866  pcf                  page_create_wait
        0.004  pcf+0x380            page_free
        0.050  pcf                  page_reclaim
        0.000  0x604ec13c650        ufs_lockfs_begin_getpage
        0.000  0x604ec13c650        ufs_lockfs_end
        0.000  0x70467650040        page_trylock
        0.000  0x70467650040        page_unlock
        0.920  TOTAL                TOTAL

Ouch, busy-waiting for the pcf lock consumes 92% of the system! There's a lock that needs fixing if ever I saw one. We fixed it.

This begs the question: what can you do if you find a hot lock? If you are a kernel developer and the lock is in your code, then change your locking granularity or algorithm. If you are an application developer, then study the results carefully, because hot kernel locks sometimes indicate a non-scalable application architecture, such as too many processes using an IPC primitive such as a message queue or a fifo. Use the lockstat -s option to see the callstack for each hot lock and trace back to your system calls, or use the dtrace lockstat provider to aggregate lock spin time by system call made by your application. The lockstat provider has also been modified to report spin time rather than spin count in its probe arguments.

If you are an end-user, then search for the lock in the support databases, to see if a fix or tuning advice is available. A few resources are
and public search engines. For example a google search for "lockstat timeout_common" finds this:
6311743 callout table lock contention in timeout and untimeout
which my group recently fixed, and will appear in coming Solaris releases. Advice to folks who file CRs: to make these searches more successful, include lockstat snippets in the publicly readable sections of a CR, such as "Description" and "Public Comments".


Steve Sistare


« July 2016