Tuesday Aug 11, 2009

Wicked fast memstat

The memstat command is commonly used to diagnose memory issues on Solaris systems. It characterizes every physical page on a system and provides a a high level summary of memory usage, such as the following:

    # mdb -k
    > ::memstat
    Page Summary                Pages                MB  %Tot
    ------------     ----------------  ----------------  ----
    Kernel                     428086              3344    3%
    ZFS File Data               25006               195    0%
    Anon                     13992767            109318   85%
    Exec and libs                 652                 5    0%
    Page cache                  24979               195    0%
    Free (cachelist)             1809                14    0%
    Free (freelist)           1979424             15464   12%
    Total                    16452723            128536

However, memstat is horribly slow on large systems. Its running time grows as O(physmem \* NCPU), and can take an hour or more on the largest systems, which makes it practically unusable. I have recently worked with Pavel Tatashin to optimize memstat, and if you use memstat, you will like the results.

memstat is an mdb command; see its soucre code in the file usr/src/cmd/mdb/common/modules/genunix/memory.c. For every page that memstat examines, it reads the page_t structure describing the page, and reads the vnode_t structure describing the page's identity. Each read of a kernel data structure is expensive - it is a system call; specifically, a pread() from the special file /dev/kmem. Max Bruning in his blog suggested the first optimization: rather than finding non-free pages through the page_hash[] and reading them one at a time, memstat should read dense arrays of page_t's from the memsegs. These include free pages which must be ignored, but it reduces the number of system calls and is a net win. Max reports more than a 2X speedup. This is a good start, but is just the tip of the iceberg.

The next big cost is reading the vnode_t per page. The key observation is that many pages point to the same vnode_t; thus, if we save the vnode_t in mdb when we first read it, we can avoid subsequent reads of the same vnode_t. In practice, there are too many vnode_t's on a production system to save every one, as this would greatly increase the memory consumption of mdb, so we implement a cache of up to 10000 vnode_t's, with LRU replacement, organized in a hash table for rapid lookup by vnode_t address. Also, we only save the vn_flag field of the vnode_t object to save space, since only the flag is needed to characterize a page's identity. The cache eliminates most vnode_t related reads, gaining another 2X in performance.

The next cost is a redundant traversal of the pages. memstat traverses and reads the pages twice, performing a slightly different accounting on the second traversal. We eliminated the second traversal and did all accounting on the first pass, gaining another 2X in performance.

The last big cost relates to virtual memory management, and is the reason that the running time grows as O(NCPU). The pread system call jumps to the kernel module for /dev/kmem, whose source code is in usr/src/uts/common/io/mem.c. For each read request, the code determines the physical address (PA), creates a temporary virtual address (VA) mapping to this address, copies the data from kernel to user space, and unmaps the VA. The unmap operation must be broadcast all CPUs to make sure no CPU has the stale VA to PA translation in its TLB. To avoid this cost, we extended and leveraged a Solaris capability called Kernel Physical Mapping (KPM), in which all of physical memory is pre-assigned to a dedicated range of kernel virtual memory that is never mapped for any other purpose. Thus a KPM mapping never needs to be purged from the CPU TLB's, and the memstat running time is no longer a function of NCPU. This optimization yields an additional 10X or more speedup on large CPU count systems.

Finally, the punchline: the combined speedup from all optimizations is almost 500X in the best case, and memstat completes in seconds to minutes. Here are the memstat run times before versus after on various systems:

    platform  memory  NCPU   before   after   speedup
               (GB)           (sec)   (sec)
               ----   ---     -----   ----    -----
    X4600       32     32        19    1.5     13 X
    T5240       32    128       490    4.5    109 X
    T5440      128    256      3929    9.5    414 X
    M9000     4096    512     34143   70.5    484 X
    E25K       864    144      2612  181.5     14 X

(The E25K speedup is "only" 14X because it does not support our KPM optimization; KPM is more complicated on UltraSPARC IV+ and older processors due to possible VA conflicts in their L1 cache).

As a bonus, all mdb -k commands are somewhat faster on large CPU count systems due to the KPM optimization. For example, on a T5440 running 10000 threads, an mdb pipeline to walk all threads and print their stacks took 64 seconds before, and 27 seconds after.

But wait, there's more! Thanks to a suggestion from Jonathan Adams, we exposed the fast method of traversing pages via memsegs with a new mdb walker which you can use:

    > ::walk allpages

These optimizations are coming soon to a Solaris near you, tracked by the following CR:
6708183 poor scalability of mdb memstat with increasing CPU count
They are available now in Open Solaris developer build 118, and will be in OpenSolaris 2010.02. They will also be in Solaris 10 Update 8, which is patch 141444-08 for SPARC and 141445-08 for x86.

Wednesday Jul 08, 2009

Lies, Damned Lies, and Stack Traces

The kernel stack trace is a critical piece of information for diagnosing kernel bugs, but it can be tricky to interpret due to quirks in the processor architecture and in optimized code. Some of these are well known: tail calls and leaf functions obscure frames, function arguments may live in registers that have been modified since entry, and so on. These quirks can cause you to waste time chasing the wrong problem if you are not careful.

Here is a less well known example to be wary of that is specific to SPARC kernel stacks. Use mdb to examine the panic thread in a kernel crash dump:

    > \*panic_thread::findstack
    stack pointer for thread 30014adaf60: 2a10c548671
      000002a10c548721 die+0x98()
      000002a10c548801 trap+0x768()
      000002a10c548961 ktl0+0x64()
      000002a10c548ab1 hat_unload_callback+0x358()
      000002a10c548f21 segvn_unmap+0x2a8()
      000002a10c549021 as_free+0xf4()
      000002a10c5490d1 relvm+0x234()
      000002a10c549181 proc_exit+0x490()
      000002a10c549231 exit+8()
      000002a10c5492e1 syscall_trap+0xac()

This says that the thread did something bad at hat_unload_callback+0x358, which caused a trap and panic. But what does panicinfo show?

    > ::panicinfo
                 cpu              195
              thread      30014adaf60
             message BAD TRAP: type=30 rp=2a10c549210 addr=0 mmu_fsr=9
                  pc          1031360

The pc symbolizes to this:

    > 1031360::dis -n 0
    hat_unload_callback+0x3f8:      ldx       [%l4 + 0x10], %o3

Hmm, that is not the same offset that was shown in the call stack: 3f8 versus 358. Which one should you believe?

panicinfo is correct, and the call stack lies -- it is an artifact of the conventional interpretation of the o7 register in the SPARC architecture, plus a discontinuity caused by the trap. In the standard calling sequence, the pc is saved in the o7 register, the destination address is written to the pc, and the destination executes a save instruction that slides the register window and renames the o registers to i registers. A stack walker interprets the value of i7 in each window as the pc.

However, a SPARC trap uses a different mechanism for saving the pc, and does not modify o7. When the trap handler executes a save instruction, the o7 register contains the pc of the most recent call instruction. This is marginally interesting, but totally unrelated to the pc at which the trap was taken. The stack walker later extracts this value of o7 from the window and shows it as the frame's pc, which is wrong.

This particular stack lie only occurs after a trap, so you can recognize it by the presence of the Solaris trap function ktl0() on the stack. You can find the correct pc in a "struct regs" that the trap handler pushes on the stack at address sp+7ff-a0, where sp is the stack pointer for the frame prior to the ktl0(). From the example above, use the sp value to the left of hat_unload_callback:

    > 000002a10c548ab1+7ff-a0::print struct regs r_pc
    r_pc = 0x1031360

This works for any thread. If you are examining the panic thread, then ::panicinfo command performs the calculation for you and shows the correct pc.


Steve Sistare


« March 2015