Wicked fast memstat

The memstat command is commonly used to diagnose memory issues on Solaris systems. It characterizes every physical page on a system and provides a a high level summary of memory usage, such as the following:

    # mdb -k
    > ::memstat
    Page Summary                Pages                MB  %Tot
    ------------     ----------------  ----------------  ----
    Kernel                     428086              3344    3%
    ZFS File Data               25006               195    0%
    Anon                     13992767            109318   85%
    Exec and libs                 652                 5    0%
    Page cache                  24979               195    0%
    Free (cachelist)             1809                14    0%
    Free (freelist)           1979424             15464   12%
    Total                    16452723            128536

However, memstat is horribly slow on large systems. Its running time grows as O(physmem \* NCPU), and can take an hour or more on the largest systems, which makes it practically unusable. I have recently worked with Pavel Tatashin to optimize memstat, and if you use memstat, you will like the results.

memstat is an mdb command; see its soucre code in the file usr/src/cmd/mdb/common/modules/genunix/memory.c. For every page that memstat examines, it reads the page_t structure describing the page, and reads the vnode_t structure describing the page's identity. Each read of a kernel data structure is expensive - it is a system call; specifically, a pread() from the special file /dev/kmem. Max Bruning in his blog suggested the first optimization: rather than finding non-free pages through the page_hash[] and reading them one at a time, memstat should read dense arrays of page_t's from the memsegs. These include free pages which must be ignored, but it reduces the number of system calls and is a net win. Max reports more than a 2X speedup. This is a good start, but is just the tip of the iceberg.

The next big cost is reading the vnode_t per page. The key observation is that many pages point to the same vnode_t; thus, if we save the vnode_t in mdb when we first read it, we can avoid subsequent reads of the same vnode_t. In practice, there are too many vnode_t's on a production system to save every one, as this would greatly increase the memory consumption of mdb, so we implement a cache of up to 10000 vnode_t's, with LRU replacement, organized in a hash table for rapid lookup by vnode_t address. Also, we only save the vn_flag field of the vnode_t object to save space, since only the flag is needed to characterize a page's identity. The cache eliminates most vnode_t related reads, gaining another 2X in performance.

The next cost is a redundant traversal of the pages. memstat traverses and reads the pages twice, performing a slightly different accounting on the second traversal. We eliminated the second traversal and did all accounting on the first pass, gaining another 2X in performance.

The last big cost relates to virtual memory management, and is the reason that the running time grows as O(NCPU). The pread system call jumps to the kernel module for /dev/kmem, whose source code is in usr/src/uts/common/io/mem.c. For each read request, the code determines the physical address (PA), creates a temporary virtual address (VA) mapping to this address, copies the data from kernel to user space, and unmaps the VA. The unmap operation must be broadcast all CPUs to make sure no CPU has the stale VA to PA translation in its TLB. To avoid this cost, we extended and leveraged a Solaris capability called Kernel Physical Mapping (KPM), in which all of physical memory is pre-assigned to a dedicated range of kernel virtual memory that is never mapped for any other purpose. Thus a KPM mapping never needs to be purged from the CPU TLB's, and the memstat running time is no longer a function of NCPU. This optimization yields an additional 10X or more speedup on large CPU count systems.

Finally, the punchline: the combined speedup from all optimizations is almost 500X in the best case, and memstat completes in seconds to minutes. Here are the memstat run times before versus after on various systems:

    platform  memory  NCPU   before   after   speedup
               (GB)           (sec)   (sec)
               ----   ---     -----   ----    -----
    X4600       32     32        19    1.5     13 X
    T5240       32    128       490    4.5    109 X
    T5440      128    256      3929    9.5    414 X
    M9000     4096    512     34143   70.5    484 X
    E25K       864    144      2612  181.5     14 X

(The E25K speedup is "only" 14X because it does not support our KPM optimization; KPM is more complicated on UltraSPARC IV+ and older processors due to possible VA conflicts in their L1 cache).

As a bonus, all mdb -k commands are somewhat faster on large CPU count systems due to the KPM optimization. For example, on a T5440 running 10000 threads, an mdb pipeline to walk all threads and print their stacks took 64 seconds before, and 27 seconds after.

But wait, there's more! Thanks to a suggestion from Jonathan Adams, we exposed the fast method of traversing pages via memsegs with a new mdb walker which you can use:

    > ::walk allpages

These optimizations are coming soon to a Solaris near you, tracked by the following CR:
6708183 poor scalability of mdb memstat with increasing CPU count
They are available now in Open Solaris developer build 118, and will be in OpenSolaris 2010.02. They will also be in Solaris 10 Update 8, which is patch 141444-08 for SPARC and 141445-08 for x86.


Sweet! I look forward to this hitting the M9000's I work with.


Posted by Chad Mynhier on August 11, 2009 at 02:25 AM EDT #

So Cool..Good job.

Posted by bonghwan kim on August 11, 2009 at 03:02 PM EDT #

Great job. !!!

Posted by Ilho Kim on August 11, 2009 at 04:51 PM EDT #

Wicked. Nice description of the optimizations, too.

Posted by Tim Cook on August 17, 2009 at 09:06 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed

Steve Sistare


« June 2016