Wicked fast memstat
By Steve Sistare on Aug 11, 2009
The memstat command is commonly used to diagnose memory issues on Solaris systems. It characterizes every physical page on a system and provides a a high level summary of memory usage, such as the following:
# mdb -k > ::memstat Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 428086 3344 3% ZFS File Data 25006 195 0% Anon 13992767 109318 85% Exec and libs 652 5 0% Page cache 24979 195 0% Free (cachelist) 1809 14 0% Free (freelist) 1979424 15464 12% Total 16452723 128536
However, memstat is horribly slow on large systems. Its running time grows as O(physmem \* NCPU), and can take an hour or more on the largest systems, which makes it practically unusable. I have recently worked with Pavel Tatashin to optimize memstat, and if you use memstat, you will like the results.
memstat is an mdb command; see its soucre code in the file usr/src/cmd/mdb/common/modules/genunix/memory.c. For every page that memstat examines, it reads the page_t structure describing the page, and reads the vnode_t structure describing the page's identity. Each read of a kernel data structure is expensive - it is a system call; specifically, a pread() from the special file /dev/kmem. Max Bruning in his blog suggested the first optimization: rather than finding non-free pages through the page_hash and reading them one at a time, memstat should read dense arrays of page_t's from the memsegs. These include free pages which must be ignored, but it reduces the number of system calls and is a net win. Max reports more than a 2X speedup. This is a good start, but is just the tip of the iceberg.
The next big cost is reading the vnode_t per page. The key observation is that many pages point to the same vnode_t; thus, if we save the vnode_t in mdb when we first read it, we can avoid subsequent reads of the same vnode_t. In practice, there are too many vnode_t's on a production system to save every one, as this would greatly increase the memory consumption of mdb, so we implement a cache of up to 10000 vnode_t's, with LRU replacement, organized in a hash table for rapid lookup by vnode_t address. Also, we only save the vn_flag field of the vnode_t object to save space, since only the flag is needed to characterize a page's identity. The cache eliminates most vnode_t related reads, gaining another 2X in performance.
The next cost is a redundant traversal of the pages. memstat traverses and reads the pages twice, performing a slightly different accounting on the second traversal. We eliminated the second traversal and did all accounting on the first pass, gaining another 2X in performance.
The last big cost relates to virtual memory management, and is the reason that the running time grows as O(NCPU). The pread system call jumps to the kernel module for /dev/kmem, whose source code is in usr/src/uts/common/io/mem.c. For each read request, the code determines the physical address (PA), creates a temporary virtual address (VA) mapping to this address, copies the data from kernel to user space, and unmaps the VA. The unmap operation must be broadcast all CPUs to make sure no CPU has the stale VA to PA translation in its TLB. To avoid this cost, we extended and leveraged a Solaris capability called Kernel Physical Mapping (KPM), in which all of physical memory is pre-assigned to a dedicated range of kernel virtual memory that is never mapped for any other purpose. Thus a KPM mapping never needs to be purged from the CPU TLB's, and the memstat running time is no longer a function of NCPU. This optimization yields an additional 10X or more speedup on large CPU count systems.
Finally, the punchline: the combined speedup from all optimizations is almost 500X in the best case, and memstat completes in seconds to minutes. Here are the memstat run times before versus after on various systems:
platform memory NCPU before after speedup (GB) (sec) (sec) ---- --- ----- ---- ----- X4600 32 32 19 1.5 13 X T5240 32 128 490 4.5 109 X T5440 128 256 3929 9.5 414 X M9000 4096 512 34143 70.5 484 X E25K 864 144 2612 181.5 14 X
(The E25K speedup is "only" 14X because it does not support our KPM optimization; KPM is more complicated on UltraSPARC IV+ and older processors due to possible VA conflicts in their L1 cache).
As a bonus, all mdb -k commands are somewhat faster on large CPU count systems due to the KPM optimization. For example, on a T5440 running 10000 threads, an mdb pipeline to walk all threads and print their stacks took 64 seconds before, and 27 seconds after.
But wait, there's more! Thanks to a suggestion from Jonathan Adams, we exposed the fast method of traversing pages via memsegs with a new mdb walker which you can use:
> ::walk allpages
These optimizations are coming soon to a Solaris near you, tracked
by the following CR:
6708183 poor scalability of mdb memstat with increasing CPU count
They are available now in Open Solaris developer build 118, and will be in OpenSolaris 2010.02. They will also be in Solaris 10 Update 8, which is patch 141444-08 for SPARC and 141445-08 for x86.