Tuesday Jul 07, 2015

Solaris 11.3: Optimized Shared Memory (OSM)

One of the new features in Solaris 11.3 is Optimized Shared Memory (OSM), which is a new mode for System V shared memory designed to marry the best parts of Intimate Shared Memory (ISM) and Dynamic Intimate Shared Memory (DISM). OSM has been a private feature in Solaris since Solaris 11 and Solaris 10 update 11, but the interfaces were not documented outside of Oracle. The only consumer has been the Oracle Database, which uses OSM instead of DISM from Oracle 12c onwards.

In 11.3, we are now documenting and supporting the use of OSM for a wider class of consumers. The OSM kernel support has also been substantially improved.

[Read More]

Monday Mar 14, 2011


Over a decade ago in 2000, David Powell used his intimate knowledge of sed(1) to write a shell script called munges, which entered the muscle memory of most of the Solaris Kernel Development staff. Short for munge stacks, it takes the output of ::walk thread | ::findstack, and groups the stacks by stack trace:
> ::walk thread | ::findstack !/home/dep/bin/munges
743     ##################################  tp: 2a100007c80

182     ##################################  tp: 30003c60020

65      ##################################  tp: 30003c617c0
1       ##################################  tp: 2a100077c80 (TS_FREE)

1       ##################################  tp: 2a10003fc80

1       ##################################  tp: 2a100017c80

1       ##################################  tp: 180e000
While very handy (even after a decade) it has a few drawbacks:
  • Since it is a shell script, it cannot be used directly when debugging with kmdb(1). Copying and pasting thousands of thread stacks from serial-console output was never convenient.
  • You have to use advanced egrep(1) (or less(1)) technology to search for particular threads of interest.
  • Since ::findstack only displays limited thread state information (i.e. whether the thread is free), munges can't uniquify based upon the thread state.
After sitting on my backburner for a long time, I finally implemented:
6799290 need to uniquify stacks in kmdb

Running ::stacks by itself gives output very similar to the above pipeline:
> ::stacks
THREAD           STATE    SOBJ                COUNT
2a10000fc80      SLEEP    CV                    743

30003dc43c0      SLEEP    SHUTTLE               182
2a10055dc80      ONPROC   <NONE>                  1

3004a4d0740      ONPROC   <NONE>                  1

180e000          STOPPED  <NONE>                  1

After my initial putback (in 2009), Greg Price added 6802742 (adding module filtering), and Bryan Cantrill added 6935550 (which added ::stacks support to userland debugging).

Here's the help message for ::stacks, which has a lot of details on its use:

> ::help stacks

  stacks - print unique kernel thread stacks

  [ addr ] ::stacks [-afiv] [-c func] [-C func] [-m module] [-M module] 
      [-s sobj | -S sobj] [-t tstate | -T tstate]

  ::stacks processes all of the thread stacks on the system, grouping
  together threads which have the same:
    \* Thread state,
    \* Sync object type, and
    \* PCs in their stack trace.
  The default output (no address or options) is just a dump of the thread
  groups in the system.  For a view of active threads, use "::stacks -i",
  which filters out FREE threads (interrupt threads which are currently
  inactive) and threads sleeping on a CV. (Note that those threads may still
  be noteworthy; this is just for a first glance.)  More general filtering
  options are described below, in the "FILTERS" section.
  ::stacks can be used in a pipeline.  The input to ::stacks is one or more
  thread pointers.  For example, to get a summary of threads in a process,
  you can do:
    procp::walk thread | ::stacks
  When output into a pipe, ::stacks prints all of the threads input,
  filtered by the given filtering options.  This means that multiple
  ::stacks invocations can be piped together to achieve more complicated
  filters.  For example, to get threads which have both 'fop_read' and
  'cv_wait_sig_swap' in their stack trace, you could do:
    ::stacks -c fop_read | ::stacks -c cv_wait_sig_swap_core
  To get the full list of threads in each group, use the '-a' flag:
    ::stacks -a
    -a    Print all of the grouped threads, instead of just a count.
    -f    Force a re-run of the thread stack gathering.
    -v    Be verbose about thread stack gathering.
    -i    Show active threads; equivalent to '-S CV -T FREE'.
    -c func[+offset]
          Only print threads whose stacks contain func/func+offset.
    -C func[+offset]
          Only print threads whose stacks do not contain func/func+offset.
    -m module
          Only print threads whose stacks contain functions from module.
    -M module
          Only print threads whose stacks do not contain functions from
    -s {type | ALL}
          Only print threads which are on a 'type' synchronization object
    -S {type | ALL}
          Only print threads which are not on a 'type' SOBJ.
    -t tstate
          Only print threads which are in thread state 'tstate'.
    -T tstate
          Only print threads which are not in thread state 'tstate'.
     SOBJ types: mutex rwlock cv sema user user_pi shuttle
  Thread states: free sleep run onproc zomb stopped wait panic


  Target: kvm
  Module: genunix
  Interface Stability: Unstable

Wednesday Jun 14, 2006

Debugging with libumem and MDB

In celebration of OpenSolaris's birthday, I thought I would do some more blogging about libumem, one of my favorite parts of it. In particular, I'll cover some of its debugging features, which borrow heavily from the kmem dcmds and walkers written by Bryan, Dan, and others.

Much of the debugging power of libumem comes from its mdb(1M) debugger module. Anyone familiar with the kmem dcmds used for kernel debugging will see a lot of similarities (modulo some 'u's where 'k's used to be). You can see a list of everything it provides by doing:

> ::dmods -l libumem.so.1

  dcmd allocdby             - given a thread, print its allocated buffers
  dcmd bufctl               - print or filter a bufctl
  dcmd bufctl_audit         - print a bufctl_audit
  dcmd findleaks            - search for potential memory leaks
  dcmd freedby              - given a thread, print its freed buffers
  dcmd ugrep                - search user address space for a pointer
  dcmd umalog               - display umem transaction log and stack traces
  dcmd umastat              - umem allocator stats
  dcmd umausers             - display current medium and large users of the
                              umem allocator
  dcmd umem_cache           - print a umem cache
  dcmd umem_debug           - toggle umem dcmd/walk debugging
  dcmd umem_log             - dump umem transaction log
  dcmd umem_malloc_dist     - report distribution of outstanding malloc()s
  dcmd umem_malloc_info     - report information about malloc()s by cache
  dcmd umem_status          - Print umem status and message buffer
  dcmd umem_verify          - check integrity of umem-managed memory
  dcmd vmem                 - print a vmem_t
  dcmd vmem_seg             - print or filter a vmem_seg
  dcmd whatis               - given an address, return information
  walk allocdby             - given a thread, walk its allocated bufctls
  walk bufctl               - walk a umem cache's bufctls
  walk bufctl_history       - walk the available history of a bufctl
  walk freectl              - walk a umem cache's free bufctls
  walk freedby              - given a thread, walk its freed bufctls
  walk freemem              - walk a umem cache's free memory
  walk leak                 - given a leak ctl, walk other leaks w/ that
  walk leakbuf              - given a leak ctl, walk addr of leaks w/ that
  walk umem                 - walk a umem cache
  walk umem_alloc_112       - walk the umem_alloc_112 cache
... more umem_alloc_\* caches ...
  walk umem_bufctl_audit_cache - walk the umem_bufctl_audit_cache cache
  walk umem_bufctl_cache    - walk the umem_bufctl_cache cache
  walk umem_cache           - walk list of umem caches
  walk umem_cpu             - walk the umem CPU structures
  walk umem_cpu_cache       - given a umem cache, walk its per-CPU caches
  walk umem_hash            - given a umem cache, walk its allocated hash table
  walk umem_log             - walk the umem transaction log
  walk umem_magazine_1      - walk the umem_magazine_1 cache
... more umem_magazine_\* caches ...
  walk umem_slab            - given a umem cache, walk its slabs
  walk umem_slab_cache      - walk the umem_slab_cache cache
  walk umem_slab_partial    - given a umem cache, walk its partially allocated
                              slabs (min 1)
  walk vmem                 - walk vmem structures in pre-fix, depth-first order
  walk vmem_alloc           - given a vmem_t, walk its allocated vmem_segs
  walk vmem_free            - given a vmem_t, walk its free vmem_segs
  walk vmem_postfix         - walk vmem structures in post-fix, depth-first
  walk vmem_seg             - given a vmem_t, walk all of its vmem_segs
  walk vmem_span            - given a vmem_t, walk its spanning vmem_segs
There's a lot of meat here, but I'll start by focusing on the most important dcmds and walkers.

Important dcmds

addr::whatis Reports information about a given buffer. At the moment, the support for this is kind of anemic; it will only give information about buffers under libumem's control. There's an RFE to do better:

4706502 enhance ::whatis for libumem

But even as it is, it is still quite handy for debugging. In Solaris 10 and later, ::whatis will automatically provide the bufctl or vmem_seg address for buffers with debugging information attached to them:
without UMEM_DEBUG set
> f3da8::whatis      
f3da8 is f3da8+0, allocated from umem_alloc_32

with UMEM_DEBUG=default
> f3da8::whatis      
f3da8 is f3da8+0, bufctl fd150 allocated from umem_alloc_32
> fd150::bufctl -v
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
           fd150            f3da8    d8c3823401f60                1
                            e2788            82920                0
This allows for a quick answer to the question "what is this buffer, and who was the last one to allocate/free it". See the descriptions of ::bufctl and ::vmem_seg, below, for more information on their use. (In Solaris 9, you need to use '-b' to get the bufctl address)

addr::ugrep Searches the entire address space for a particular pointer value. The value must be properly aligned. There are options to loosen the search; -d dist searches for [addr, addr + dist) instead of an exact match, -m mask only compares the bits selected in mask, etc.

addr::bufctl Display's information from a 'umem_bufctl_audit_t' pointer, which includes the buffer address, timestamp, thread, and caller. With the '-v' switch ("verbose" mode), it also includes the cache, transaction log pointer, contents log pointer, and stack trace.

This dcmd can also be used in a pipeline to filter the bufctls it prints, by address (-a addr), function or function+offset in the stack trace (-c caller), timestamp (-e earliest / -l latest), or thread (-t thread)

In Solaris Nevada, you can also get the full history for a bufctl by using the -h flag.

On Solaris 9, the equivalent of ::bufctl -v is addr::bufctl_audit.

addr::vmem_seg Display's information from a 'vmem_seg_t' pointer, which includes the type, start address and end addresses of the segment. For ALLoCated segments, it also includes the top stack from the stacktrace. With the '-v' switch, ("verbose" mode), it also includes (for ALLoCated segments only) the thread, timestamp, and stack trace recorded at allocation time.

::findleaks Does a conservative garbage-collection of the entire process in order to find memory leaks. The memory leaks are then grouped by stack trace and either (umem cache) or (allocation size). To dump all of the leak stack traces, you can use the -d flag.

::umastat Prints a report of all of the umem-managed memory in the system, grouped by umem cache and vmem arena. This can be used to see which allocation sizes are chewing up the most memory.

[addr]::umem_verify Verifies the consistancy of the umem heap. If debugging is enabled, this will find instances of modified free buffers and writes past the end of the buffer. To get a detailed report of corrupted buffer addresses, take the cache pointer from a line with "n corrupted buffers", and do addr::umem_verify.

::umem_cache Lists all of the umem caches in the system, in tabular form. This is often the easiest way to get a cache's address.

That's all for now, but I'll follow up with more descriptions and examples later.

Tags: [ , , , ]

Tuesday Feb 21, 2006

Some block comments about libumem

One of the projects I've been working on recently is a wad covering the following bugs:
4720206 ::findleaks shouldn't cache results across state changes
4743353 libumem's module fails to load on idle targets 
6304072 libumem seems to use more heap than it needs
6336202 d4fc7824::typegraph made mdb crash
As part of it, I made some ASCII-art comments describing the layout of a umem buffer and slab, which I thought might be of interest more generally. Here are the block comments:
 * Each slab in a given cache is the same size, and has the same
 * number of chunks in it;  we read in the first slab on the
 * slab list to get the number of chunks for all slabs.  To
 * compute the per-slab overhead, we just subtract the chunk usage
 * from the slabsize:
 * +------------+-------+-------+ ... --+-------+-------+-------+
 * |////////////|       |       | ...   |       |///////|///////|
 * |////color///| chunk | chunk | ...   | chunk |/color/|/slab//|
 * |////////////|       |       | ...   |       |///////|///////|
 * +------------+-------+-------+ ... --+-------+-------+-------+
 * |            \_______chunksize * chunks_____/                |
 * \__________________________slabsize__________________________/
 * For UMF_HASH caches, there is an additional source of overhead;
 * the external umem_slab_t and per-chunk bufctl structures.  We
 * include those in our per-slab overhead.
 * Once we have a number for the per-slab overhead, we estimate
 * the actual overhead by treating the malloc()ed buffers as if
 * they were densely packed:
 *      additional overhead = (# mallocs) * (per-slab) / (chunks);
 * carefully ordering the multiply before the divide, to avoid
 * round-off error.
 * A malloc()ed buffer looks like:
 *      <----------- mi.malloc_size --->
 *      <----------- cp.cache_bufsize ------------------>
 *      <----------- cp.cache_chunksize -------------------------------->
 *      +-------+-----------------------+---------------+---------------+
 *      |/tag///| mallocsz              |/round-off/////|/debug info////|
 *      +-------+---------------------------------------+---------------+
 *              <-- usable space ------>
 * mallocsz is the argument to malloc(3C).
 * mi.malloc_size is the actual size passed to umem_alloc(), which
 * is rounded up to the smallest available cache size, which is
 * cache_bufsize.  If there is debugging or alignment overhead in
 * the cache, that is reflected in a larger cache_chunksize.
 * The tag at the beginning of the buffer is either 8-bytes or 16-bytes,
 * depending upon the ISA's alignment requirements.  For 32-bit allocations,
 * it is always a 8-byte tag.  For 64-bit allocations larger than 8 bytes,
 * the tag has 8 bytes of padding before it.
 * 32-byte, 64-byte buffers <= 8 bytes:
 *      +-------+-------+--------- ...
 *      |/size//|/stat//| mallocsz ...
 *      +-------+-------+--------- ...
 *                      ^
 *                      pointer returned from malloc(3C)
 * 64-byte buffers > 8 bytes:
 *      +---------------+-------+-------+--------- ...
 *      |/padding///////|/size//|/stat//| mallocsz ...
 *      +---------------+-------+-------+--------- ...
 *                                      ^
 *                                      pointer returned from malloc(3C)
 * The "size" field is "malloc_size", which is mallocsz + the padding.
 * The "stat" field is derived from malloc_size, and functions as a
 * validation that this buffer is actually from malloc(3C).
For more details on how umem works, you can look at the kmem and vmem papers:

The Slab Allocator: An Object-Caching Kernel Memory Allocator, Summer USENIX 1994
Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources, USENIX 2001

Tags: [ , , , ]

Wednesday Nov 23, 2005

Coverage testing

A couple years back, I wrote up a description of how to use the Sun Studio compiler's coverage testing features to test userland code. Now that OpenSolaris is here, I thought it might come in handy for a larger audience. Here's goes:

How do I do coverage analysis on my user-level code?

The Sun Workshop compilers we use have some pretty good profiling and test analysis tools built in to them. One of the more useful for user-space code is Coverage Analysis, which gives you a measure of how complete your testing is.

Coverage analysis annotates each "block" of straight-line code with a count of the number of times it has executed. For testing, what is usually more interesting is which lines were never executed, and the "Coverage", or percentage of blocks in your program or library that were exercised in your testing. For more information, see tcov(1), in /opt/SUNWspro/man.

Compilation and Linking
Coverage analysis requires a special compilation of your program or library. Each .c file needs to be compiled with -xprofile=tcov, and the final link (either to executable or shared library) also needs -xprofile=tcov.

CFLAGS += -xprofile=tcov
CFLAGS64 += -xprofile=tcov
DYNFLAGS += -xprofile=tcov (shared libraries only)
in the appropriate Makefiles, then make clean; make install is sufficient.

Generating Profile Data
The -xprofile=tcov version of your binary will generate profile information every time the executable is run (or, in the case of a shared library, any executable which links against it is run) and exits normally. The output is placed (by default) in ./progname.profile/, which will build up data from all executions as they exit. It will even join up 32-bit and 64-bit data sets.

The tcov output location is controlled by two environment variables, SUN_PROFDATA_DIR (default '.'), and SUN_PROFDATA (default 'progname.profile'). So if you are testing libfoo.so, and want to join the data from a bunch of executions into /tmp/libfoo.profile, you would set:
% SUN_PROFDATA=libfoo.profile
% setenv SUN_PROFDATA_DIR /tmp
% setenv SUN_PROFDATA libfoo.profile
before your runs.

Processing the profile data
Once you have finished gathering data, you can use the tcov(1) command, located in /opt/SUNWspro/bin (or wherever you keep your compilers) to analyze it. It's syntax is pretty straightforward:
% tcov -x profile_dir sourcefile...
For example, to analyze the previous libfoo example, you might: (here I use a seperate directory for my tcov analysis)
% cd usr/src/lib/libfoo
% mkdir tcov
% cd tcov
% tcov -x /tmp/libfoo.profile ../common/\*.c ../sparc/\*.c ../sparcv9/\*.c
Analyzing the data
Nota Bene: The counts tcov uses to generate its output are updated without holding locks. For multi-threaded programs only, this means that some counts may be lower than expected. Nevertheless, if a block has been executed at least once, its count will be non-zero.

For each source file you pass in on the command line, tcov will generate a .tcov file (for example, ../common/foo.c -> foo.c.tcov). Each file contains the original source, annotated with execution counts. Each line that starts a "basic block" is prefixed with either '##### ->', indicating that it has not been executed, or 'count ->', indicating how many times it was executed.

After the annotated source, there is a summary of the file, including things like total blocks, number executed, % coverage, average executions per block, etc.

I've written a tool, tcov_summarize, which takes the tcov files in the current directory and displays a summary of the current state. The columns are "total blocks", "executed blocks", and "% executed" (or % coverage).

Command example: cpio
% cd usr/src/cmd/cpio
% grep tcov Makefile
CFLAGS += -xprofile=tcov
% make
... (made cpio) ...
% mkdir tcov
% cd tcov
% ../cpio
cpio: One of -i, -o or -p must be specified.
        cpio -i[bcdfkmrstuv@BSV6] [-C size] [-E file] [-H hdr] [-I file [-M msg]] [-R id] [patterns]
        cpio -o[acv@ABLV] [-C size] [-H hdr] [-O file [-M msg]]
        cpio -p[adlmuv@LV] [-R id] directory
% ls
% tcov -x cpio.profile ../\*.c
% ls
cpio.c.tcov       cpio.profile/     cpiostat.c.tcov
% tcov_summarize
 1818    32   1.76 cpio.c
    2     0   0.00 cpiostat.c
 1820    32   1.76 total
% find . | ../cpio -ocB > /dev/null
590 blocks
% tcov -x cpio.profile ../\*.c
% tcov_summarize
 1818   326  17.93 cpio.c
    2     0   0.00 cpiostat.c
 1820   326  17.91 total
Library example: libumem
% cd usr/src/lib/libumem   
% grep tcov Makefile.com   
CFLAGS +=       -v $(LOCFLAGS) -I$(CMNDIR) -xprofile=tcov
CFLAGS64 +=     -v $(LOCFLAGS) -I$(CMNDIR) -xprofile=tcov
DYNFLAGS +=     -M $(MAPFILE) -z interpose -xprofile=tcov
% make
... (made libumem) ...
% mkdir tcov   
% cd tcov   
% SUN_PROFDATA=libumem.profile   
% LD_PRELOAD=../sparc/libumem.so.1 LD_PRELOAD_64=../sparcv9/libumem.so.1
% export LD_PRELOAD LD_PRELOAD_64   
% ls   
% ls   
% tcov -x libumem.profile ../common/\*.c ../sparc/\*.c   
% /home/jwadams/bin/tcov_summarize   
   75    44  58.67 envvar.c
   10     7  70.00 getpcstack.c
   72    22  30.56 malloc.c
   78    27  34.62 misc.c
  592   255  43.07 umem.c
    1     0   0.00 umem_agent_support.c
  315   167  53.02 vmem.c
   13    10  76.92 vmem_base.c
   20     0   0.00 vmem_mmap.c
   35    17  48.57 vmem_sbrk.c
 1211   549  45.33 total
% tcov -x libumem.profile ../common/\*.c ../sparc/\*.c   
% /home/jwadams/bin/tcov_summarize   
   77    45  58.44 envvar.c
   10     7  70.00 getpcstack.c
   72    28  38.89 malloc.c
   78    27  34.62 misc.c
  592   314  53.04 umem.c
    1     0   0.00 umem_agent_support.c
  315   192  60.95 vmem.c
   13    10  76.92 vmem_base.c
   20     0   0.00 vmem_mmap.c
   35    17  48.57 vmem_sbrk.c
 1213   640  52.76 total
(Note that running tcov gave us more coverage, since the library is being preloaded underneath it)

Tags: [ , ]

Wednesday Nov 16, 2005

An initial encounter with ZFS

After ZFS became available in onnv_27, I immediately upgraded my desktop system to the newly minted bits. After some initial setup, I've been happily using ZFS for all of my non-root, non-NFSed data. I'm getting about 1.7x my storage due to ZFS's compression, and have new-found safety, since my data is now mirrored.

During the initial setup, my intent was to use only the slices I'd already set up to do the transfer. What I did not plan for was the fact that ZFS does not currently allow you to remove a non-redundant slice from a storage pool without destroying the pool; here's what I did, as well as what I should have done:

My setup

Before I began, my system layout was fairly simple:


Total disk cylinders available: 24620 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm    1452 -  7259        8.00GB    (5808/0/0)  16779312
  1 unassigned    wm       0                0         (0/0/0)            0
  2     backup    wm       0 - 24619       33.92GB    (24620/0/0) 71127180
  3       swap    wu       0 -  1451        2.00GB    (1452/0/0)   4194828
  4 unassigned    wm       0                0         (0/0/0)            0
  5 unassigned    wm       0                0         (0/0/0)            0
  6 unassigned    wm       0                0         (0/0/0)            0
  7       aux0    wm    7260 - 24619       23.91GB    (17360/0/0) 50153040

Total disk cylinders available: 24620 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0    altroot    wm       0 -  5807        8.00GB    (5808/0/0)  16779312
  1 unassigned    wm       0                0         (0/0/0)            0
  2     backup    wm       0 - 24619       33.92GB    (24620/0/0) 71127180
  3       swap    wu    5808 -  7259        2.00GB    (1452/0/0)   4194828
  4 unassigned    wm       0                0         (0/0/0)            0
  5 unassigned    wm       0                0         (0/0/0)            0
  6 unassigned    wm       0                0         (0/0/0)            0
  7       aux1    wm    7260 - 24619       23.91GB    (17360/0/0) 50153040
That is, two 34GB hard disks, partitioned identically. There are four slices of major interest:
c0t0d0s0    /            8G    root filesystem
c0t0d0s7    /aux0       24G    data needing preserving
c0t1d0s0    /altroot     8G    alternate root, currently empty
c0t1d0s7    /aux1       24G    some data (/opt) which needed preserving
My goal was to create a 24Gig mirrored ZFS pool, using the underlying slices of /aux0 and /aux1. /altroot would be my initial stepping stone.

The process

Without any prior experience setting up a ZFS pool, I did the following steps:

... remove /altroot from /etc/vfstab ...
# zpool create mypool c0t1d0s0
invalid vdev specification
use '-f' to override the following errors:
/dev/dsk/c0t1d0s0 contains a ufs filesystem
# zpool create -f mypool c0t1d0s0
# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
mypool                 7.94G   32.5K   7.94G     0%  ONLINE     -
# zfs set compression=yes mypool
# zfs create pool/opt
# zfs set mountpoint=/new_opt mypool/opt
... copy data from /aux1/opt to /new_opt, clear out /aux1 ...
... remove /aux1 from vfstab, and remove the /opt symlink ...
# zfs set mountpoint=/opt mypool/opt
# df -h /opt
Filesystem             size   used  avail capacity  Mounted on
mypool/opt             7.9G   560M   7.4G     6%    /opt
I now had all of the data I needed off of /aux1, and I wanted to add it to the storage pool. This is where I made a mistake; zfs, in its initial release, cannot remove a non-redundant device from a pool (this is being worked on). I did:
# zpool add -f mypool c0t1d0s7 (\*MISTAKE\*)
# zpool status mypool
  pool: mypool
 state: ONLINE
 scrub: none requested

        NAME        STATE     READ WRITE CKSUM
        mypool      ONLINE       0     0     0
          c0t1d0s0  ONLINE       0     0     0
          c0t1d0s7  ONLINE       0     0     0
# zfs create mypool/aux0
# zfs create mypool/aux1
# zfs set mountpoint=/new_aux0 mypool/aux0
# zfs set mountpoint=/aux1 mypool/aux1
... move data from /aux0 to /new_aux0 ...
... remove /aux0 from /etc/vfstab ...
# zfs set mountpoint=/aux0 mypool/aux0
And now I was stuck; I wanted to end up with a configuration of:
but there was no way to get there without removing c0t1d0s0 from the pool, which ZFS doesn't allow you to do directly. I ended up creating a new pool, "pool" with c0t0d0s7 in it, copying all of my data \*again\*, destroying "mypool", then mirroring c0t0d0s7 by doing:
# zpool attach pool c0t0d0s7 c0t1d0s7

The right way

If I'd planned this all better, the right way to build the pool I wanted would have been to do:
# zpool create pool c0t1d0s0
... create /opt_new, move data to it ...
# zpool replace pool c0t1d0s0 c0t1d0s7
... create /aux0_new, move data to it ...
# zpool attach pool c0t1d0s7 c0t0d0s7
... clean up attributes, wait for sync to complete ...
# zpool iostat -v
                 capacity     operations    bandwidth
pool           used  avail   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
pool          15.0G  8.83G      0      1  1.47K  6.99K
  mirror      15.0G  8.83G      0      1  1.47K  6.99K
    c0t1d0s7      -      -      0      1  2.30K  7.10K
    c0t0d0s7      -      -      0      1  2.21K  7.10K
------------  -----  -----  -----  -----  -----  -----
The real lesson here is to do a little planning if you have to juggle slices around; missteps can take some work to undo. Read about "zpool replace" and "zpool attach"; they are very useful for this kind of juggling.

Once I got everything set up, everything just works; despite having cut my available storage in /aux0 and /aux1 in half (48GB -> 24GB, due to mirroring), compression is giving me back a substantial fraction of the loss (~70%, give or take, and assuming the ratio holds steady):

# zfs get -r compressratio pool
NAME             PROPERTY       VALUE                      SOURCE
pool             compressratio  1.70x                      -                
pool/aux0        compressratio  1.54x                      -                
pool/aux1        compressratio  2.01x                      -                
pool/opt         compressratio  1.68x                      -                
and ZFS itself is quite zippy. I hope this is a useful lesson, and that you enjoy ZFS!

Tags: [ , ]

Wednesday Sep 28, 2005

Brokenness Hides Itself

When engineers get together and talk, one of the things they like to bring out and share is war stories; tales of debugging daring-do and the amazing brokenness that can be found in the process. I recently went through an experience that makes good war story material, and I thought I'd share it.

A couple weeks ago, there were multiple reports of svc.configd(1M) failing repeatedly with one of:

svc.configd: Fatal error: invalid integer "10" in field "id"
svc.configd: Fatal error: invalid integer in database

Since I'm one of the main developers of svc.configd(1M), I started to investigate. I first had the people hitting it send me their repositories, but they all checked out as having no problems. The problem was only being seen on prototype Niagara machines and some Netra T1s; the first reproducible machine I got console access to was a Niagara box.

Figuring out what happened

Unfortunately, the box was running old firmware, which significantly restrained its usability; I spent more time fighting the machine than working on tracking down the problem. I finally boot neted the machine, mounted the root filesystem, and added a line:

sulg::sysinit:/sbin/sulogin </dev/console 2<>/dev/console >&2
to /etc/inittab to get a shell very early in boot. After booting and logging into that shell, I was able attach mdb(1) to svc.configd(1M):
# mdb -p `pgrep svc.configd`
Loading modules: [ svc.configd ld.so.1 libumem.so.1 libuutil.so.1
libc.so.1 ]
Now, I knew that svc.configd uses a utility function, uu_strtouint(), to do all of its integer-to-number conversions, and that the most likely cause of the failure seen was some kind of unprintable character in its first argument. So I set a breakpoint there and started it running:
> uu_strtouint::bp
> :c
mdb: stop at libuutil.so.1`uu_strtouint
mdb: target stopped at:
uu_strtouint:     save      %sp, -0x68, %sp
> ::step[1]
mdb: target stopped at:
uu_strtouint+4:   ld        [%fp + 0x60], %l6
> $C 2
feefb968 libuutil.so`uu_strtouint+4(1cb4e4, feefba9c)
feefb9d0 string_to_id+0x24(1cb4e4, feefba9c)
feefba38 fill_child_callback+0x20(feefbbe4, 2)
feefbaa0 sqlite_exec+0xd8(13de08, 2)
feefbb18 backend_run+0x74(89b48, 169848)
feefbb80 scope_fill_children+0x38(5e9f00, 1000)
So the first argument to uu_strtouint() is 1cb4e4. I attempted to print out the string there:
> 1cb4e4/s
mdb: failed to read data from target: no mapping for address
"failed to read data from target"? A reported address of 0x2e instead of 0x1cb4e4? Clearly, the problem is affecting mdb(1) as well. Some additional tests in mdb(1) revealed the magnitude of the problem:
> 1=J
> 10=J
> 11=J
> 111=J
> 101=J
i.e. the results were completely busted, and had a lot of high-word bits set unexpectedly. Interestingly, it looked like only multiple-digit numbers were affected.

So I proceeded to investigate uu_strtouint(). The first step was to see how the function was failing; there are a number of different ways to get to uu_set_error(), which sets up libuutil's equivalent of errno. A simple breakpoint led to the following code segment:

   269         if (strtoint(s, &val, base, 0) == -1)
   270                 return (-1);
   272         if (val < min) {
   273                 uu_set_error(UU_ERROR_UNDERFLOW);
   274                 return (-1);
   275         } else if (val > max) {
   276                 uu_set_error(UU_ERROR_OVERFLOW);
   277                 return (-1);
   278         }
The failure was occurring at line 276; i.e. an overflow. The value (for an input string of "10") was 0x10000000a, that is, 1\*2\^32 + 10. So something was going terribly wrong in strtoint(). Since mdb(1) was also failing, the problem was probably in some shared library routine. Looking at the disassembly, there are very few external calls:
> strtoint::dis ! grep call | sed 's/libuutil.so.1`//g'
strtoint+0xc:     call      +8            <strtoint+0x14>
strtoint+0x204:   call      +0x12a30      <PLT:__udiv64>
strtoint+0x2b8:   call      +0x12988      <PLT:__umul64>
strtoint+0x404:   call      -0xd30        <uu_set_error>
strtoint+0x414:   call      -0xd40        <uu_set_error>
strtoint+0x424:   call      -0xd50        <uu_set_error>
strtoint+0x440:   call      -0xd6c        <uu_set_error>
strtoint+0x450:   call      -0xd7c        <uu_set_error>
strtoint+0x460:   call      -0xd8c        <uu_set_error>
The first call is part of the PIC[2] calling sequence, and uu_set_error() is only called in the failure paths, which we knew weren't being hit. So __umul64() and __udiv64() are the next suspects. These are runtime support routines in libc, which the compiler inserts calls to when it wants to do 64-bit multiplications and divisions. The code for strtoint() only has one multiply and one divide, so it's easy to see where they occur:
    103         multmax = (uint64_t)UINT64_MAX / (uint64_t)base;
    105         for (c = \*++s; c != '\\0'; c = \*++s) {
    116                 if (val > multmax)
    117                         overflow = 1;
    119                 val \*= base;
    120                 if ((uint64_t)UINT64_MAX - val < (uint64_t)i)
    121                         overflow = 1;
    123                 val += i;
    124         }
The division always occurs, so I looked at the multiply routine first; disassembling it showed the following suspicious section:
> __umul64::dis ! sed 's/libc.so.1`//g'
__umul64+0x38:        cmp       %l7, 0
__umul64+0x3c:        call      +0xc95e4      <PLT:.umul>
__umul64+0x40:        mov       %i3, %i0
__umul64+0x44:        mov       %l6, %i1
For a function call, %o0-%o6 hold the arguments to the function, and afterwards, %o0 and %o1 hold the results. But here, there's no manipulation of the %os. In fact, the function doesn't reference them anywhere:
> __umul64::dis ! grep '%o'
Here's the smoking gun; we've hit some kind of compiler bug. The relevant source code is:
36 extern unsigned long long __umul32x32to64(unsigned, unsigned);
70 unsigned long long
71 __umul64(unsigned long long i, unsigned long long j)
72 {
81                 if (i1)
82                         result = __umul32x32to64(i1, j1);
88         return (result);
89 }

     29         .inline __umul32x32to64,8
     30         call    .umul,2
     31         nop
     32         mov     %o0, %o2
     33         mov     %o1, %o0
     34         mov     %o2, %o1
     35         .end
From previous experience with the compiler's inlining (I codereviewed the fix for another inliner-related bug, 6225876) I knew that there is an optimization stage after the inline code is generated; from the looks of it, that stage thought that the call had no effect on register state, and so optimized away the argument set up. As a result, the final return value of __umul64 is just some junk register state. So I updated the original bug, and sent it over to the compiler people:

6323803 compiler bug causes __\*mul64 failure; svc.configd dies

After some investigation, the compiler folks accepted the bug, and noted that it only effects call statements in inline assembly. Checking the rest of the '.il' files in OS/Net, I verified that this was the only place where we used call.

I still needed to figure out why this was only effecting certain platforms, and how we were going to deal with the problem in the short-term, so that we weren't waiting on a compiler patch.

Why only sun4v and Netras?

I first wrote a small C program, to test:

% cat > math_test.c <<EOF
#include <stdio.h>

main(int argc, char \*argv[])
        unsigned long long x, y, z;

        x = 10;
        y = 1;
        z = x \* y;
        printf("%llx\\n", z);

        return (0);
% cc -o math_test math_test.c
% ./math_test
% truss -t \\!all -u '::__\*mul64' ./math_test
/1@1:   -> libc_psr:__mul64(0x0, 0xa, 0x0, 0x1)
/1@1:   <- libc_psr:__mul64() = 0
... (moving over to niagara machine) ...
% truss -t \\!all -u '::__\*mul64' ./math_test
/1@1:   -> libc:__mul64(0x0, 0xa, 0x0, 0x1)
/1@1:   <- libc:__mul64() = 1
A hah! For sun4u, libc_psr is overriding the libc definitions. So sun4v won't work for two reasons:
  1. none of the sun4v libc_psrs override __\*mul64
  2. even if they did, sun4v has a wacky "mount the correct libc_psr over /platform/sun4v/lib/libc_psr.so.1 which wouldn't occur until later in boot anyway.
It ends up that the reason __\*mul64 are overriden in libc_psr is a hold-over from Solaris 9 and earlier, where 32-bit Solaris was supported on pre-UltraSPARC chips. In fact, the call to .umul is a hold-over from the days of sparcv7, where the umul instruction wasn't guaranteed to be there. The sun4u libc_psr version has sparcv8plus versions of __mul64, which take advantage of the 64-bit processing that sparcv9 adds, in a form compatible with 32-bit code.

The fact that these weren't included in the sun4v libc_psr is an oversight, but #2 means that it wouldn't have mattered if they did. The Netra T1's running into this problem are explained by the fact that there are a set of missing /platform/\*/lib symlinks for the following platforms:

Which are all various "Netra T1" varieties. Since the link to the sun4u libc_psr is missing, they exhibit the same problem as the sun4v does. I filed 6324790 to cover adding the missing links, since it is a performance problem. (programs will not be able to take advantage of the faster mem{cpy,cmp,move,set} versions contained there)

The final question is "Why now?". What changed to make this a problem? Four days before the first reported incident, 6316914 was putback, which switched the build from the Studio 8 to the Studio 10 compilers. Because of the libc_psr masking, no-one noticed the problem until they put the bits on the (much rarer) platforms with the bug.

The Fix

To fix this, you simply move the __{u,}{mul,div}64 functions from libc_psr back into libc, using the v8plus versions that were in libc_psr. libc's assembly files are already being compiled in v8plus mode due to atomic_ops(3C), so it just required shuffling around some code, removing stuff from makefiles, and deleting the old, out of date code. This was done under bugid 6324631, integrated in the same build as the compiler switch, so only a limited number of OS/Net developers were effected. Life was all better. Well, almost.

The follow-on

In testing my fix, I did a full build, bfued a machine, and just dumped fixed binaries on other machines. The one thing I didn't test was BFUing from broken bits to fixed bits. And, of course, there was an unforeseen problem (bugid 6327152). To understand what went wrong, I'm going to have to do some background on how BFU works.

bfu is a power-tool whose essential job is to dump a full set of binaries over a running system, even if the new bits are incompatible with the ones running the system. To do this, it copies binaries, libraries, and the dynamic linker into a set of subdirectories of /tmp: /tmp/bfubin, /tmp/bfulib, and /tmp/bl. It then uses a tool called "bfuld" to re-write the "interepreter" information for the binaries in /tmp/bfubin, to point at the copied ld.so.1(1). It then sets LD_LIBRARY_PATH in the environment, to re-direct any executed programs to the copied libraries, and sets PATH=/tmp/bfubin. This gives BFU a protected environment to run in.

The problem is that auxiliary filters (like libc_psr) were not disabled, so programs running in the BFU environment were picking up the libc_psr from /platform. Once the \*new\* libc_psr was extracted, programs were no longer protected from the broken __\*mul64() routines. Since things like scanf(3C) use __mul64 internally, this caused breakage all over the place, most noticeably in cpio(1).

The fix for this is reasonably simple; set LD_NOAUXFLTR=1 in the environment to prevent auxiliary filters from being used,[3] make a copy of libc_psr.so.1 into /tmp/bfulib, and use LD_PRELOAD=/tmp/bfulib/libc_psr.so.1 to override the bad libc functions. The latter part of this can be removed once we're sure no broken libcs are running around.


I hope you've enjoyed this. The bug ended up being surprisingly subtle (as many compiler bugs are), but luckily the fix was relatively simple. The Law of Unintended Consequences applies, as always.


[1] ::steping over the "save" instruction is a standard SPARC debugging trick; it makes the arguments to the function and the stack trace correct.
[2] Position-Independent Code, which is how shared libraries are compiled.
[3] Ironically, if we had done this \*before\* the compiler switch was done, BFU would have immediately failed when run on the broken bits, and the whole problem would have been noticed much more quickly.

[ Technorati ]

Thursday Jun 30, 2005

The FBT provider, opensolaris source, and fun with the environment

The FBT provider, opensolaris source, and fun with the environment

Now that opensolaris is out there, it's quite a bit easier for folks to use DTrace's FBT provider. FBT provides "function boundary tracing", i.e. it has probes for the entry and exit for almost every function in the kernel. This is amazingly powerful and flexible, but it leads to it being hard to use: with over 20,000 functions on a typical Solaris box, it's very hard to know where to start, especially without access to the source code.

With OpenSolaris, the source code is available. So to illustrate how you can use this newly available information to get something done, I thought I'd use a classic question: How can I examine and muck with the environment?

To start off, we need a plan of attack; since dtrace doesn't have any way of looping over datastructures, typically if you want to walk some datastructure, you just find a place where the kernel is already doing so, and use probes there to sneak a peak at the data as it goes by. The initial environment for a process is set up in the exec(2) system call, so lets see if we can find where we read in the data from the user process.

Looking at the source, exece() calls exec_common(), which is the main workhorse. There doesn't seem to be any direct mucking of the environment, but there is:

    222 	ua.fname = fname;
    223 	ua.argp = argp;
    224 	ua.envp = envp;
    226 	if ((error = gexec(&vp, &ua, &args, NULL, 0, &execsz,
    227 	    exec_file, p->p_cred)) != 0) {
Now we could continue to track all of this down, but it's much easier to just search usr/src/uts for references to the envp symbol. That leads us to stk_copyin(), the routine responsible for copying in everything which is needed on the stack. Here's the environment processing loop:
   1303 	if (envp != NULL) {
   1304 		for (;;) {
   1305 			if (stk_getptr(args, envp, &sp))
   1306 				return (EFAULT);
   1307 			if (sp == NULL)
   1308 				break;
   1309 			if ((error = stk_add(args, sp, UIO_USERSPACE)) != 0)
   1310 				return (error);
   1311 			envp += ptrsize;
   1312 		}
   1313 	}
Without even knowing what stk_getptr() and stk_add() do, we now have enough information to write a D script to get the environment of a process as it exec()s. Here's the basic outline:
  1. First, use fbt::stk_copyin:entry to stash a copy of the envp pointer.
  2. Second, use fbt::stk_getptr:entry to watch for reads from the stored envp address.
  3. Third, use fbt::stk_add:entry and fbt::stk_add:return to print out the environment.
  4. Lastly, use fbt::stk_copyin:return to clean up.
And here's our first script:
#!/usr/sbin/dtrace -s

#pragma D option quiet

	self->envp = (uintptr_t)args[0]->envp;

/ self->envp != 0 /
	/\* check if we're looking at envp or envp+1 \*/
	self->on = ((arg1 - self->envp) <= sizeof (uint64_t));

	/\* update envp if we're on \*/
	self->envp = self->on ? arg1 : self->envp;

/ self->on && args[2] == UIO_USERSPACE /
	self->ptr = arg1;

/ self->ptr != 0 /
	printf("%.79s\\n", copyinstr(self->ptr));
	self->ptr = 0;
	self->on = 0;

	self->envp = 0;
	self->on = 0;
Note that we delay the copyinstr of stk_add()'s second argument until fbt::stk_add:return. This is due to the fact that dtrace(1M) cannot fault in pages; so if a probe tries to copyinstr an address which has not yet been touched, you'll get a runtime error like:
dtrace: error on enabled probe ID 4 (ID 12535: fbt:genunix:stk_add:entry):
invalid address (0x1818000) in action #1 at DIF offset 28
By waiting until the return probe, we avoid this problem; we know that the kernel just touched the page to read in its copy.

Now, looking at the environment is fun, but it would be even more interesting to change the environment of a process while it is being execed. This requires a bit more work, and access to the destructive action copyout(). I'm going to start with a script which requires a recent version of Solaris (snv_15+, or the OpenSolaris release), because Bryan introduced some nice string handling stuff recently. We'll adapt the script to S10 afterwards.

Lets start by saying we want to change the name of the environment variable "FOO" to "BAR", but leave the value the same. The basic idea is simple; copyin() the string in the fbt::stk_add:entry probe, and if it's the one we want to change, copyout() the changes. The kernel will then proceed to copyin() the changed string, and use it for the environment of the process. The complication is the same as before; what if the page hasn't yet been touched, or the copyout() operation fails (for example, if the string isn't writable)?

There's no simple solution, so I'm just going to check \*afterwards\* that we didn't miss changing it, and kill -9 the process if we did. It's vile, but effective. Here's the script:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option destructive

self uintptr_t ptr;

inline int interesting = (strtok(copyinstr(self->ptr), "=") == "FOO");

	self->envp = (uintptr_t)args[0]->envp;

/ self->envp != 0 /
	/\* check if we're looking at envp or envp+1 \*/
	self->on = ((arg1 - self->envp) <= sizeof (uint64_t));

	/\* update envp if we're on \*/
	self->envp = self->on ? arg1 : self->envp;
	self->ptr = 0;

/ self->on && args[2] == UIO_USERSPACE /
	self->ptr = arg1;
	self->didit = 0;

/ self->ptr != 0 && interesting /
	printf("%d: %s: changed env \\"%s\\"\\n",
	    pid, execname, copyinstr(self->ptr));
	copyout("BAR", self->ptr, 3);		/\* 3 == strlen("BAR") \*/
	self->didit = 1;

/ self->ptr != 0 && interesting && !self->didit /
	printf("%d: %s: killed, env \\"%s\\" couldn't be changed\\n",
	    pid, execname, copyinstr(self->ptr));

	self->envp = 0;
	self->on = 0;
	self->ptr = 0;
The above works great on Solaris Nevada and OpenSolaris, but doesn't work on Solaris 10, because it uses "strtok". So to use it on Solaris 10, we'll have to do things slightly more manually. The only thing that needs to change is the definition of the "interesting" inline, and some more cleanup in fbt::stk_copyin:return:
inline int interesting =
    ((self->str = copyinstr(self->ptr), "=")[0] == 'F' &&
    self->str[1] == 'O' &&
    self->str[2] == 'O' &&
    self->str[3] == '=');
	self->envp = 0;
	self->on = 0;
	self->ptr = 0;
	self->str = 0;
A final note is on stability; we're using private implementation details of Solaris to make this all work, and they are subject to change without notice at any time. This particular part of solaris isn't likely to change much, but you never know. A reasonable RFE would be for more Stable probe-points in the exec(3C) family, so that people can write things like this more stably.

Technorati Tag:
Technorati Tag:
Technorati Tag:

Wednesday Jun 15, 2005

macros and powers of two

Unlike many userland applications, there are large portions of the kernel that need to do bit-level manipulations. These generally fall into a few distinct areas:

  • Drivers communicating with hardware
  • Hash functions, bitmasks, and other high-performance algorithms
  • Memory allocators, page manipulators, etc.
In this note, I'm going to peek into the last group, which typically needs to do a lot of power-of-2 based alignment work, and look at a set of macros used to abstract away some of the bit-fiddling.

The main low-level allocators in Solaris are kmem and vmem; kmem handles your standard page-backed memory allocation, while vmem handles the more abstract concept of allocating from spaces of integers. kmem actually layers on top of vmem, but the details are beyond the scope of this blog entry.

While implementing vmem, Jeff Bonwick introduced a set of handy macros for dealing with things relative to powers of two. The macros live in uts/common/sys/sysmacros.h, and are defined thusly:

 \* Macros for various sorts of alignment and rounding when the alignment
 \* is known to be a power of 2.
#define	P2ALIGN(x, align)		((x) & -(align))
#define	P2PHASE(x, align)		((x) & ((align) - 1))
#define	P2NPHASE(x, align)		(-(x) & ((align) - 1))
#define	P2ROUNDUP(x, align)		(-(-(x) & -(align)))
#define	P2END(x, align)			(-(~(x) & -(align)))
#define	P2PHASEUP(x, align, phase)	((phase) - (((phase) - (x)) & -(align)))
#define	P2CROSS(x, y, align)		(((x) \^ (y)) > (align) - 1)
 \* Determine whether two numbers have the same high-order bit.
#define	P2SAMEHIGHBIT(x, y)		(((x) \^ (y)) < ((x) & (y)))
The purpose of this blog entry is to talk about what they do and how they work; if you don't want the challenge spoiled, you should stop reading now.

Alright, now that they've all left, we can dive in. First, remember the basic equation of two's complement arithmetic:
-x = ~x + 1
That is, to negate a number, complement all of its bits and add one. If I assume that the number has a lowest set bit,[1] and capital Xs represent complements of the corresponding xs, here are some relations for any non-zero binary number:
        x = ...xxxxx10000...
      x-1 = ...xxxxx01111...
       ~x = ...XXXXX01111...
-x = ~x+1 = ...XXXXX10000...
Now, for powers of two, there is only one set bit, so all of the xs are 0. This means that (x-1) is a mask for the offset modulo x, and -x is a mask which clears the offset. In addition, all numbers are unsigned, align is a power-of-2, and offset is less than align. These relations in mind, we can look at what each of the above macros do:
P2ALIGN x & -align clears all bits below align, rounding x down to the next lower multiple of align
P2PHASE x & (align - 1) clears all bits above or equal to align, getting (x % align), the phase of x with regards to align
P2NPHASE (-x) & (align - 1) This is the first tricky formula. We negate x, and get its phase. The result is the same as (align - P2PHASE(x, align)) % align, but is much faster to compute.
P2ROUNDUP -((-x) & -align) This can be re-written as -P2ALIGN(-x, align), which shows how the operation works; "negate, round down, negate" is the same as "round up". So we round x up to the next align boundary.
P2END -((~x) & -align) This can be re-written as -((-x-1) & -align) = P2ROUNDUP(x+1, align); round up (x+1) to the next align boundary. This gets the end address of the current align block.
P2PHASEUP (phase -
 ((phase - x) &
This can be re-written as (phase + P2ROUNDUP(x - phase, align)). So what we've done is rounded up x to have a specific phase with regards to align. (If you've got some linear algebra under your belt, this is similar to transforming f(y) = P2ROUNDUP(y, align) into a coordinate system with x = y + phase)
P2CROSS (x \^ y) > (align-1) (x \^ y) has bits set wherever x and y differ. This just tests to see if they differ in any position above the alignment, which will only be the case if [x, y] crosses an align boundary.
P2SAMEHIGHBIT (x \^ y) < (x & y) Similar to P2CROSS; here, we're checking that the first bit that differs is less than the highest set bit the two have in common.

From the above, here are some trivial relations between the various functions:

P2ALIGN(x, align)   = x - P2PHASE(x, align)
P2ROUNDUP(x, align) = x + P2NPHASE(x, align)
P2END(x, align)     = P2ALIGN(x, align) + align

Finally, the one drawback of these macros is the fact that they are macros. This means that (as usual) the argument parsing can leave much to be desired. The biggest problem is that there is no type enforcement; if x, y, or align are not unsigned integers of the same width, bad results can occur. Hence the (recently added) macros, P2\*_TYPED, defined underneath the originals, which allow an explicit type to be specified.

Despite this drawback, these macros tend to be much clearer than the same things done by hand in the code. They probably aren't used enough, in fact.


[1] Of course, zero doesn't have a lowest set bit, but you can treat it as being 2bits, and get the same answer.

Technorati Tag:
Technorati Tag:

Tuesday Jun 14, 2005

The implementation of ::findleaks

Now that OpenSolaris is available, it's time to explain how it all works. I thought I'd start with some of the mdb(1) dcmds I've worked on, since they are relatively self-contained, and have very strong interface boundaries.

I'll start with my favorite dcmd, ::findleaks, a memory leak detector originally written by Bryan Cantrill, which I re-factored substantially late in Solaris 10. There were a few reasons the refactoring was necessary:

  • When I'd done the original ::findleaks for libumem implementation, I simply copied the original leaky.c and re-wrote it into submission. This worked fine, but was an additional maintenance burden; two dissimilar copies of the same code is to be avoided if possible.
  • The original code reported oversize leaks in an obscure way:
    findleaks: Oversize leak at seg 0x12345678!
    which, unless you are very familiar with the way the allocator works internally, was unlikely to lead you to the stack trace or size of the leaked buffer. (for the curious, 12345678$<vmem_seg was the necessary incantation in the old days)
  • The original ::findleaks algorithm was designed with running in a user process in mind. In particular, it assumed it had a lot of memory to play with. Unfortunately, with the coming of kmdb(1), memory space for dcmds was getting very tight. For ::findleaks to work under kmdb(1), it needed substantial belt-tightening.
  • There were some enhancements I wanted to make; in particular, there was no simple way to dump all of the leaked stack traces. Internally, the "leakbot" tool (also written by Bryan) automatically extracted the set of stack traces, but this didn't help people without access to that tool.
So I started a side project to re-work ::findleaks. The following bugs were part of it:
  4840780 ::findleaks oversized leak reporting needs overhaul
  4849669 ::findleaks' cleanup code needs work
  4931271 kmem walkers and ::findleaks use large amounts of memory
  5030758 ::findleaks should have an option to display the stack traces
but this note only covers the generic ::findleaks implementation.

The files of ::findleaks

Let's start with a lay of the land. The following files encompass the implementation of ::findleaks (to shorten things, I'm assuming a prefix of usr/src/cmd/mdb/common for all file references):

.../modules/genunix/leaky.h The dcmd, dcmd help, and walker function declarations, used by .../modules/genunix/genunix.c and .../modules/libumem/libumem.c to create the dcmd linkage for mdb(1).
.../modules/genunix/leaky_impl.h An implementation header, which defines (and documents, via a large block comment) the interface between the main engine, .../modules/genunix/leaky.c and the two target implementations.
.../modules/genunix/leaky.c The common ::findleaks engine.
The two target implementations, one for the kernel, one for libumem(3lib).

I'll use the term "target" when talking about the other side of ::findleaks's implementation. The non-static target interfaces all start with leaky_subr_.

The target interface is a link-time, functional binding; during the compilation process, alternate versions of the target routines are linked against essentially the same "generic" framework. This is less flexible than doing some kind of dynamic binding (i.e. using function pointers), but is sufficiently general as long as only one target interfaces is required for a given dmod.

The public interface

The top layer of interface consists of the dcmds (D commands) and walkers exported by the genunix and libumem dmods. In the genunix dmod, these are defined in .../modules/genunix/genunix.c:

static const mdb_dcmd_t dcmds[] = {
	/\* from leaky.c + leaky_subr.c \*/
	{ "findleaks", FINDLEAKS_USAGE,
	    "search for potential kernel memory leaks", findleaks,
	    findleaks_help },
static const mdb_walker_t walkers[] = {
	/\* from leaky.c + leaky_subr.c \*/
	{ "leak", "given a leaked bufctl or vmem_seg, find leaks w/ same "
	    "stack trace",
		leaky_walk_init, leaky_walk_step, leaky_walk_fini },
	{ "leakbuf", "given a leaked bufctl or vmem_seg, walk buffers for "
	    "leaks w/ same stack trace",
		leaky_walk_init, leaky_buf_walk_step, leaky_walk_fini },
(the structures used are part of the Debugger Module Linkage, defined in the Solaris Modular Debugger Guide)
The walkers are straightforward; they just walk the cached leak table. ::findleaks is the main event.

First up: initialize, estimate, slurp in the buffers

When ::findleaks is invoked, mdb(1) calls findleaks(), which validates its arguments, calls leaky_cleanup() to clean up any interrupted state, then checks if there is cached data. If not, we start the run in earnest.

The first call into the target interface is here:

	if ((ret = leaky_subr_estimate(&est)) != DCMD_OK)
		return (ret);
This calls over to here or here, which first check that finding memory leaks is possible (i.e. savecore -L dumps are not a consistent snapshot, so we refuse to run on them), then calculate an upper bound on the number of allocated segments in the system.

Shiny new estimate in hand, ::findleaks allocates a (possibly huge) array of leak_mtab_ts, calls into the target to fill it, and sorts the result:

	lk_mtab = mdb_zalloc(est \* sizeof (leak_mtab_t), UM_SLEEP | UM_GC);
	lmp = lk_mtab;

	if ((ret = leaky_subr_fill(&lmp)) != DCMD_OK)
		return (ret);

	lk_nbuffers = lmp - lk_mtab;

	qsort(lk_mtab, lk_nbuffers, sizeof (leak_mtab_t), leaky_mtabcmp);
leaky_subr_fill(), defined here and here, fills in the array with the details of every allocated buffer in the system. Each entry has a base, limit, and control pointer, the last being completely target-defined. The qsort(3C) call sorts the array by base address.

Stage two: find everything reachable

::findleaks is, at its heart, a simulation of a conservative garbage collection run. All of the lk_mtab entries start in the "unmarked" state; that is, they are not known to be reachable. We call into the target:

	if ((ret = leaky_subr_run()) != DCMD_OK)
		return (ret);
( here and here) to scan the "root set"; that is, all statically defined objects. For each such object, the target implementation calls leaky_grep() or one of its variants. These look for pointers, and mark each lk_mtab entry they find, along with any (recursively) reachable entries. When the run is finished, any unmarked buffers must be unreachable leaks.

While the interface for leaky_grep() has a simple description, it invites a spectrum of implementation possibilities, each with different trade-offs in speed, memory usage, and maintainability.[1] Any findleaks run spends virtually all of its time and energy in leaky_grep(), so it is worth the effort of optimizing. The most straightforward implementation is something like:

leaky_grep(addr, size)
	allocate a buffer of size bytes
	read [addr, size) into the buffer
	for each pointer in buffer {
		look up mtab entry
		if it is marked, continue
		mark it
		leaky_grep(mtab's base, mtab's size);
	free the buffer
The immediate problem with this is the recursive depth (hundreds of thousands of call frames are typical in a normal dump.)[2] The next is memory usage; each recursive step requires additional storage for the entire contents of a memory buffer; with hundreds of thousands of recursive invocations occurring at once, that quickly adds up to tens of megabytes of peak storage.

Instead of storing a buffer for each recursive step, the current algorithm uses a single small (16 kbyte) buffer. The following comment describes how we use the buffer:

		 \* The buffer looks like:  ('+'s are unscanned data)
		 \* -----------------------------++++++++++++++++
		 \* |				|		|
		 \* buf				cur		end
		 \* cur scans forward.  When we encounter a new buffer, and
		 \* it will fit behind "cur", we read it in and back up cur,
		 \* processing it immediately.
The idea is that we read in a chunk of data at the end of the buffer, then start scanning for pointers. When we find a pointer to an unmarked mtab entry, we mark it, then either read it in to the buffer immediately, or put it on a stack of unfinished work. Since each stack entry only has to name an entry in the lk_mtab array, the stack is quite compact. Small buffers will almost always be processed in line, and the overall memory usage is quite small; the only large allocation is the lk_mtab array itself.

Stage three: collect the leaks

Once leaky_subr_run() returns successfully, we're on the home stretch. First, we scan the lk_mtab array for unmarked buffers, and report them to the target implementation:

	for (i = 0; i < lk_nbuffers; i++) {
		if (LK_MARKED(lk_mtab[i].lkm_base))
leaky_subr_add_leak(), defined here and here, reads in the available debugging information about the leak, and invokes leaky_add_leak() with the details. leaky_add_leak() adds each leak to a global hash table, coalescing any duplicates. Each entry in the table is a leak_bufctl_t, which is used later for leak reporting. Once all of the leaks have been processed, the final processing is done:
which groups all of the representative leaks by type, and sorts each type's list of buffers using another target interface, leaky_subr_bufctl_cmp(), which defines a "display" order for the leaks.

Final stage: report the leaks

Once leaky_sort has completed, the results have been stored and cached in static data. leaky_dump() dumps the data, using the algorithm explained in .../modules/genunix/leaky_impl.h:

 \* void leaky_subr_dump_start(type)
 \* void leaky_subr_dump(lkb, verbose)
 \* void leaky_subr_dump_end(type)
 \*	Used to dump the table of discovered leaks.  invoked as:
 \*		for i in 0 .. LK_NUM_TYPES
 \*			leaky_subr_dump_start(i)
 \*			for lkb in (possibly a subset of) the type i leaks
 \*				leaky_subr_dump(lkb, 0)
 \*			leaky_subr_dump_end(i)
 \*	leaky_subr_dump_start()/end() are always invoked for each type, even
 \*	if there are no leaks of that type.  If '-d' was passed to
 \*	::findleaks, then we loop through the sequence of lkbs:
 \*		for i in 0 .. LK_NUM_TYPES
 \*			for lkb of type i, same subset and order as above
 \*				leaky_subr_dump(lkb, 1)
 \*	leaky_subr_dump() can use the leaks chained off of lkb_next to
 \*	access coalesced leaks.  lkb_dups is the length of the dup list.
The targets use these invocations to produce tabular output.

Conclusion and Final Thoughts

The generic findleaks code is compact (913 lines), and for the most part without external dependency.[3] It is a useful shell for running the leak processing, but knows nothing of how to discover allocated buffers, root sets, or any other implementation details. All such external dependencies are covered by the target implementations. (It should be possible to do a target implementation for any allocator which has enough state to track all allocated buffers.)

One interesting aspect of this re-working is that the "target" interface places a huge set of requirements on the generic framework; given just the contents of .../modules/genunix/leaky_impl.h, someone writing a clean-room implementation of leaky.c would be constrained to writing something extremely similar to what is there today. On the other hand, the target implementation is less constrained by the target interface. Instead, it is mostly determined by how it must retrieve the needed data from the process, memory image, or core file mdb(1) is running against. This trade off of flexibility on two sided of a software interface is a common pattern, and interesting to see on a smaller scale.


[1] Interestingly, speed is pretty low on the priority list, especially in the post-mortem case: ::findleaks, when run against a dump, is almost entirely I/O bound.
[2] Which will blow out the stack, causing a core dump. Simple recursion removal techniques lead you to the original leaky_grep() implementation: an iterative form of the same algorithm, with the stack of pending work stored on the heap. This implementation served ::findleaks well, until kmdb's memory constraints came along.
[3] Ignoring the "verbose" output, the only external calls not part of the target interface is the call to mdb_vread() in leaky_grep(), which reads from the program's address space, and mdb_getopts(), used to parse the arguments to ::findleaks.

Technorati Tag:
Technorati Tag:
Technorati Tag:

Thursday May 19, 2005

Debugging smf(5)-managed processes

Recently, I've seen a couple questions (internally and externally) about the best way to debug something controlled by smf(5); since the process isn't started up directly, the usual ways of running things under a debugger aren't effective. If you don't need to debug the initialization process, it's also easy; just attach the debugger after the daemon is running.

If you need to debug the initialization, however, you need some way to stop the process before it does so. The easiest way is to use dtrace(1M). The following script:

# cat > stop_daemon.d <<\\EOF
#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option destructive


/execname == $$1/
        printf("stopped %d\\n", pid);
# chmod +x stop_daemon.d
Will let you grab the \*daemon\* process (i.e. at the time of the second fork()). If you need to grab the parent process, change proc:::start to proc:::exec-success. It takes the "execname" of the daemon (i.e. the first 16 characters of the executable name, which you can get using "pgrep -l daemon") as its argument. For example, if you wanted to debug the fault manager, you can do:
# ./stop_daemon fmd &
svcadm restart fmd
# stopped 5678
mdb -p 5678
Loading modules: [ fmd ld.so.1 ... ]
Any debugger which can attach to processes will work; mdb(1), dbx(1), gdb(1), etc.

Technorati Tag:
Technorati Tag:

Thursday Oct 07, 2004

mdb(1) background, intro, and cheatsheet

In the Solaris kernel group, we take our crash dumps seriously. Historically, the two main tools for analyzing crash dumps on UNIX were adb(1) and crash(1M). More recently, mdb(1) has replaced them both as the debugger of choice in Solaris.


adb(1) is a venerable tool in the UNIX tool chest -- 7th Edition UNIX (from 1979) had a version of it. It's syntax is quite quirky (as you'd expect from such an old tool), and one thing to keep in mind is that adb(1) is an assembly-level debugger. Generally, it deals directly with register values and assembly instructions -- the only symbolic information it gives you is access to the symbol table. That said, it has a reasonably powerful macro/scripting facility.

During the development of SunOS, a large number of adb macros were written to dump out various bits of kernel state. In SunOS 3.5 (released in 1988), kadb(1) (an interactive kernel debugger version of adb(1)) already existed, as did 50-odd adb scripts, mostly generated with adbgen(1M). Solaris 2.0/SunOS 5.x continued the tradition, and by Solaris 9, there are over 890 scripts in /usr/lib/adb/sparcv9 (compared to 507 in Solaris 8).1


crash(1M) is a bit more recent; it appeared sometime between 7th Edition UNIX and SysV R3, and while SunOS 3.5 did not have it, SunOS 4.x did. While adb(1) is a reasonably generic debugger with scripting facilities, crash(1M) takes an almost diametrically opposed approach: it uses compiled C code which knows how to traverse and understand various structures in the kernel to dump out information of interest. This makes crash(1M) much more powerful than adb(1M) (since you can do complicated things like virtual-to-physical address translation), while simultaneously making it much less flexible (if it wasn't already written into crash(1M), you're going to have to write it yourself, or do without).

This means that adb(1) and crash(1M) were quite complimentary. During any given debugging session, each might be used for its different strengths.2


mdb(1), the Solaris "Modular Debugger", is the brain-child of Michael Shapiro and Bryan Cantrill. Upon their arrival in the Solaris Kernel Group, they took one look at adb and crash, and decided that they were both exceedingly long in the tooth. Together, they created mdb(1M) to replace them. It's designed to embody their best features, while introducing a new framework for building debugging support, live and post-mortem.

Because of the sheer number of existent adb macros, and the finger-memory of hundreds of people, mdb(1) is almost completely backwards compatible with adb(1).3

mdb(1) allows for extensibility in the form of "Debugger Modules" (dmods) which can provide "debugger commands" (dcmds) and "walkers". dcmds are similar to the commands of crash, while walkers walk a particular dataset. Both dcmds and walkers are written in C using the interface defined by the MDB module API, which is documented in the Modular Debugger Guide.

Using the mdb module API, kernel engineers (Mike, Bryan and Dan, and others) have built up a huge library of dcmds and walkers to explore Solaris -- my desktop (a Solaris 10 system) has 196 generic walkers and 368 dcmds defined. (there are ~200 auto-generated walkers for the various kmem caches on the system, which I'm not counting here)

The neat thing about walkers and dcmds is that they can build on each other, combining to do more powerful things. This occurs both in their implementation and by the user's explicit action of placing them into an mdb "pipeline".

To give you a taste of the power of pipelines, here's an example, running against the live kernel on my desktop: the ::pgrep dcmd allows you to find all processes matching a pattern, the thread walker walks all of the threads in a process, and the ::findstack dcmd gets a stack trace for a given thread. Connecting them into a pipeline, you get:

# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba
s1394 fctl nca audiosup logindmux ptm cpc random fcip nfs lofs ipc ]
> ::pgrep sshd
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R 100174      1 100174 100174      0 0x42000000 0000030009216790 sshd
R 276948 100174 100174 100174      0 0x42010000 000003002d9a9860 sshd
R 276617 100174 100174 100174      0 0x42010000 0000030013943010 sshd
> ::pgrep sshd | ::walk thread
> ::pgrep sshd | ::walk thread | ::findstack
stack pointer for thread 3000c4f0c80: 2a10099d071
[ 000002a10099d071 cv_wait_sig_swap+0x130() ]
  000002a10099d121 poll_common+0x530()
  000002a10099d211 pollsys+0xf8()
  000002a10099d2f1 syscall_trap32+0x1e8()
stack pointer for thread 311967e9660: 2a100897071
[ 000002a100897071 cv_wait_sig_swap+0x130() ]
stack pointer for thread 30f2ff2c340: 2a100693071
[ 000002a100693071 cv_wait_sig_swap+0x130() ]
  000002a100693121 poll_common+0x530()
  000002a100693211 pollsys+0xf8()
  000002a1006932f1 syscall_trap32+0x1e8()
Yielding the stack traces of all sshd threads on the system (note that the middle one is swapped out). mdb pipelines are quite similar to standard UNIX pipelines, and allow those using the debugger a similar level of power and flexibility.

An mdb(1) cheat sheet

Because of its backwards compatibility with adb, mdb can have a bit of a learning curve. A while back, I put together an mdb(1) cheatsheet [ps pdf] to reference during late-night post-mortem debugging sessions, and it has become a pretty popular reference in the Kernel Group. It's designed to print out double-sided; the front covers the full mdb syntax, while the back is a set of commonly-used kernel dcmds and walkers, with short descriptions.

That's it for a quick history and tour -- I should be talking more about mdb later, along with libumem(3lib) (my current claim to fame), smf(5), and userland and kernel debugging in general.

1  The introduction of mdb(1) in Solaris 8, and CTF (compact ANSI-C type format) in Solaris 9 has started to slow down this trend significantly -- Solaris 10 will only have about 16 new adb scripts over Solaris 9.
2  I have little direct experience with crash(1M) -- by the time I joined Sun, it had been EOLed.
3  invoking adb on Solaris 9 and later just invokes mdb in backwards-compatibility mode.

Technorati Tag:
Technorati Tag:




Top Tags
« December 2016