Recent Posts



Over a decade ago in 2000, David Powell used his intimate knowledge of sed(1) to write a shell script called munges, which entered the muscle memory of most of the Solaris Kernel Development staff. Short for munge stacks, it takes the output of ::walk thread | ::findstack, and groups the stacks by stack trace:> ::walk thread | ::findstack !/home/dep/bin/munges743 ################################## tp: 2a100007c80 taskq_thread_wait+0x38() taskq_thread+0x350() thread_start+4()182 ################################## tp: 30003c60020 syscall_trap32+0xcc()65 ################################## tp: 30003c617c0 poll_common+0x448() pollsys+0xf0() syscall_trap32+0xcc()...1 ################################## tp: 2a100077c80 (TS_FREE) clock+0x508() cyclic_softint+0xbc() cbe_level10+8() intr_thread+0x168() callout_list_expire+0x5c() callout_expire+0x14() callout_realtime+0x14()1 ################################## tp: 2a10003fc80 kmdbmod`kctl_wr_thread+0x80() thread_start+4()1 ################################## tp: 2a100017c80 thread_reaper+0xa8() thread_start+4()1 ################################## tp: 180e000 _start+0x108()While very handy (even after a decade) it has a few drawbacks:Since it is a shell script, it cannot be used directly when debugging with kmdb(1). Copying and pasting thousands of thread stacks from serial-console output was never convenient.You have to use advanced egrep(1) (or less(1)) technology to search for particular threads of interest. Since ::findstack only displays limited thread state information (i.e. whether the thread is free),munges can't uniquify based upon the thread state.After sitting on my backburner for a long time, I finally implemented:6799290 need to uniquify stacks in kmdbRunning ::stacks by itself gives output very similar to the above pipeline:> ::stacksTHREAD STATE SOBJ COUNT2a10000fc80 SLEEP CV 743 taskq_thread_wait+0x38 taskq_thread+0x350 thread_start+430003dc43c0 SLEEP SHUTTLE 182 syscall_trap32+0xcc...2a10055dc80 ONPROC <NONE> 1 idle+0x120 thread_start+43004a4d0740 ONPROC <NONE> 1 trap+0x1b30 user_rtt+0x20180e000 STOPPED <NONE> 1 _start+0x108After my initial putback (in 2009), Greg Price added 6802742 (adding module filtering), and Bryan Cantrill added 6935550 (which added ::stacks support to userland debugging).Here's the help message for ::stacks, which has a lot of details on its use:> ::help stacksNAME stacks - print unique kernel thread stacksSYNOPSIS [ addr ] ::stacks [-afiv] [-c func] [-C func] [-m module] [-M module] [-s sobj | -S sobj] [-t tstate | -T tstate]DESCRIPTION ::stacks processes all of the thread stacks on the system, grouping together threads which have the same: \* Thread state, \* Sync object type, and \* PCs in their stack trace. The default output (no address or options) is just a dump of the thread groups in the system. For a view of active threads, use "::stacks -i", which filters out FREE threads (interrupt threads which are currently inactive) and threads sleeping on a CV. (Note that those threads may still be noteworthy; this is just for a first glance.) More general filtering options are described below, in the "FILTERS" section. ::stacks can be used in a pipeline. The input to ::stacks is one or more thread pointers. For example, to get a summary of threads in a process, you can do: procp::walk thread | ::stacks When output into a pipe, ::stacks prints all of the threads input, filtered by the given filtering options. This means that multiple ::stacks invocations can be piped together to achieve more complicated filters. For example, to get threads which have both 'fop_read' and 'cv_wait_sig_swap' in their stack trace, you could do: ::stacks -c fop_read | ::stacks -c cv_wait_sig_swap_core To get the full list of threads in each group, use the '-a' flag: ::stacks -aOPTIONS -a Print all of the grouped threads, instead of just a count. -f Force a re-run of the thread stack gathering. -v Be verbose about thread stack gathering.FILTERS -i Show active threads; equivalent to '-S CV -T FREE'. -c func[+offset] Only print threads whose stacks contain func/func+offset. -C func[+offset] Only print threads whose stacks do not contain func/func+offset. -m module Only print threads whose stacks contain functions from module. -M module Only print threads whose stacks do not contain functions from module. -s {type | ALL} Only print threads which are on a 'type' synchronization object (SOBJ). -S {type | ALL} Only print threads which are not on a 'type' SOBJ. -t tstate Only print threads which are in thread state 'tstate'. -T tstate Only print threads which are not in thread state 'tstate'. SOBJ types: mutex rwlock cv sema user user_pi shuttle Thread states: free sleep run onproc zomb stopped wait panicATTRIBUTES Target: kvm Module: genunix Interface Stability: Unstable

Over a decade ago in 2000, David Powell used his intimate knowledge of sed(1) to write a shell script called munges, which entered the muscle memory of most of the Solaris Kernel Development staff....


Debugging with libumem and MDB

In celebration of OpenSolaris'sbirthday, I thought I would do some more blogging about libumem,one of my favorite parts of it. In particular, I'll cover some of its debugging features,which borrow heavily from the kmem dcmds and walkers written by Bryan, Dan, and others.Much of the debugging power of libumem comes from its mdb(1M)debugger module.Anyone familiar with the kmem dcmds used for kernel debugging will see alot of similarities (modulo some 'u's where 'k's used to be). You cansee a list of everything it provides by doing:> ::dmods -l libumem.so.1libumem.so.1 dcmd allocdby - given a thread, print its allocated buffers dcmd bufctl - print or filter a bufctl dcmd bufctl_audit - print a bufctl_audit dcmd findleaks - search for potential memory leaks dcmd freedby - given a thread, print its freed buffers dcmd ugrep - search user address space for a pointer dcmd umalog - display umem transaction log and stack traces dcmd umastat - umem allocator stats dcmd umausers - display current medium and large users of the umem allocator dcmd umem_cache - print a umem cache dcmd umem_debug - toggle umem dcmd/walk debugging dcmd umem_log - dump umem transaction log dcmd umem_malloc_dist - report distribution of outstanding malloc()s dcmd umem_malloc_info - report information about malloc()s by cache dcmd umem_status - Print umem status and message buffer dcmd umem_verify - check integrity of umem-managed memory dcmd vmem - print a vmem_t dcmd vmem_seg - print or filter a vmem_seg dcmd whatis - given an address, return information walk allocdby - given a thread, walk its allocated bufctls walk bufctl - walk a umem cache's bufctls walk bufctl_history - walk the available history of a bufctl walk freectl - walk a umem cache's free bufctls walk freedby - given a thread, walk its freed bufctls walk freemem - walk a umem cache's free memory walk leak - given a leak ctl, walk other leaks w/ that stacktrace walk leakbuf - given a leak ctl, walk addr of leaks w/ that stacktrace walk umem - walk a umem cache walk umem_alloc_112 - walk the umem_alloc_112 cache... more umem_alloc_\* caches ... walk umem_bufctl_audit_cache - walk the umem_bufctl_audit_cache cache walk umem_bufctl_cache - walk the umem_bufctl_cache cache walk umem_cache - walk list of umem caches walk umem_cpu - walk the umem CPU structures walk umem_cpu_cache - given a umem cache, walk its per-CPU caches walk umem_hash - given a umem cache, walk its allocated hash table walk umem_log - walk the umem transaction log walk umem_magazine_1 - walk the umem_magazine_1 cache... more umem_magazine_\* caches ... walk umem_slab - given a umem cache, walk its slabs walk umem_slab_cache - walk the umem_slab_cache cache walk umem_slab_partial - given a umem cache, walk its partially allocated slabs (min 1) walk vmem - walk vmem structures in pre-fix, depth-first order walk vmem_alloc - given a vmem_t, walk its allocated vmem_segs walk vmem_free - given a vmem_t, walk its free vmem_segs walk vmem_postfix - walk vmem structures in post-fix, depth-first order walk vmem_seg - given a vmem_t, walk all of its vmem_segs walk vmem_span - given a vmem_t, walk its spanning vmem_segsThere's a lot of meat here, but I'll start by focusing on the most importantdcmds and walkers.Important dcmdsDCMDDescriptionaddr::whatisReports information about a given buffer. At the moment, the support for this is kind of anemic; it will only give information about buffers under libumem's control. There's an RFE to do better: 4706502 enhance ::whatis for libumem But even as it is, it is still quite handy for debugging. In Solaris 10 and later, ::whatis will automatically provide the bufctl or vmem_seg address for buffers with debugging information attached to them: without UMEM_DEBUG set> f3da8::whatis f3da8 is f3da8+0, allocated from umem_alloc_32with UMEM_DEBUG=default> f3da8::whatis f3da8 is f3da8+0, bufctl fd150 allocated from umem_alloc_32> fd150::bufctl -v ADDR BUFADDR TIMESTAMP THREAD CACHE LASTLOG CONTENTS fd150 f3da8 d8c3823401f60 1 e2788 82920 0 libumem.so.1`umem_cache_alloc+0x218 libumem.so.1`umem_alloc+0x58 libumem.so.1`malloc+0x28 nam_putval+0x3dc nam_fputval+0x1c env_namset+0x228 env_init+0xb4 main+0xa8 _start+0x108 This allows for a quick answer to the question "what is this buffer, and who was the last one to allocate/free it". See the descriptions of ::bufctl and ::vmem_seg, below, for more information on their use. (In Solaris 9, you need to use '-b' to get the bufctl address) addr::ugrep Searches the entire address space for a particular pointer value. The value must be properly aligned. There are options to loosen the search; -d dist searches for [addr, addr + dist) instead of an exact match, -m mask only compares the bits selected in mask, etc. addr::bufctl Display's information from a 'umem_bufctl_audit_t' pointer, which includes the buffer address, timestamp, thread, and caller. With the '-v' switch ("verbose" mode), it also includes the cache, transaction log pointer, contents log pointer, and stack trace. This dcmd can also be used in a pipeline to filter the bufctls it prints, by address (-a addr), function or function+offset in the stack trace (-c caller), timestamp (-e earliest / -l latest), or thread (-t thread) In Solaris Nevada, you can also get the full history for a bufctl by using the -h flag. On Solaris 9, the equivalent of ::bufctl -v is addr::bufctl_audit. addr::vmem_seg Display's information from a 'vmem_seg_t' pointer, which includes the type, start address and end addresses of the segment. For ALLoCated segments, it also includes the top stack from the stacktrace. With the '-v' switch, ("verbose" mode), it also includes (for ALLoCated segments only) the thread, timestamp, and stack trace recorded at allocation time. ::findleaks Does a conservative garbage-collection of the entire process in order to find memory leaks. The memory leaks are then grouped by stack trace and either (umem cache) or (allocation size). To dump all of the leak stack traces, you can use the -d flag. ::umastat Prints a report of all of the umem-managed memory in the system, grouped by umem cache and vmem arena. This can be used to see which allocation sizes are chewing up the most memory. [addr]::umem_verify Verifies the consistancy of the umem heap. If debugging is enabled, this will find instances of modified free buffers and writes past the end of the buffer. To get a detailed report of corrupted buffer addresses, take the cache pointer from a line with "n corrupted buffers", and do addr::umem_verify. ::umem_cache Lists all of the umem caches in the system, in tabular form. This is often the easiest way to get a cache's address. That's all for now, but I'll follow up with more descriptions andexamples later.Tags: [libumem,MDB,OpenSolaris,Solaris ]

In celebration of OpenSolaris's birthday, I thought I would do some more blogging about libumem, one of my favorite parts of it. In particular, I'll cover some of its debugging features,which borrow...


Some block comments about libumem

One of the projects I've been working on recently is a wad covering the following bugs:4720206 ::findleaks shouldn't cache results across state changes4743353 libumem's module fails to load on idle targets 6304072 libumem seems to use more heap than it needs6336202 d4fc7824::typegraph made mdb crashAs part of it, I made some ASCII-art comments describing the layout of a umem buffer and slab, which I thought might be of interest more generally. Here are the block comments:/* * Each slab in a given cache is the same size, and has the same * number of chunks in it; we read in the first slab on the * slab list to get the number of chunks for all slabs. To * compute the per-slab overhead, we just subtract the chunk usage * from the slabsize: * * +------------+-------+-------+ ... --+-------+-------+-------+ * |////////////| | | ... | |///////|///////| * |////color///| chunk | chunk | ... | chunk |/color/|/slab//| * |////////////| | | ... | |///////|///////| * +------------+-------+-------+ ... --+-------+-------+-------+ * | \_______chunksize * chunks_____/ | * \__________________________slabsize__________________________/ * * For UMF_HASH caches, there is an additional source of overhead; * the external umem_slab_t and per-chunk bufctl structures. We * include those in our per-slab overhead. * * Once we have a number for the per-slab overhead, we estimate * the actual overhead by treating the malloc()ed buffers as if * they were densely packed: * * additional overhead = (# mallocs) * (per-slab) / (chunks); * * carefully ordering the multiply before the divide, to avoid * round-off error. */.../* * A malloc()ed buffer looks like: * * <----------- mi.malloc_size ---> * <----------- cp.cache_bufsize ------------------> * <----------- cp.cache_chunksize --------------------------------> * +-------+-----------------------+---------------+---------------+ * |/tag///| mallocsz |/round-off/////|/debug info////| * +-------+---------------------------------------+---------------+ * <-- usable space ------> * * mallocsz is the argument to malloc(3C). * mi.malloc_size is the actual size passed to umem_alloc(), which * is rounded up to the smallest available cache size, which is * cache_bufsize. If there is debugging or alignment overhead in * the cache, that is reflected in a larger cache_chunksize. * * The tag at the beginning of the buffer is either 8-bytes or 16-bytes, * depending upon the ISA's alignment requirements. For 32-bit allocations, * it is always a 8-byte tag. For 64-bit allocations larger than 8 bytes, * the tag has 8 bytes of padding before it. * * 32-byte, 64-byte buffers <= 8 bytes: * +-------+-------+--------- ... * |/size//|/stat//| mallocsz ... * +-------+-------+--------- ... * ^ * pointer returned from malloc(3C) * * 64-byte buffers > 8 bytes: * +---------------+-------+-------+--------- ... * |/padding///////|/size//|/stat//| mallocsz ... * +---------------+-------+-------+--------- ... * ^ * pointer returned from malloc(3C) * * The "size" field is "malloc_size", which is mallocsz + the padding. * The "stat" field is derived from malloc_size, and functions as a * validation that this buffer is actually from malloc(3C). */For more details on how umem works, you can look at the kmem and vmem papers:The Slab Allocator: An Object-Caching Kernel Memory Allocator, Summer USENIX 1994Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources, USENIX 2001 Tags: [libumem,MDB,OpenSolaris,Solaris ]

One of the projects I've been working on recently is a wad covering the following bugs: 4720206 ::findleaks shouldn't cache results across state changes4743353 libumem's module fails to load on idle...


Coverage testing

A couple years back, I wrote up a description of how to use the Sun Studio compiler's coverage testing features to test userland code. Now that OpenSolaris is here, I thought it might come in handy for a larger audience. Here's goes:How do I do coverage analysis on my user-level code?The Sun Workshop compilers we use have some pretty good profiling andtest analysis tools built in to them. One of the more useful foruser-space code is Coverage Analysis, which gives you a measure of howcomplete your testing is.Coverage analysis annotates each "block" of straight-line code witha count of the number of times it has executed. For testing, what isusually more interesting is which lines were never executed, and the"Coverage", or percentage of blocks in your program or library thatwere exercised in your testing. For more information, seetcov(1), in /opt/SUNWspro/man.Compilation and LinkingCoverage analysis requires a special compilation of your program orlibrary. Each .c file needs to be compiled with-xprofile=tcov, and the final link (either to executable orshared library) also needs -xprofile=tcov.Setting:CFLAGS += -xprofile=tcovCFLAGS64 += -xprofile=tcovDYNFLAGS += -xprofile=tcov (shared libraries only)in the appropriate Makefiles, then make clean; make install issufficient.Generating Profile DataThe -xprofile=tcov version of your binary will generate profileinformation every time the executable is run (or, in the case of a sharedlibrary, any executable which links against it is run) and exits normally. The output is placed (by default) in ./progname.profile/, whichwill build up data from all executions as they exit. It will even joinup 32-bit and 64-bit data sets.The tcov output location is controlled by two environment variables,SUN_PROFDATA_DIR (default '.'), and SUN_PROFDATA(default 'progname.profile'). So if you are testing libfoo.so,and want to join the data from a bunch of executions into/tmp/libfoo.profile, you would set:sh:% SUN_PROFDATA_DIR=/tmp% SUN_PROFDATA=libfoo.profile% export SUN_PROFDATA_DIR SUN_PROFDATAcsh:% setenv SUN_PROFDATA_DIR /tmp% setenv SUN_PROFDATA libfoo.profilebefore your runs.Processing the profile dataOnce you have finished gathering data, you can use the tcov(1)command, located in /opt/SUNWspro/bin (or wherever you keepyour compilers) to analyze it. It's syntax is pretty straightforward:% tcov -x profile_dir sourcefile...For example, to analyze the previous libfoo example, you might: (here Iuse a seperate directory for my tcov analysis)% cd usr/src/lib/libfoo% mkdir tcov% cd tcov% tcov -x /tmp/libfoo.profile ../common/\*.c ../sparc/\*.c ../sparcv9/\*.cAnalyzing the dataNota Bene: The counts tcov uses to generate its output are updated withoutholding locks. For multi-threaded programs only, this means that somecounts may be lower than expected. Nevertheless, if a block has beenexecuted at least once, its count will be non-zero.For each source file you pass in on the command line, tcov will generatea .tcov file (for example, ../common/foo.c ->foo.c.tcov). Each file contains the original source, annotatedwith execution counts. Each line that starts a "basic block" isprefixed with either '##### ->', indicating that it has not beenexecuted, or 'count ->', indicating how many times itwas executed.After the annotated source, there is a summary of the file, includingthings like total blocks, number executed, % coverage, averageexecutions per block, etc.I've written a tool, tcov_summarize, whichtakes the tcov files in the current directory and displays a summary ofthe current state. The columns are "total blocks", "executed blocks",and "% executed" (or % coverage).Command example: cpio% cd usr/src/cmd/cpio% grep tcov MakefileCFLAGS += -xprofile=tcov% make... (made cpio) ...% mkdir tcov% cd tcov% ../cpiocpio: One of -i, -o or -p must be specified.USAGE: cpio -i[bcdfkmrstuv@BSV6] [-C size] [-E file] [-H hdr] [-I file [-M msg]] [-R id] [patterns] cpio -o[acv@ABLV] [-C size] [-H hdr] [-O file [-M msg]] cpio -p[adlmuv@LV] [-R id] directory% lscpio.profile/% tcov -x cpio.profile ../\*.c% lscpio.c.tcov cpio.profile/ cpiostat.c.tcov% tcov_summarize 1818 32 1.76 cpio.c 2 0 0.00 cpiostat.c 1820 32 1.76 total% find . | ../cpio -ocB > /dev/null590 blocks% tcov -x cpio.profile ../\*.c% tcov_summarize 1818 326 17.93 cpio.c 2 0 0.00 cpiostat.c 1820 326 17.91 total%Library example: libumem% cd usr/src/lib/libumem % grep tcov Makefile.com CFLAGS += -v $(LOCFLAGS) -I$(CMNDIR) -xprofile=tcovCFLAGS64 += -v $(LOCFLAGS) -I$(CMNDIR) -xprofile=tcovDYNFLAGS += -M $(MAPFILE) -z interpose -xprofile=tcov% make... (made libumem) ...% mkdir tcov % cd tcov % SUN_PROFDATA_DIR=`pwd` % SUN_PROFDATA=libumem.profile % export SUN_PROFDATA_DIR SUN_PROFDATA % LD_PRELOAD=../sparc/libumem.so.1 LD_PRELOAD_64=../sparcv9/libumem.so.1% export LD_PRELOAD LD_PRELOAD_64 % ls % ls libumem.profile/% tcov -x libumem.profile ../common/\*.c ../sparc/\*.c % /home/jwadams/bin/tcov_summarize 75 44 58.67 envvar.c 10 7 70.00 getpcstack.c 72 22 30.56 malloc.c 78 27 34.62 misc.c 592 255 43.07 umem.c 1 0 0.00 umem_agent_support.c 315 167 53.02 vmem.c 13 10 76.92 vmem_base.c 20 0 0.00 vmem_mmap.c 35 17 48.57 vmem_sbrk.c 1211 549 45.33 total% tcov -x libumem.profile ../common/\*.c ../sparc/\*.c % /home/jwadams/bin/tcov_summarize 77 45 58.44 envvar.c 10 7 70.00 getpcstack.c 72 28 38.89 malloc.c 78 27 34.62 misc.c 592 314 53.04 umem.c 1 0 0.00 umem_agent_support.c 315 192 60.95 vmem.c 13 10 76.92 vmem_base.c 20 0 0.00 vmem_mmap.c 35 17 48.57 vmem_sbrk.c 1213 640 52.76 total%(Note that running tcov gave us more coverage, since the library is being preloaded underneath it)Tags: [OpenSolaris,Solaris ]

A couple years back, I wrote up a description of how to use the Sun Studio compiler's coverage testing features to test userland code. Now that OpenSolaris is here, I thought it might come in handy...


An initial encounter with ZFS

After ZFS became available in onnv_27, I immediately upgraded my desktopsystem to the newly minted bits. After some initial setup, I've beenhappily using ZFS for all of my non-root, non-NFSed data. I'm gettingabout 1.7x my storage due to ZFS's compression, and have new-foundsafety, since my data is now mirrored.During the initial setup, my intent was to use only the slices I'd alreadyset up to do the transfer. What I did not plan for was the fact that ZFSdoes not currently allow you to remove a non-redundant slice from a storagepool without destroying the pool; here's what I did, as well as what I shouldhave done:My setupBefore I began, my system layout was fairly simple:c0t0d0:Total disk cylinders available: 24620 + 2 (reserved cylinders)Part Tag Flag Cylinders Size Blocks 0 root wm 1452 - 7259 8.00GB (5808/0/0) 16779312 1 unassigned wm 0 0 (0/0/0) 0 2 backup wm 0 - 24619 33.92GB (24620/0/0) 71127180 3 swap wu 0 - 1451 2.00GB (1452/0/0) 4194828 4 unassigned wm 0 0 (0/0/0) 0 5 unassigned wm 0 0 (0/0/0) 0 6 unassigned wm 0 0 (0/0/0) 0 7 aux0 wm 7260 - 24619 23.91GB (17360/0/0) 50153040c0t1d0:Total disk cylinders available: 24620 + 2 (reserved cylinders)Part Tag Flag Cylinders Size Blocks 0 altroot wm 0 - 5807 8.00GB (5808/0/0) 16779312 1 unassigned wm 0 0 (0/0/0) 0 2 backup wm 0 - 24619 33.92GB (24620/0/0) 71127180 3 swap wu 5808 - 7259 2.00GB (1452/0/0) 4194828 4 unassigned wm 0 0 (0/0/0) 0 5 unassigned wm 0 0 (0/0/0) 0 6 unassigned wm 0 0 (0/0/0) 0 7 aux1 wm 7260 - 24619 23.91GB (17360/0/0) 50153040That is, two 34GB hard disks, partitioned identically. There are fourslices of major interest:c0t0d0s0 / 8G root filesystemc0t0d0s7 /aux0 24G data needing preservingc0t1d0s0 /altroot 8G alternate root, currently emptyc0t1d0s7 /aux1 24G some data (/opt) which needed preservingMy goal was to create a 24Gig mirrored ZFS pool, using the underlyingslices of /aux0 and /aux1. /altroot would be my initial stepping stone.The processWithout any prior experience setting up a ZFS pool, I did the followingsteps:... remove /altroot from /etc/vfstab ...# zpool create mypool c0t1d0s0invalid vdev specificationuse '-f' to override the following errors:/dev/dsk/c0t1d0s0 contains a ufs filesystem# zpool create -f mypool c0t1d0s0# zpool listNAME SIZE USED AVAIL CAP HEALTH ALTROOTmypool 7.94G 32.5K 7.94G 0% ONLINE -# zfs set compression=yes mypool# zfs create pool/opt# zfs set mountpoint=/new_opt mypool/opt... copy data from /aux1/opt to /new_opt, clear out /aux1 ...... remove /aux1 from vfstab, and remove the /opt symlink ...# zfs set mountpoint=/opt mypool/opt# df -h /optFilesystem size used avail capacity Mounted onmypool/opt 7.9G 560M 7.4G 6% /opt#I now had all of the data I needed off of /aux1, and I wanted to add itto the storage pool. This is where I made a mistake; zfs, inits initial release, cannot remove a non-redundant device from a pool(this is being worked on). I did:# zpool add -f mypool c0t1d0s7 (\*MISTAKE\*)# zpool status mypool pool: mypool state: ONLINE scrub: none requestedconfig: NAME STATE READ WRITE CKSUM mypool ONLINE 0 0 0 c0t1d0s0 ONLINE 0 0 0 c0t1d0s7 ONLINE 0 0 0# zfs create mypool/aux0# zfs create mypool/aux1# zfs set mountpoint=/new_aux0 mypool/aux0# zfs set mountpoint=/aux1 mypool/aux1... move data from /aux0 to /new_aux0 ...... remove /aux0 from /etc/vfstab ...# zfs set mountpoint=/aux0 mypool/aux0And now I was stuck; I wanted to end up with a configuration of: mypool mirror c0t0d0s7 c0t1d0s7but there was no way to get there without removing c0t1d0s0 from thepool, which ZFS doesn't allow you to do directly. I ended up creating anew pool, "pool" with c0t0d0s7 in it, copying all of my data \*again\*,destroying "mypool", then mirroring c0t0d0s7 by doing:# zpool attach pool c0t0d0s7 c0t1d0s7# The right wayIf I'd planned this all better, the right way to build the pool I wantedwould have been to do:# zpool create pool c0t1d0s0... create /opt_new, move data to it ...# zpool replace pool c0t1d0s0 c0t1d0s7... create /aux0_new, move data to it ...# zpool attach pool c0t1d0s7 c0t0d0s7... clean up attributes, wait for sync to complete ...# zpool iostat -v capacity operations bandwidthpool used avail read write read write------------ ----- ----- ----- ----- ----- -----pool 15.0G 8.83G 0 1 1.47K 6.99K mirror 15.0G 8.83G 0 1 1.47K 6.99K c0t1d0s7 - - 0 1 2.30K 7.10K c0t0d0s7 - - 0 1 2.21K 7.10K------------ ----- ----- ----- ----- ----- -----#The real lesson here is to do a little planning if you have to juggleslices around; missteps can take some work to undo. Read about "zpool replace"and "zpool attach"; they are very useful for this kind of juggling.Once I got everything set up, everything just works; despite havingcut my available storage in /aux0 and /aux1 in half (48GB -> 24GB, due tomirroring), compression is giving me back a substantial fraction of theloss (~70%, give or take, and assuming the ratio holds steady):# zfs get -r compressratio poolNAME PROPERTY VALUE SOURCEpool compressratio 1.70x - pool/aux0 compressratio 1.54x - pool/aux1 compressratio 2.01x - pool/opt compressratio 1.68x - # and ZFS itself is quite zippy. I hope this is a useful lesson, and that youenjoy ZFS!Tags: [OpenSolaris,ZFS ]

After ZFS became available in onnv_27, I immediately upgraded my desktop system to the newly minted bits. After some initial setup, I've beenhappily using ZFS for all of my non-root, non-NFSed data....


Brokenness Hides Itself

When engineers get together and talk, one of the things they like tobring out and share is war stories; tales of debugging daring-do andthe amazing brokenness that can be found in the process. I recentlywent through an experience that makes good war story material, and Ithought I'd share it.A couple weeks ago, there were multiple reports of svc.configd(1M)failing repeatedly with one of:svc.configd: Fatal error: invalid integer "10" in field "id"svc.configd: Fatal error: invalid integer in databaseSince I'm one of the main developers of svc.configd(1M), I started toinvestigate. I first had the people hitting it send me their repositories,but they all checked out as having no problems. The problem was only beingseen on prototype Niagara machines and some Netra T1s; the firstreproducible machine I got console access to was a Niagara box.Figuring out what happenedUnfortunately, the box was running old firmware, which significantlyrestrained its usability; I spent more time fighting the machine thanworking on tracking down the problem. I finally boot neted themachine, mounted the root filesystem, and added a line:sulg::sysinit:/sbin/sulogin </dev/console 2<>/dev/console >&2to /etc/inittab to get a shell very early in boot. After bootingand logging into that shell, I was able attach mdb(1) tosvc.configd(1M):# mdb -p `pgrep svc.configd`Loading modules: [ svc.configd ld.so.1 libumem.so.1 libuutil.so.1libc.so.1 ]> Now, I knew that svc.configd uses a utility function,uu_strtouint(),to do all of its integer-to-number conversions, and that the most likely causeof the failure seen was some kind of unprintable character in its firstargument. So I set a breakpoint there and started it running:> uu_strtouint::bp> :cmdb: stop at libuutil.so.1`uu_strtouintmdb: target stopped at:uu_strtouint: save %sp, -0x68, %sp> ::step[1]mdb: target stopped at:uu_strtouint+4: ld [%fp + 0x60], %l6> $C 2feefb968 libuutil.so`uu_strtouint+4(1cb4e4, feefba9c)feefb9d0 string_to_id+0x24(1cb4e4, feefba9c)feefba38 fill_child_callback+0x20(feefbbe4, 2)feefbaa0 sqlite_exec+0xd8(13de08, 2)feefbb18 backend_run+0x74(89b48, 169848)feefbb80 scope_fill_children+0x38(5e9f00, 1000)...> So the first argument to uu_strtouint() is 1cb4e4. I attemptedto print out the string there:> 1cb4e4/smdb: failed to read data from target: no mapping for address0x2e: > "failed to read data from target"? A reported address of 0x2e insteadof 0x1cb4e4? Clearly, the problem is affecting mdb(1) as well.Some additional tests in mdb(1) revealed the magnitude of the problem:> 1=J 1 > 10=J 1000000001 > 11=J 1000000002 > 111=J 11000000003 > 101=J 11000000002 i.e. the results were completely busted, and had a lot of high-word bits setunexpectedly. Interestingly, it looked like only multiple-digit numbers wereaffected.So I proceeded to investigateuu_strtouint().The first step was to see howthe function was failing; there are a number of different ways to getto uu_set_error(), which sets up libuutil's equivalentof errno. A simple breakpoint led to the following codesegment:269 if (strtoint(s, &val, base, 0) == -1)270 return (-1);271 272 if (val < min) {273 uu_set_error(UU_ERROR_UNDERFLOW);274 return (-1);275 } else if (val > max) {276 uu_set_error(UU_ERROR_OVERFLOW);277 return (-1);278 }The failure was occurring at line 276; i.e. an overflow. The value (foran input string of "10") was 0x10000000a, that is, 1\*2\^32+ 10. So something was going terribly wrong in strtoint(). Since mdb(1) was also failing, the problem was probably in some sharedlibrary routine. Looking at the disassembly, there are very few externalcalls:> strtoint::dis ! grep call | sed 's/libuutil.so.1`//g'strtoint+0xc: call +8 <strtoint+0x14>strtoint+0x204: call +0x12a30 <PLT:__udiv64>strtoint+0x2b8: call +0x12988 <PLT:__umul64>strtoint+0x404: call -0xd30 <uu_set_error>strtoint+0x414: call -0xd40 <uu_set_error>strtoint+0x424: call -0xd50 <uu_set_error>strtoint+0x440: call -0xd6c <uu_set_error>strtoint+0x450: call -0xd7c <uu_set_error>strtoint+0x460: call -0xd8c <uu_set_error>> The first call is part of the PIC[2]calling sequence, and uu_set_error() is only called in thefailure paths, which we knew weren't being hit. So __umul64()and __udiv64() are the next suspects. Theseare runtime support routines in libc, which the compiler inserts callsto when it wants to do 64-bit multiplications and divisions. The codefor strtoint() only has one multiply and one divide, so it's easy tosee where they occur:103 multmax = (uint64_t)UINT64_MAX / (uint64_t)base;104 105 for (c = \*++s; c != '\\0'; c = \*++s) {...116 if (val > multmax)117 overflow = 1;118 119 val \*= base;120 if ((uint64_t)UINT64_MAX - val < (uint64_t)i)121 overflow = 1;122 123 val += i;124 }The division always occurs, so I looked at the multiply routine first;disassembling it showed the following suspicious section:> __umul64::dis ! sed 's/libc.so.1`//g'...__umul64+0x38: cmp %l7, 0__umul64+0x3c: call +0xc95e4 <PLT:.umul>__umul64+0x40: mov %i3, %i0__umul64+0x44: mov %l6, %i1...For a function call, %o0-%o6 hold the arguments tothe function, and afterwards, %o0 and %o1 hold theresults. But here, there's no manipulation of the %os. Infact, the function doesn't reference them anywhere:> __umul64::dis ! grep '%o'> Here's the smoking gun; we've hit some kind of compiler bug. The relevantsource code is:usr/src/lib/libc/sparc/crt/mul64.c:36 extern unsigned long long __umul32x32to64(unsigned, unsigned);...70 unsigned long long71 __umul64(unsigned long long i, unsigned long long j)72 {...81 if (i1)82 result = __umul32x32to64(i1, j1);88 return (result);89 }usr/src/lib/libc/sparc/crt/muldiv64.il: 29 .inline __umul32x32to64,8 30 call .umul,2 31 nop 32 mov %o0, %o2 33 mov %o1, %o0 34 mov %o2, %o1 35 .endFrom previous experience with the compiler's inlining (I codereviewedthe fix for another inliner-related bug,6225876)I knew that there is an optimization stage after the inline code is generated;from the looks of it, that stage thought that the call had no effect on register state, and so optimized away the argument set up. As a result,the final return value of __umul64 is just some junk register state. SoI updated the original bug, and sent it over to the compiler people:6323803 compiler bug causes __\*mul64 failure; svc.configd diesAfter some investigation, the compiler folks accepted the bug, and notedthat it only effects call statements in inline assembly.Checking the rest of the '.il' files in OS/Net, I verified that this wasthe only place where we used call.I still needed to figure out why this was only effecting certain platforms,and how we were going to deal with the problem in the short-term, so thatwe weren't waiting on a compiler patch.Why only sun4v and Netras?I first wrote a small C program, to test:% cat > math_test.c <<EOF#include <stdio.h>intmain(int argc, char \*argv[]){ unsigned long long x, y, z; x = 10; y = 1; z = x \* y; printf("%llx\\n", z); return (0);}EOF% cc -o math_test math_test.c% ./math_testa% truss -t \\!all -u '::__\*mul64' ./math_test/1@1: -> libc_psr:__mul64(0x0, 0xa, 0x0, 0x1)/1@1: <- libc_psr:__mul64() = 0a%... (moving over to niagara machine) ...% truss -t \\!all -u '::__\*mul64' ./math_test/1@1: -> libc:__mul64(0x0, 0xa, 0x0, 0x1)/1@1: <- libc:__mul64() = 110000000a%A hah! For sun4u, libc_psr is overriding the libc definitions. So sun4v won't work for two reasons:none of the sun4v libc_psrs override __\*mul64even if they did, sun4v has a wacky "mount the correct libc_psr over/platform/sun4v/lib/libc_psr.so.1 which wouldn't occur until later in boot anyway.It ends up that the reason __\*mul64 are overriden in libc_psr is a hold-overfrom Solaris 9 and earlier, where 32-bit Solaris was supported onpre-UltraSPARC chips. In fact, the call to .umul is a hold-over fromthe days of sparcv7, where the umul instruction wasn'tguaranteed to be there. The sun4u libc_psr version has sparcv8plusversions of __mul64, which take advantage of the 64-bit processing thatsparcv9 adds, in a form compatible with 32-bit code.The fact that these weren't included in the sun4v libc_psr is an oversight,but #2 means that it wouldn't have mattered if they did. The Netra T1's running into this problem are explained by the fact that there are a set ofmissing /platform/\*/lib symlinks for the following platforms:SUNW,Ultra-1-EngineSUNW,UltraAX-MPSUNW,UltraAX-eSUNW,UltraAX-e2SUNW,UltraSPARC-IIi-EngineSUNW,UltraSPARC-IIi-cEngineSUNW,UltraSPARCengine_CP-20SUNW,UltraSPARCengine_CP-40SUNW,UltraSPARCengine_CP-60SUNW,UltraSPARCengine_CP-80Which are all various "Netra T1" varieties. Since the link to the sun4ulibc_psr is missing, they exhibit the same problem as the sun4v does. Ifiled6324790 to cover adding the missing links, since it is a performanceproblem. (programs will not be able to take advantage of the fastermem{cpy,cmp,move,set} versions contained there)The final question is "Why now?". What changed to make this a problem? Fourdays before the first reported incident,6316914 was putback, which switched the build from the Studio 8 tothe Studio 10 compilers. Because of the libc_psr masking, no-one noticed theproblem until they put the bits on the (much rarer) platforms with the bug.The FixTo fix this, you simply move the __{u,}{mul,div}64 functions fromlibc_psr back into libc, using the v8plus versions that were inlibc_psr. libc's assembly files are already being compiled in v8plusmode due to atomic_ops(3C), so it just required shuffling aroundsome code, removing stuff from makefiles, and deleting the old, out of datecode. This was done under bugid6324631, integrated in the same build as the compiler switch, so only alimited number of OS/Net developers were effected. Life was all better.Well, almost.The follow-onIn testing my fix, I did a full build, bfued a machine, and just dumped fixedbinaries on other machines. The one thing I didn't test was BFUing frombroken bits to fixed bits. And, of course, there was an unforeseenproblem (bugid6327152). To understand what went wrong, I'm going to have to do somebackground on how BFU works.bfu is a power-tool whose essential job is to dump a full set of binaries overa running system, even if the new bits are incompatible with the ones runningthe system. To do this, it copies binaries, libraries, and the dynamiclinker into a set of subdirectories of /tmp: /tmp/bfubin,/tmp/bfulib, and /tmp/bl. It then uses a tool called "bfuld"to re-write the "interepreter" information for the binaries in /tmp/bfubin,to point at the copied ld.so.1(1). It then sets LD_LIBRARY_PATH inthe environment, to re-direct any executed programs to the copied libraries,and sets PATH=/tmp/bfubin. This gives BFU a protected environmentto run in.The problem is that auxiliary filters (like libc_psr) were not disabled, soprograms running in the BFU environment were picking up the libc_psr from/platform. Once the \*new\* libc_psr was extracted, programs were no longerprotected from the broken __\*mul64() routines. Since things likescanf(3C) use __mul64 internally, this caused breakage all over the place,most noticeably in cpio(1).The fix for this is reasonably simple; set LD_NOAUXFLTR=1 in theenvironment to prevent auxiliary filters from beingused,[3] make a copy oflibc_psr.so.1 into /tmp/bfulib, and useLD_PRELOAD=/tmp/bfulib/libc_psr.so.1 to override the bad libcfunctions. The latter part of this can be removed once we're sure no brokenlibcs are running around.ConclusionI hope you've enjoyed this. The bug ended up being surprisingly subtle (asmany compiler bugs are), but luckily the fix was relatively simple. TheLaw of Unintended Consequences applies, as always.Footnotes:[1]::steping over the "save" instruction is a standard SPARC debuggingtrick; it makes the arguments to the function and the stack trace correct.[2]Position-Independent Code, which is how shared libraries are compiled.[3]Ironically, if we had done this \*before\* the compiler switch was done, BFUwould have immediately failed when run on the broken bits, and the wholeproblem would have been noticed much more quickly.[ TechnoratiOpenSolarisSolaris]

When engineers get together and talk, one of the things they like to bring out and share is war stories; tales of debugging daring-do andthe amazing brokenness that can be found in the process....


The FBT provider, opensolaris source, and fun with the environment

The FBT provider, opensolaris source, and fun with the environmentNow that opensolaris is out there, it's quite a bit easier for folks to useDTrace's FBT provider. FBT provides "function boundary tracing", i.e. ithas probes for the entry and exit for almost every function in the kernel.This is amazingly powerful and flexible, but it leads to it being hard touse: with over 20,000 functions on a typical Solaris box, it's very hardto know where to start, especially without access to the source code.With OpenSolaris, the source code is available. So to illustrate howyou can use this newly available information to get something done, I thoughtI'd use a classic question: How can I examine and muck with the environment?To start off, we need a plan of attack; since dtrace doesn't have any wayof looping over datastructures, typically if you want to walk somedatastructure, you just find a place where the kernel is already doing so,and use probes there to sneak a peak at the data as it goes by. Theinitial environment for a process is set up in the exec(2) system call, solets see if we can find where we read in the data from the user process.Looking at the source,exece() calls exec_common(), which is the main workhorse. There doesn't seem to be any direct mucking of the environment, but there is: 222 ua.fname = fname; 223 ua.argp = argp; 224 ua.envp = envp; 225 226 if ((error = gexec(&vp, &ua, &args, NULL, 0, &execsz, 227 exec_file, p->p_cred)) != 0) {Now we could continue to track all of this down, but it's much easier to justsearch usr/src/uts forreferences to the envp symbol.That leads us to stk_copyin(),the routine responsible for copying in everything which is needed on the stack. Here's theenvironment processing loop: 1303 if (envp != NULL) { 1304 for (;;) { 1305 if (stk_getptr(args, envp, &sp)) 1306 return (EFAULT); 1307 if (sp == NULL) 1308 break; 1309 if ((error = stk_add(args, sp, UIO_USERSPACE)) != 0) 1310 return (error); 1311 envp += ptrsize; 1312 } 1313 }Without even knowing what stk_getptr() and stk_add() do, we now have enoughinformation to write a D script to get the environment of a process as itexec()s. Here's the basic outline:First, use fbt::stk_copyin:entry to stash a copy of theenvp pointer. Second, use fbt::stk_getptr:entry to watch for reads from the stored envp address. Third, use fbt::stk_add:entry and fbt::stk_add:return to print out the environment. Lastly, use fbt::stk_copyin:return to clean up. And here's our first script:#!/usr/sbin/dtrace -s#pragma D option quietfbt::stk_copyin:entry{self->envp = (uintptr_t)args[0]->envp;}fbt::stk_getptr:entry/ self->envp != 0 /{/\* check if we're looking at envp or envp+1 \*/self->on = ((arg1 - self->envp) <= sizeof (uint64_t));/\* update envp if we're on \*/self->envp = self->on ? arg1 : self->envp;}fbt::stk_add:entry/ self->on && args[2] == UIO_USERSPACE /{self->ptr = arg1;}fbt::stk_add:return/ self->ptr != 0 /{printf("%.79s\\n", copyinstr(self->ptr));self->ptr = 0;self->on = 0;}fbt::stk_copyin:return{self->envp = 0;self->on = 0;}Note that we delay the copyinstr of stk_add()'s second argumentuntil fbt::stk_add:return. This is due to the fact that dtrace(1M)cannot fault in pages; so if a probe tries to copyinstr anaddress which has not yet been touched, you'll get a runtime error like:dtrace: error on enabled probe ID 4 (ID 12535: fbt:genunix:stk_add:entry):invalid address (0x1818000) in action #1 at DIF offset 28By waiting until the return probe, we avoid this problem; we knowthat the kernel just touched the page to read in its copy.Now, looking at the environment is fun, but it would be even more interestingto change the environment of a process while it is beingexeced. This requires a bit more work, and access to thedestructive action copyout(). I'm going to start with a script whichrequires a recent version of Solaris (snv_15+, or the OpenSolaris release),because Bryan introduced some nicestring handling stuff recently. We'll adapt the script to S10 afterwards.Lets start by saying we want to change the name of the environment variable"FOO" to "BAR", but leave the value the same. The basic idea is simple;copyin() the string in the fbt::stk_add:entry probe, and if it's theone we want to change, copyout() the changes. The kernel will then proceed to copyin()the changed string, and use it for the environment of the process. The complication is the same as before; what if the page hasn't yet been touched, or the copyout() operationfails (for example, if the string isn't writable)?There's no simple solution, so I'm just going to check \*afterwards\* that wedidn't miss changing it, and kill -9 the process if we did. It's vile,but effective. Here's the script:#!/usr/sbin/dtrace -s#pragma D option quiet#pragma D option destructiveself uintptr_t ptr;inline int interesting = (strtok(copyinstr(self->ptr), "=") == "FOO");fbt::stk_copyin:entry{self->envp = (uintptr_t)args[0]->envp;}fbt::stk_getptr:entry/ self->envp != 0 /{/\* check if we're looking at envp or envp+1 \*/self->on = ((arg1 - self->envp) <= sizeof (uint64_t));/\* update envp if we're on \*/self->envp = self->on ? arg1 : self->envp;self->ptr = 0;}fbt::stk_add:entry/ self->on && args[2] == UIO_USERSPACE /{self->ptr = arg1;self->didit = 0;}fbt::stk_add:entry/ self->ptr != 0 && interesting /{printf("%d: %s: changed env \\"%s\\"\\n", pid, execname, copyinstr(self->ptr));copyout("BAR", self->ptr, 3);/\* 3 == strlen("BAR") \*/self->didit = 1;}fbt::stk_add:return/ self->ptr != 0 && interesting && !self->didit /{printf("%d: %s: killed, env \\"%s\\" couldn't be changed\\n", pid, execname, copyinstr(self->ptr));raise(9);}fbt::stk_copyin:return{self->envp = 0;self->on = 0;self->ptr = 0;}The above works great on Solaris Nevada and OpenSolaris, but doesn't workon Solaris 10, because it uses "strtok". So to use it on Solaris 10, we'llhave to do things slightly more manually. The only thing that needs to change is the definition of the "interesting" inline, and some morecleanup in fbt::stk_copyin:return:inline int interesting = ((self->str = copyinstr(self->ptr), "=")[0] == 'F' && self->str[1] == 'O' && self->str[2] == 'O' && self->str[3] == '=');...fbt::stk_copyin:return{self->envp = 0;self->on = 0;self->ptr = 0;self->str = 0;}A final note is on stability; we're using private implementation details ofSolaris to make this all work, and they are subject to change withoutnotice at any time. This particular part of solaris isn't likely to changemuch, but you never know. A reasonable RFE would be for more Stableprobe-points in the exec(3C) family, so that people can write things likethis more stably.Technorati Tag: DtraceTechnorati Tag: OpenSolarisTechnorati Tag: Solaris

The FBT provider, opensolaris source, and fun with the environment Now that opensolaris is out there, it's quite a bit easier for folks to useDTrace's FBT provider. FBT provides "function boundary...


The implementation of ::findleaks

Now that OpenSolaris isavailable, it's time to explain how it all works. I thought I'd startwith some of the mdb(1) dcmds I've worked on, since they arerelatively self-contained, and have very strong interface boundaries.I'll start with my favorite dcmd, ::findleaks, a memory leakdetector originally written by BryanCantrill, which I re-factored substantially late in Solaris 10.There were a few reasons the refactoring was necessary: When I'd done the original ::findleaks for libumem implementation, I simply copied the original leaky.c and re-wrote it into submission. This worked fine, but was an additional maintenance burden; two dissimilar copies of the same code is to be avoided if possible. The original code reported oversize leaks in an obscure way:findleaks: Oversize leak at seg 0x12345678! which, unless you are very familiar with the way the allocator works internally, was unlikely to lead you to the stack trace or size of the leaked buffer. (for the curious, 12345678$<vmem_seg was the necessary incantation in the old days) The original ::findleaks algorithm was designed with running in a user process in mind. In particular, it assumed it had a lot of memory to play with. Unfortunately, with the coming of kmdb(1), memory space for dcmds was getting very tight. For ::findleaks to work under kmdb(1), it needed substantial belt-tightening. There were some enhancements I wanted to make; in particular, there was no simple way to dump all of the leaked stack traces. Internally, the "leakbot" tool (also written byBryan) automatically extracted the set of stack traces, but this didn't help people without access to that tool.So I started a side project to re-work ::findleaks. The followingbugs were part of it:4840780 ::findleaks oversized leak reporting needs overhaul4849669 ::findleaks' cleanup code needs work4931271 kmem walkers and ::findleaks use large amounts of memory5030758 ::findleaks should have an option to display the stack tracesbut this note only covers the generic ::findleaks implementation.The files of ::findleaksLet's start with a lay of the land. The following files encompass theimplementation of ::findleaks (to shorten things, I'm assuminga prefix of usr/src/cmd/mdb/common for all file references):.../modules/genunix/leaky.h The dcmd, dcmd help, and walker function declarations, used by.../modules/genunix/genunix.c and.../modules/libumem/libumem.c to create the dcmd linkage for mdb(1)..../modules/genunix/leaky_impl.h An implementation header, which defines (and documents, via a largeblock comment) the interface between the main engine,.../modules/genunix/leaky.c and the two target implementations..../modules/genunix/leaky.c The common ::findleaks engine..../modules/genunix/leaky_subr.c.../modules/libumem/leaky_subr.c The two target implementations, one for the kernel, one forlibumem(3lib).I'll use the term "target" when talking about the other side of::findleaks's implementation. The non-static target interfacesall start with leaky_subr_.The target interface is a link-time, functional binding; duringthe compilation process, alternate versions of the target routines are linkedagainst essentially the same "generic" framework. This is less flexible thandoing some kind of dynamic binding (i.e. using function pointers), but issufficiently general as long as only one target interfaces is required for agiven dmod.The public interfaceThe top layer of interface consists of the dcmds (D commands)and walkers exported by the genunix andlibumem dmods. In the genunix dmod, these are defined in.../modules/genunix/genunix.c:static const mdb_dcmd_t dcmds[] = {.../\* from leaky.c + leaky_subr.c \*/{ "findleaks", FINDLEAKS_USAGE, "search for potential kernel memory leaks", findleaks, findleaks_help },...static const mdb_walker_t walkers[] = {.../\* from leaky.c + leaky_subr.c \*/{ "leak", "given a leaked bufctl or vmem_seg, find leaks w/ same " "stack trace",leaky_walk_init, leaky_walk_step, leaky_walk_fini },{ "leakbuf", "given a leaked bufctl or vmem_seg, walk buffers for " "leaks w/ same stack trace",leaky_walk_init, leaky_buf_walk_step, leaky_walk_fini },(the structures used are part of theDebuggerModule Linkage,defined in theSolaris Modular DebuggerGuide)The walkers are straightforward; they just walk the cached leak table. ::findleaks is the main event.First up: initialize, estimate, slurp in the buffersWhen ::findleaks is invoked, mdb(1) callsfindleaks(), which validates itsarguments, callsleaky_cleanup()to clean up any interrupted state, then checks if there is cached data.If not, we start the run in earnest.The first call into the target interface is here:if ((ret = leaky_subr_estimate(&est)) != DCMD_OK)return (ret);This calls over tohere orhere,which first check that finding memory leaks is possible (i.e.savecore -L dumps are not a consistent snapshot, so we refuseto run on them), then calculate an upper bound on the number ofallocated segments in the system.Shiny new estimate in hand, ::findleaks allocates a (possibly huge)array of

Now that OpenSolaris is available, it's time to explain how it all works. I thought I'd start with some of the mdb(1) dcmds I've worked on, since they arerelatively self-contained, and have very...


Debugging smf(5)-managed processes

Recently, I've seen a couple questions (internally and externally) about the best way to debug something controlled by smf(5); since the process isn't started up directly, the usual ways of running things under a debugger aren't effective. If you don't need to debug the initialization process, it's also easy; just attach the debugger after the daemon is running.If you need to debug the initialization, however, you need some way to stop the process before it does so. The easiest way is to use dtrace(1M). The following script:# cat > stop_daemon.d <<\\EOF#!/usr/sbin/dtrace -s#pragma D option quiet#pragma D option destructiveBEGIN{ printf("READY\\n");}proc:::start/execname == $$1/{ stop(); printf("stopped %d\\n", pid); exit(0);}EOF# chmod +x stop_daemon.dWill let you grab the \*daemon\* process (i.e. at the time of the second fork()). If you need to grab the parent process, changeproc:::start to proc:::exec-success. It takes the "execname" of the daemon (i.e. the first 16 characters of the executable name, which you can get using "pgrep -l daemon") as its argument. For example, if you wanted to debug the fault manager, you can do:# ./stop_daemon fmd &# READYsvcadm restart fmd# stopped 5678mdb -p 5678Loading modules: [ fmd ld.so.1 ... ]> Any debugger which can attach to processes will work; mdb(1), dbx(1), gdb(1), etc.Technorati Tag: SolarisTechnorati Tag: Dtrace

Recently, I've seen a couple questions (internally and externally) about the best way to debug something controlled by smf(5); since the process isn't started up directly, the usual ways of running...


mdb(1) background, intro, and cheatsheet

In the Solaris kernel group, we take our crash dumps seriously.Historically, the two main tools for analyzing crash dumps onUNIX wereadb(1)andcrash(1M).More recently, mdb(1)has replaced them both as the debugger of choice in Solaris.adb(1)adb(1) is a venerable tool in the UNIX tool chest -- 7th EditionUNIX (from 1979) had a version of it. It's syntax is quite quirky (asyou'd expect from such an old tool), and one thing to keep in mind isthat adb(1) is an assembly-level debugger. Generally,it deals directly with register values and assembly instructions -- theonly symbolic information it gives you is access to the symbol table.That said, it has a reasonably powerful macro/scripting facility.During the development of SunOS, a large number of adb macros werewritten to dump out various bits of kernel state. In SunOS 3.5(released in 1988), kadb(1) (an interactive kernel debuggerversion of adb(1)) already existed, as did 50-odd adb scripts,mostly generated withadbgen(1M).Solaris 2.0/SunOS 5.x continued the tradition, and by Solaris 9, thereare over 890 scripts in /usr/lib/adb/sparcv9 (compared to 507 in Solaris8).1crash(1M)crash(1M) is a bit more recent; it appeared sometime between7th Edition UNIX and SysV R3, and while SunOS 3.5 did not have it, SunOS4.x did. While adb(1) is a reasonably generic debugger withscripting facilities, crash(1M) takes an almost diametricallyopposed approach: it uses compiled C code which knows how to traverseand understand various structures in the kernel to dump out information of interest.This makes crash(1M) much more powerful than adb(1M)(since you can do complicated things like virtual-to-physical addresstranslation), while simultaneously making it much less flexible (if itwasn't already written into crash(1M), you're going to have towrite it yourself, or do without).This means that adb(1) and crash(1M) were quitecomplimentary. During any given debugging session, each might be usedfor its different strengths.2mdb(1)mdb(1), the Solaris "Modular Debugger", is thebrain-child of Michael Shapiro and Bryan Cantrill. Upon their arrival in the Solaris Kernel Group, they took one look atadb and crash, and decided that they were bothexceedingly long in the tooth. Together, they created mdb(1M) to replace them. It's designed to embodytheir best features, while introducing a new framework for building debugging support, live and post-mortem. Because of the sheer number of existent adb macros, andthe finger-memory of hundreds of people, mdb(1) isalmost completely backwards compatible withadb(1).3mdb(1) allows for extensibility in the form of "DebuggerModules" (dmods) which can provide "debugger commands"(dcmds) and "walkers". dcmds are similar to the commandsof crash, while walkers walk a particular dataset.Both dcmds and walkers are written in C using the interfacedefined by the MDB module API, which is documented in the Modular Debugger Guide.Using the mdb module API, kernel engineers (Mike,Bryan and Dan, and others) havebuilt up a huge library of dcmds and walkers to explore Solaris -- my desktop (a Solaris10 system) has 196 generic walkers and 368 dcmds defined. (there are ~200 auto-generated walkers for the various kmem caches on the system, which I'mnot counting here)The neat thing about walkers and dcmds is that they can build on eachother, combining to do more powerful things. This occurs both intheir implementation and by the user's explicit action of placing them into anmdb "pipeline".To give you a taste of the power of pipelines, here's an example, runningagainst the live kernel on my desktop: the ::pgrep dcmd allows youto find all processes matching a pattern, the thread walker walksall of the threads in a process, and the ::findstack dcmd gets astack trace for a given thread. Connecting them into a pipeline, youget:# mdb -kLoading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usbas1394 fctl nca audiosup logindmux ptm cpc random fcip nfs lofs ipc ]> ::pgrep sshdS PID PPID PGID SID UID FLAGS ADDR NAMER 100174 1 100174 100174 0 0x42000000 0000030009216790 sshdR 276948 100174 100174 100174 0 0x42010000 000003002d9a9860 sshdR 276617 100174 100174 100174 0 0x42010000 0000030013943010 sshd> ::pgrep sshd | ::walk thread3000c4f0c80311967e966030f2ff2c340> ::pgrep sshd | ::walk thread | ::findstackstack pointer for thread 3000c4f0c80: 2a10099d071[ 000002a10099d071 cv_wait_sig_swap+0x130() ] 000002a10099d121 poll_common+0x530() 000002a10099d211 pollsys+0xf8() 000002a10099d2f1 syscall_trap32+0x1e8()stack pointer for thread 311967e9660: 2a100897071[ 000002a100897071 cv_wait_sig_swap+0x130() ]stack pointer for thread 30f2ff2c340: 2a100693071[ 000002a100693071 cv_wait_sig_swap+0x130() ] 000002a100693121 poll_common+0x530() 000002a100693211 pollsys+0xf8() 000002a1006932f1 syscall_trap32+0x1e8()>Yielding the stack traces of all sshd threads on the system(note that the middle one is swapped out). mdb pipelines arequite similar to standard UNIX pipelines, and allow those using thedebugger a similar level of power and flexibility.An mdb(1) cheat sheetBecause of its backwards compatibility with adb, mdbcan have a bit of a learning curve. A while back, I put together anmdb(1) cheatsheet[pspdf] toreference during late-night post-mortem debugging sessions, and it hasbecome a pretty popular reference in the Kernel Group. It's designed toprint out double-sided; the front covers the full mdb syntax, whilethe back is a set of commonly-used kernel dcmds and walkers, with shortdescriptions.That's it for a quick history and tour -- I should be talking more aboutmdb later, along with libumem(3lib) (my current claimto fame), smf(5), and userland and kernel debugging in general.Footnotes:1  The introduction of mdb(1) in Solaris8, and CTF (compact ANSI-C type format) in Solaris 9 has started to slowdown this trend significantly -- Solaris 10 will only have about 16 new adbscripts over Solaris 9.2  I have little direct experience withcrash(1M) -- by the time I joined Sun, it had been EOLed.3  invoking adb on Solaris 9and later just invokes mdb in backwards-compatibility mode.Technorati Tag: SolarisTechnorati Tag: mdb

In the Solaris kernel group, we take our crash dumps seriously. Historically, the two main tools for analyzing crash dumps on UNIX wereadb(1) andcrash(1M). More recently,mdb(1)has replaced them both...