Sunday Jun 19, 2005

Adding a kernel module to OpenSolaris

On opening day, I chose to post an entry on adding a system call to OpenSolaris. Considering the feedback, I thought I'd continue with brief "How-To add to OpenSolaris" documents for a while. There's a lot to choose from here, so I'll just pick them off as quick as I can. Todays topic as adding a new kernel module to OpenSolaris.

For the sake of discussion, we will be adding a new module that does nothing apart from print a message on load and unload. It will be architecture-neutral, and be distributed as part of a separate package (to give you a taste of our packaging system). We'll continue my narcissistic tradition and name this the "schrock" module.

1. Adding source

To begin, you must put your source somewhere in the tree. It must be put somewhere under usr/src/uts/common, but exactly where depends on the type of module. Just about the only real rule is that filesystems go in the "fs" directory, but other than that there are no real rules. The bulk of the modules live in the "io" directory, since the majority of modules are drivers of some kind. For now, we'll put 'schrock.c' in the "io" directory:

#include <sys/modctl.h>
#include <sys/cmn_err.h>

static struct modldrv modldrv = {
	"schrock module %I%",

static struct modlinkage modlinkage = {
	MODREV_1, (void \*)&modldrv, NULL

	cmn_err(CE_WARN, "OpenSolaris has arrived");
	return (mod_install(&modlinkage));

	cmn_err(CE_WARN, "OpenSolaris has left the building");
	return (mod_remove(&modlinkage));

_info(struct modinfo \*modinfop)
	return (mod_info(&modlinkage, modinfop));

The code is pretty simple, and is basically the minimum needed to add a module to the system. You notice we use 'mod_miscops' in our modldrv. If we were adding a device driver or filesystem, we would be using a different set of linkage structures.

2. Creating Makefiles

We must add two Makefiles to get this building:


With contents similar to the following:

UTSBASE = ../..

MODULE          = schrock
LINTS           = $(SCHROCK_OBJS:%.o=$(LINTS_DIR)/%.ln)

include $(UTSBASE)/intel/

LINT_TARGET     = $(MODULE).lint

CFLAGS          += $(CCVERBOSE)


def:            $(DEF_DEPS)

all:            $(ALL_DEPS)

clean:          $(CLEAN_DEPS)

clobber:        $(CLOBBER_DEPS)

lint:           $(LINT_DEPS)

modlintlib:     $(MODLINTLIB_DEPS)

clean.lint:     $(CLEAN_LINT_DEPS)

install:        $(INSTALL_DEPS)

include $(UTSBASE)/intel/Makefile.targ

3. Modifying existing Makefiles

There are two remaining Makefile chores before we can continue. First, we have to add the set of files to usr/src/uts/common/Makefile.files:

KMDB_OBJS += kdrv.o

SCHROCK_OBJS += schrock.o

BGE_OBJS += bge_main.o bge_chip.o bge_kstats.o bge_log.o bge_ndd.o \\
                bge_atomic.o bge_mii.o bge_send.o bge_recv.o

If you had created a subdirectory for your module instead of placing it in "io", you would also have to add a set of rules to usr/src/uts/common/Makefile.rules. If you need to do this, make sure you get both the object targets and the lint targets, or you'll get build failures if you try to run lint.

You'll also need to modify the usr/src/uts/intel/ file, as well as the corresponding SPARC version:

MISC_KMODS      += usba usba10
MISC_KMODS      += zmod
MISC_KMODS      += schrock

#       Software Cryptographic Providers (/kernel/crypto):

4. Creating a package

As mentioned previously, we want this module to live in its own package. We start by creating usr/src/pkgdefs/SUNWschrock and adding it to the list of COMMON_SUBDIRS in usr/src/pkgdefs/Makefile:

        SUNWsasnm \\
        SUNWsbp2 \\
        SUNWschrock \\
        SUNWscpr  \\
        SUNWscpu  \\

Next, we have to add a skeleton package system. Since we're only adding a miscellaneous module and not a full blown driver, we only need a simple skeleton. First, there's the Makefile:

include ../


all: $(FILES)
install: all pkg

include ../Makefile.targ

A 'pkgimfo.tmpl' file:

NAME="Sample kernel module"
VENDOR="Sun Microsystems, Inc."
DESC="Sample kernel module"
HOTLINE="Please contact your local service provider"

And 'prototype_com', 'prototype_i386', and 'prototype_sparc' (elided) files:

# prototype_i386
!include prototype_com

d none kernel/misc/amd64 755 root sys
f none kernel/misc/amd64/schrock 755 root sys
# prototype_com
i pkginfo

d none kernel 755 root sys
d none kernel/misc 755 root sys
f none kernel/misc/schrock 755 root sys

5. Putting it all together

If we pkgadd our package, or BFU to the resulting archives, we can see our module in action:

halcyon# modload /kernel/misc/schrock
Jun 19 12:43:35 halcyon schrock: WARNING: OpenSolaris has arrived
halcyon# modunload -i 197
Jun 19 12:43:50 halcyon schrock: WARNING: OpenSolaris has left the building

This process is common to all kernel modules (though packaging is simpler for those combined in SUNWckr, for example). Things get a little more complicated and a little more specific when you begin to talk about drivers or filesystems in particular. I'll try to create some simple howtos for those as well.

Technorati Tag:

Friday Jun 17, 2005

Observability in OpenSolaris

Just a heads up that we've formed a new OpenSolaris Observability community. There's not much there night now, but I encourage to head over and check out what OpenSolaris has to offer. Or come to the discussion forum and gripe about what features we're still missing. Topics covered include process, system, hardware, and post mortem observability. We'll be adding much more content as soon as we can.

Technorati Tag:

GDB to MDB Migration, Part Two

So talking to Ben last night convinced me I needed to finish up the GDB to MDB reference that I started last month. So here's part two.


Program Stack

backtrace n::stack
Display stack backtrace for the current thread
-thread::findstack -v Display a stack for a given thread. In the kernel, thread is the address of the kthread_t. In userland, it's the thread identifier.
info ...- Display information about the current frame. MDB doesn't support the debugging data necessary to maintain the frame abstraction.

Execution Control

:c Continue target.
Step to the next machine instruction. MDB does not support stepping by source lines.
::step over
Step over the next machine instruction, skipping any function calls.
finish::step out Continue until returning from the current frame.
jump \*addressaddress>reg Jump to the given location. In MDB, reg depends on your platform. For SPARC it's 'pc', for i386 its 'eip', and for amd64 it's 'rip'.


print expraddr::print expr Print the given expression. In GDB you can specify variable names as well as addresses. For MDB, you give a particular address and then specify the type to display (which can include dereferencing of members, etc).
print /faddr/f Print data in a precise format. See ::formats for a list of MDB formats.
disassem addraddr::dis Dissasemble text at the given address, or the current PC if no address is specified

This is just a primer. Both programs support a wide variety of additional options. Running 'mdb -k', you can quickly see just how many commands are out there:

> ::dcmds ! wc -l
> ::walkers ! wc -l

One helpful trick is ::dcmds ! grep thing, which searches the description of each command. Good luck, and join the discussion over at the OpenSolaris MDB community if you have any questions or tips of your own.

Technorati tag:
Technorati tag:
Technorati tag:

Tuesday Jun 14, 2005

How to add a system call to OpenSolaris

When I first started in the Solaris group, I was faced with two equally difficult tasks: learning the development model, and understanding the source code. For both these tasks, the recommended method is usually picking a small bug and working through the process. For the curious, the first bug I putback to ON was 4912227 (ptree call returns zero on failure), a simple bug with near zero risk. It was the first step down a very long road.

As a another first step, someone suggested adding a very simple system call to the kernel. This turned out to be a whole lot harder than one would expect, and has so many subtle aspects that experienced Solaris engineers (myself included) still miss some of the necessary changes. With that in mind, I thought a reasonable first OpenSolaris blog would be describing exactly how to add a new system call to the kernel.

For the purposes of this post, we will assume that it's a simple system call that lives in the generic kernel code, and we'll put the code into an existing file to avoid having to deal with Makefiles. The goal is to print an arbitrary message to the console whenever the system call is issued.

1. Picking a syscall number

Before writing any real code, we first have to pick a number that will represent our system call. The main source of documentation here is syscall.h, which describes all the available system call numbers, as well as which ones are reserved. The maximum number of syscalls is currently 256 (NSYSCALL), which doesn't leave much space for new ones. This could theoretically be extended - I believe the hard limit is in the size of sysset_t, whose 16 integers must be able to represent a complete bitmask of all system calls. This puts our actual limit at 16\*32, or 512, system calls. But for the purposes of our tutorial, we'll pick system call number 56, which is currently unused. For my own amusement, we'll name our (my?) system call 'schrock'. So first we add the following line to syscall.h

#define SYS_uadmin      55
#define SYS_schrock     56
#define SYS_utssys      57

2. Writing the syscall handler

Next, we have to actually add the function that will get called when we invoke the system call. What we should really do is add a new file schrock.c to usr/src/uts/common/syscall, but I'm trying to avoid Makefiles. Instead, we'll just stick it in getpid.c:

#include <sys/cmn_err.h>

schrock(void \*arg)
	char	buf[1024];
	size_t	len;

	if (copyinstr(arg, buf, sizeof (buf), &len) != 0)
		return (set_errno(EFAULT));

	cmn_err(CE_WARN, "%s", buf);

	return (0);

Note that declaring a buffer of 1024 bytes on the stack is a very bad thing to do in the kernel. We have limited stack space, and a stack overflow will result in a panic. We also don't check that the length of the string was less than our scratch space. But this will suffice for illustrative purposes. The cmn_err() function is the simplest way to display messages from the kernel.

3. Adding an entry to the syscall table

We need to place an entry in the system call table. This table lives in sysent.c, and makes heavy use of macros to simplify the source. Our system call takes a single argument and returns an integer, so we'll need to use the SYSENT_CI macro. We need to add a prototype for our syscall, and add an entry to the sysent and sysent32 tables:

int     rename();
void    rexit();
int     schrock();
int     semsys();
int     setgid();

/\* ... \*/

        /\* 54 \*/ SYSENT_CI("ioctl",             ioctl,          3),
        /\* 55 \*/ SYSENT_CI("uadmin",            uadmin,         3),
        /\* 56 \*/ SYSENT_CI("schrock",		schrock,	1),
        /\* 57 \*/ IF_LP64(
                        SYSENT_2CI("utssys",    utssys64,       4),
                        SYSENT_2CI("utssys",    utssys32,       4)),

/\* ... \*/

        /\* 54 \*/ SYSENT_CI("ioctl",             ioctl,          3),
        /\* 55 \*/ SYSENT_CI("uadmin",            uadmin,         3),
        /\* 56 \*/ SYSENT_CI("schrock",		schrock,	1),
        /\* 57 \*/ SYSENT_2CI("utssys",           utssys32,       4),

4. /etc/name_to_sysnum

At this point, we could write a program to invoke our system call, but the point here is to illustrate everything that needs to be done to integrate a system call, so we can't ignore the little things. One of these little things is /etc/name_to_sysnum, which provides a mapping between system call names and numbers, and is used by dtrace(1M), truss(1), and friends. Of course, there is one version for x86 and one for SPARC, so you will have to add the following lines to both the intel and SPARC versions:

ioctl                   54
uadmin                  55
schrock                 56
utssys                  57
fdsync                  58

5. truss(1)

Truss does fancy decoding of system call arguments. In order to do this, we need to maintain a table in truss that describes the type of each argument for every syscall. This table is found in systable.c. Since our syscall takes a single string, we add the following entry:

{"ioctl",       3, DEC, NOV, DEC, IOC, IOA},                    /\*  54 \*/
{"uadmin",      3, DEC, NOV, DEC, DEC, DEC},                    /\*  55 \*/
{"schrock",     1, DEC, NOV, STG},                              /\*  56 \*/
{"utssys",      4, DEC, NOV, HEX, DEC, UTS, HEX},               /\*  57 \*/
{"fdsync",      2, DEC, NOV, DEC, FFG},                         /\*  58 \*/

Don't worry too much about the different constants. But be sure to read up on the truss source code if you're adding a complicated system call.

6. proc_names.c

This is the file that gets missed the most often when adding a new syscall. Libproc uses the table in proc_names.c to translate between system call numbers and names. Why it doesn't make use of /etc/name_to_sysnum is anybody's guess, but for now you have to update the systable array in this file:

        "ioctl",                /\* 54 \*/
        "uadmin",               /\* 55 \*/
        "schrock",              /\* 56 \*/
        "utssys",               /\* 57 \*/
        "fdsync",               /\* 58 \*/

7. Putting it all together

Finally, everything is in place. We can test our system call with a simple program:

#include <sys/syscall.h>

main(int argc, char \*\*argv)
	syscall(SYS_schrock, "OpenSolaris Rules!");
	return (0);

If we run this on our system, we'll see the following output on the console:

June 14 13:42:21 halcyon genunix: WARNING: OpenSolaris Rules!

Because we did all the extra work, we can actually observe the behavior using truss(1), mdb(1), or dtrace(1M). As you can see, adding a system call is not as easy as it should be. One of the ideas that has been floating around for a while is the Grand Unified Syscall(tm) project, which would centralize all this information as well as provide type information for the DTrace syscall provider. But until that happens, we'll have to deal with this process.

Technorati Tag:
Technorati Tag:

Saturday Jun 04, 2005

FISL Final Day

The last day of FISL has come and gone, thankfully. I'm completely drained, both physically and mentally. As you can probably tell from the comments on yesterday's blog entry, we had quite a night out last night in Porto Alegre. I didn't stay out quite as late as some of the Brazil guys, but Ken and I made it back in time to catch about 4 hours of sleep before heading off to the conference. Thankfully I remembered to set my alarm, otherwise I probably would have ended up in bed until the early afternoon. The full details of the night are better told in person...

This last day was significantly quieter than previous days. With the conference winding down, I assume that many people took off early. Most of our presentations today were to an audience of 2 or 3 people, and we even had to cancel some of the early ones as no one was there. I managed to give presentations for Performance, Zones, and DTrace, despite my complete lack of sleep. The DTrace presentation was particularly rough because it's primarily demo-driven, with no set plan. This turns out to be rather difficult after a night of no sleep and a few too many caipirinhas.

The highlight of the day was when a woman (stunningly beautiful, of course) came up to me while I was sitting in one of the chairs and asked to take a picture with me. We didn't talk at all, and I didn't know who she was, but she seemed psyched to be getting her picture taken with someone from Sun. I just keep telling myself that it was my stunning good looks that resulted in the picture, not my badge saying "Sun Microsystems". I can dream, can't I?

Tomorrow begins the 24 hours of travelling to get me back home. I can't wait to get back to my own apartment and a normal lifestyle.

Friday Jun 03, 2005

FISL Day 3

The exhaustion continues to increase. Today I did 3 presentations: DTrace, Zones, and FMA (which turned into OpenSolaris). Every one took up the full hour allotted. And tomorrow I'm going to add a Solaris performance presentation, to bring the grand total to 4 hours of presentations. Given how bad the acoustics are on the exposition floor, my goal is to lose my voice by the end of the night. So far, I've settled into a schedule: wake up around 7:00, check email, work on slides, eat breakfast, then get to the conference around 8:45. After a full day of talking and giving presentations, I get back to the hotel around 7:45 and do about an hour of work/email before going out to dinner. We get back from dinner around 11:30, at which point I get to blogging and finishing up some work. Eventaully I get to sleep around 1:00, at which point I have to do the whole thing the next day. Thank god tomorrow is the end, I don't know how much more I can take.

Today's highlight was when Dimas (from Sun Brazil) began an impromptu Looking Glass demo towards the end of the day. He ended up overflowing our booth with at least 40 people for a solid hour before the commotion started to die down. Those of us sitting in the corner were worried we'd have to lave to make room. Our Solaris presentations hit 25 or so people, but never so many for so long. The combination of cool eye candy and a native Portuguese speaker really helped out (though most people probably couldn't hear him anyway).

Other highlights included hanging out with the folks at CodeBreakers, who really seem to dig Solaris (Thiago had S10 installed on his laptop within half a day). We took some pictures with them (which Dave should post soon), and are going out for barbeque and drinks tonight with them and 100+ other open source Brazil folks. I also helped a few other people get Solaris 10 installed on their laptops (mostly just the "disable USB legacy support" problem). It's unbelievably cool to see the results of handing out Solaris 10 DVDs before even leaving the conference. The top Solaris presentations were understandably DTrace and Zones, though the booth was pretty well packed all day.

Let's hope the last day is as good as the rest. Here's to Software Livre!

Thursday Jun 02, 2005

FISL Day 2

Another day at FISL, another day full of presentations. Today we did mini-presentations every hour on the hour, most of which were very well attended. When we overlapped with the major keynote sessions, turnout tended to be low, but other than that it was very successful. We covered OpenSolaris, DTrace, FMA, SMF, Security, as well as a Java presentation (by Charlie, not Dave or myself). As usual, lots of great questions from the highly technical audience.

The highlight today was a great conversation with a group of folks very interested in starting an OpenSolaris users group in Brazil. Extremely nice group of guys, very interested in technology and helping OpenSolaris build a greater presence in Brazil (both through user groups and Solaris attendance at conferences). I have to say that after experiencing this conference and seeing the enthusiasm that everyone has for exciting technology and open source, I have to agree that Brazil is a great place to focus our OpenSolaris presence. Hopefully we'll see user groups pop up here as well as the rest of the world. We'll be doing everything we can to help from within Sun.

The other, more amusing, highlight of the day was during my DTrace demonstration. I needed an interesting java application to demonstrate the jstack() DTrace action, so I started up the only java application (apart from some internal Sun tools) that I use on a regular basis: Yahoo! Sports Fantasy Baseball StatTracker (the classic version, not the new flash one). I tried to explain that maybe I was trying to debug why the app was lying to me about Tejada going 0-2 so far in the Sox/Orioles game; really he should have hit two homers and I should be dominating this week's scores1. I was rather amused, but I think the cultural divide was a little too wide. Not only baseball, but fantasy baseball: I don't blame the audience at all.

Technorati tags:

1 This is clearly a lie. Despite any dreams of fantasy baseball domination, I would never root for my players in a game over the Red Sox. In the end, Ryan's 40.5 ERA was worth the bottom of the ninth comeback capped by Ortiz's 3-run shot.

Tuesday May 31, 2005

Live from Brazil

Dave Powell and myself have arrived at FISL, an open source conference in Brazil, along with a crowd of other Sun folks. Dave and I (with introduction from Sun VP Tom Goguen) will be hosting a 4 hour OpenSolaris pre-event tomorrow, June 1st. We'll be talking about all the cool features available in OpenSolaris, as well as how Solaris development works today and how we hope it will work in the future. If you're attending the conference, be sure to stop by to learn about OpenSolaris, and what makes Solaris (and Solaris developers) tick. We'll also be hanging around the Sun booth during the rest of the conference, giving mini-presentations, demos, answering questions, and chatting with anyone who will listen. We're happy to talk about OpenSolaris, Solaris, Sun, or your favorite scenes from Monty Python and the Holy Grail. Oh yeah, there will be lots of T-shirts and Solaris 10 DVDs as well.

Thursday May 12, 2005

In the News(.com)

So it looks like my blog made it over to the frontpage of in this article about slipping Solaris 10 features. Don't get your hopes up - I'm not going to refute Genn's claims; we certainly are not scheduled for a specific update at the moment. But pay attention to the details: ZFS and Janus will be available in an earlier Solaris Express release. I also find it encouraging that engineers like myself have a voice that actually gets picked up by the regular press (without being blown out of proportion or slashdotted).

I would like to point out that I putback the last major chunk of command redesign to the ZFS gate yesterday ;-) There are certainly some features left to implement, but the fact that I re-whacked all of the userland components (within six weeks, no less) should not be interpreted as any statement of schedule plans. Hopefully I can get into some of the details of what we're doing but I don't want to be seen as promoting vaporware (even though we have many happy beta customers) or exposing unfinished interfaces which are subject to change.

I also happen to be involved with the ongoing Janus work, but that's another story altogether. I swear there's no connection between myself and slipping products (at least not one where I'm the cause).

Update: So much for not getting blown out of proportion. Leave it to the second tier news sites to turn "not scheduled for an update" into "delayed indefinitely over deficiencies". Honestly, rewriting 5% of the code should hardly be interpreted as "delayed indefinitely" - so much for legitimate journalism. Please keep in mind that all features will hit Software Express before a S10 Update, and OpenSolaris even sooner.

Tuesday May 10, 2005

GDB to MDB Migration, Part One

In past comments, it has been pointed out that a transition guide between GDB and MDB would be useful to some developers out there. A full comparison would also cover dbx(1), but I'll defer this to a later point. Given the number of available commands, I'll be dividing up this post into at least two pieces.

Before diving into too much detail, it should be noted that MDB and GDB have slightly different design goals. MDB (and KMDB) replaced the aging adb(1) and crash(1M), and was designed primarily for post-mortem analysis and live kernel analysis. To this end, MDB presents the same interface when debugging a crash dump as when examining a live kernel. Solaris corefiles have been enhanced so that all the information for the process (including library text and type information) is present in the corefile. MDB can examine and run live processes, but lacks some of the features (source level debugging, STABS/DWARF support, conditional breakpoints, scripting language) that are standard for developer-centric tools like GDB (or dbx). GDB was designed for interactive process debugging. While you can use GDB on corefiles (and even LKCD crash dumps or Linux kernels - locally and remotely), you often need the original object files to take advantage of GDB's features.

Before going too far into MDB, be sure to check out Jonathan's MDB Cheatsheet as a useful quick reference guide, with some examples of stringing together commands into pipelines. Seeing as how I'm not the most accomplished GDB user in the world, I'll be basing this comparison off the equivalent GDB reference card.


Starting Up

gdb programmdb path
mdb -p pid
Start debugging a command or running process. GDB will treat numeric arguments as pids, while mdb explicitly requires the '-p' option
gdb program coremdb [ program ] coreDebug a corefile associated with 'program'. For MDB, the program is optional and is generally unnecessary given the corefile enhancements made during Solaris 10.


quit::quitBoth programs also exit on Ctrl-D.

Getting Help

help command
::help dcmd
In mdb, you can list all the available walkers or dcmds, as well as get help on a specific dcmd. Another useful trick is ::dmods -l module which lists walkers and dcmds provided by a specific module.

Running Programs

run arglist::run arglistRuns the program with the given arguments. If the target is currently running, or is a corefile, MDB will restart the program if possible.
kill::killForcibly kill and release target.
show env::getenvDisplay current environment.
set env var string::setenv var=stringSet an environment variable.
get env var::getenv varGet a specific environment variable.

Shell Commands

shell cmd! cmdExecute the given shell command.

Breakpoints and Watchpoints

break func
break \*addr
addr::bpSet a breakpoint at the given address or function.
break file:line-Break at the given line of the file. MDB does not support source level debugging.
break ... if expr-Set a conditional breakpoint. MDB doesn't support conditional breakpoints, though you can get a close approximation via the -c option (though its complicated enough to warrant its own post).
watch expraddr::wp -rwx [-L size] Set a watchpoint on the given region of memory.
info break
info watch
::eventsDisplay active watchpoints and breakpoints. MDB will show you signal events as well.
delete [n]::delete nDelete the given breakpoint or watchpoints.

I think that's enough for now; hopefully the table is at least readable. More to come in a future post.

Sunday Apr 03, 2005

Bug of the week

There are many bugs out there that are interesting, either because of an implementation detail or the debugging necessary to root cause the problem. As you may have noticed, I like to publicly expound upon the most interesting ones I've fixed (as long as it's not a security vulnerability). This week turned up a rather interesting specimen:

6198523 dirfindvp() can erroneously return ENOENT

This bug was first spotted by Casper back in November last year while trying to do some builds on ZFS. The basic pathology is that at some point during the build, we'd get error messages like:

sh: cannot determine current directory

Some ideas were kicked around by the ZFS team, and after the problem seemed to go away, the team believed that some recent mass of changes had also fixed the problem. Five months later, Jonathan hit the same bug on another build machine running ZFS. As I wrote the getcwd() code, I was determined to root cause the problem this time around.

Back in build 56 of S10, I moved getcwd(3c) into the kernel, along with changes to store pathnames with vnodes (which is used by the DTrace I/O provider as well as pfiles(1)). Basically, we first try to do a forward lookup on the stored pathname; if that works, then we simply return the resolved path1. If this fails (vnode paths are never guaranteed to be correct), then we have to fall back into the slow path. This slow path involves looking up the parent, finding the current vnode in parent, prepending path, and repeat. Once we reach the root of the filesystem, we have a complete path.

To debug this problem, I used this D script to track the behavior of dirtopath(), the function that performs the dirty work of the slow path. Running this for a while produced a tasty bit of information:

dirtopath       /export/ws/build/usr/src/cmd/sgs/ld
lookup(/export/ws/build/usr/src/cmd, .make.dependency.8309dfdc.234596.166) failed (2)
dirfindvp(/export/ws/build/usr/src/cmd,/export/ws/build/usr/src/cmd/sgs) failed (2)
dirtopath() returned 2

Looking at this, it was clear that dirfindvp() (which finds a given vnode in its parent) was inappropriately failing. In particular, after a failed lookup for a temporary make file, we bail out of the loop and report failure, despite the fact that "sgs" is still sitting there in the directory. A long look at the code revealed the problem. Without revealing too much of the code (OpenSolaris, where are you?), it's essentially structured like so:

        while (!err && !eof) {

                /\* ... \*/

                while ((intptr_t)dp < (intptr_t)dbuf + dbuflen) {

                        /\* ... \*/

                         \* We only want to bail out if there was an error other
                         \* than ENOENT.  Otherwise, it could be that someone
                         \* just removed an entry since the readdir() call, and
                         \* the entry we want is further on in the directory.
                        if (err != ENOENT) {                             

The code is trying to avoid exactly our situation: we fail to do a lookup of a file we just saw beacuse the contents are rapidly changing. The bug is that in the while loop we have a check for !err && !eof. If we fail to look up an entry, and it's the last entry in the chunk we just read, then we'll prematurely bail out of the enclosing while loop, returning ENOENT when we shouldn't. Using this test program, it's easy to reproduce on both ZFS and UFS. There are several noteworthy aspects of this bug:

  • The bug had been in the gate for over a year, and there hadn't been a single reported build failure.

  • It only happens when the cached vnode value is invalid, which is rare2.

  • It is a race condition between readdir, lookup, and remove.

  • On UFS, inodes are marked as deleted but can still be looked up until the delete queue is processed at a later point. ZFS deletes entries immediately, so this was much more apparent on ZFS.

  • Because of the above, it was incredibly transient. It would have taken an order of magnitude more time to root cause if not for DTrace, which excels at solving these transient phenomena

A three line change and the bug was fixed, and will make it back to S10 in time for Update 1. If it hadn't been for those among us willing to run our builds on top of ZFS, this problem would not have been found until ZFS integrated, or a customer escalation cost the company a whole bunch of money.

1 There are many more subtleties here relating to Zones, and verifying that the path hasn't been changed to refer to another file. The curious among you will have to wait for OpenSolaris.

2 I haven't yet investigated why we ended up in the slow path in this case. First things first.

Friday Mar 18, 2005

Google coredumper

In the last few days you may have noticed that Google released a site filled with Open Source applications and interfaces. First off, kudos to the Google guys for putting this together. It's always great to see a company open sourcing their tools, as well as encouraging open standards to take advantage of their services.

That being said, I found the google coredumper particularly amusing. From the google page:

coredumper: Gives you the ability to dump cores from programs when it was previously not possible.

Being very close to the debugging tools on Solaris, I was a little taken aback by this statement. On Solaris, the gcore(1) command has always been a supported tool for generating standard Solaris core files readable by any debugger. Seeing as how I can't imagine a UNIX system without this tool, I went looking in some old source trees to find out when it was originally written. While the current Solaris version has been re-written over the course of time, I did find this comment buried in the old SunOS 3.5 source:

     \* gcore - get core images of running processes 
     \* Author: Eric Cooper 
     \* Written: Fall 1981. 
     \* Inspired by a version 6 program by Len Levin, 1978. 
     \* Several pieces of code lifted from Bill Joy's 4BSD ps. 

So this tool has been a standard part of UNIX since 1981, and based on sources as old as 1978. This is why the statement that it was "previously not possible" on Linux seemed shocking to me. Just to be sure, I logged into one of our machines running Linux and tried poking around:

    $ find /usr/bin -name "\*core\*"

No luck. Intrigued, I took a look at the google project. From the included README:

The coredumper library can be compiled into applications to create core dumps of the running program, without having to terminate them. It supports both single- and multi-threaded core dumps, even if the kernel does not have native support for multi-threaded core files.

So the design goal appears to be slightly different; being able to dump core from within the program itself. On Solaris, I would just fork/exec a copy of gcore(), or use the (unfortunately private) libproc interface to do so. I find it hard to believe that there are kernels out there without support for multi-threaded core files, though. I took a quick google search for 'gcore linux', and turned up a few mailing list articles here here and here. I went and downloaded the latest GDB sources, and sure enough there is a "gcore" command. I went back to our lab machine and tested it out with gdb 5.1, and sure enough it worked. But reading back the file was not as successful:

    # gdb -p `pgrep nscd`
    (gdb) info threads
      7 Thread 5126 (LWP 1018)  0x420e7fc2 in accept () from /lib/i686/
      6 Thread 4101 (LWP 1017)  0x420e7fc2 in accept () from /lib/i686/
      5 Thread 3076 (LWP 1016)  0x420e7fc2 in accept () from /lib/i686/
      4 Thread 2051 (LWP 1015)  0x420e0037 in poll () from /lib/i686/
      3 Thread 1026 (LWP 1014)  0x420e7fc2 in accept () from /lib/i686/
      2 Thread 2049 (LWP 1013)  0x420e0037 in poll () from /lib/i686/
      1 Thread 1024 (LWP 1007)  0x420e7fc2 in accept () from /lib/i686/
    (gdb) bt
    #0  0x420e7fc2 in accept () from /lib/i686/
    #1  0x40034603 in accept () from /lib/i686/
    #2  0x0804acd5 in geteuid ()
    #3  0x4002ffef in pthread_start_thread () from /lib/i686/
    (gdb) gcore
    Saved corefile core.1014
    (gdb) quit
    The program is running.  Quit anyway (and detach it)? (y or n) y
    # gdb core.1014.
    "/tmp/core.1014": not in executable format: File format not recognized
    (gdb) quit
    # gdb /usr/sbin/nscd core.1014
    Core was generated by `/usr/sbin/nscd'.
    Program terminated with signal 17, Child status changed.
    #0  0x420e0037 in poll () from /lib/i686/
    (gdb) info threads
      7 process 67109878  0x420e7fc2 in accept () from /lib/i686/
      6 process 134284278  0x420e0037 in poll () from /lib/i686/
      5 process 67240950  0x420e7fc2 in accept () from /lib/i686/
      4 process 134415350  0x420e7fc2 in accept () from /lib/i686/
      3 process 201589750  0x420e7fc2 in accept () from /lib/i686/
      2 process 268764150  0x420e7fc2 in accept () from /lib/i686/
      \* 1 process 335938550  0x420e0037 in poll () from /lib/i686/
    (gdb) bt
    #0  0x420e0037 in poll () from /lib/i686/
    #1  0x0804aca8 in geteuid ()
    #2  0x4002ffef in pthread_start_thread () from /lib/i686/
    (gdb) quit

This whole exercise was rather distressing, and brought me straight back to college when I had to deal with gdb on a regular basis (Brown moved to Linux my senior year and I was responsible (together with Rob) for porting the Brown Simulator and Weenix OS from Solaris). Everything seemed fine when first attaching to the process; the gcore command appeared to work fine. But when reading back a corefile, gdb can't understand a lone corefile, the process/thread IDs have been completely garbled, and I've lost floating point state (not shown above). It makes me glad that we have MDB, and configurable corefile content in Solaris 10.

This is likely an unfair comparison since it's using GDB version 5.1, when the latest is 6.3, but at least it validates the existence of the google library. I always pay attention to debugging tools around the industry, but it seems like I need to get a little more hands-on experience to really guage the current state of affairs. I'll have to get access to a system running a more recent version of GDB to see if it is any better before drawing any definitive conclusions. Then again, Solaris has had a working gcore(1) and mdb(1)/adb(1) since the SunOS days back in the 80s, so I don't see why I should have to lower my expectations just because it's GNU/Linux.

Thursday Mar 17, 2005

Another interesting bug

I know it's been a long time since I posted a blog entry. But I've either been too busy, out of the country, working on (not yet public) projects, or fixing relatively uninteresting bugs. But last week I finally nailed down a nasty bug that had been haunting me for several weeks, so I thought I'd share some of the experience. I apologize if this post gets a little too technical and/or incomprehensible. But I found it to be an interesting exercise, and hopefully sharing it will get me back in the blogging spirit.

First a little background. In Solaris, we have a set of kernel functions known as 'copyops' used to transfer data between the kernel and userland. In order to support watchpoints and SunCluster, we maintain a backup vector of functions used when one of these fails. For example, if you have a piece of data on a watched page, we keep that page entirely unmapped. If the kernel tries to read data from this page, the copyin() function will initially fail, before falling back on watch_copyin(). This goes and temporarily maps in the page, does the copy (triggering a watchpoint if necessary) and then unmapping the page. In this way, the average kernel consumer has no idea that there was a watched area on the page.

Clustering uses this facility in their pxfs (proxy filesystem) implementation. In order to support ioctl() calls that access an unspecified amount of memory, they use the copyops vector to translate any reads or writes into over-the-wire requests for the necessary data. These requests are always done from kernel threads, with no attached user space, so any attempt to access userland should fault before vectoring off to their copyops vector.

OK, on to the bug. During testing, SunCluster folks found that they were getting essentially random memory corruption during some ioctl() calls over pxfs on SPARC machines. After trying in vain to understand the crash dumps, the Clustering folks were able to reproduce the problem on DEBUG bits. In addition to getting traptrace output (a black-box style record of OS traps), the kernel failed an ASSERT() deep in the sfmmu HAT (Spitfire Memory Management Unit Hardware Address Translation) layer during a copyin() call. This smoking gun pointed straight to the copyops. We expect a kernel thread accessing userland to generate a T_DATA_EXCEPTION trap, but instead we were getting a T_DATA_MMU_MISS trap, which the HAT was not prepared to handle (nor should it have to).

I spent nearly a week enhancing my copyops test suite, and following several wrong paths deep into SPARC trap tables and the HAT layer. But no amount of testing could reproduce the problem. Finally, I noticed that we had reached the sfmmu assertion as a kernel thread, but our secondary ASI was set to INVALID_CONTEXT instead of KCONTEXT. On SPARC, all addresses are implicitly tagged with an ASI (address space identifier) that lets us refer to kernel addresses and user addresses without having to share the address psace like we do on x86. All kernel threads are supposed to use KCONTEXT (0) as their secondary ASI. INVALID_CONTEXT (1) is reserved for userland threads in various invalid states. Needless to say, this was confusing.

I knew that somehow we were setting the secondary ASI improperly, or forgetting to set it when we should. I began adding some ASSERTs to a custom kernel and quickly ruled out the former. Finally I booted a kernel with some debug code added to resume(), and panicked almost instantly. It was clear that we were coming out of resume() as a kernel thread, but with INVALID_CONTEXT as our secondary ASI. Many hours of debugging later, I finally found my culprit in resume_from_zombie(), which is used when resuming from an exited thread. When a user thread is exiting, we re-parent to p0 (the kernel 'sched' process) and set our secondary ASI to INVALID_CONTEXT. If, in resume(), we switch from one of these threads to another kernel thread, we see that they both belong to the same process (p0) and don't bother to re-initialize the secondary ASI. We even have a function, hat_thread_exit(), designed to do exactly this, only it was a no-op on SPARC. I added a call to sfmmu_setctx_sec() to this function, and the problem disappeared. Technically, this has been a bug since the dawn of time, but it had no ill side effects until I changed the way the copyops were used, and SunCluster began testing on S10.

Besides the sheer amount of debugging effort, this bug was interesting for several reasons:

  • It was impossible to root cause on a non-DEBUG kernel. While we try to make the normal kernel as debuggable as possible, memory corruption (especially due to corruption in the HAT layer) is one of those problems that needs to be caught as close to the source as possible. Solaris has a huge amount of debug code, as well as facilities like traptrace that can only be enabled on a debug kernel due to performance overhead.
  • The cause of the problem was separated from the symptom by an arbitrary period of time. Once we switched to a kernel thread with a bad ASI, we could harmlessly switch between any number of kernel threads before we find one that actually tries to access userland.
  • It was completely unobservable in constrained test scenarios. We not only needed to create kernel threads that accessed userland, but we needed to have a userland thread exit and then switch immediately to one of these threads. Needless to say, this is not easy to reproduce, especially when you don't understand exactly what's going on.
  • This would have been nearly unsolvable on most other OSes. Without a kernel debugger, post mortem crash dump analysis, and tools like DTrace and traptrace records, I doubt I could have ever solved this problem. This is one of those situations where a stack trace and a bunch of printf() calls would never have solved the problem.

While this wasn't the most difficult problem I've ever had to debug, it certainly ranks up there in recent memory.

Wednesday Feb 02, 2005

::whatthread and MDB modules

A long time ago I described a debugging problem where it was necessary to determine which threads owned a reader lock. In particular, I used the heuristic that if the address of the rwlock is in a particular thread's stack, then it most likely held by the thread (and can be verified by examining the thread's stack). This works 99% of the time, because you typically have the following:

    rw_enter(lock, RW_READER);
    /\* ... do something ... \*/


The compiler has to preserve the address of the lock across all the junk in the middle, so it almost always ends up getting pushed on the stack. At described in the previous post, this means a combination of ::kgrep and ::whatis, plus some hand-pruning, to get the threads in question. At the time, I mentioned how nice it would be to have a dedicated command do this dirty work. Now that Solaris 10 has shipped, I finally sat down and gave it a try. In a testament to MDB's well-designed interfaces, I was able to write the entire command in under 5 minutes with just 50 lines of code. On top of that, it runs in a fraction of the time. Rather than searching the entire address space, we only have to look at the stack for each thread. For example:

    > c8d45bb6::kgrep | ::whatis
    c8d45ae4 is c8d45aa0+44, allocated as a thread structure
    cae92ed8 is in thread c8d45aa0's stack
    cae92ee4 is in thread c8d45aa0's stack
    cae92ef8 is in thread c8d45aa0's stack
    cae92f24 is in thread c8d45aa0's stack
    > c8d45bb6::whatthread

The simple output allows it to be piped to ::findstack to quickly locate questionable threads. There have been discussions about maintaining a very small set of held reader locks in the thread structure, but it's a difficult problem to solve definitively (without introducing massive performance regressions).

This demonstrates an oft-overlooked benefit of MDB. Though very few developers exist outside of the Solaris group, developing MDB modules is extremely simple and powerful (there are more than 500 commands and walkers in MDB today). Over time, I think I've almost managed to suppress all the painful GDB memories from my college years...

Tuesday Feb 01, 2005

DTrace and customer service

Today, I thought I'd share a real-world experience that might portray DTrace in a slightly different light than you're used to. The other week, I was helping a customer with the following question:

Why is automountd constantly taking up 1.2% of CPU time?

The first thought that came to mind was a broken automountd. But if that were the case, you'd be more likely to see it spinning and stealing 100% of the CPU. Just to be safe, I asked the customer to send truss -u a.out:: output for the automountd process. As expected, I saw automountd chugging away, happily servicing each request as it came in. Automountd was doing nothing wrong - some process was indirectly sending millions of requests a day to the automounter. Taking a brief look at the kernel code, I responded with the following D script:

   #!/usr/sbin/dtrace -s

      @lookups[execname, stringof(args[0]->fi_path)] = count();

The customer gave it a shot, and found a misbehaving program that was continuously restarting and causing loads of automount activity. Without any further help from me, the customer could easily see exactly which application was the source of the problem, and quickly fixed the misconfiguration.

Afterwards, I reflected on how simple this exchange was, and how difficult it would have been in the pre-Solaris 10 days. Now, I don't expect customers to be able to come up with the above D script on their own (though industrious admins will soon be able to wade through OpenSolaris code). But I was able to resolve their problem in just 2 emails. I was reminded of the infamous gtik2_applet2 fiasco described in the DTrace USENIX paper - automountd was just a symptom of an underlying problem, part of an interaction that was prohibitively difficult to trace to its source. One could turn on automountd debug output, but you'd still only see the request itself, not where it came from. To top it off, the offending processes were so short-lived, that they never showed up in prstat(1) output, hiding from traditional system-wide tools.

After a little thought, I imagined a few Solaris 9 scenarios where I'd either set a kernel breakpoint via kadb, or set a user breakpoint in automountd and use mdb -k to see which threads were waiting for a response. But these (and all other solutions I came up with) were:

  • Disruptive to the running system
  • Not guaranteed to isolate the particular problem
  • Difficult for the customer to understand and execute

It really makes me feel the pain our customer support staff must go through now to support Solaris 8 and Solaris 9. DTrace is such a fundamental change in the debugging and observability paradigm that it changes not only the way we kernel engineers work, but also the way people develop applications, administer machines, and support customers. Too bad we can't EOL Solaris 8 and Solaris 9 next week for the benefit of Sun support...


Musings about Fishworks, Operating Systems, and the software that runs on them.


« July 2016