Monday Aug 23, 2004

Solaris 10 top 11-20 number 15: kmdb

go to the Solaris 10 top 11-20 list for more

Getting back to the business of the Solaris 10 top 11-20, Eric Schrock has written up a great piece on kmdb the new kernel-mode debugger which is newly available in Solaris Express 8/04. Check it out.

Thursday Aug 19, 2004

Inside Solaris Express

Since a few people in various forums have been asking about it, I thought I'd explain a little about how Solaris Express works. I know the story best from the kernel side, but keep in mind there are other parts of Solaris -- Java, the X server, etc. -- that have slightly different processes.

In kernel development we cut a build of Solaris 10 every two weeks; these are numbered s10_XX (for example, Solaris Express 7/04 is s10_60). Those take a week or two to coagulate into the WOS (Wad Of Stuff) which combines the kernel with the latest cut of the X server, gnome, etc. We spend another week or three making sure there's nothing too toxic in that build and release it in the form of a Solaris Express build. The lag time between when the build cuts and when it hits the streets in a Solaris Express build is usually about 4-6 weeks. We're about to release Solaris Express 8/04 (s10_63) and we just cut s10_66 on Monday. Note that Solaris Express isn't some release which we spend extensive time polishing; unless there's some real tragic problem, you're using the same bits that I'm using on my desktop. Since we cut a build every 2 weeks, we choose the best, most stable of the two or three builds since the last Solaris Express release, but usually it's the latest stuff. It can be pretty daunting to know that once you integrate a change into Solaris there's very little time to make sure its right -- we take a lot of pride in making sure Solaris is stable not just for every release of Solaris Express, but every numbered build and, in fact, every nightly build.

As far as what to expect in future releases, I have some hints for DTrace here, but other than, that I think you just have to bite your nails and wait for the release notes. I will tell you that SX 9/04 is going to be exciting -- check out Stephen's weblog for why.

As I mentioned, SX 8/04 will be out very soon. Check out my DTrace Solaris Express decoder ring to see what new DTrace features are in this release (hint: -c and -p are way cool). Dan Price has written up a great description of all the stuff that's new in Solaris Express 8/04.

Wednesday Aug 11, 2004

Assuaging OpenSolaris fears

While trawling through b.s.c., a comment caught my eye in this post from Glenn's weblog:

As a shareholder, I do NOT want you to "open source" solaris in its entirety (ESPECIALLY DTrace!). I want you to keep the good stuff completely sun-only, accessible only under NDA.

Certainly, this echoes some of the same concern I had when I started hearing rumblings about OpenSolaris -- we in Solaris have spent years of our lives making these innovations (ESPECIALLY DTrace!), and we don't want to see them robbed. I'm also a shareholder and I don't want to see my investments of time, effort, and -- forgive me -- money go to waste.

Now that I know more about OpenSolaris and open source in general, I'm confident that Sun isn't selling out Solaris or giving away the company's crown jewel, rather we're going to make Solaris better, and more widely used. That sounds a little Rah Rah Solaris, but let's look more closely at OpenSolaris and what it might mean for Solaris and for Sun (and for the author of the comment, a shareholder).

Open source is an interesting dichotomy: on one side there are the developers and the community with the spirit of the free trade of software and ideas, and on the other side there are the Linux vendors selling service contracts to fat cat customers. The former is clearly the benefit of OpenSolaris -- a larger community of developers and users will improve Solaris and grow its audicent. The latter is the potential risk -- we're concerned that other companies might directly steals Sun's customers by using Sun's technology. The specifics of the OpenSolaris license haven't been finalized so it's possible that the license and patents will prevent Linux vendors from selling technologies developed in Solaris outright. Regardless, Solaris isn't just a bunch of code, it's the support and service and documentation and us, the Solaris developers.

When a Sun customer pays for Solaris, they're paying for someone over here to answer the phone when they call and for me and others in Solaris kernel development to do the things they need. Even when they source code is available, customers will still want to tap into the origins of that code and talk to the people who made it. If there are problems they'll want to be able to rely on the experts to fix them.

What about documentation? The Solaris Dynamic Tracing Guide is still going to be free but only as in beer -- we're not open sourcing our documentation (as least as far as I know). So let's say dtrace.c was dropped into Linux, would they then rewrite the entire answer book (400 pages and counting) from scratch? Maybe this wouldn't matter much to ordinary users, but if you're giving some Linux vendor a big sweaty wad of cash to support DTrace on Linux you expect some documentation! The Solaris docs would be close enough for some users, but not customers shelling out the big big dollars for a service contract.

Even if Linux were able to replicate DTrace and document it and a linux vendor were able to support it, I'm confident the existing and growing Solaris community could keep innovating and push Solaris ahead. On a more person note, I'm also excited about OpenSolaris because it means if I were ever to leave Sun, I could still work on DTrace, mdb, nohup, and the other parts of Solaris that I consider my own.

OpenSolaris can only help Sun. If it succeeds, there will be a larger community of Solaris developers making it work with more platforms and devices, fixing more problems, and improving the quality of life on Solaris which will spawn an even larger community of Solaris users, both individuals and paying customers; if OpenSolaris fails, then that it won't help to create those communities, and I think that's the only consequence.

Friday Aug 06, 2004

Number 19 of 20: per-thread p-tools

go to the Solaris 10 top 11-20 list for more

p-tools

Since Solaris 7 we've included a bunch of process observability tools -- the so called "p-tools". Some of them inspect aspects of the process of the whole. For example, the pmap(1) command shows you information about a process's mappings, their location and ancillary information (the associated file, shmid, etc.). pldd(1) is another example; it shows which shared objects a process has opened.

Other p-tools apply to the threads in a process. The pstack(1) utility shows the call stacks for each thread in a process. New in Solaris 10 Eric and Andrei have modified the p-tools that apply to threads so that you can specify the threads you're interested in rather than having to sift through all of them.

pstack(1)

Developers and administrators often use pstack(1) to see what a process is doing and if it's making progress. You'll often turn to pstack(1) after prstat(1) or top(1) shows a process consuming a bunch of CPU time -- what's that guy up to. Complex processes can many many threads; fortunately prstat(1)'s -L flag will split out each thread in a process as its own row so you can quickly see that thread 5, say, is the one that's hammering the processor. Now rather than sifting through all 100 threads to find thread 5, you can just to this:

$ pstack 107/5
100225: /usr/sbin/nscd
-----------------  lwp# 5 / thread# 5  --------------------
 c2a0314c nanosleep (c25edfb0, c25edfb8)
 08056a96 gethost_revalidate (0) + 4b
 c2a02d10 _thr_setup (c2949000) + 50
 c2a02ed0 _lwp_start (c2949000, 0, 0, c25edff8, c2a02ed0, c2949000)

Alternatively, you can specify a range of threads (5-7 or 11-), and combinations of ranges (5-7,11-). Giving us something like this:

$ pstack 107/5-7,11-
100225: /usr/sbin/nscd
-----------------  lwp# 5 / thread# 5  --------------------
 c2a0314c nanosleep (c25edfb0, c25edfb8)
 08056a96 gethost_revalidate (0) + 4b
 c2a02d10 _thr_setup (c2949000) + 50
 c2a02ed0 _lwp_start (c2949000, 0, 0, c25edff8, c2a02ed0, c2949000)
-----------------  lwp# 6 / thread# 6  --------------------
 c2a0314c nanosleep (c24edfb0, c24edfb8)
 080577d6 getnode_revalidate (0) + 4b
 c2a02d10 _thr_setup (c2949400) + 50
 c2a02ed0 _lwp_start (c2949400, 0, 0, c24edff8, c2a02ed0, c2949400)
-----------------  lwp# 7 / thread# 7  --------------------
 c2a0314c nanosleep (c23edfb0, c23edfb8)
 08055f56 getgr_revalidate (0) + 4b
 c2a02d10 _thr_setup (c2949800) + 50
 c2a02ed0 _lwp_start (c2949800, 0, 0, c23edff8, c2a02ed0, c2949800)
-----------------  lwp# 11 / thread# 11  --------------------
 c2a0314c nanosleep (c1fcdf60, c1fcdf68)
 0805887d reap_hash (80ca918, 8081140, 807f2f8, 259) + ed
 0805292a nsc_reaper (807f92c, 80ca918, 8081140, 807f2f8, c1fcdfec, c2a02d10) + 6d
 08055ded getpw_uid_reaper (0) + 1d
 c2a02d10 _thr_setup (c20d0800) + 50
 c2a02ed0 _lwp_start (c20d0800, 0, 0, c1fcdff8, c2a02ed0, c20d0800)
...

The thread specification syntax also works for core files if you're just trying to drill down on, say, the thread that caused the fatal problem:

$ pstack core/2
core 'core/2' of 100225:        /usr/sbin/nscd
-----------------  lwp# 2 / thread# 2  --------------------
 c2a04888 door     (c28fbdc0, 74, 0, 0, c28fde00, 4)
 080540bd ???????? (deadbeee, c28fddec, 11, 0, 0, 8053d33)
 c2a0491c _door_return () + bc

truss(1)

The truss(1) utility is the mother of all p-tools. It lets you trace a process's system calls, faults, and signals as well as user-land function calls. In addition to consuming pretty much every lower- and upper-case command line option, truss(1) now also supports the thread specification syntax. Now you can follow just the threads that are doing something interesting:

truss -p 107/5
openat(-3041965, ".", O_RDONLY|O_NDELAY|O_LARGEFILE) = 3
fcntl(3, F_SETFD, 0x00000001)                   = 0
fstat64(3, 0x08047800)                          = 0
getdents64(3, 0xC2ABE000, 8192)                 = 8184
brk(0x080721C8)                                 = 0
...

pbind(1)

The pbind(1) utility isn't an observability tool, rather this p-tool binds a process to a particular CPU so that it will only run on that CPU (except in some unusual circumstances; see the man page for details). For multi-threaded processes, the process is clearly not the right granularity for this kind of activity -- you want to be able to bind this thread to that CPU, and those threads to some other CPU. In Solaris 10, that's a snap:

$ pbind -b 1 107/2
lwp id 107/2: was not bound, now 1
$ pbind -b 0 107/2-5
lwp id 107/2: was 1, now 0
lwp id 107/3: was not bound, now 0
lwp id 107/4: was not bound, now 0
lwp id 107/5: was not bound, now 0

These are perfect examples of Solaris responding to requests from users: there was no easy way to solve these problems, and that was causing our users pain, so we fixed it. After the BOF at OSCON, a Solaris user had a laundry lists of problems and requests, and was skeptical about our interest in fixing them, but I convinced him that we do care, but we need to hear about them. So let's hear about your gripes and wish lists for Solaris. Many of the usability features (the p-tools for example) came out of our own use of Solaris in kernel development -- once OpenSolaris lets everyone be a Solaris kernel developer, I'm sure we'll be stumbling onto many more quality of life tools like pstack(1), truss(1), and pbind(1).

Friday Jul 30, 2004

Linux, Solaris, and Open Source

This past week at OSCON I've spent my time trying to understand open source processes, talking about Solaris, and trying to figure out what OpenSolaris is going to look like.

Learning from Linux

I attended a talk by Greg Kroah-Hartman about Linux kernel development. As we work towards open sourcing Solaris, we're trying to figure out how to do it right -- source control, process, licenses, community etc. As I didn't know much about how Linux development works, I was hoping to learn from a largely successful open source operating system.

Linux development is built around fiefdoms maintained by folks like Greg. Ordinary folks can contribute to the repositories they maintain (either directly or by proxy based on some sort of Linux-street-cred it seems). Those repositories are then fed up to a combined unstable repository, and from there Linus himself ordains the patches and welcomes them into the circle of linux 2.6.x. This all seemed to make some sense and work alright. That is, until someone asked about firewire support. The answer, "I wouldn't run firewire -- it should build [laughter], but I wouldn't run it. The discussion then led to Linux testing which it seems is highly ad-hoc and unreliable. IBM and Novell are working on nightly testing runs, but very little exists today in terms of quality control tests or general tests that developers themselves can run before they integrate their changes.

In Solaris, testing can be arduous. Some changes are obvious and can be tested on just on architecture, but others require extensive tests on a variety of SPARC and x86 platforms. And linux supports so many more platforms! I have no idea how a developer working on his x86 box can ever be sure that some seemingly innocuous change hasn't broken 64-bit PPC (or whatever). Clearly this is something we have to solve for OpenSolaris -- reliability and testing are at the core of our DNA in the Solaris kernel group, and we need to not only export that idea to the community, but only some subset of facilities so that contributors can adhere to the same levels of quality.

OpenSolaris

Later in the day a bunch of us from Solaris met with some open source leaders (I don't know quite how one earns that title, but that's what our liaison told us they were). We first told them where we were: Yes, we really are going to open source Solaris; no, we don't know the license yet; no, we don't know if it's going to be GPL compatible; no, we aren't planning on moving the cool stuff in Solaris over to Linux ourselves; and, no, we do not know what the license is going to be, but we promise to tell you when we do.

We got a lot of helpful suggestions from folks involved with apache and other projects. "Bite sized bugs" sound like a great way to get new people involved with Solaris and contributing code without a huge investment of effort. Documentation, partitioning and documenting that partitioning will all be much more important that we had previously anticipated. We get the message: OpenSolaris will be easy to download, build, and install, and we'll make sure it's as easy as possible to get started with development.

The one thing that disappointed me was the lack of knowledge about Solaris 10 -- some comment suggested that the members of the panel think Solaris doesn't really have anything interesting. Fortunately we had the BOF that night...

The Solaris Community

Andy, Bart, Eric and I held a BOF session last night to talk about OpenSolaris and Solaris 10. After satiating the crowd's curiosity about open sourcing Solaris (no, we don't know what the license is going to be), we gave some Solaris 10 demonstrations.

Since we were tight on time, I buzzed through the DTrace demo in about 15 minutes touching on the syscall provider, aggregations, the ustack() action, user-land tracing with the pid provider and kernel tracing with the fbt provider. Whew. Then came the questions -- some of the audience members had used DTrace, others had heard of it; almost everyone had a question. When I demo DTrace, there's always this great moment of epiphany that people go through. I can see it on their faces. After the initial demo they look like people who just got off a roller coaster -- windblown and trying to understand what just happened, but there's always this moment, this ah-ha moment, when something -- the answer to a question, an additional example, an anecdote -- sparks them into understanding. It's great to see someone suddenly sit up in her chair and start nodding vigorously at every new site I point out in the DTrace guided tour.

My personal favorite piece of input about OpenSolaris was someone's claim that six months after Solaris goes open source there will be a port to PowerBook hardware. If that's true then everyone in the Solaris kernel group is going to have PowerBooks in 6 months plus a day.

Before the BOF, I was worried that we might not find our community for open source Solaris. Not only was there a good crowd at the BOF, but they were interested and impressed. That's our community, and those are the people who are going to be contributing to OpenSolaris. And I can't wait for it to happen.

Wednesday Jul 28, 2004

team ZFS enters the fray

Two members of the ZFS team have joined the blogging fray. Check out Matt Ahrens's and Val Henson's weblogs.

For the unintiated, ZFS is the brand new file system that's going to be in Solaris 10. ZFS is incredibly fast, reliable, and easy to manage. I recently moved my home directory from out UFS file server to an experimental ZFS file server. Opening my (extensive) mail spool went from 20 seconds to 3; doing an ls(1) sped up by more than a factor of two in my home directory; and the repository of crash dumps I keep went from 40G to 4G. This is really cool technology both under the hood and from the point of view of users and administrators -- stay tuned to their weblogs for all the details.

Tuesday Jul 27, 2004

off to the Open Source Convention...

This afternoon, I'm leaving for OSCON (easily confused with the bi-mon-sci-fi-con). Here in Solaris Kernel Development we've been talking a bunch about the impending open sourcing of Solaris and what that's going to look like. I'm very excited about OpenSolaris itself, and I'm looking forward to talking to folks at OSCON to hear what they think.

The part they're going to find most surprising is that this is for real. Within a year or two, there are going to be people from outside of Sun contributing to Solaris. Period. It's going to be a little scary, but I'm excited about the possibilities of what this might mean for Solaris (my dream of Solaris on my PowerBook might even come true).

We're holding a BOF session on Thursday at 9pm. So come by if you want to talk to Andy, Eric, Bart or me about the cool stuff in Solaris 10 or about open sourcing Solaris. Stay for the free (as in beer) beer.

Thursday Jul 22, 2004

Linker alien spotting (part II)

Another linker alien has joing the b.s.c. fray. Mike Walker already has some useful stuff about shared libraries that you should check out. If you have linker questions, do what I do: ask Mike.

Wednesday Jul 21, 2004

Number 20 of 20: event ports

go to the Solaris 10 top 11-20 list for more

Bart Smaalders has written some great stuff about event ports including an extensive coding example. Event ports provide a single API for tying together disparate souces of events. We had baby steps in the past with poll(2) and select(3c), but event ports let you have the file descriptor and timer monitoring as well as dealing with asynchronous I/O and your own custom events.

Tuesday Jul 20, 2004

Number 17 of 20: java stack traces

go to the Solaris 10 top 11-20 list for more

Here's a little secret about software development: different groups usually aren't that good at working with one another. That's probably not such a shocker for most of you, but the effects can be seen everywhere, and that's why tight integration can be such a distinguishing feature for a collection of software.

About a year and a half ago, we had the DTrace prototype working on much of the system: from kernel functions, through system calls, to every user-land function and instruction. But we were focused completely on C and C++ based applications and this java thing seemed to be catching on. In a radical move, we worked with some of the java guys to take the first baby step in making DTrace and Solaris's other observability tools begin to work with java.

ustack() action for java

One of the most powerful features of DTrace is its ability to correlate low level events in the kernel -- disk I/O, scheduler events, networking, etc. -- with user-land activity. What application is generating all this I/O to this disk? DTrace makes answering that a snap. But what about when you want to dive deeper? What is that application actually doing to generate all that kernel activity? The ustack() action records the user-land stack backtrace so even in that prototype over a year ago, you could hone in on the problem.

Java, however, was still a mystery. Stacks in C and C++ are fairly easy to record, but in java, some methods are interpretted and just-in-time (JIT) compilation means that other methods can move around in the java virtual machine's (JVM) address space. DTrace needed help from the JVM. Working with the java guys, we built a facility where the JVM actually contains a little bit of D (DTrace's C-like language) machinery that knows how to interpret java stacks. We enhanced the ustack() action to take an optional second argument for the number of bytes to record (we've also recently added the jstack() action; see the DTrace Solaris Express Schedule for when it will be available) so when we use the ustack() action in the kernel on a thread in the JVM, that embedded machinery takes over and fills in those bytes with the symbolic interpretation for those methods. Either Bryan or I will give a more complete (and comprehensible) description in the future, but an example should speak volumes:

# dtrace -n profile-100'/execname == "java"/{ @[ustack(50, 512)] = count() }'
    ...
              java/security/AccessController.doPrivileged
              java/net/URLClassLoader.findClass
              java/lang/ClassLoader.loadClass
              sun/misc/Launcher$AppClassLoader.loadClass
              java/lang/ClassLoader.loadClass
              java/lang/ClassLoader.loadClassInternal
              StubRoutines (1)
    ...

It seems simple, but there's a lot of machinery behind this simple view, and this is actually an incredibly powerful and unique view of the system. Maybe you've had a java application that generated a lot of I/O or had some unexpected latency -- using DTrace and its java-enabled ustack() action, you can finally track the problem down.

pstack(1) for java

While we had the java guys in the room, we couldn't pass up the opportunity to collaborate on getting stacks working in another observability tool: pstack(1). The pstack(1) utility can print out the stack traces of all the threads in a live process or a core file. We implemented it slightly differently than DTrace's ustack() action, but pstack(1) now works on java processes and java core files.

Collaboration is a great thing, and I hope you find the fruits of collaborative effort useful. These are just the first steps -- we have much more planned for integrating Solaris and DTrace with java.

Sunday Jul 18, 2004

Number 16 of 20: improved watchpoints

go to the Solaris 10 top 11-20 list for more

Eric Schrock has an overview of watchpoints as well as a discussion of the cool improvements he's made to watchpoints in Solaris 10. Watchpoints have, in the past, been a bit dodgy -- they were only vaguely compatible with C++, multi-threaded code, and x86 stacks. Now they're way more robust and much faster.

Saturday Jul 17, 2004

Number 18 of 20: pmap(1) improvements

go to the Solaris 10 top 11-20 list for more

pmap(1)

For the uninitiated, pmap(1) is a tool that lets you observe the mappings in a process. Here's some typical output:

311981: /usr/bin/sh
08046000       8K rw---    [ stack ]
08050000      80K r-x--  /sbin/sh
08074000       4K rwx--  /sbin/sh
08075000      16K rwx--    [ heap ]
C2AB0000      64K rwx--    [ anon ]
C2AD0000     752K r-x--  /lib/libc.so.1
C2B9C000      28K rwx--  /lib/libc.so.1
C2BA3000      16K rwx--  /lib/libc.so.1
C2BB1000       4K rwxs-    [ anon ]
C2BC0000     132K r-x--  /lib/ld.so.1
C2BF1000       4K rwx--  /lib/ld.so.1
C2BF2000       8K rwx--  /lib/ld.so.1
 total      1116K

You can use this to understand various adresses you might see from a debugger, or you can use other modes of pmap(1) to see the page sizes being used for various mappings, how much of the mappings have actually been faulted in, the attached ISM, DISM or System V shared memory segments, etc. In Solaris 10, pmap(1) has some cool new features -- after a little more thought, I'm not sure that this really belongs on the top 11-20 list, but this is a very cool tool and gets some pretty slick new features; anyways the web affords me the chance for some revisionist history if I feel like updating the list...

thread and signal stacks

When a process creates a new thread, that thread needs a stack. By default, that stack comes from an anonymous mapping. Before Solaris 10, those mappings just appeared as [ anon ] -- undifferentiated from other anonymous mappings; now we label them as thread stacks:

311992: ./mtpause.x86 2
08046000       8K rwx--    [ stack ]
08050000       4K r-x--  /home/ahl/src/tests/mtpause/mtpause.x86
08060000       4K rwx--  /home/ahl/src/tests/mtpause/mtpause.x86
C294D000       4K rwx-R    [ stack tid=3 ]
C2951000       4K rwxs-    [ anon ]
C2A5D000       4K rwx-R    [ stack tid=2 ]
    ...

That can be pretty useful if you're trying to figure out what some address means in a debugger; before you could tell that it was from some anonymous mapping, but what the heck was that mapping all about? Now you can tell at a glance that its the stack for a particular thread.

Another kind of stack is the alternate signal stack. Alternate signal stacks let threads handle signals like SIGSEGV which might arise due to a stack overflow of the main stack (leaving no room on that stack for the signal handler). You can establish an alternate signal stack using the sigaltstack(2) interface. If you allocate the stack by creating an anonymous mapping using mmap(2) pmap(1) can now identify the per-thread alternate signal stacks:

    ...
     FEBFA000       8K rwx-R    [ stack tid=8 ]
     FEFFA000       8K rwx-R    [ stack tid=4 ]
     FF200000      64K rw---    [ altstack tid=8 ]
     FF220000      64K rw---    [ altstack tid=4 ]
    ...

core file content

Core files have always contained a partial snapshot of a process's memory mappings. Now that you can you manually adjust the content of a core file (see my previous entry) some ptools will give you warnings like this:
pargs: core 'core' has insufficient content
So what's in that core file? pmap(1) now let's you see that easily; mappings whose data is missing from the core file are marked with a \*:

$ coreadm -P heap+stack+data+anon
$ cat
\^\\Quit - core dumped
$ pmap core
core 'core' of 312077:  cat
08046000       8K rw---    [ stack ]
08050000       8K r-x--\* /usr/bin/cat
08062000       4K rwx--  /usr/bin/cat
08063000      40K rwx--    [ heap ]
C2AB0000      64K rwx-- 
C2AD0000     752K r-x--\* /lib/libc.so.1
C2B9C000      28K rwx--  /lib/libc.so.1
C2BA3000      16K rwx--  /lib/libc.so.1
C2BC0000     132K r-x--\* /lib/ld.so.1
C2BF1000       4K rwx--  /lib/ld.so.1
C2BF2000       8K rwx--  /lib/ld.so.1
 total      1064K

If you're looking at a core file from an earlier release or from a customer in the field, you can quickly tell if you're going to be able to get the data you need out of the core file or if the core file can only be interpreted on the original machine or whatever.

Thursday Jul 15, 2004

Number 13 of 20: Core file improvements

go to the Solaris 10 top 11-20 list for more

core files

Core files are snapshots of a process's state. They contain some of the memory segments (e.g. the stack and heap) as well as some of the in-kernel state associated with the process (e.g. the signal masks and register values). When a process gets certain signals, the kernel, by default, kills the process and produces a core file. You can also creat core files from running processes -- without altering the process -- using Solaris's gcore(1) utility.

So when your application crashed in the field, you could just take the core file and debug it right? Well, not exactly. Core files contained a partial snap-shot of the process's memory mappings -- in particular they omitted the read-only segments which contained the program text (instructions). As a result you would have to recreate the environment from the machine where the core file was produce exactly -- identical versions of the libraries, application binary and loadable modules. Consequently, core files were mostly useful for developers in development (and even then, an old core file could be useless after a recompilation). And this isn't just Solaris -- every OS I've every worked with has omitted program text from core files making those core files of marginal utility once they've left the machine that produced them.

coreadm(1M)

In Solaris 7 we introduced coreadm(1M) to let users and system administrators control the location and name of core files. Previously , core files had always been named "core" and resided in the current working directory of the process that dumped the core. With coreadm(1M) you can name core files whatever you want including meta characters that expand when the core is created; for example, "core.%f.%n" would expand to "core.staroffice.dels" if staroffice were to dump core on my desktop (named dels). System administrators can also set up a global repository for all cores produced on the system to keep an eye on programs unexpectedly dumping core (naturally in Solaris 10, zone administrators can set up per-zone core file repositories).

In Solaris 10, coreadm(1M) becomes an even more powerful tool. Now you can specify which parts of the processes image go into the core file. Program text is there by default, and you can also choose to omit or include the stack, heap, anonymous data, mapped files, system V shared memory segments, ISM, DISM, etc. Let's say you've got some multi-processed database that contains a big DISM segment; rather than having each process include the shared segment in its core file, you can set up just one of the processes (or none of them) to include the segment in the core file.

debugging core files from the field

Now that program text is included by default, core files from failures in the field can be useful without the incredibly arduous task of exactly replicating the original environment. The program text also includes a partial symbol table -- the dynsym -- so you can get accurate stack back traces, and correctly disassemble functions in your favorite post-mortem debugger. If the dynsym doesn't cut it, you can use coreadm(1M) to configure your process to include the full symbol table in its core dumps as well -- so don't strip those binaries!

Also new to Solaris 10, we've started building many libraries with embedded type information in a compressed format. This is more of a teaser, since we're not quite ready to ship the tools to generate that type information, but that type information is included in core files by default. So now not only can we in Solaris actually make headway on core files we get from customers, but we can make progress much more quickly.

If you've installed Solaris Express, go check out the man page for coreadm(1m) and figure out how to get the right content in your core files. Once you get your first core file from a Solaris 10 machine in the field I hope you'll appreciate how much easier it was to debug.

Tuesday Jul 13, 2004

Number 12 of 20: file names in pfiles(1)

go to the Solaris 10 top 11-20 list for more

Eric Schrock has tagged in to talk about file names in pfiles(1). This is something we've wanted for forever; here's a teaser:

bash-2.05# pfiles 100354
100354: /usr/lib/nfs/mountd
  Current rlimit: 256 file descriptors
   0: S_IFCHR mode:0666 dev:267,0 ino:6815752 uid:0 gid:3 rdev:13,2
      O_RDONLY
      /devices/pseudo/mm@0:null
   1: S_IFCHR mode:0666 dev:267,0 ino:6815752 uid:0 gid:3 rdev:13,2
      O_WRONLY
      /devices/pseudo/mm@0:null
  ...
  11: S_IFCHR mode:0000 dev:267,0 ino:33950 uid:0 gid:0 rdev:105,45
      O_RDWR
      /devices/pseudo/tl@0:ticots
  12: S_IFREG mode:0644 dev:32,0 ino:583850 uid:0 gid:1 size:364
      O_RDWR|O_CREAT|O_TRUNC
      /etc/rmtab

Number 11 of 20: libumem

go to the Solaris 10 top 11-20 list for more

libumem

In Solaris 2.4 we replaced the old buddy allocator1 the slab allocator2 invented by Jeff Bonwick. The slab allocator is covered in pretty much every operating systems text book -- and that's because most operating systems are now using it. In Solaris 103, Jonathan Adams brought the slab allocator to user-land in the form of libumem4.

Getting started with libumem is easy; just do the linker trick of setting LD_PRELOAD to "libumem.so" and any program you execute will use libumem's malloc(3C) and free(3C) (or new and delete if you're into that sort of thing). Alteratively, if you like what you see, you can start linking your programs against libumem by passing -lumem to your compiler or linker. But I'm getting ahead of myself; why is libumem so great?

Scalability

The slab allocator is designed for systems with many threads and many CPUs. Memory allocation with naive allocators can be a serious bottleneck (in fact we recently used DTrace to find such a bottleneck; using libumem got us a 50% improvement). There are other highly scalable allocators out there, but libumem is about the same or better in terms of performance, has compelling debugging features, and it's free and fully supported by Sun.

Debugging

The scalability and performance are impressive, but not unique to libumem; where libumem really sets itself apart is in debugging. If you've ever spent more than 20 seconds debugging heap corruption or chasing down a memory leak, you need libumem. Once you've used libumem it's hard to imagine debugging this sort of problem with out it.

You can use libumem to find double-frees, use-after-free, and many other problems, but my favorite is memory leaks. Memory leaks can really be a pain especially in large systems; libumem makes leaks easy to detect, and easy to diagnose. Here's a simple example:

$ LD_PRELOAD=libumem.so
$ export LD_PRELOAD
$ UMEM_DEBUG=default
$ export UMEM_DEBUG
$ /usr/bin/mdb ./my_leaky_program
> ::sysbp _exit
> ::run
mdb: stop on entry to _exit
mdb: target stopped at:
libc.so.1`exit+0x14:    ta        8
mdb: You've got symbols!
mdb: You've got symbols!
Loading modules: [ ld.so.1 libumem.so.1 libc.so.1 ]
> ::findleaks
CACHE     LEAKED   BUFCTL CALLER
0002c508       1 00040000 main+4
----------------------------------------------------------------------
   Total       1 buffer, 24 bytes
> 00040000::bufctl_audit
    ADDR  BUFADDR    TIMESTAMP THR  LASTLOG CONTENTS    CACHE     SLAB     NEXT
DEPTH
00040000 00039fc0 3e34b337e08ef   1 00000000 00000000 0002c508 0003bfb0 00000000
     5
         libumem.so.1`umem_cache_alloc+0x13c
         libumem.so.1`umem_alloc+0x60
         libumem.so.1`malloc+0x28
         main+4
         _start+0x108

Obviously, this is a toy leak, but you get the idea, and it's really that simple to find memory leaks. Other utilities exist for debugging memory leaks, but they dramatically impact performance (to the point where it's difficult to actually run the thing you're trying to debug), and can omit or incorrectly identify leaks. Do you have a memory leak today? Go download Solaris Express, slap your app on it and run it under libumem. I'm sure it will be well worth the time spent.

You can use other mdb dcmds like ::umem_verify to look for corruption. The kernel versions of these dcmds are described in the Solaris Modular Debugger Guide today; we'll be updating the documentation for Solaris 10 to describe all the libumem debugging commands.

Programmatic Interface

In addition to offering the well-known malloc() and free(), also has a programmatic interface for creating your own object caches backed by the heap or memory mapped files or whatever. This offers additional flexibility and precision and allows you to futher optimize your application around libumem. Check out the man pages for umem_alloc() and umem_cache_alloc() for all the details.

Summary

Libumem is a hugely important feature in Solaris 10 that just slipped off top 10 list, but I doubt there's a Solaris user (or soon-to-be Solaris user) that won't fall in love with it. I've only just touched on what you can do with libumem, but Jonathan Adams (libumem's author) will soon be joining the ranks of blogs.sun.com to tell you more. Libumem is fast, it makes debugging a snap, it's easy to use, and you can get down and dirty with it's expanded API -- what else couldn anyone ask for in an allocator?

1. Jeff's USENIX paper is definitely worth a read
2. For more about Solaris history, and the internals of the slab allocator check out Solaris Internals
3. Actually, Jonathan slipped libumem into Solaris 9 Update 3 so you might have had libumem all this time and not known...
4. Jeff and Jonathan wrote a USENIX paper about some additions to the allocator and its extension to user-land in the form of libumem
About

Adam Leventhal, Fishworks engineer

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today