Number 19 of 20: per-thread p-tools

go to the Solaris 10 top 11-20 list for more

p-tools

Since Solaris 7 we've included a bunch of process observability tools -- the so called "p-tools". Some of them inspect aspects of the process of the whole. For example, the pmap(1) command shows you information about a process's mappings, their location and ancillary information (the associated file, shmid, etc.). pldd(1) is another example; it shows which shared objects a process has opened.

Other p-tools apply to the threads in a process. The pstack(1) utility shows the call stacks for each thread in a process. New in Solaris 10 Eric and Andrei have modified the p-tools that apply to threads so that you can specify the threads you're interested in rather than having to sift through all of them.

pstack(1)

Developers and administrators often use pstack(1) to see what a process is doing and if it's making progress. You'll often turn to pstack(1) after prstat(1) or top(1) shows a process consuming a bunch of CPU time -- what's that guy up to. Complex processes can many many threads; fortunately prstat(1)'s -L flag will split out each thread in a process as its own row so you can quickly see that thread 5, say, is the one that's hammering the processor. Now rather than sifting through all 100 threads to find thread 5, you can just to this:

$ pstack 107/5
100225: /usr/sbin/nscd
-----------------  lwp# 5 / thread# 5  --------------------
 c2a0314c nanosleep (c25edfb0, c25edfb8)
 08056a96 gethost_revalidate (0) + 4b
 c2a02d10 _thr_setup (c2949000) + 50
 c2a02ed0 _lwp_start (c2949000, 0, 0, c25edff8, c2a02ed0, c2949000)

Alternatively, you can specify a range of threads (5-7 or 11-), and combinations of ranges (5-7,11-). Giving us something like this:

$ pstack 107/5-7,11-
100225: /usr/sbin/nscd
-----------------  lwp# 5 / thread# 5  --------------------
 c2a0314c nanosleep (c25edfb0, c25edfb8)
 08056a96 gethost_revalidate (0) + 4b
 c2a02d10 _thr_setup (c2949000) + 50
 c2a02ed0 _lwp_start (c2949000, 0, 0, c25edff8, c2a02ed0, c2949000)
-----------------  lwp# 6 / thread# 6  --------------------
 c2a0314c nanosleep (c24edfb0, c24edfb8)
 080577d6 getnode_revalidate (0) + 4b
 c2a02d10 _thr_setup (c2949400) + 50
 c2a02ed0 _lwp_start (c2949400, 0, 0, c24edff8, c2a02ed0, c2949400)
-----------------  lwp# 7 / thread# 7  --------------------
 c2a0314c nanosleep (c23edfb0, c23edfb8)
 08055f56 getgr_revalidate (0) + 4b
 c2a02d10 _thr_setup (c2949800) + 50
 c2a02ed0 _lwp_start (c2949800, 0, 0, c23edff8, c2a02ed0, c2949800)
-----------------  lwp# 11 / thread# 11  --------------------
 c2a0314c nanosleep (c1fcdf60, c1fcdf68)
 0805887d reap_hash (80ca918, 8081140, 807f2f8, 259) + ed
 0805292a nsc_reaper (807f92c, 80ca918, 8081140, 807f2f8, c1fcdfec, c2a02d10) + 6d
 08055ded getpw_uid_reaper (0) + 1d
 c2a02d10 _thr_setup (c20d0800) + 50
 c2a02ed0 _lwp_start (c20d0800, 0, 0, c1fcdff8, c2a02ed0, c20d0800)
...

The thread specification syntax also works for core files if you're just trying to drill down on, say, the thread that caused the fatal problem:

$ pstack core/2
core 'core/2' of 100225:        /usr/sbin/nscd
-----------------  lwp# 2 / thread# 2  --------------------
 c2a04888 door     (c28fbdc0, 74, 0, 0, c28fde00, 4)
 080540bd ???????? (deadbeee, c28fddec, 11, 0, 0, 8053d33)
 c2a0491c _door_return () + bc

truss(1)

The truss(1) utility is the mother of all p-tools. It lets you trace a process's system calls, faults, and signals as well as user-land function calls. In addition to consuming pretty much every lower- and upper-case command line option, truss(1) now also supports the thread specification syntax. Now you can follow just the threads that are doing something interesting:

truss -p 107/5
openat(-3041965, ".", O_RDONLY|O_NDELAY|O_LARGEFILE) = 3
fcntl(3, F_SETFD, 0x00000001)                   = 0
fstat64(3, 0x08047800)                          = 0
getdents64(3, 0xC2ABE000, 8192)                 = 8184
brk(0x080721C8)                                 = 0
...

pbind(1)

The pbind(1) utility isn't an observability tool, rather this p-tool binds a process to a particular CPU so that it will only run on that CPU (except in some unusual circumstances; see the man page for details). For multi-threaded processes, the process is clearly not the right granularity for this kind of activity -- you want to be able to bind this thread to that CPU, and those threads to some other CPU. In Solaris 10, that's a snap:

$ pbind -b 1 107/2
lwp id 107/2: was not bound, now 1
$ pbind -b 0 107/2-5
lwp id 107/2: was 1, now 0
lwp id 107/3: was not bound, now 0
lwp id 107/4: was not bound, now 0
lwp id 107/5: was not bound, now 0

These are perfect examples of Solaris responding to requests from users: there was no easy way to solve these problems, and that was causing our users pain, so we fixed it. After the BOF at OSCON, a Solaris user had a laundry lists of problems and requests, and was skeptical about our interest in fixing them, but I convinced him that we do care, but we need to hear about them. So let's hear about your gripes and wish lists for Solaris. Many of the usability features (the p-tools for example) came out of our own use of Solaris in kernel development -- once OpenSolaris lets everyone be a Solaris kernel developer, I'm sure we'll be stumbling onto many more quality of life tools like pstack(1), truss(1), and pbind(1).

Comments:

This is all fantastic! Just yesterday ( or the day before, they all blur for me ) I ran into something that I had not seen in a very very long time; I saw a Lotus Domino server freeze all its server threads on Solaris 9. I was stunned. I have been working with Lotus Notes and Lotus Domino since my days at Lotus over a decade ago now and I have seen it all quite frankly. Or so I think. In the past we could generate BRB ( big red box ) errors that would point to some null handle problem or a nasty exception not caught within the Windows world. The Solaris client was pulled off the shelf a few years ago, let's not talk about why. The Domino server product for Solaris is alive and well, on Sparc that is. If you are living in the Domino version 5 world then most people know that it is rock solid. Super stable. At least until the most recent patch from IBM called version 5.0.13a that is. Oh I know that we are in the version 6.5 world and 7 beta is out but a lot of organizations ( banks and insurance companies ) are living just fine with V5 because it works. I made it crash. I have a software agent that is about 200 pages of code and it is a web agent. Part of my inventory and order processing system that is pure web based for any standard browser. I watched as I entered a line that would return the base class name of an object and then when my browser requested the service, boom. The Domino server stopped with a message about "freezing all server threads". This can not be true? Can it? This is just too stable and I did not ask for anything complex. I weaved my way through all of the code and created a test code component with only three lines and a few class objects defined. It crashed the server over and over. I trussed the PID of the httpd server. I gathered the output but could not determine the the real problem yet. What I think that I shall do is to move the server to the global zone on my Solaris 10 build 60 server and then use these new tools to trace the problem. I may just have something to report back to IBM about their Domino server patch.

Posted by Dennis Clarke on August 06, 2004 at 01:47 PM PDT #

The p-tools are some of the best tools in Solaris and definitely make it stand out from the other Unices. I started using them in S9 when I was working for Sun and I remember pfiles and pargs helping me out a lot with Sun Cluster. I've also used the tools extensively in my last job, which was a large Oracle/BEA shop. I'm really glad that the ptools can be used on threads now, it definitely helps out with java apps and wacky things like apache or qmail. I always find it interesting to see what other sysadmins think after I show them the ptools. It's always "Wow.. that's the coolest thing I've seen" or "That's better than lsof..". So these tools add a lot of value to Solaris.

Posted by Octave Orgeron on August 09, 2004 at 07:03 AM PDT #

Dennis, great to hear that this is going to be helpful with your problem. If you think this stuff is exciting, try thowing DTrace at the problem and see how much more quickly you figure it out.

Posted by Adam Leventhal on August 11, 2004 at 04:37 AM PDT #

Octave, thanks so much for the praise. The p-tools grew largely out of our own needs and our own use in Solaris kernel development -- developers know what developers want.

If people's eyes pop out when you show them p-tools wait until you show them this stuff (in particular tracing a thread through user-land into the kernel).

Posted by Adam Leventhal on August 11, 2004 at 04:41 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

Adam Leventhal, Fishworks engineer

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today