Monday Apr 04, 2005

The simple things in life are still cool

I've been messing around with DTrace for some time now and, as with most things in life, a certain level of intimacy means that I've began take the old boy a bit for granted. What was incredibly cool at first becomes normal after a while even though it is still mind boggling. A simple example that came up today reminded me of how the simple things are still incredibly cool.

A colleague of mine was wanting to see the http interaction in a web infrastructure he's developing. On a Solaris 10 system he has multiple zones with portal server, access manager and various other JES pieces. He had never touched DTrace before but being a bright chap he went and got the manual and knocked some stuff up to just look at the write() calls being made. He wanted to just look for http GETs at first but was having a slight problem getting his predicates right. To look for all http GETs being issued by firefox (for example) he just needed this:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option strsize=2048

/execname == "firefox-bin" && stringof(copyin(arg1, 3)) == "GET"/
	printf("%s\\n\\n", copyinstr(arg1));

All we are saying here is that when a write system call is made, print out the write buffer if firefox made the write and if the first 3 characters being written are "GET". Really simple but a great demonstration of the ability to prune data at source provided by the predicate mechanism. Firefox makes thousands of write calls but I can specify exactly which ones I'm interested in with no post processing at all. Still cool.

For those that like to see output, the above script produces the following sample bumph when loading up

User-Agent: Mozilla/5.0 (X11; U; SunOS i86pc; en-US; rv:1.7) Gecko/20041221
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,\*/\*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,\*;q=0.7
Keep-Alive: 300
Proxy-Connection: keep-alive
Cookie: ...

User-Agent: Mozilla/5.0 (X11; U; SunOS i86pc; en-US; rv:1.7) Gecko/20041221
Accept: image/png,\*/\*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,\*;q=0.7
Keep-Alive: 300
Proxy-Connection: keep-alive
Cookie: ...

Monday Mar 21, 2005

uname lies ...

Well, it's been a while to say the least but I think it's about time to put fingers to keys again and see what comes forth ...

I've mentioned before some of the great things that can be achieved with destructive actions. Indeed many good examples are popping up - check out this cool example from Chris Gerhards blog. Today we'll just take a quick look at another one (I actually have a bunch which I'll try and write-up over the next few weeks).

In the UK we have run a series of events called Make-It-Fly which I've been involved with. Last week I did a session on DTrace which I hope everyone who was present enjoyed (I did anyway!). At the events I usually do quite a bit of hands on demo with most of it being hand cranked. However, one of the scripts I use that always gets a laugh is the following one which I don't hand crank as I can never remember the offsets ... If you've ever wanted to get uname(1) to return something different to normal then this is what you need:

#!/usr/sbin/dtrace -s

#pragma D option destructive

        self->addr = arg0;

        copyoutstr("SunOS", self->addr, 257);
        copyoutstr("PowerPC", self->addr+257, 257);
        copyoutstr("5.5.1", self->addr+(257\*2), 257);
        copyoutstr("gate:1996-12-01", self->addr+(257\*3), 257);
        copyoutstr("PPC", self->addr+(257\*4), 257);
Before we have:
# uname -a
SunOS homer 5.10 SunOS_Development sun4u sparc SUNW,Ultra-5_10
and like magic we morph into something else when the above script is ran:
# uname -a
SunOS PowerPC 5.5.1 gate:1996-12-01 PPC sparc SUNW,Ultra-5_10
Here at Sun we often test pieces of software on versions of Solaris that return something different to that which the software is expecting. Previously I would LD_PRELOAD a library in with my own uname hack. Now I can not only do this without bothering the application but I can present different uname information to different applications/users/whatever as I can predicate accordingly!

Note, that the above script isn't quite complete as it returns the incorrect ISA information. This is me being idle and a bit of twiddling with sysinfo() is all that's needed. Maybe another day.

Friday Sep 10, 2004

Who's at the Door?

Doors are an IPC facility that exist within Solaris which allow one process to execute functions within another process. A process that exports a function using the door_create() interface is termed a door server and a process that invokes this function with the door_call() interface is known as a client (suprisingly enough).

I was on the train the other day just tinkering with a few DTrace scripts that I've been developing (see Alans Blog for other DTrace train tinkering) when I happened to notice a reasonable amount of door_call() calls being made by my X server (the Xsun process). Nowadays it's really easy to see what process the door_call() is being made to as the pfiles(1) utility even tells us what process is the door server for this door. Taking a quick look at my X server we see:
# pgrep -fl Xsun
  497 /usr/openwin/bin/Xsun :0 -nobanner -defdepth 24 -auth /var/dt/A:0-PDaa9a
# pfiles 497 | grep -i door
   4: S_IFDOOR mode:0777 dev:214,0 ino:0 uid:0 gid:0 size:0
      O_RDWR  door to svc.configd[9]
  26: S_IFDOOR mode:0777 dev:214,0 ino:0 uid:0 gid:0 size:0
      O_RDWR  door to stfontserverd[695]
How spoilt we are nowadays - it's all so easy to see what is happening! However, I really want to see a systemic view a what door calls are being made so inbetween Surbiton and Waterloo I put the following simple script together:
#!/usr/sbin/dtrace -s

#pragma D option quiet

        self->in = 1;

        self->called = stringof(args[1]->door_target->p_user.u_comm);

        @[execname, self->called] = count();
        self->in = 0;
        self->called = 0;

        printa("%s called door in %s %@d times\\n", @);
The above script just takes advantage of the fact that we do a door_lookup() from within a door_call(). We can then extract the name of the door server from the proc structure that is stored in the struct door_node that door_lookup() returns (see sys/door.h).
# ./door.d
kcfd called door in nscd 1 times
zoneadm called door in zoneadmd 1 times
devfsadm called door in devfsadm 1 times
zoneadmd called door in devfsadm 1 times
snmpd called door in nscd 2 times
sshd called door in kcfd 2 times
sshd called door in nscd 2 times
syslogd called door in nscd 2 times
syseventd called door in devfsadm 2 times
chown called door in nscd 2 times
devfsadm called door in syseventd 2 times
ttymon called door in nscd 3 times
dtlogin called door in nscd 3 times
id called door in nscd 4 times
deallocate called door in nscd 4 times
snmpXdmid called door in nscd 4 times
lpstat called door in syslogd 4 times
cron called door in nscd 5 times
svc.startd called door in nscd 6 times
inetd called door in nscd 16 times
svcprop called door in svc.configd 26 times
svccfg called door in svc.configd 54 times
rpcbind called door in svc.configd 117 times
lsvcrun called door in svc.configd 550 times
svc.configd called door in nscd 2034 times
mfstscan called door in svc.configd 2708 times
Xsun called door in stfontserverd 5046 times
svcs called door in svc.configd 7918 times
inetd called door in svc.configd 28908 times
svc.startd called door in svc.configd 37110 times
You'll have to excuse the poor formatting as it's always a weakness of mine but isn't it cool to see how easy we can see the various interactions? That's quite a few doors being opened in the smf(5) world as I was doing a bit of svcs work at the time (see Stephen Hahn's Blog for smf(5) goodness).

Thursday Aug 26, 2004

DTrace and Visualisation

As I talk to people about DTrace one of the most frequent requests that arises is for scripts and tools. On the scripts front people want cookbook type stuff as it is always easier to learn a new technology such as DTrace by "re-using" and building upon pre-prepared material. This is an excellent request that, hopefully, Sun and the ever growing band of DTrace addicts will fulfill in the near future.

On the tools front a request that is made is for a DTrace GUI. In the early days of DTrace (about a year or so ago...) my usual base reaction to such a request was to reply that, whilst I never say never, I doubt that this is high on the priority list but I'll feed it back ... However, age and thought has softened me tremendously and I now feel that the union of DTrace and Visualisation techniques is a source of terrific innovation and will be of immense benefit to all.

With DTrace we have the infrastructure to ask almost arbitrary questions about the behaviour of our software stacks from the top of an application through to the bottom of a bus nexus driver. However, some of the data that these complex subsystems generate can be rich in nature. This is not a size issue particularly as DTrace has great primitives for data reduction but more referring to the relationships that exist within the data. Sometimes a bit of visual stimulation is required to get our synapses firing and to force our grey cells into seeing patterns and relationships in the data. I'm not knocking DTrace here; on the contrary I'm saying that DTrace is the nearly infinitely rich architecture with which we can feed the visualisation engines. I'm convinced that in the near future we will see some real innovation in the tools around systemic and application observability as the data is so easily at hand.

For example, take something that a colleague ( Simon Ritter ) and and I have been working on in our copious spare time, the area of call graphs. With DTrace we can trivially discover what functions a given function is calling and in it's simplest form the script looks like:
#usr/sbin/dtrace -s

#pragma D option flowindent

    self->in = 1;

/self->in && probefunc != "$1"/

    self->in = 0;
So, a trivial example would be to look at what activity the clock routine generates:
# ./fbt.d clock
dtrace: script './fbt.d' matched 31751 probes
CPU FUNCTION                                 
  0  -> clock                                 
  0    -> set_anoninfo                        
  0    <- set_anoninfo                        
  0    -> set_freemem                         
  0    <- set_freemem                         
  0    -> wakeup_pcgs                         
  0    <- wakeup_pcgs                         
  0    -> clock_tick                          
  0      -> sys_nullsys                       
  0      <- sys_nullsys                       
  0      -> mutex_owned                       
  0      <- mutex_owned                       
  0      -> rctl_test                         
  0      <- rctl_test                         
  0      -> rctl_test_entity                  
  0        -> mutex_owned                     
  0        <- mutex_owned                     
  0        -> rctl_set_find                   
  0          -> mutex_owned                   
  0          <- mutex_owned                   
  0        <- rctl_set_find                   
  0      <- rctl_test_entity                  
  0      -> rctl_test                         
  0      <- rctl_test                         
  0      -> rctl_test_entity                  
  0        -> mutex_owned                     
  0        <- mutex_owned                     
  0        -> rctl_set_find                   
  0          -> mutex_owned                   
  0          <- mutex_owned                   
  0        <- rctl_set_find                   
  0      <- rctl_test_entity                  
  0      -> rm_asrss                          
  0      <- rm_asrss                          
  0    <- clock_tick                          
  0    -> callout_schedule                    
  0      -> callout_schedule_1                
  0      <- callout_schedule_1                
  0      -> softcall                          
  0        -> mutex_vector_enter              
  0        <- mutex_vector_enter              
  0        -> mutex_vector_exit               
  0        <- mutex_vector_exit               
  0      <- softcall                          
  0      -> siron                             
  0      <- siron                             
  0      -> setsoftint                        
  0      <- setsoftint                        
  0      -> callout_schedule_1                
  0      <- callout_schedule_1                
  0    <- callout_schedule                    
  0   | clock:return                          
  0  <- clock                                 

However, as great as it is to see the above, the shape of the call sequence isn't very apparent and that is a simple routine. Using something such as a ufs_write() you soon see how this gets very complex. It is much better to draw a directed graph of the calls and you can immediately see what is being called by what. An example of a single ufs_write() call can be seen in this image . The image was generated using DTrace to feed a public domain piece of software called JzGraph which is a great Java implementation of the wonderful Graphviz algorithms.

However, the above is very static in its nature and we wanted to generate something much more dynamic. What we have done is to create a Java application which is a dtrace consumer (i.e it links a JNI library which uses libdtrace(3LIB)). Therefore we can interact with DTrace at will and be much more dynamic in the way we present and manipulate the data. The current version of our application allows you to dynamically generate graphs for any kernel function, step through the call sequence, zoom in and out of graphs, collapse and expand nodes, display timing data, display source code for the nodes in a graph (amongst other things).

We have many other ideas of what to do and hopefully we'll get around to them soon. Sun is developing some Java bindings for DTrace and they're looking really good at the moment (see here for Bryans comment on this).

I for one am looking forward to a future in which complex systemic interactions will be visible to all as the data is now readily available. It has to be seen to be believed.

Monday Aug 23, 2004

It's all about size

Well, I must say that last week was possibly the worst week back from a holiday that I've ever had. Tragedy, chaos and trauma; it had the lot. Still, as my dear, dear old granny says - "Stop whinging, shut your yap and get on with it". So without further yap let's get it on ...

Someone posted a question the other day on an internal alias:- What's the best/cheapest way to monitor the size of a file? Well, there are several possible ways to do this but our boy DTrace gives us the best way in my view. The following D script uses the ufs function wrip() to inform us when a given files size changes.
#!/usr/sbin/dtrace -s

#pragma D option quiet

        size = 0;

/\* wrip does the real work for a ufs write. The first argument is a struct inode\* so
  we can extract everything we need (vnode, size) from this \*/
/(self->file = stringof(args[0]->i_vnode->v_path)) == $$1 /
        self->inode = args[0];

/\* If the file in question has changed size tell us \*/
/self->inode && size != self->inode->i_ic.ic_lsize/ 
        printf("File %s changed size: size = %d\\n",
            self->file, self->inode->i_ic.ic_lsize);
        size = self->inode->i_ic.ic_lsize;
        self->inode = 0;
        self->file = 0;

/self->inode && size == self->inode->i_ic.ic_lsize/
        self->inode = 0;
        self->file = 0;
So, an example run looks like:
# ./file.d /var/tmp/bigfile
File /var/tmp/bigfile changed size: size = 132120576
File /var/tmp/bigfile changed size: size = 133169152
File /var/tmp/bigfile changed size: size = 134217728
File /var/tmp/bigfile changed size: size = 134712950
Marvellous. Yet again showing that the only tool you'll ever need is DTrace (and maybe a potato peeler). The above is only good for ufs but it would be fairly trivial to add other filesystems into this. However, if anybody has a more general purpose solution I'm all ears.

Friday Jul 30, 2004

Light Sabers, Holidays and Least Privs

This morning I went and bought three Light Sabers , the weapon of choice for any young Jedi. No, this isn't because I'm some sad Star Wars junkie but because my three little boys will be 5 next week when we are on holiday (yes, they are triplets). They can beat eachother stupid with these things all week long (they'll probably be wondering why the don't actually cut hands off like in the film ...).

I'm off to catch some rays for a couple of weeks. Laptops and computers of any kind have been banned some I'm reduced to books which is probably a good thing.

Well, before I finally go I'll just mention that I was looking into the Least Privilege work that Casper Dik has introduced into Solaris 10 . This work provides a mechanism for fine grained control over the action of processes so that a processes privilige can be elevated to carry out a set of privileged operations (just create a device special file for example and nothing else). It's no longer all (uid == 0) or nothing (uid != 0).

This stuff is really cool. I was on a plane the other day writing some D scripts to show me when we'd failed a privilege check and what exactly had failed. I has the script working and I was quite pleased with myself when I noticed two things.
  • There is a DTrace sdt probe for failed permissions (priv-err) which I could use.
  • There is a kernel variable (priv_debug) which can be set to 1 to log all failures to /var/adm/messages.
Oh well, I think my script has got legs for more detailed failure analysis (such as what file or directory did I just try to operate on when a 'file_dac_write' was logged). I don't think the logging framework has that kind of thing at the moment but I may be wrong.

Very cool stuff though. There's even a developers guide for writing privilege aware applications which you can find here . Guess this should be some of my beach reading ...

Tuesday Jul 27, 2004

fsflush revisited in D

The other day, Erik O'Shaughnessy wrote a nice blog on extracting some new statistics that fsflush makes available to us in Solaris 10 . Any new insight that can be gathered into the operation of the mysterious fsflush has to be a good thing and goodness abounds with Eriks fsfstat tool.

However, it wouldn't be right if we didn't knock a quick version up in D now would it? The script below just takes advantage of the fact that we can reference symbols in the kernel using the backquote character (`) which is the symbol scoping operator.
#!/usr/sbin/dtrace -s

#pragma D option quiet

    lexam = 0; lscan = 0; llock = 0; lmod = 0; lcoal = 0; lrel = 0; ltime = 0;
    printf("%10s %10s %10s %10s %10s %10s %10s\\n", "SCANNED", "EXAMINED",

    printf("%10d %10d %10d %10d %10d %10d %10d\\n", `fsf_total.fsf_scan - lscan,
        `fsf_total.fsf_examined - lexam, `fsf_total.fsf_locked - llock,
        `fsf_total.fsf_modified - lmod, `fsf_total.fsf_coalesce - lcoal,
        `fsf_total.fsf_releases - lrel, `fsf_total.fsf_time - ltime);
    lexam = `fsf_total.fsf_examined;
    lscan = `fsf_total.fsf_scan;
    llock = `fsf_total.fsf_locked;
    lmod = `fsf_total.fsf_modified;
    lcoal = `fsf_total.fsf_coalesce;
    lrel = `fsf_total.fsf_releases;
    ltime = `fsf_total.fsf_time;

 \* First time through

    lexam = `fsf_total.fsf_examined;
    lscan = `fsf_total.fsf_scan;
    llock = `fsf_total.fsf_locked;
    lmod = `fsf_total.fsf_modified;
    lcoal = `fsf_total.fsf_coalesce;
    ltime = `fsf_total.fsf_time;
    lrel = `fsf_total.fsf_releases;
On my lowly laptop with 768MB of memory things look like:
      6355       6356        222          2          0          0     576918
      6355       6356        506          1          0          0     593640
      6355       6356        946          0          0          0     635839
      6355       6356        560          2          0          0     621283
      6355       6356        348          5          0          0     612733
      6355       6356        636          3          0          0     642727
      6355       6356       1005          3          0          0     673983
      6355       6356        783          4          0          0     674637
      6355       5138        309          0          0          0     492221
      6355       5333          0          0          0          0     441155
      6355       6140          2          0          0          0     491863
      6355          7          0          0          0          0       4859
      6355          7          0          0          0          0       4618
      6355          7          1          0          0          0       4588
      6355       5333        619          0          0          0     508046
      6355       6356        350          1          0          0     685573
      6355       6356        343          0          0          0     585538
      6355       6356          0          0          0          0     506531
      6355       6356         18          0          0          0     508688
      6355       6356        415          1          0          0     711370
      6355       6356        542          0          0          0     619117
      6355       6356        377          0          0          0     578586
Sure, Eriks looks nicer and I do like his percentage column but for 15 minutes of my time I'm not complaining!

Monday Jul 26, 2004

Spotting async-unsafe usage

Over the years (now I sound like an old man) I've worked in various situations where the root cause of the issue turned out to be code that uses calls that are not async-signal-safe from within a signal handler. In case you don't know, an async-signal-safe call is any function that can be called from within a signal handler (i.e a function that catches signals). In fact the Single UNIX specification Version 3 defines any function that isn't explicitly listed to be async-signal-safe as one that is unsafe (i.e don't use it from a signal handler you bad man!). A basic list of async-signal-safe functions can be found in section 2.4.1 of the Single UNIX specification.

What makes a function unsafe is the fact that you can't reenter it. Think about malloc(3C) for a second ... we're half way through carrying out a malloc, just going about our own business manipulating some pointers and BANG, someone drops a signal on us. Off we trot to execute the handler for this signal. However, let's say the signal handler itself executes a malloc call. The code we've just jumped from hasn't finished the malloc and we have an indeterminate state. The signal handlers malloc will either go pear shaped now (if you're lucky) or someone will trip over the carnage in a while. The malloc() library call is not designed to be reentrant and so shouldn't be called from a signal handler unless you can guarantee that the signal that has the naughty handler attached can't be called when in any malloc (nigh on impossible to do).

Prior to DTrace it was very difficult indeed to look at what was being called from a signal handler. Now it's easier than childs play as there's less emotion involved! The D below just looks for either printf or malloc (two examples of async-signal-unsafe) when we are in the process of handling a signal (when libc's sigacthandler() has been called).
#!/usr/sbin/dtrace -s

#pragma D option quiet

/\* Just add your asynch unsafe signals below. Not fullproof as
the user could have maked signals whilst in their async-unsafe
calls but this is unlikely. We could check for this.

Usage: ./sigtrace.d 'pid' where 'pid' is the process under review \*/

        name = "";
        sigs[0] = "SIGEXIT"; sigs[1] = "SIGHUP"; sigs[2] = "SIGINT";
        sigs[3] = "SIGQUIT"; sigs[4] = "SIGILL"; sigs[5] = "SIGTRAP";
        sigs[6] = "SIGABRT"; sigs[7] = "SIGEMT"; sigs[8] = "SIGFPE";
        sigs[9] = "SIGKILL"; sigs[10] = "SIGBUS"; sigs[11] = "SIGSEGV";
        sigs[12] = "SIGSYS"; sigs[13] = "SIGPIPE"; sigs[14] = "SIGALRM";
        sigs[15] = "SIGTERM"; sigs[16] = "SIGUSR1"; sigs[17] = "SIGUSR2";
        sigs[18] = "SIGCLD"; sigs[19] = "SIGPWR"; sigs[20] = "SIGWINCH";
        sigs[21] = "SIGURG"; sigs[22] = "SIGPOLL"; sigs[23] = "SIGSTOP";
        sigs[24] = "SIGTSTP"; sigs[25] = "SIGCONT"; sigs[26] = "SIGTTIN";
        sigs[27] = "SIGTTOU"; sigs[28] = "SIGVTALRM"; sigs[29] = "SIGPROF";
        sigs[30] = "SIGXCPU"; sigs[31] = "SIGXFSZ"; sigs[32] = "SIGWAITING";
        sigs[33] = "SIGLWP"; sigs[34] = "SIGFREEZE"; sigs[35] = "SIGTHAW";
        sigs[36] = "SIGCANCEL"; sigs[37] = "SIGLOST"; sigs[38] = "SIGXRES";
        sigs[39] = "SIGJVM1"; sigs[40] = "SIGJVM2"; sigs[41] = "SIGRTMIN";
        sigs[42] = "SIGRTMIN+1"; sigs[43] = "SIGRTMIN+2"; sigs[44] = "SIGRTMIN+3
        sigs[45] = "SIGRTMAX-2"; sigs[46] = "SIGRTMAX-1"; sigs[47] = "SIGRTMAX";


        name = execname;
        self->signo = arg0;
        printf("%s: %s being handled\\n", execname, sigs[self->signo]);


/ self->inhandler /
        @a[sigs[self->signo], "malloc"] = count();

/ self->inhandler /
        @a[sigs[self->signo], "printf"] = count();

        printf("----------- %s --------------\\n", name);
        printa("%s handler called %s %@d times\\n", @a);

# ./sigtrace.d `pgrep inetd`
inetd: SIGHUP being handled
----------- inetd --------------
SIGHUP handler called malloc 2101 times

You can now go and tell your favourite ISV exactly what they shouldn't be doing when and where! You'll be glad to hear the age old code for reconfiguring inetd (demonstrated above with dropping a `kill -1` on inetd!) has been rewritten as part of the new Service Management Facility that will be coming your way in Septembers Solaris Express . Also check out Stephen Hahns blog for more details on smf.

Friday Jul 23, 2004

Friday afternoon japes

A certain Alec Muffet reminded me at lunch that I should post a few more "normal" DTrace scripts. What is meant by "normal" in this context are scripts that are simple and old but are good for learning the basics of D; good for morphing into something specific that you want to do. After all, when I learnt Perl I did it via ripping other peoples examples around to do what I wanted to do and not by reading a man page.

Alec, I'll try to do better in the future but now my big tummy is a rumblin' for some tea so I'll leave you with something that you wouldn't really run as you'd stop people logging in! However, it's good for a laugh and I'm sure you can use the principle in many other areas ...
#cat read.d
#!/usr/sbin/dtrace -ws

/ stringof(args[0]->v_path) == $$1 /
        printf("File %s read by %d\\n", $$1, curpsinfo->pr_uid);
In another window we do:
# more /etc/passwd
The read.d script output looks like this:
# ./read.d /etc/passwd
dtrace: script './read.d' matched 1 probe
dtrace: allowing destructive actions
CPU     ID                    FUNCTION:NAME
  0  15625                   ufs_read:entry File /etc/passwd read by 0

Monday Jul 19, 2004

DTrace - TCP by volume

In Solaris 10 build 63 there is a new DTrace provider called the mib provider. It allows us to drill down and look into aspects of the networking subsystems. For example, how cool and simple is the following script which tells you which processes are generating what volume of tcp traffic:
#!/usr/sbin/dtrace -s

        @[execname] = sum(args[0]);

# ./mib.d
dtrace: script './mib.d' matched 3 probes

  xterm                                                          9696
  java                                                          11660
  sshd                                                         197720
I love it! All this and much, much, much more in the August edition of Software Express for Solaris

Saturday Jul 17, 2004

Friday nights make Jon a lazy boy

Doing things on a friday night can sometimes be a bad thing for a couple of reasons.
  • I can't usually be over bothered with checking my facts.
  • It often annoys my long suffering wife!
My last post was no different to this as I managed to definitely fulfill the first point above and most likely the second. The cool new thread modes that are now in pstack(1) and truss(1) came to us from the world of Eric Schrock . Amongst other things Eric has given to us really cool features like the new vnode pathname work, meaning we can all see the names of the files we are working with when using tools such as pfiles(1) (and obviously DTrace). Keep on bringin' it Eric!

Next time I'll check the SCCS history and not just the architecture case. Lazy boy.

Friday Jul 16, 2004

Sexy truss and pstack additions

Now I know I'm becoming a bit of an old man and my definition of sexy my be getting a tad sad, but Adam Leventhal has recently made some changes in build 63 to truss(1) and pstack(1) that have got me strangely excited for a friday night.

Now I know that DTrace meets our every need but it's nice to see some of the old tools getting some nice new additions. Good old truss(1) and pstack(1) have now been made to operate on individual threads. Hurraahh!! If you've ever tried to use these tools on processes with 100's or even 1000's of threads then you'll be cheering to. Let's see them in action on the mildly threaded nscd(1M).
# pstack `pgrep nscd` | grep -c "lwp#"
# pstack `pgrep nscd`/13-14
100203: /usr/sbin/nscd
-----------------  lwp# 13 / thread# 13  --------------------
 ff1be628 nanosleep (fe57bed8, fe57bed0)
 00019334 reap_hash (79a30, 30258, 2e648, ad, 0, 0) + 11c
 00012400 nsc_reaper (2f0d0, 79a30, 30258, 2e648, 0, 2e400) + 48
 ff1be3ec _lwp_start (0, 0, 0, 0, 0, 0)
-----------------  lwp# 14 / thread# 14  --------------------
 ff1be628 nanosleep (fe47bed8, fe47bed0)
 00019334 reap_hash (79a48, 30258, 2e648, ad, 0, 0) + 11c
 00012400 nsc_reaper (2f0d8, 79a48, 30258, 2e648, 0, 2e400) + 48
 ff1be3ec _lwp_start (0, 0, 0, 0, 0, 0)

# truss -p `pgrep nscd`/13-14
/13:    nanosleep(0xFE57BED8, 0xFE57BED0) (sleeping...)
/14:    nanosleep(0xFE47BED8, 0xFE47BED0) (sleeping...)
Marvellous. So we have 24 threads in total but just look at two of them, 13 and 14. Never again will you have to wade through screenfuls of output trying to correlate the output of individual threads. Like I said before though - you'd obviously do it in DTrace but just for one of your off days you've got a nice backup.

Additions like this may seem small but they add to the quality of life no end. Check out Adams blog for more discussions on other nice additions coming your way in Solaris 10 . The build 63 additions I've mentioned here will be hitting the streets as part of the Software Express for Solaris program in August.

Wednesday Jul 14, 2004

DTrace - when will the dream(s) end?

Just recently in the DTrace Forum on Bigadmin , Jarod Jenson spoke of his 12 steps to DTrace addiction. Well, until the Vulcan mind-meld is taught in undergraduate studies we will have to bypass number 12 in the list (assuming that Team DTrace are not Ferengi in which case we have no hope of number 12 ever being a reality... ). However, it's number 11 that has bothered me more recently, dreaming D.

Over the last 5 or so days I've been in my sick bed with a case of Tonsillitis . I'm a bit of an old hand at the tonsillitis game really as I usually get it once every year or so, so if I see it coming I head it off at the pass with a good dose of Penicillin. However, this time I did have some pretty high temperatures for a few days and along with the temperatures come the very, very strange dreams and this brings me back to point 11 in Jarods list. I've been dreaming D for a while now but things took a turn for the worse in those fevered and odd nights. My recollections of the dreams were fuzzy but always disturbed with a recollection of being pursued by a jeering, baying crowd and all I woke up with were confused and disturbed thoughts of DTrace and Linkers! It was incredibly odd.

Still, I'm fealing a bit better now and one good did come of it; I had an idea for a script which would be useful for cutting down on the amount of work i had to do to find which file a symbol is being resolved from. More about that another time as I'm off for some R&R.

Thursday Jul 08, 2004

I have the power! Speculations and instruction level tracing.

As my old friend He-Man often tells me, I have the Power. Now I may have had a few too many cartoons recently (I'm blaming the children) but when it comes to debugging application issues with DTrace then I never cease to be amazed at the power available to me. A problem I've been looking at today reminded me of this and deserves reference here.

An application, that shall remain nameless for the good of the people, has been giving me serious grief recently. Basically, I have a library which bolts into this application. This library ends up using the excellent compression library zlib - no problem so far. However, when used the library throws out errors relating to incompatible versions of zlib being used. Odd as I thought that my library was the only source of zlib use. Ha-Ha.

I know that the function failing is inflateInit2_() and it is returning the error code of -6 (meaning "Incompatible Version"). The code that is failing looks like this:
int ZEXPORT inflateInit2_(z, w, version, stream_size)
z_streamp z;
int w;
const char \*version;
int stream_size;
  if (version == Z_NULL || version[0] != ZLIB_VERSION[0] ||
      stream_size != sizeof(z_stream))
      return Z_VERSION_ERROR;
However, which of the three conditions are failing? It's easy to see whether or not 'version == Z_NULL' but the other two are not so easy (especially not the third one). However, with DTrace all things are possible. With the pid provider we can keep a track of every instruction that is executed and this is what I'll do here to see what instructions are executed when inflateInit2_() returns -6. From that I'll be able to locate which condition is failing. Awesome!

We'll need to use a facility in DTrace called speculations to do this. Speculations give you the ability to record a sequence of events and only keep them if they are deemed to be interesting at some time in the future (i.e a certain return code generated from a function as is here). The script looks like this:
        self->spec = speculation();
        printf("%s: version = %s, stream size = %d\\n",
                 probefunc, stringof(copyinstr(arg2)), arg3);


/self->spec && arg1 != -6/
        self->spec = 0;

/self->spec && arg1 == -6/
        self->spec = 0;
All we do is to instrument every instruction in the inflateInit2_() function and, if -6 is returned, commit the instructions we executed getting there or, if the return is not -6 then discard them. Sample output of a run looks like;
  0  32854              inflateInit2_:entry inflateInit2_: version = 1.1.4, stream size = 112
  0  32854              inflateInit2_:entry
  0  32857                  inflateInit2_:4
  0  32858                  inflateInit2_:8
  0  32859                  inflateInit2_:c
  0  32860                 inflateInit2_:10
  0  32861                 inflateInit2_:14
  0  32862                 inflateInit2_:18
  0  32863                 inflateInit2_:1c
  0  32864                 inflateInit2_:20
  0  32865                 inflateInit2_:24
  0  32866                 inflateInit2_:28
  0  32867                 inflateInit2_:2c
  0  32868                 inflateInit2_:30
  0  32869                 inflateInit2_:34
  0  32870                 inflateInit2_:38
  0  32871                 inflateInit2_:3c
  0  32872                 inflateInit2_:40
  0  32873                 inflateInit2_:44
  0  32874                 inflateInit2_:48
  0  32875                 inflateInit2_:4c
  0  33020                inflateInit2_:290
  0  33021                inflateInit2_:294
  0  33022                inflateInit2_:298
  0  33023                inflateInit2_:29c
  0  32855             inflateInit2_:return
So, the above are instructions that are executed for a call that returned -6. Next we need to tie the instructions above to the code so we'll disassemble the relevant instructions:
inflateInit2_:  save      %sp, -0xb0, %sp
inflateInit2_+4:call      +8            
inflateInit2_+8:sethi     %hi(0x106000), %g3
inflateInit2_+0xc:              sethi     %hi(0x800), %g2
inflateInit2_+0x10:             add       %g3, 0x344, %g3
inflateInit2_+0x14:             add       %g2, 0x10, %g2
inflateInit2_+0x18:             add       %g3, %o7, %l1
inflateInit2_+0x1c:             orcc      %g0, %i2, %g3
inflateInit2_+0x20:             be,pn     %xcc, +0x270  < inflateInit2_+0x290>
inflateInit2_+0x24:             sra       %i1, 0, %l0
inflateInit2_+0x28:             ldx       [%l1 + %g2], %g2
inflateInit2_+0x2c:             ldsb      [%i2], %g3
inflateInit2_+0x30:             ldsb      [%g2], %g2
inflateInit2_+0x34:             cmp       %g3, %g2
inflateInit2_+0x38:             bne,pt    %icc, +0x25c  < inflateInit2_+0x294>
inflateInit2_+0x3c:             mov       -6, %g2
inflateInit2_+0x40:             sra       %i3, 0, %g2
inflateInit2_+0x44:             cmp       %g2, 0x58
inflateInit2_+0x48:             bne,pn    %xcc, +0x248  < inflateInit2_+0x290>
inflateInit2_+0x4c:             cmp       %i0, 0
 - chopped here -
inflateInit2_+0x290:            mov       -6, %g2
inflateInit2_+0x294:            sra       %g2, 0, %i0
inflateInit2_+0x298:            ret
inflateInit2_+0x29c:            restore
Cutting to the chase it can be seen that the cmp on inflateInit2_+0x44 is what is failing. Tying this back to the source (as %i3 is the fouth argument) we can see this is the stream_size != sizeof(z_stream) condition that is failing. However, what are the values being compared here. Well, as we captured the stream size at the start (112 bytes) we can see that the cmp fails because 112 != 88 (0x58 is 88 in hex). However, we can just verify that this is absolutely what is happening by looking at the register contents when we execute this line.
        printf("reg g2 = %d\\n", uregs[R_G2]);
So, when we hit inflateInit2_+0x44 we print out register %g2. How cool is that! Running the above produces output like:
  0  32854                 inflateInit2_:44 reg g2 = 112

  0  32854                 inflateInit2_:44 reg g2 = 112
So, we know for sure the exact conditions that cause the error but the question is now why. Well this is a whole other story that boils down to the application in question having it's own local versions of these zlib routines which were being linked instead of the ones everything else is using ...

Sorting that out is going to be a bit more difficult. For more details on Speculations and user-level traing with the pid provider check out the fabulous DTrace answerbook.

Tuesday Jul 06, 2004

Vive la Revolution!

I've just come across a few thoughts by Paul Murphy in which he uses the old maxim of "you get what you measure". In this article he basically surmises that people generally maximise performance of a system (system in the broadest sense) with respect to whatever parameter they can measure. How very, very sadly true this is with respect to computer systems.

For years people have been hamstrung by the extremely restrictive data that is presented to them. Most literature they have been presented with has encouraged them to use this data in a very bottom-up manner to try and figure out what is happening on their systems. For example, they look at mpstat(1M) and worry about numbers which look high to them and automatically assume that this is the source of their poor application performance issues. They are conditioned into what can be termed a pattern matching approach to problem solving. By this I mean that they look for certain values or sets of values coming out of the standard tools and have a pre-defined set of problem causes that fit each pattern (this set is usually populated from theirs and their peers experiences).

However their are a few very serious and fundamental problems with the above approach.
  • Firstly, a "system" is just an N-tier application with the kernel being the bottom layers. When you use the standard tools (mpstat(1M), vmstat(1M), sar(1) you are viewing just a small piece of the stack and, even then, a very restricted set off data for that part of the stack.
  • Tied very much to the above point, remember that applications drive system characteristic. On it's own a kernel doesn't do too much. If you want to understand why your system is performing like it is then you need to up level and start with the applications.
  • John Donne said 'No man is an Island" and the same goes for your system(s). Thinking holistically is one of the keys to success when analysing the behaviour of a system. The inter and intra application interaction that occurs on a system is staggering and when married with kernel interaction becomes frightening. Software systems are complex and should be viewed as a whole to understand them.
  • Pattern matching has to be replaced with a scientific approach to problem solving. What I mean by that is that the hypotheses we make should be backed by data at all times. Make no assumptions and trust no-one. Presuppositions are everyones worst enemy when investigating any form of issue.
With discipline, all of the above could be adhered to previously apart from one and it was the last point that was the real problem. Extracting good data from the beast that is your N-tier application has historically been extraordinarily time consuming and prone to error. Instrumenting the varying layers in the stack, harvesting the data and then analysing the data is what costs organisations around the world small fortunes. The iterative cycle that a problem solver follows goes like this:
hypothesis -> instrumentation -> data harvesting -> data analysis ->hypothesis
So we see that we use data to refine our hypothesis. The miracle of DTrace is that we can spend much, much more time on what we should be doing (making hypothesis about system behaviour) and much, much less time and money on the truly mundane (instrumenting and analysing).

It's a brave new world indeed. Vive la Revolution!



  • General
« April 2014