What is that system call doing?

Getting back on topic, here is a nice short bit of Dtrace.

Sometimes by the time I get to see an issue the “where on the object” question is well defined and in two recent cases that came down to “Why is system call X slow?” . The two system calls were not the same in each case but the bit of D to find the answer was almost identical in both cases.

Faced with a system call that is taking a long time you have to understand the three possible reasons this can happen:

  1. It has to do a lot of processing to achieve it's results.

  2. It blocks for a long time waiting for an asynchronous event to occur.

  3. It blocks for a short time but many times waiting for asynchronous events to occur.

So it would be really nice to be able to see where a system call is spending all it's time. The starting point for such an investigation is that when in the system call there are only two important states. The thread is either running on a CPU or it is not. Typically when it is not it is because it is blocked for some reason. So using the Dtrace sched provider's on-cpu and off-cpu probes to see how much time the system call spends blocked and then print out stacks if it is blocked for more than a given amount of time.

Here it is running against a simple mv(1) command:

$ pfexec /usr/sbin/dtrace -s syscall-time.d -c "mv .d .x"
dtrace: script 'syscall-time.d' matched 17 probes
dtrace: pid 26118 has exited
CPU     ID                    FUNCTION:NAME
  3  79751                     rename:entry rename(.d, .x)
  3  21381                    resume:on-cpu Off cpu for: 1980302
              genunix`cv_timedwait_sig+0x1c6
              rpcmod`clnt_cots_kcallit+0x55d
              nfs`nfs4_rfscall+0x3a9
              nfs`rfs4call+0xb7
              nfs`nfs4rename_persistent_fh+0x1eb
              nfs`nfs4rename+0x482
              nfs`nfs4_rename+0x89
              genunix`fop_rename+0xc2
              genunix`vn_renameat+0x2ab
              genunix`vn_rename+0x2b

  3  79752                    rename:return 
  on-cpu                                            
           value  ------------- Distribution ------------- count    
           16384 |                                         0        
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           3        
           65536 |@@@@@@@@@@                               1        
          131072 |                                         0        

  off-cpu                                           
           value  ------------- Distribution ------------- count    
          131072 |                                         0        
          262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              2        
          524288 |                                         0        
         1048576 |@@@@@@@@@@@@@                            1        
         2097152 |                                         0        

rename times on: 205680 off: 2625604 total: 2831284


$ 

From the aggregations at the bottom of the output you can see that the system call went off-cpu three times and one of those occasions it was off CPU for long enough that my limit of 10000000 nano seconds was reached and so a stack trace was printed. It also becomes pretty clear where that system call spent all it's time. It was a “rename” system call and I'm on an NFS file system so it has to wait for the server to respond and that server is going to have to make sure it has updated some non-volatile storage.


Here is the script:


#!/usr/sbin/dtrace -s
/\* run using dtrace -p or dtace -c \*/
syscall::rename:entry
/ pid == $target /
{
	self->traceme = 1;
	self->ts = timestamp;
	self->on_cpu = timestamp;
	self->total_on = 0;
	self->total_off = 0;
	printf("rename(%s, %s)", copyinstr(arg0), copyinstr(arg1));
}
sched:::off-cpu
/ self->traceme == 1 /
{
	self->off_cpu = timestamp;
	self->total_on += self->off_cpu - self->on_cpu;
}
sched:::off-cpu
/ self->traceme == 1 /
{
	 @["on-cpu"] = quantize(self->off_cpu - self->on_cpu);
}
sched:::on-cpu
/ self->traceme == 1 /
{
	self->on_cpu = timestamp;
	 @["off-cpu"] = quantize(self->on_cpu - self->off_cpu);
	 self->total_off += self->on_cpu - self->off_cpu;
}
/\* if off cpu for more than a millisecond print a stack \*/
sched:::on-cpu
/ self->traceme == 1 && timestamp - self->off_cpu > 1000\*1000 /
{
	printf("Off cpu for: %d", self->on_cpu - self->off_cpu);
	stack(10);
}
sched:::off-cpu
/ self->traceme == 1 && timestamp - self->on_cpu > 1000\*1000 /
{
	printf("On cpu for: %d", self->off_cpu - self->on_cpu);
	stack(10);
}
syscall::rename:return
/self->traceme/
{
	self->traceme = 0;
	self->total_on += timestamp - self->on_cpu;
	@["on-cpu"] = quantize(timestamp - self->on_cpu);
	printa(@);
	printf("%s times on: %d off: %d total: %d\\n",probefunc, self->total_on,
	self->total_off, timestamp-self->ts);
	self->on_cpu = 0;
	self->off_cpu = 0;
	self->total_on = 0;
	self->total_off = 0;
}

11/9/2009 Updated script to correct the time in the comment.

Comments:

Nice one. Is there a reason for having 2 instances of sched:::off-cpu probe with the same predicate but different bodies ?

Posted by Vladimir Kotal on October 03, 2008 at 08:38 AM BST #

No there is not. It is just the way the script evolved.

Posted by Chris Gerhard on October 03, 2008 at 08:49 AM BST #

Great, this really helped give a definitive answer to a serious customer problem. For me the mental leap required to get from "examine potential code paths, instrument code paths, run experiment, repeat" to "get a stack at just the right point in time." was a big one.

There is a lesson to be learned for newbs like me in that: when you are in a position where you think the only way to get the answer with dtrace will be to install probes everywhere it is wiser to try to recast the problem to "how do I take a reading at the critical \*time\*". Many thanks for that lesson, my troubleshooting will be better for it.

Posted by Bond on January 21, 2009 at 05:28 AM GMT #

Post a Comment:
Comments are closed for this entry.
About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today