Monday May 12, 2008

first step: using dtrace to look at i/o system calls.

Assuming you believe that your application is having i/o related problems dtrace provides a very powerful way to slice and dice the software stack to find the cause of your delays. I always start by observing the i/o related system calls that a process is making. A few lines of dtrace allow you to get a graphical display of the system call response times - which is really all the application cares about. If you see no signs of delay here then there is no need to probe further down into the i/o stack.

I have chosen probe points in the context of the process being observed in order to try and eliminate any scheduling type issues.. The attached dtrace program io.d uses aggregations and the quantize() function to produce neat graphs of system call response times for aioread/aiowrite, read/write and pread/pwrite system calls as these are the main ways application do i/o.

Read/write and pread/pwrite are quite easy as the i/o happens inside the system call ( well the disk i/o might happen later if doing i/o to a filesystem without O_DSYNC set) as far as the application is concerned, so we just use the time delta between entry and return probes.

For aioread/aiowrite to raw devices there are two system calls involved, one call to kaio() to issue the read of the write and another call to kaio() to reap the return notification. I have therefore used probe points that allow me to observe the sending of the aio_request structure and the reaping of it in the second system call and I have used the time delta between these two events to provide a kernel timing as close to what the userland application would see if it used gethrtime() to time this sequence of systems calls.

Here in this output I am using the program to watch a program generating aiowrites() to two slices on the same disk, ( you also see writes to the tty as the program emits some timing information of its own..).. The vertical index (value) is in millisecs.

ksh: /usr/sbin/dtrace -qs /var/tmp/aio.d $(pgrep aio)
------- 1648930 value axis is in millisecs -------

  write against /devices/pseudo/pts@0:1             
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 82       
               1 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:a,raw
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@                     299      
               8 |@@@@@@@@@@@@@@@@@@@@                     298      
              16 |                                         1        
              32 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:b,raw
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@                     295      
               8 |@@@@@@@@@@@@@@@@@@@@                     295      
              16 |@                                        9        
              32 |                                         0        

------- 1648940 value axis is in millisecs -------

  write against /devices/pseudo/pts@0:1             
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 114      
               1 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:a,raw
           value  ------------- Distribution ------------- count    
               1 |                                         0        
               2 |                                         2        
               4 |@@@@@@@@@@@@@@@@@@@@                     328      
               8 |@@@@@@@@@@@@@@@@@@@@                     334      
              16 |                                         1        
              32 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:b,raw
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@                     328      
               8 |@@@@@@@@@@@@@@@@@@@@                     329      
              16 |                                         7        
              32 |                                         0        


I am surprised that companies have not started producing dtrace based i/o analysis tools allowing health checks and event monitoring with an accuracy and ease we have never had before. Even this script would allow you to look for abnormal i/o responses or produce reports on throughtput and response times with small changes to the script.

Thursday Nov 15, 2007

Things I have learnt in the last few weeks.

1) Customer's system suffered from i/o slowdowns of the order of seconds. The cause was a load of unused targets in the scsi related driver .conf files like ( ses.conf, st.conf, sgen.conf, sd.conf). A home made monitoring application was triggering driver loading, as part of that process the framework tried to probe all the targets in these .conf files. On a copper scsi bus each probe involves sending inquiry commands, obviously with no target to respond to the selection you get the 0.25 second selection timeout which holds the scsi bus. Get enough drivers and unused targets and you can get bus freezes of many seconds. This was resolved by using scsi.d through a sharedshell session that allowed me to login and root cause it- luckily the customer was running Solaris 10.
2) When using Oracle RAC you need to partition your tables to avoid both nodes wanting to access the same block. If that happens the block is read from disk into one node and then when it is free it is passed across the network interconnect to the other node..
3) don't ignore scsi messages in /var/adm/messages about targets reducing sync rate or even worse going async.. A copper scsi drive in async mode canonly transfer about 5MB/second. If you do send it 5 Mbytes in big transfers then it will occupy the bus transferring data slowly holding off other targets. Currently you have to reboot to clear this reduced speed.
4) if you have a storedge 3310 JBOD you must set the scsi option in mpt.conf to reduce the speed to ultra 160 even though the drives can do ultra 320. If you don't you get odd scsi warnings and change lots of hardware that did not need changing.
5) if you get a problem on a pci bus eg a device ends a transaction with a target abort then you can get two distinct behaviors on a sparc system. If that transaction is a write then the host bridge sets a suitable bit in a status register and generates an interrupt that is serviced by pbm_error_interrupt(). This can go to a different cpu than the one issuing the pci write and can take a while to process. This delay explain why crash dumps of "panic : pci fatal error" rarely capture the stack trace of the offending pci write. If the transaction is a read you get a Berr ( bus error) trap on the faulting store instruction and the resultant crash dump shows you the offending stack trace.

I have a new boss and job, I am part of a small team that is hoping to form a diagnostic community within System-TSC ( the part of services I work in). The goal is to speed up the problem resolution of difficult and complex issues by engaging with and offering advice/mentoring/training to other engineers to grow their skill sets. The other folks in this team are all exceptionally talented so this maybe a bit of a challenge.

Administering my little web site and non-technical blog at isn't taking too much time.



« April 2014

No bookmarks in folder


No bookmarks in folder