Monday May 12, 2008

first step: using dtrace to look at i/o system calls.

Assuming you believe that your application is having i/o related problems dtrace provides a very powerful way to slice and dice the software stack to find the cause of your delays. I always start by observing the i/o related system calls that a process is making. A few lines of dtrace allow you to get a graphical display of the system call response times - which is really all the application cares about. If you see no signs of delay here then there is no need to probe further down into the i/o stack.

I have chosen probe points in the context of the process being observed in order to try and eliminate any scheduling type issues.. The attached dtrace program io.d uses aggregations and the quantize() function to produce neat graphs of system call response times for aioread/aiowrite, read/write and pread/pwrite system calls as these are the main ways application do i/o.

Read/write and pread/pwrite are quite easy as the i/o happens inside the system call ( well the disk i/o might happen later if doing i/o to a filesystem without O_DSYNC set) as far as the application is concerned, so we just use the time delta between entry and return probes.

For aioread/aiowrite to raw devices there are two system calls involved, one call to kaio() to issue the read of the write and another call to kaio() to reap the return notification. I have therefore used probe points that allow me to observe the sending of the aio_request structure and the reaping of it in the second system call and I have used the time delta between these two events to provide a kernel timing as close to what the userland application would see if it used gethrtime() to time this sequence of systems calls.

Here in this output I am using the program to watch a program generating aiowrites() to two slices on the same disk, ( you also see writes to the tty as the program emits some timing information of its own..).. The vertical index (value) is in millisecs.

ksh: /usr/sbin/dtrace -qs /var/tmp/aio.d $(pgrep aio)
------- 1648930 value axis is in millisecs -------

  write against /devices/pseudo/pts@0:1             
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 82       
               1 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:a,raw
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@                     299      
               8 |@@@@@@@@@@@@@@@@@@@@                     298      
              16 |                                         1        
              32 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:b,raw
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@                     295      
               8 |@@@@@@@@@@@@@@@@@@@@                     295      
              16 |@                                        9        
              32 |                                         0        

------- 1648940 value axis is in millisecs -------

  write against /devices/pseudo/pts@0:1             
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 114      
               1 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:a,raw
           value  ------------- Distribution ------------- count    
               1 |                                         0        
               2 |                                         2        
               4 |@@@@@@@@@@@@@@@@@@@@                     328      
               8 |@@@@@@@@@@@@@@@@@@@@                     334      
              16 |                                         1        
              32 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:b,raw
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@                     328      
               8 |@@@@@@@@@@@@@@@@@@@@                     329      
              16 |                                         7        
              32 |                                         0        


I am surprised that companies have not started producing dtrace based i/o analysis tools allowing health checks and event monitoring with an accuracy and ease we have never had before. Even this script would allow you to look for abnormal i/o responses or produce reports on throughtput and response times with small changes to the script.

Thursday May 08, 2008

dtrace looking for slow i/o.

I have spent the last 6 months helping support folks in Sun deal with performance escalations. A recent one has changed the way I think about using dtrace.

Its very easy to just turn on all the fbt probes and then log start and end times of every function, and if you add in a few lines of code you can get deltas timings for function execution, and if you predicate the functions on the i/o generator process you can get all the function timings that are called in the context of the user process you are predicating on. But does this help with the task of finding the cause of the slow i/o ?

In the case at hand the answer is NO.

In this case the graph of number of i/os versus completion time in millisecs had a bonus extra peak at about the 400 millisec mark and of a magnitude of about 10% of all i/o.

Previous engineers had done the massive tracing of all function timings in the context of the oracle process and once they had waded through the 100s of megabytes of numbers had not been able to find delays that matched what the oracle internal stats were showing so they had started looking for very esoteric scheduling type problems..

By applying SGR methodology and using the problem statement "where in the lifecycle of the i/os can we observe the 40 millisec delay", you start to treat the i/o as an object not a code flow, and that is the important paradigm. You then ask yourself "where in the software stack that the object ( i/o) has to traverse on its way to the disk drive and back to the application can I observe the object", dtrace allows you to perform the measurement using aggregations.

The first thing to do as the Oracle process used async i/o against raw device was to observe the distribution of the time taken for the i/o to pass into the kernel via aioread()/aiowrite() and for the corresponding aiowait() to reap the returning i/o completion. Here the i/o is represented by the aio_req_t object, created in the aioread/aiowrite system calls and destroyed in the corresponding aiowait() system call. This is exactly what the application will be using to do its timing in userland. the quantize() aggragating function is ideal for producing graphs and in this case it showed an identical graph to the oracle application own internal statistics.

the lifecyle of the i/o was..

application ( aioread/aiowrite)
kaio (aioread/aiowrite)
a volume manager
ssd  target driver
mpxio - multipathing
leadville - san stack
qlc - hba
qlc interrupt handler
leadville completion routines
ssd interrupt handler
the volume managers completion routines
kaio (aiowait)
application ( aiowait)

So our first test that involved enabling only two fbt probes had allowed us to say that the problem lay in the top kaio and the bottom kaio and that the application statistics were accurate. What we needed was some other probe points to allow us to chop up this stack and find our delay.

The problem with chasing i/o through a volume manager is that the entering i/o ( represented by a buf structure) is often broken up into smaller buf structures or otherwise changed in a manner that does not allow easy following of the objects, so we jump down to the ssd routines, again using just two probes we can graph the time distribution of buf objects entering the ssd driver and the interrupt returning the same object. This showed no i/o taking more than a millisec.

That ruled out issues with the storage itself. So now we have a couple of places in the lifecycle where our delay could be introduced. The next test was to time the pci interrupt thread from its creation to its death again with two probes to see if the interrupt routines was blocking anywhere above sd - it wasn't. This could have been a mistake if the interrupt thread had handed our returning buf off to a taskq thread and that had delayed, so we aggregated the stack of the callers of the kaio completion routine and 100% of the callers were interrupt threads, so now we know where the problem is and where it isn't.

Attention was focused on to the volume manager and it quickly became obvious that the delay was caused by the i/o object being passed off to a worker thread which was delaying the i/o. This would explain why the previous engineers approach had not worked - when an i/o was delayed the thread returned much quicker as it did less work - they should have looked for a function taking less time not one taking more time!

So the way to work these issues is - to understand the software stack, to understand what software object represents your i/o at each of the layers, and to have access to the source code ( to be able to find that object as it passes down through each layer and when it passes back up again.

my next few posts will include the dtrace to chop up the i/o stack at different levels, up near the application will alow you to verify if you have an i/o delay issue and some scripts lower down to try and chop up the stack to speed diagnosis.




« August 2016

No bookmarks in folder


No bookmarks in folder