dtrace looking for slow i/o.
By timatworkhomeandinbetween on May 08, 2008
I have spent the last 6 months helping support folks in Sun deal with performance escalations. A recent one has changed the way I think about using dtrace.
Its very easy to just turn on all the fbt probes and then log start and end times of every function, and if you add in a few lines of code you can get deltas timings for function execution, and if you predicate the functions on the i/o generator process you can get all the function timings that are called in the context of the user process you are predicating on. But does this help with the task of finding the cause of the slow i/o ?
In the case at hand the answer is NO.
In this case the graph of number of i/os versus completion time in millisecs had a bonus extra peak at about the 400 millisec mark and of a magnitude of about 10% of all i/o.
Previous engineers had done the massive tracing of all function timings in the context of the oracle process and once they had waded through the 100s of megabytes of numbers had not been able to find delays that matched what the oracle internal stats were showing so they had started looking for very esoteric scheduling type problems..
By applying SGR methodology and using the problem statement "where in the lifecycle of the i/os can we observe the 40 millisec delay", you start to treat the i/o as an object not a code flow, and that is the important paradigm. You then ask yourself "where in the software stack that the object ( i/o) has to traverse on its way to the disk drive and back to the application can I observe the object", dtrace allows you to perform the measurement using aggregations.
The first thing to do as the Oracle process used async i/o against raw device was to observe the distribution of the time taken for the i/o to pass into the kernel via aioread()/aiowrite() and for the corresponding aiowait() to reap the returning i/o completion. Here the i/o is represented by the aio_req_t object, created in the aioread/aiowrite system calls and destroyed in the corresponding aiowait() system call. This is exactly what the application will be using to do its timing in userland. the quantize() aggragating function is ideal for producing graphs and in this case it showed an identical graph to the oracle application own internal statistics.
the lifecyle of the i/o was..
application ( aioread/aiowrite) libaio -------- kaio (aioread/aiowrite) a volume manager ssd target driver mpxio - multipathing leadville - san stack qlc - hba --- disk --- qlc interrupt handler leadville completion routines ssd interrupt handler the volume managers completion routines kaio (aiowait) ----- libaio application ( aiowait)
So our first test that involved enabling only two fbt probes had allowed us to say that the problem lay in the top kaio and the bottom kaio and that the application statistics were accurate. What we needed was some other probe points to allow us to chop up this stack and find our delay.
The problem with chasing i/o through a volume manager is that the entering i/o ( represented by a buf structure) is often broken up into smaller buf structures or otherwise changed in a manner that does not allow easy following of the objects, so we jump down to the ssd routines, again using just two probes we can graph the time distribution of buf objects entering the ssd driver and the interrupt returning the same object. This showed no i/o taking more than a millisec.
That ruled out issues with the storage itself. So now we have a couple of places in the lifecycle where our delay could be introduced. The next test was to time the pci interrupt thread from its creation to its death again with two probes to see if the interrupt routines was blocking anywhere above sd - it wasn't. This could have been a mistake if the interrupt thread had handed our returning buf off to a taskq thread and that had delayed, so we aggregated the stack of the callers of the kaio completion routine and 100% of the callers were interrupt threads, so now we know where the problem is and where it isn't.
Attention was focused on to the volume manager and it quickly became obvious that the delay was caused by the i/o object being passed off to a worker thread which was delaying the i/o. This would explain why the previous engineers approach had not worked - when an i/o was delayed the thread returned much quicker as it did less work - they should have looked for a function taking less time not one taking more time!
So the way to work these issues is - to understand the software stack, to understand what software object represents your i/o at each of the layers, and to have access to the source code (opensolaris.org) to be able to find that object as it passes down through each layer and when it passes back up again.
my next few posts will include the dtrace to chop up the i/o stack at different levels, up near the application will alow you to verify if you have an i/o delay issue and some scripts lower down to try and chop up the stack to speed diagnosis.