News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

How long does your IO take ?

Guest Author
Oracle Linux engineer Rajan Shanmugavelu illustrates how to analyse disk IO latency using Dtrace.


There are times despite having a Highly Available and Fault Tolerant architected storage environment, the disk IO takes an abnormally longer time to complete, potentially causing outages at different levels in a data center. This becomes critical in a Cluster with multiple nodes that are using disk heartbeat to do health checks.

There are utilities like iostat and sar that provide some information at a higher level. The service time 'svctm' column in both iostat and sar shows the latency from the host, the amount of time spent on the wire between the HBA port and the target/lun. Sometimes the service time 'svctm' provided by Linux iostat and sar are not reliable.

Dynamic Tracing (DTrace) allows one to measure latency at a more granular level like measuring elapsed time at adapter driver layer. With this we can find out where in the whole driver stack more cycles are spent. This can be done on the running Linux kernel without having to install an instrumented driver or requiring a reboot of the system.

Below is an example of DTrace measuring latency in the QLogic FC qla2xxx driver from the time the SCSI command is queued until it is completed from the target. This would measure every IO that is sent down the FC channel. We can also filter for commands that are taking abnormally longer time to complete. We would be interested in SCSI commands that have taken more than 25 milliseconds. Anything less than 15 milliseconds is normal per the SCSI specification standard for a spindle Disk.

The DTrace utility on Oracle Linux can be installed by following the instructions on Getting Started With DTrace, the kernel DTrace provider modules can be dynamically loaded.

#!/usr/sbin/dtrace -s 
 * qla.d DTrace script to measure latency of IO in the FC adapter qla2xxx layer.
 * Requires 'fbt' provider to be loaded (run 'sudo modprobe fbt').
        qla_cmnd = (struct scsi_cmnd *)arg1;
        qla_starttime = (int64_t)timestamp;
/qla_cmnd == ((srb_t *)arg0)->u.scmd.cmd && 
 (this->elapsed = ((int64_t)timestamp - qla_starttime)/1000000) > 25/ 
        this->scsi_device = (struct scsi_device *)qla_cmnd->device;
        this->device = (struct device) this->scsi_device->sdev_gendev;
        this->dev_name = stringof(this->device.kobj.name);
        this->disk_name = stringof(qla_cmnd->request->rq_disk->disk_name);
        printf("%Y cmnd 0x%p opcode %xh on [%s] %s took %3dms",  
                walltimestamp, qla_cmnd, qla_cmnd->cmnd[0],
                this->dev_name, this->disk_name, this->elapsed);
        qla_cmnd = NULL;
        qla_starttime = 0;

The DTrace qla.d script has 2 fbt (function boundary trace) probes. The 'qla2xxx_queuecommand:entry' probe stores the SCSI command, which is the second argument in this routine. We also store the nanosecond timestamp as qla_starttime.

The probe 'qla2x00_sp_compl:entry' is fired upon SCSI command completion, which gets called for every IO. This is a place to insert the DTrace predicate to filter the code block to execute only when it meets the condition specified in the predicate. Here we check for the SCSI command that matches with the one set at qla2xxx_queuecommand and if the elapsed time is greater than 25 milliseconds.

To match the SCSI command with the one set at qla2xxx_queuecommand we need to get the SCSI command upon entering the qla2x00_sp_comp routine whose function's first argument is a pointer of type 'srb_t'. The SCSI command resides in a union member u.scmd.cmd of the srb_t structure. DTrace allows one to traverse the structure member of the argument with ((srb_t *)arg0)->u.scmd.cmd. Then we calculate the elapsed time as the built-in nanosecond timer variable timestamp minus the start time set at qla2xxx_queuecommand. If the elapsed time is greater than 25ms and the SCSI commands match, then we execute the code block to show details. Below is output from a system with simulated slow disk:

# dtrace -s qla.d
dtrace: script 'qla.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
 18  71529           qla2x00_sp_compl:entry 2020 Dec  3 20:17:33 cmnd 0xffff881e33796800 opcode 28h on [1:0:0:1] sdc took 214ms
 18  71529           qla2x00_sp_compl:entry 2020 Dec  3 20:17:33 cmnd 0xffff881e337971c0 opcode 28h on [1:0:0:1] sdc took  55ms
 18  71529           qla2x00_sp_compl:entry 2020 Dec  3 20:20:45 cmnd 0xffff881e33cea700 opcode 28h on [1:0:0:245] sddw took 438ms
  6  71529           qla2x00_sp_compl:entry 2020 Dec  3 20:20:52 cmnd 0xffff881e35959040 opcode 28h on [2:0:1:259] sduh took 268ms
  6  71529           qla2x00_sp_compl:entry 2020 Dec  3 20:21:18 cmnd 0xffff881cec68b740 opcode 28h on [2:0:1:243] sdtz took  31ms
  6  71529           qla2x00_sp_compl:entry 2020 Dec  3 20:22:26 cmnd 0xffff881cdf960000 opcode 2ah on [2:0:1:103] sdsf took  53ms
  6  71529           qla2x00_sp_compl:entry 2020 Dec  3 20:23:16 cmnd 0xffff881e30737500 opcode 28h on [2:0:1:193] sdta took 525ms
 18  71529           qla2x00_sp_compl:entry 2020 Dec  3 20:24:11 cmnd 0xffff881def666180 opcode 2ah on [1:0:0:175] sdh took 118ms
 18  71529           qla2x00_sp_compl:entry 2020 Dec  3 20:24:50 cmnd 0xffff881e2629db00 opcode 28h on [1:0:0:253] sdea took 190ms
  6  71529           qla2x00_sp_compl:entry 2020 Dec  3 20:25:01 cmnd 0xffff88006b78de40 opcode 28h on [2:0:1:135] sdrx took 559ms
 18  71529           qla2x00_sp_compl:entry 2020 Dec  3 20:25:26 cmnd 0xffff881cbc718000 opcode 28h on [1:0:0:73] sdao took 133ms

The DTrace command takes a -s option on the command line to specify the location of the D program source file to be compiled. There are other options that can be specified along with the -s option (see dtrace(1) man page).

The output shows the probe being fired when the predicate conditions are met. The printf statement in the DTrace code block shows the timestamp in human readable form, the SCSI command address, and how long it took to complete the IO. The Device name shows details like on which Nexus the delay occurred. It shows if the delay is due to a specific device/lun or a specific target or is it channel wide or across all channels. We can also find which SCSI command was on the wire for such a long time.

DTrace allows one to access structure members with type casting, which enables to read the values of the member. They are not writable. The dev_name or Nexus shows the SCSI host number, this is also the FibreChannel port, Target, Lun number. The disk name is the name of the device under the /dev/ directory. The opcode shows if it is read (28h), write (2ah) or Test Unit Ready (0h) or some other SCSI command opcode that got delayed and the time it took to complete the IO.

# dtrace -qs qla.d
2020 Dec  8 19:22:55 cmnd 0xffff881cd08b71c0 opcode 28h on [1:0:0:213] sddg took  35ms
2020 Dec  8 19:26:37 cmnd 0xffff881e24e93400 opcode 28h on [2:0:0:189] sdno took 462ms
2020 Dec  8 19:27:53 cmnd 0xffff881cd8f93a80 opcode 28h on [2:0:0:219] sdod took  29ms
2020 Dec  8 21:04:39 cmnd 0xffff881c5c0c0340 opcode 2ah on [2:0:0:41] sdks took  44ms
2020 Dec  8 22:16:08 cmnd 0xffff881f3a96e4c0 opcode 28h on [1:0:0:235] sddr took 529ms
2020 Dec  8 22:27:47 cmnd 0xffff881e2de109c0 opcode 28h on [2:0:0:91] sdlr took 578ms
2020 Dec  9 00:33:38 cmnd 0xffff881e00a5aa40 opcode 28h on [1:0:0:43] sdz took  61ms
2020 Dec  9 00:54:48 cmnd 0xffff881dc7846800 opcode 28h on [1:0:0:195] sdcx took 408ms

The DTrace script running with '-q' option is for quite mode, it displays what is specified in the printf() statement. The non-quiet mode shows the number of probes being traced, which CPU ran the probe routine, PID and the probe function name.

Join the discussion

Comments ( 1 )
  • Pattabhi Wednesday, December 16, 2020
    Excellent, great work & contribution to make IO Latency in Highly Available and Fault Tolerant architecture to be more effective.
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.