Interested in diving into block layer performance? Frequently developers are looking for details about the ongoing I/O load such as:

  • Processes initiating R/W requests
  • Size of the requests
  • I/O latency
  • Disks inundated with requests
  • Sector read/writes taking time
  • I/O time spent in the queue

This article focuses on how bcc-tools can be leveraged for block layer analysis (bcc here stands for BPF Compiler Collection). These tools provide a useful insight into the system without any coding, taking off-the-shelf products to identify performance issues. Why reinvent the wheel! Though if one does need to invent, BCC provides a simple interface to create custom tooling for the purpose of analysing the kernel! These tools provide a great reference and starting point for your learning/debugging experience.

Tools:

To access and start playing with the tools just install the packages bcc-tools and kernel-uek-devel. Once installed the scripts can be found found at /usr/share/bcc/tools. The following sections describe some of these tools and how they can help in block layer diagnosability, and accelerate the understanding of system performance. We will use fio tool to generate the I/O and see how the following tools will help in achieving our goals.

  • biotop
  • biosnoop
  • biolatency
  • bitesize
  • biolatpcts
  • tplist
  • trace

biotop

biotop allows us to look at an overview of what processes are generating I/O on this system. We will use biotop to examine the system at a higher level.

It is similar to the top command except for disks with default display interval of one second. It displays processes performing I/O sorted by the size of I/O (Kbytes column) and acts like a snapshot of processes doing the largest read/write operations and their average latency.

usage: biotop.py [-h] [-C] [-r MAXROWS] [interval] [count]

The first line contains the time and load average from the file /proc/loadavg.

Fields displayed:

  1. PID: Process Id
  2. COMM: Process name
  3. D: Operation – Read/Write
  4. MAJ: Major number of disk
  5. MIN: Minor number of disk
  6. DISK: I/O to this disk
  7. I/O: The number of I/Os done in the current interval
  8. Kbytes: The total Kbytes completed 9. AVGms: Average IO time in milliseconds

 

biotop generates the following output and tells us the current fio processes which are reading and writing from/to a couple of dm disks with an average latency >1000ms. This output is enough to signal whether our system is slow in processing various I/Os.

# /usr/share/bcc/tools/biotop -C
Tracing... Output every 1 secs. Hit Ctrl-C to end
12:55:01 loadavg: 0.11 0.28 0.36 6/511 11790

PID     COMM            D    MAJ   MIN   DISK     I/O   Kbytes   AVGms
31917   fio             W    252   3     dm-3     13      52     1843.11
31917   fio             R    252   3     dm-3     12      48     869.73
31915   fio             W    252   3     dm-3     11      44     1124.25
1610    kworker/19:1H   R    131   240   sdgj     11      44     3288.80
31917   fio             W    252   42    dm-42    9       36     946.63
31914   fio             W    252   3     dm-3     9       36     1562.52
31915   fio             R    252   3     dm-3     8       32     1718.64
31914   fio             R    252   44    dm-44    7       28     2633.36

biosnoop

Next, let us look at I/O level latency. ’‘’biosnoop’’ traces and prints one line for each I/O request. It also shows the latency between the request issue and its completion. The -Q option can be used to print the time spent in the block queue.

usage: biosnoop.py [-h] [-Q]

Fields displayed:

  1. T: Operation: Read/Write
  2. LAT: Latency in milliseconds from request issue to device to its completion
  3. QUE: Time spent in milliseconds in kernel queue
# /usr/share/bcc/tools/biosnoop
TIME(s)     COMM  PID    DISK    T  SECTOR     BYTES     QUE(ms)     LAT(ms)
103.809904  fio   31916  dm-44   W  35688      4096      0.01        21964.53
122.904032  fio   31917  sdhr    R  2385000    4096      0.00        20113.50
122.904040  fio   31917  dm-3    R  2385000    4096      0.00        20113.51
122.904144  fio   31916  sdas    R  241776     4096      0.00        20173.90
122.904159  fio   31916  dm-42   R  241776     4096      0.00        20173.92
122.904280  fio   31915  sdas    R  2034336    4096      0.00        20173.72
122.904288  fio   31915  dm-42   R  2034336    4096      0.00        20173.74
122.947834  fio   31917  sdjp    R  25591480   4096      31170.24    8898.60
122.947840  fio   31917  dm-3    R  25591480   4096      0.00        40068.85

Outliers can be also be filtered, such as I/Os which are taking more time. These are generally the cause of performance concerns.

awk '$NF > 20000' biosnoop.out

The above output shows the fio R/W operations which are taking more than 20 ms on average while the amount of time spent in the block queue is on average negligible. This amount of time is considered to be huge latency for any I/O and points to performance issues below the block layer.

biolatency

Another highly valuable tool when process information is not necessary and one merely wants to check the time taken by I/Os irrespective of their details. Since this supports the option of per disk latencies, it offers insight into which disk might have performance issues. This prints a histogram of I/O latencies. The number of I/Os are also displayed with each latency bucket.

usage: biolatency.py [-h] [-T] [-Q] [-m] [-D] [-F] [-j] [interval] [count]

Fields displayed:

  1. usecs: I/O latency bucket in microseconds
  2. count: Number of I/Os within the latency bucket
  3. distribution: Bar chart to visualize the count
# /usr/share/bcc/tools/biolatency
Tracing block device I/O... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 1 | |
64 -> 127 : 9391913 | **************************** |
128 -> 255 : 13087567 | **************************************** |
256 -> 511 : 2535900 | ******* |
512 -> 1023 : 155152 | |
1024 -> 2047 : 5770 | |
2048 -> 4095 : 6686 | |
4096 -> 8191 : 2836 | |
8192 -> 16383 : 2143 | |
16384 -> 32767 : 60606 | |
32768 -> 65535 : 1275 | |

Outliers can be spotted, such as in this case only 1275 I/Os have highest latency whereas the majority of the I/O latency is between 64usec to 500usec. Instead of biosnoop which displays latency per I/O, biolatency can be used for a quick glance across the I/O performance of the system.

bitesize

A great tool to look at the block sizes of each process. It displays the number of I/Os in a process for each size bucket. One can also monitor the latency outcomes with different block sizes. It is helpful to know which processes are doing big or small amounts of I/Os. Extreme I/O sizes could be a cause of performance issues and this tool helps identify those processes generating such I/Os.

usage: ./bitesize

Fields displayed:

  1. Kbytes: I/O size range in Kb
  2. count: Number of I/Os within the Kbytes size bucket
  3. distribution: Bar chart to visualize the count
# /usr/share/bcc/tools/bitesize
Tracing block I/O... Hit Ctrl-C to end.
^C
Process Name = fio
Kbytes : count distribution
0 -> 1 : 1 | |
2 -> 3 : 0 | |
4 -> 7 : 30144068 |****************************************|
8 -> 15 : 3 | |

fio command is doing 30144068 I/Os with IO size between 4-7 Kbytes.

biolatpcts

This tool further breaks down the latency into percentiles and the type of I/Os. It prints the I/O latency percentile for each kind of I/O. The default latency is the amount of time I/O taken on the device. This can be modified by providing the from-rq-alloc and after-rq-alloc options, which determine the latency from an attempt to allocate request and the request allocation itself.

usage: biolatpcts.py [-h] [-i INTERVAL] [-w {from-rq-alloc,after-rq-alloc,on-device}] [-p PCT,…] [-j] [–verbose] DEV

Fields displayed:

  1. DEV: Device name being traced
  2. read: Read latency for the respective percentile
  3. write: Write latency for the respective percentile
  4. discard: Discard latency for the respective percentile
  5. flush: Flush latency for the respective percentile
  6. Percentile columns: 1-100 percentile columns (can be customized)
# /usr/share/bcc/tools/biolatpcts sda
sda        1    5      10      16      25      50     75     84     90     95     99     100
read       5us  5us    465us   505us   535us   615us  78ms   86ms   94ms   98ms   100ms  150ms
write      5us  495us  515us   535us   555us   655us  2.5ms  3.5ms  5.5ms  8.5ms  12ms   150ms
discard    -    -      -       -       -        -     -      -      -      -      -      -
flush      -    -      -       -       -        -     -      -      -      -      -      -

From the above table, 100 percent of the read requests on disk sda take less than 150ms to complete while only 1 percent of the reads on sda take 5us or less.

The next two tools are not specific to the block layer but provide a deep look into the kernel (block layer included).

tplist

This tool displays the tracepoints and probes within the kernel which could be used for tracing the block layer.

usage: tplist.py [-h] [-p PID] [-l LIB] [-v] [filter]

The following prints the tracepoints within the block layer and these tracepoints could be used with other tools to gain further insights.

# /usr/share/bcc/tools/tplist block*
block:block_touch_buffer
block:block_dirty_buffer
block:block_rq_requeue
block:block_rq_complete
block:block_rq_insert
block:block_rq_issue
block:block_bio_bounce
block:block_bio_complete
block:block_bio_backmerge
block:block_bio_frontmerge
block:block_bio_queue
block:block_getrq
block:block_sleeprq
block:block_plug
block:block_unplug
block:block_split
block:block_bio_remap
block:block_rq_remap

trace

This tool traces entry/return of functions and can print the stack trace(-K) leading up-to entry/return. It also provides an optional conditional check for the trace event and prints the parameter values as well. Multiple functions/probes can be traced at a time.

usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [-v] [-Z STRING_SIZE] [-S] [-M MAX_EVENTS] [-t] [-T] [-K] [-U] [-a] [-I header] probe [probe …]

Fields displayed:

  1. PID: Process ID
  2. TID: Thread ID
  3. COMM: Process name
  4. FUNC: function name/tracepoint being traced
  5. Last column displays the parameter values requested.

For instance, the following output displays the process and the sector number of the I/O.

# /usr/share/bcc/tools/trace 't:block:block_rq_insert "sector=%d", args->sector'
PID      TID      COMM   FUNC             -
1492319  1492319  fio    block_rq_insert  sector=80075364
1492319  1492319  fio    block_rq_insert  sector=80075365
1492319  1492319  fio    block_rq_insert  sector=80075366
1492319  1492319  fio    block_rq_insert  sector=80075367
1492319  1492319  fio    block_rq_insert  sector=80075368
1492319  1492319  fio    block_rq_insert  sector=80075369
1492319  1492319  fio    block_rq_insert  sector=80075370

Conclusion

Be sure to check the –help output for each tool as it displays various options which provide further insight into the block layer. Apart from block layer understanding, the above tools act as a jumping-off point to create your own tools under bcc-tools. Bear in mind there are also other tools for other kernel areas located in bcc-tools which are also worth exploring.

References