Block Layer Observability with bcc-tools

Interested in diving into block layer performance? Frequently developers are looking for details about the ongoing I/O load such as:

Processes initiating R/W requests
Size of the requests
I/O latency
Disks inundated with requests
Sector read/writes taking time
I/O time spent in the queue

This article focuses on how bcc-tools can be leveraged for block layer analysis (bcc here stands for BPF Compiler Collection). These tools provide a useful insight into the system without any coding, taking off-the-shelf products to identify performance issues. Why reinvent the wheel! Though if one does need to invent, BCC provides a simple interface to create custom tooling for the purpose of analysing the kernel! These tools provide a great reference and starting point for your learning/debugging experience.

Tools:

To access and start playing with the tools just install the packages bcc-tools and kernel-uek-devel. Once installed the scripts can be found found at /usr/share/bcc/tools. The following sections describe some of these tools and how they can help in block layer diagnosability, and accelerate the understanding of system performance. We will use fio tool to generate the I/O and see how the following tools will help in achieving our goals.

biotop
biosnoop
biolatency
bitesize
biolatpcts
tplist
trace

`biotop`

biotop allows us to look at an overview of what processes are generating I/O on this system. We will use biotop to examine the system at a higher level.

It is similar to the top command except for disks with default display interval of one second. It displays processes performing I/O sorted by the size of I/O (Kbytes column) and acts like a snapshot of processes doing the largest read/write operations and their average latency.

usage: biotop.py [-h] [-C] [-r MAXROWS] [interval] [count]

The first line contains the time and load average from the file /proc/loadavg.

Fields displayed:

PID: Process Id
COMM: Process name
D: Operation – Read/Write
MAJ: Major number of disk
MIN: Minor number of disk
DISK: I/O to this disk
I/O: The number of I/Os done in the current interval
Kbytes: The total Kbytes completed 9. AVGms: Average IO time in milliseconds

biotop generates the following output and tells us the current fio processes which are reading and writing from/to a couple of dm disks with an average latency >1000ms. This output is enough to signal whether our system is slow in processing various I/Os.

# /usr/share/bcc/tools/biotop -C
Tracing... Output every 1 secs. Hit Ctrl-C to end
12:55:01 loadavg: 0.11 0.28 0.36 6/511 11790

PID     COMM            D    MAJ   MIN   DISK     I/O   Kbytes   AVGms
31917   fio             W    252   3     dm-3     13      52     1843.11
31917   fio             R    252   3     dm-3     12      48     869.73
31915   fio             W    252   3     dm-3     11      44     1124.25
1610    kworker/19:1H   R    131   240   sdgj     11      44     3288.80
31917   fio             W    252   42    dm-42    9       36     946.63
31914   fio             W    252   3     dm-3     9       36     1562.52
31915   fio             R    252   3     dm-3     8       32     1718.64
31914   fio             R    252   44    dm-44    7       28     2633.36

`biosnoop`

Next, let us look at I/O level latency. ’‘’biosnoop’’ traces and prints one line for each I/O request. It also shows the latency between the request issue and its completion. The -Q option can be used to print the time spent in the block queue.

usage: biosnoop.py [-h] [-Q]

Fields displayed:

T: Operation: Read/Write
LAT: Latency in milliseconds from request issue to device to its completion
QUE: Time spent in milliseconds in kernel queue

# /usr/share/bcc/tools/biosnoop
TIME(s)     COMM  PID    DISK    T  SECTOR     BYTES     QUE(ms)     LAT(ms)
103.809904  fio   31916  dm-44   W  35688      4096      0.01        21964.53
122.904032  fio   31917  sdhr    R  2385000    4096      0.00        20113.50
122.904040  fio   31917  dm-3    R  2385000    4096      0.00        20113.51
122.904144  fio   31916  sdas    R  241776     4096      0.00        20173.90
122.904159  fio   31916  dm-42   R  241776     4096      0.00        20173.92
122.904280  fio   31915  sdas    R  2034336    4096      0.00        20173.72
122.904288  fio   31915  dm-42   R  2034336    4096      0.00        20173.74
122.947834  fio   31917  sdjp    R  25591480   4096      31170.24    8898.60
122.947840  fio   31917  dm-3    R  25591480   4096      0.00        40068.85

Outliers can be also be filtered, such as I/Os which are taking more time. These are generally the cause of performance concerns.

awk '$NF > 20000' biosnoop.out

The above output shows the fio R/W operations which are taking more than 20 ms on average while the amount of time spent in the block queue is on average negligible. This amount of time is considered to be huge latency for any I/O and points to performance issues below the block layer.

`biolatency`

Another highly valuable tool when process information is not necessary and one merely wants to check the time taken by I/Os irrespective of their details. Since this supports the option of per disk latencies, it offers insight into which disk might have performance issues. This prints a histogram of I/O latencies. The number of I/Os are also displayed with each latency bucket.

usage: biolatency.py [-h] [-T] [-Q] [-m] [-D] [-F] [-j] [interval] [count]

Fields displayed:

usecs: I/O latency bucket in microseconds
count: Number of I/Os within the latency bucket
distribution: Bar chart to visualize the count

# /usr/share/bcc/tools/biolatency
Tracing block device I/O... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 1 | |
64 -> 127 : 9391913 | **************************** |
128 -> 255 : 13087567 | **************************************** |
256 -> 511 : 2535900 | ******* |
512 -> 1023 : 155152 | |
1024 -> 2047 : 5770 | |
2048 -> 4095 : 6686 | |
4096 -> 8191 : 2836 | |
8192 -> 16383 : 2143 | |
16384 -> 32767 : 60606 | |
32768 -> 65535 : 1275 | |

Outliers can be spotted, such as in this case only 1275 I/Os have highest latency whereas the majority of the I/O latency is between 64usec to 500usec. Instead of biosnoop which displays latency per I/O, biolatency can be used for a quick glance across the I/O performance of the system.

`bitesize`

A great tool to look at the block sizes of each process. It displays the number of I/Os in a process for each size bucket. One can also monitor the latency outcomes with different block sizes. It is helpful to know which processes are doing big or small amounts of I/Os. Extreme I/O sizes could be a cause of performance issues and this tool helps identify those processes generating such I/Os.

usage: ./bitesize

Fields displayed:

Kbytes: I/O size range in Kb
count: Number of I/Os within the Kbytes size bucket
distribution: Bar chart to visualize the count

# /usr/share/bcc/tools/bitesize
Tracing block I/O... Hit Ctrl-C to end.
^C
Process Name = fio
Kbytes : count distribution
0 -> 1 : 1 | |
2 -> 3 : 0 | |
4 -> 7 : 30144068 |****************************************|
8 -> 15 : 3 | |

fio command is doing 30144068 I/Os with IO size between 4-7 Kbytes.

`biolatpcts`

This tool further breaks down the latency into percentiles and the type of I/Os. It prints the I/O latency percentile for each kind of I/O. The default latency is the amount of time I/O taken on the device. This can be modified by providing the from-rq-alloc and after-rq-alloc options, which determine the latency from an attempt to allocate request and the request allocation itself.

usage: biolatpcts.py [-h] [-i INTERVAL] [-w {from-rq-alloc,after-rq-alloc,on-device}] [-p PCT,…] [-j] [–verbose] DEV

Fields displayed:

DEV: Device name being traced
read: Read latency for the respective percentile
write: Write latency for the respective percentile
discard: Discard latency for the respective percentile
flush: Flush latency for the respective percentile
Percentile columns: 1-100 percentile columns (can be customized)

# /usr/share/bcc/tools/biolatpcts sda
sda        1    5      10      16      25      50     75     84     90     95     99     100
read       5us  5us    465us   505us   535us   615us  78ms   86ms   94ms   98ms   100ms  150ms
write      5us  495us  515us   535us   555us   655us  2.5ms  3.5ms  5.5ms  8.5ms  12ms   150ms
discard    -    -      -       -       -        -     -      -      -      -      -      -
flush      -    -      -       -       -        -     -      -      -      -      -      -

From the above table, 100 percent of the read requests on disk sda take less than 150ms to complete while only 1 percent of the reads on sda take 5us or less.

The next two tools are not specific to the block layer but provide a deep look into the kernel (block layer included).

`tplist`

This tool displays the tracepoints and probes within the kernel which could be used for tracing the block layer.

usage: tplist.py [-h] [-p PID] [-l LIB] [-v] [filter]

The following prints the tracepoints within the block layer and these tracepoints could be used with other tools to gain further insights.

# /usr/share/bcc/tools/tplist block*
block:block_touch_buffer
block:block_dirty_buffer
block:block_rq_requeue
block:block_rq_complete
block:block_rq_insert
block:block_rq_issue
block:block_bio_bounce
block:block_bio_complete
block:block_bio_backmerge
block:block_bio_frontmerge
block:block_bio_queue
block:block_getrq
block:block_sleeprq
block:block_plug
block:block_unplug
block:block_split
block:block_bio_remap
block:block_rq_remap

`trace`

This tool traces entry/return of functions and can print the stack trace(-K) leading up-to entry/return. It also provides an optional conditional check for the trace event and prints the parameter values as well. Multiple functions/probes can be traced at a time.

usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [-v] [-Z STRING_SIZE] [-S] [-M MAX_EVENTS] [-t] [-T] [-K] [-U] [-a] [-I header] probe [probe …]

Fields displayed:

PID: Process ID
TID: Thread ID
COMM: Process name
FUNC: function name/tracepoint being traced
Last column displays the parameter values requested.

For instance, the following output displays the process and the sector number of the I/O.

# /usr/share/bcc/tools/trace 't:block:block_rq_insert "sector=%d", args->sector'
PID      TID      COMM   FUNC             -
1492319  1492319  fio    block_rq_insert  sector=80075364
1492319  1492319  fio    block_rq_insert  sector=80075365
1492319  1492319  fio    block_rq_insert  sector=80075366
1492319  1492319  fio    block_rq_insert  sector=80075367
1492319  1492319  fio    block_rq_insert  sector=80075368
1492319  1492319  fio    block_rq_insert  sector=80075369
1492319  1492319  fio    block_rq_insert  sector=80075370

Conclusion

Be sure to check the –help output for each tool as it displays various options which provide further insight into the block layer. Apart from block layer understanding, the above tools act as a jumping-off point to create your own tools under bcc-tools. Bear in mind there are also other tools for other kernel areas located in bcc-tools which are also worth exploring.

Block Layer Observability with bcc-tools

Tools:

`biotop`

`biosnoop`

`biolatency`

`bitesize`

`biolatpcts`

`tplist`

`trace`

Conclusion

References

Oracle Linux and BPF

Backporting patches using git

Block Layer Observability with bcc-tools

Tools:

biotop

biosnoop

biolatency

bitesize

biolatpcts

tplist

trace

Conclusion

References

Authors

Oracle Linux and BPF

Backporting patches using git

`biotop`

`biosnoop`

`biolatency`

`bitesize`

`biolatpcts`

`tplist`

`trace`