In this blog post, Oracle Linux performance engineer Jesse Gordon presents an alternate approach to comparing the performance of a workload when measured in two different scenarios. This improves on the traditional “perf diff” method. The benefits of this approach are as follows:
- ability to compare based on either inclusive time (time spent in a given method and all the methods it calls) or exclusive time (time spent only in a given method) fields in perf output
- can be applied to any two experiments that have common function names
- more readable output
Comparing Perf Output from Different Kernels
You’ve just updated your Oracle Linux kernel – or had it updated autonomously — and you notice that the performance of your key workload has changed. How do you figure out what is responsible for the difference?
The basic tool for this task is the perf profile 1, which can be used to generate traces of the workload on the two kernels. Once you have the two perf outputs, the current Linux approach is to use “perf diff” 2 to compare the resulting traces. The problem with the approach is that “perf diff” output is neither easy to read nor to use. Here is an example:
#
# Baseline  Delta Abs  Shared Object                        Symbol                                       
# ........  .........  ...................................  ..............................................
#
               +3.38%  [unknown]                            [k] 0xfffffe0000006000
    29.46%     +0.98%  [kernel.kallsyms]                    [k] __fget
     8.42%     +0.91%  [kernel.kallsyms]                    [k] fput
               +0.88%  [kernel.kallsyms]                    [k] entry_SYSCALL_64_after_hwframe
               +0.68%  [kernel.kallsyms]                    [k] syscall_trace_enter
     2.98%     -0.67%  [kernel.kallsyms]                    [k] _raw_spin_lock
               +0.55%  [kernel.kallsyms]                    [k] do_syscall_64
     0.40%     -0.34%  syscall                              [.] [main] 
In this blog, we outline an alternate approach which produces easier to read and use output. Here is what the above output looks like using this approach:
Command Symbol Before# After# Delta -------------------- ------------------------------ ------- ------- ------- syscall __fget 29.46 30.43 0.97 syscall fput 8.41 9.33 0.92 syscall entry_SYSCALL_64_after_hwframe 0.00 0.88 0.88 syscall syscall_trace_enter 0.00 0.68 0.68 syscall _raw_spin_lock 2.98 2.31 -0.67 syscall do_syscall_64 0.00 0.55 0.55 syscall main 0.40 0.06 -0.34
Furthermore, this alternate approach extends the comparison options, allowing one to compare based on any of the fields in the perf output report.
In the remainder of this blog, we detail the steps involved in producing such output.
Step 1: Generate the perf traces
Taking a trace involves running the workload while invoking perf. In this article, we chose to use the syscall workload from UnixBench 3 suite, a typical sequence would be:
$ perf record -a -g -c 1000001 \<PATH-TO\>/byte-unixbench-master/UnixBench/Run syscall -i 1 -c 48
where:
- -a asks perf to monitor all online CPUs;
- -g asks perf to collect data so call graphs (stack traces) may be generated;
- -c 1000001 asks perf to collect a sample once every 1000001 cycles
Step 2: Post-process the trace data
Samples collected by perf record are saved into a binary file called, by default, perf.data. The “perf report” command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first. To post-process the perf.data file generated in step 1:
$ mv perf.data perf.data.KERNEL $ perf report -i perf.data.KERNEL -n > perf.report.KERNEL
Step 3: Compare the traces
To be able to compare the two traces, first ensure that they are in a common directory on the system. So, we would have, for example, perf.report.KERNEL1 and perf.report.KERNEL2. This is what one such trace profile looks like for UnixBench syscall:
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 1476340476339
#
# Children      Self       Samples  Command          Shared Object                       Symbol                                              
# ........  ........  ............  ...............  ..................................  ....................................................
#
    98.60%     0.00%             0  syscall          [unknown]                           [.] 0x7564207325203a65                             
                    |
                    ---0x7564207325203a65
    85.91%     0.24%          3538  syscall          [kernel.kallsyms]                   [k] system_call_fastpath                           
                    |
                    ---system_call_fastpath
                       |         
                       |--60.76%-- __GI___libc_close
                       |          0x7564207325203a65
                       |         
                       |--37.72%-- __GI___dup
                       |          0x7564207325203a65
                       |         
                       |--1.30%-- __GI___umask
                       |          0x7564207325203a65
                        --0.21%-- [...] 
Listing 1: example perf trace profile
The columns of interest shown are as follows:
- Children — the percent of time spent in this method and all the methods that it calls, also referred to as inclusive time
- Self — the percent of time spent in this method only, also referred to as exclusive time
- Samples — the number of trace samples that fell in this method only
- Command — the process name
- Shared Object — the library
- Symbol — the method (or function) name
Now, we can use perf diff as follows:
$ perf diff perf.data.KERNEL1 perf.data.KERNEL2 > perf.diff.KERNEL1.vs.KERNEL2
Here is what the resulting output looks like:
#
# Baseline  Delta Abs  Shared Object                        Symbol                                       
# ........  .........  ...................................  ..............................................
#
               +3.38%  [unknown]                            [k] 0xfffffe0000006000
    29.46%     +0.98%  [kernel.kallsyms]                    [k] __fget
     8.42%     +0.91%  [kernel.kallsyms]                    [k] fput
               +0.88%  [kernel.kallsyms]                    [k] entry_SYSCALL_64_after_hwframe
               +0.68%  [kernel.kallsyms]                    [k] syscall_trace_enter
     2.98%     -0.67%  [kernel.kallsyms]                    [k] _raw_spin_lock
               +0.55%  [kernel.kallsyms]                    [k] do_syscall_64
     0.40%     -0.34%  syscall                              [.] main 
Listing 2: example perf diff profile trace
This output, by default, has done the comparison using the “Self” column, or time spent in just this one method. This can be useful output, but is often insufficient as part of a performance analysis. We next present an approach to comparing using the “Children” column, for time spent in this method and all its children.
Step 4: Generate comparison using the “Children” column
To perform the comparison, we first extract all of the lines that have entries in all six columns i.e., all fields are present. These are the lines at the top of each of the call graphs.
You can find the allfields.delta.py program that we use to render these results on github at https://github.com/oracle/linux-blog-sample-code/tree/comparing-workload-performance/allfields.delta.py
$ grep "\\[" perf.data.DESCRIPTOR | grep -v "|" | grep -v "\\-\\-" > perf.report.DESCRIPTOR.allfields
The output of this script looks as follows:
    98.60%     0.00%             0  syscall          [unknown]                           [.] 0x7564207325203a65
    85.91%     0.24%          3538  syscall          [kernel.kallsyms]                   [k] system_call_fastpath                           
    55.20%     0.16%          2403  syscall          libc-2.17.so                        [.] __GI___libc_close                              
    52.27%     0.14%          2020  syscall          [kernel.kallsyms]                   [k] sys_close                                       
    52.11%     0.08%          1207  syscall          [kernel.kallsyms]                   [k] __close_fd                                     
    50.39%    21.98%        324434  syscall          [kernel.kallsyms]                   [k] filp_close                                     
    35.44%     0.13%          1958  syscall          libc-2.17.so                        [.] __GI___dup                                     
    32.39%     0.15%          2181  syscall          [kernel.kallsyms]                   [k] sys_dup                                        
    29.46%    29.46%        434902  syscall          [kernel.kallsyms]                   [k] __fget                                         
    19.92%    19.92%        294070  syscall          [kernel.kallsyms]                   [k] dnotify_flush              
Listing 3: perf output showing lines with all fields present
Now we compare two “allfields” files, using a Python script which reads in the two files and compares lines for which the combination of SharedObject + Symbol are the same. This script allows the user to compare based on each of the three left side columns (children, self, or samples) and would be run as follows:
$ allfields.delta.py -b perf.report.KERNEL1.allfields -a perf.report.KERNEL2.allfields -d children > allfields.delta.children.KERNEL1.vs.KERNEL2
For the UnixBench syscall workload, comparing a two distinct kernels, the output would look like this:
perf report allfields delta report before file name == perf.report.KERNEL1.allfields after file name == perf.report.KERNEL2.allfields delta type == children Command Symbol Before# After# Delta -------------------- ------------------------------ ------- ------- ------- syscall 0x7564207325203a65 98.60 99.81 1.21 syscall system_call_fastpath 85.91 0.00 -85.91 syscall __GI___libc_close 55.20 56.73 1.53 syscall sys_close 52.27 53.69 1.42 syscall __close_fd 52.11 53.62 1.51 [...]
Listing 4: example output from script, comparing using “children” field
Lastly, we can sort this output to highlight the largest differences in each direction, as follows:
$ sort -rn -k 5 allfields.delta.children.KERNEL1.vs.KERNEL2 | less
where the head of the file shows those methods where more time was spent in KERNEL1 and the tail of the file shows those methods where more time was spent in KERNEL2:
syscall entry_SYSCALL_64_after_hwframe 0.00 92.18 92.18 syscall do_syscall_64 0.00 91.07 91.07 syscall filp_close 50.39 52.70 2.31 syscall syscall_slow_exit_work 0.00 1.67 1.67 syscall __GI___libc_close 55.20 56.73 1.53 [...] syscall tracesys 1.18 0.00 -1.18 syscall syscall_trace_leave 1.70 0.00 -1.70 syscall int_very_careful 1.83 0.00 -1.83 syscall system_call_after_swapgs 2.42 0.00 -2.42 syscall system_call 3.47 0.00 -3.47 syscall system_call_fastpath 85.91 0.00 -85.91
Listing 5: sorted output of script
We may now use these top and bottom methods as starting points into root causing the performance differences observed when executing the workload on the two kernels.
Summary
We have presented an alternate approach to comparing the performance of a workload when measured in two different scenarios. This method can be applied to any two experiments that have common function names. The benefits of this approach are as follows:
- ability to compare based on either inclusive time (Children) or exclusive time (Self) fields in perf output
- more readable output
Please try it out.
