X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

  • January 14, 2020

Comparing Workload Performance

In this blog post, Oracle Linux performance engineer Jesse Gordon presents an alternate approach to comparing the performance of a workload when measured in two different scenarios.  This improves on the traditional "perf diff" method. The benefits of this approach are as follows:

  • ability to compare based on either inclusive time (time spent in a given method and all the methods it calls) or exclusive time (time spent only in a given method) fields in perf output
  • can be applied to any two experiments that have common function names
  • more readable output

Comparing Perf Output from Different Kernels

You’ve just updated your Oracle Linux kernel – or had it updated autonomously -- and you notice that the performance of your key workload has changed.  How do you figure out what is responsible for the difference? 

The basic tool for this task is the perf profile 1, which can be used to generate traces of the workload on the two kernels.  Once you have the two perf outputs, the current Linux approach is to use "perf diff" 2 to compare the resulting traces.  The problem with the approach is that "perf diff" output is neither easy to read nor to use.  Here is an example:

#
# Baseline  Delta Abs  Shared Object                        Symbol                                       
# ........  .........  ...................................  ..............................................
#
               +3.38%  [unknown]                            [k] 0xfffffe0000006000
    29.46%     +0.98%  [kernel.kallsyms]                    [k] __fget
     8.42%     +0.91%  [kernel.kallsyms]                    [k] fput
               +0.88%  [kernel.kallsyms]                    [k] entry_SYSCALL_64_after_hwframe
               +0.68%  [kernel.kallsyms]                    [k] syscall_trace_enter
     2.98%     -0.67%  [kernel.kallsyms]                    [k] _raw_spin_lock
               +0.55%  [kernel.kallsyms]                    [k] do_syscall_64
     0.40%     -0.34%  syscall                              [.] [main]

In this blog, we outline an alternate approach which produces easier to read and use output.  Here is what the above output looks like using this approach:

Command              Symbol                         Before#  After#   Delta  
-------------------- ------------------------------ -------  -------  -------
syscall              __fget                           29.46    30.43     0.97
syscall              fput                              8.41     9.33     0.92
syscall              entry_SYSCALL_64_after_hwframe    0.00     0.88     0.88
syscall              syscall_trace_enter               0.00     0.68     0.68
syscall              _raw_spin_lock                    2.98     2.31    -0.67
syscall              do_syscall_64                     0.00     0.55     0.55
syscall              main                              0.40     0.06    -0.34

Furthermore, this alternate approach extends the comparison options, allowing one to compare based on any of the fields in the perf output report. 

In the remainder of this blog, we detail the steps involved in producing such output.  

Step 1: Generate the perf traces

Taking a trace involves running the workload while invoking perf.  In this article, we chose to use the syscall workload from UnixBench 3 suite, a typical sequence would be:

$ perf record -a -g -c 1000001 \<PATH-TO\>/byte-unixbench-master/UnixBench/Run syscall -i 1 -c 48

where:

  • -a asks perf to monitor all online CPUs;
  • -g asks perf to collect data so call graphs (stack traces) may be generated;
  • -c 1000001 asks perf to collect a sample once every 1000001 cycles

Step 2: Post-process the trace data

Samples collected by perf record are saved into a binary file called, by default, perf.data. The "perf report" command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first.  To post-process the perf.data file generated in step 1:

$ mv perf.data perf.data.KERNEL 
$ perf report -i perf.data.KERNEL -n > perf.report.KERNEL

Step 3: Compare the traces

To be able to compare the two traces, first ensure that they are in a common directory on the system.  So, we would have, for example, perf.report.KERNEL1 and perf.report.KERNEL2.  This is what one such trace profile looks like for UnixBench syscall:

#
# Samples: 1M of event 'cycles'
# Event count (approx.): 1476340476339
#
# Children      Self       Samples  Command          Shared Object                       Symbol                                              
# ........  ........  ............  ...............  ..................................  ....................................................
#
    98.60%     0.00%             0  syscall          [unknown]                           [.] 0x7564207325203a65                             
                    |
                    ---0x7564207325203a65
    85.91%     0.24%          3538  syscall          [kernel.kallsyms]                   [k] system_call_fastpath                           
                    |
                    ---system_call_fastpath
                       |         
                       |--60.76%-- __GI___libc_close
                       |          0x7564207325203a65
                       |         
                       |--37.72%-- __GI___dup
                       |          0x7564207325203a65
                       |         
                       |--1.30%-- __GI___umask
                       |          0x7564207325203a65
                        --0.21%-- [...]

Listing 1: example perf trace profile

The columns of interest shown are as follows:

  • Children -- the percent of time spent in this method and all the methods that it calls, also referred to as inclusive time
  • Self -- the percent of time spent in this method only, also referred to as exclusive time
  • Samples -- the number of trace samples that fell in this method only
  • Command -- the process name
  • Shared Object -- the library
  • Symbol -- the method (or function) name

Now, we can use perf diff as follows:

$ perf diff perf.data.KERNEL1 perf.data.KERNEL2 > perf.diff.KERNEL1.vs.KERNEL2  

Here is what the resulting output looks like:

#
# Baseline  Delta Abs  Shared Object                        Symbol                                       
# ........  .........  ...................................  ..............................................
#
               +3.38%  [unknown]                            [k] 0xfffffe0000006000
    29.46%     +0.98%  [kernel.kallsyms]                    [k] __fget
     8.42%     +0.91%  [kernel.kallsyms]                    [k] fput
               +0.88%  [kernel.kallsyms]                    [k] entry_SYSCALL_64_after_hwframe
               +0.68%  [kernel.kallsyms]                    [k] syscall_trace_enter
     2.98%     -0.67%  [kernel.kallsyms]                    [k] _raw_spin_lock
               +0.55%  [kernel.kallsyms]                    [k] do_syscall_64
     0.40%     -0.34%  syscall                              [.] main

Listing 2: example perf diff profile trace

This output, by default, has done the comparison using the “Self” column, or time spent in just this one method.  This can be useful output, but is often insufficient as part of a performance analysis.  We next present an approach to comparing using the “Children” column, for time spent in this method and all its children. 

Step 4: Generate comparison using the “Children” column

To perform the comparison, we first extract all of the lines that have entries in all six columns i.e., all fields are present.  These are the lines at the top of each of the call graphs.

You can find the allfields.delta.py program that we use to render these results on github at https://github.com/oracle/linux-blog-sample-code/tree/comparing-workload-performance/allfields.delta.py

$ grep "\\[" perf.data.DESCRIPTOR | grep -v "|" | grep -v "\\-\\-" > perf.report.DESCRIPTOR.allfields

The output of this script looks as follows:

    98.60%     0.00%             0  syscall          [unknown]                           [.] 0x7564207325203a65
    85.91%     0.24%          3538  syscall          [kernel.kallsyms]                   [k] system_call_fastpath                           
    55.20%     0.16%          2403  syscall          libc-2.17.so                        [.] __GI___libc_close                              
    52.27%     0.14%          2020  syscall          [kernel.kallsyms]                   [k] sys_close                                       
    52.11%     0.08%          1207  syscall          [kernel.kallsyms]                   [k] __close_fd                                     
    50.39%    21.98%        324434  syscall          [kernel.kallsyms]                   [k] filp_close                                     
    35.44%     0.13%          1958  syscall          libc-2.17.so                        [.] __GI___dup                                     
    32.39%     0.15%          2181  syscall          [kernel.kallsyms]                   [k] sys_dup                                        
    29.46%    29.46%        434902  syscall          [kernel.kallsyms]                   [k] __fget                                         
    19.92%    19.92%        294070  syscall          [kernel.kallsyms]                   [k] dnotify_flush             

Listing 3: perf output showing lines with all fields present

Now we compare two "allfields" files, using a Python script which reads in the two files and compares lines for which the combination of SharedObject + Symbol are the same.  This script allows the user to compare based on each of the three left side columns (children, self, or samples) and would be run as follows:

$ allfields.delta.py -b perf.report.KERNEL1.allfields -a perf.report.KERNEL2.allfields -d children > allfields.delta.children.KERNEL1.vs.KERNEL2

For the UnixBench syscall workload, comparing a two distinct kernels, the output would look like this:

perf report allfields delta report
   before file name == perf.report.KERNEL1.allfields
   after  file name == perf.report.KERNEL2.allfields
   delta  type      == children
 
Command              Symbol                         Before#  After#   Delta 
-------------------- ------------------------------ -------  -------  -------
syscall              0x7564207325203a65               98.60    99.81     1.21
syscall              system_call_fastpath             85.91     0.00   -85.91
syscall              __GI___libc_close                55.20    56.73     1.53
syscall              sys_close                        52.27    53.69     1.42
syscall              __close_fd                       52.11    53.62     1.51
[...]

Listing 4: example output from script, comparing using "children" field

Lastly, we can sort this output to highlight the largest differences in each direction, as follows:

$ sort -rn -k 5 allfields.delta.children.KERNEL1.vs.KERNEL2 | less

where the head of the file shows those methods where more time was spent in KERNEL1 and the tail of the file shows those methods where more time was spent in KERNEL2:

syscall              entry_SYSCALL_64_after_hwframe    0.00    92.18    92.18
syscall              do_syscall_64                     0.00    91.07    91.07
syscall              filp_close                       50.39    52.70     2.31
syscall              syscall_slow_exit_work            0.00     1.67     1.67
syscall              __GI___libc_close                55.20    56.73     1.53
[...]
syscall              tracesys                          1.18     0.00    -1.18
syscall              syscall_trace_leave               1.70     0.00    -1.70
syscall              int_very_careful                  1.83     0.00    -1.83
syscall              system_call_after_swapgs          2.42     0.00    -2.42
syscall              system_call                       3.47     0.00    -3.47
syscall              system_call_fastpath             85.91     0.00   -85.91

Listing 5: sorted output of script

We may now use these top and bottom methods as starting points into root causing the performance differences observed when executing the workload on the two kernels.

Summary

We have presented an alternate approach to comparing the performance of a workload when measured in two different scenarios.  This method can be applied to any two experiments that have common function names.  The benefits of this approach are as follows:

  • ability to compare based on either inclusive time (Children) or exclusive time (Self) fields in perf output
  • more readable output

Please try it out.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.