gprofng: Java Profiling

January 19, 2024 | 14 minute read
Text Size 100%:

This blog entry was contributed by: Ruud van der Pas, Kurt Goebel, Vladimir Mezentsev. They work in the Oracle Linux Toolchain Team and are involved with gprofng on a daily basis.

GprofNG Logo

Introduction

The GNU gprofng profiling tool can be used to profile applications written in C, C++, Java, Scala, and Fortran. Although many of the features are the same across these languages, this blog specifically focuses on how to make a profile of a Java application.

To begin, the steps needed to get a first function level profile on the screen. are covered. Next, some additional features of gprofng are demonstrated. In particular, how to find the most time-consuming source code lines, and the most expensive instructions in the application. The blog concludes with an example of two profiles being compared.

No prior knowledge, nor experience, with gprofng is needed. Every command and feature that is used will be explained.

Note that although gprofng often uses the word "function" in its terminology, the features that apply to functions, are also applicable to Java methods.

The requirements

It’s assumed that gprofng has been installed on the system used. There are different ways to install gprofng. For Oracle Linux there are RPMs, but the code can also be built from a tar file, or from the GNU binutils git repo. For example this blog includes a video with installation instructions on Oracle Linux, as well as links to more documentation.

As this blog focuses on profiling Java applications, it’s assumed that Java has been installed. The minimum Java version required to use gprofng is Java 8.

The example program

The program used in testing is called MxV. It is a Java application that implements the matrix-matrix multiplication algorithm. This blog will not go into the details of this program, but will demonstrate how gprofng can be used to provide insights into the execution of this application and identify potential opportunities to improve the performance.

This test program can be download here.

This program takes several options, but the examples here only set the matrix dimensions m and n.

This is how the application was compiled with javac and executed:

$ javac MxV.java
$ java -Xmx16G MxV -m 5000 -n 4000
MxV: rows: 5000  columns: 4000
$

Note that the application output consists of a single line only and echoes the matrix sizes m and n used.

The use of the -Xmx option to the JVM to set the maximum heap size to 16 Gbyte, is purely for demonstration purposes. Below you'll see how to pass such options on to the JVM when using gprofng.

A first profile

It takes two steps to create a profile with gprofng.

In the first step, the "gprofng collect app" tool is used to generate the performance data. This information is stored in an experiment directory. This is a regular Linux directory that contains all the information that gprofng needs to display the performance results. Next, the performance data stored in the experiment directory is analyzed with the "gprofng display text" tool, which has many options to produce various views into the results.

It is important to note that the application to be profiled is executed "as is": no special libraries, environment variables, etc. are needed to create the data. All that needs to be done is to execute the program as usual, but now under control of gprofng.

In the example below the "gprofng collect app" tool is used to generate the performance data. The program is run as shown before, but now under control of gprofng:

$ gprofng collect app -O perf-small.er -j on java -Xmx16G MxV -m 5000 -n 4000
Creating experiment directory perf-small.er (Process ID: 2597465) ...
Picked up JAVA_TOOL_OPTIONS: -agentlib:gp-collector
MxV: rows: 5000  columns: 4000
$

Two options that are specific to the gprofng collect app tool are used here.

  • The -O option gives the experiment directory the name perf-small.er.

    Note that the alternative -o option may also be used to name the experiment directory. The difference is that this option does not allow an experiment directory with the same name to be overwritten. If neither the -O nor the -o option is specified, the experiment directory will get a default name.

  • The -j on option is specific to profiling a Java application and enables Java profiling.

    Technically this option is not needed, because it is the default. It is shown here, since it may also be used to point to a different JVM, or to explicitly disable Java profiling.

This is the general command syntax to collect performance data for a Java class. Three different types of options are supported:

$ gprofng collect app [collect app option(s)] java [JVM option(s)] classname [application arg(s)]

In the example above, -O and -j are options for gprofng collect app, -Xmx16G is a JVM option, and both -m, as well as -n, are arguments for the MxV class.

In case multiple options need to be passed on to the JVM, all the settings need to be passed on as a single string starting and ending with the quote (") symbol. This string consists of a set of options, separated by either a blank space, or a tab. Each option is passed as a separate argument to the JVM. Note that most arguments to the JVM must begin with a "-" character.

Just as before, the application prints a single line with the matrix sizes, but prior to this, there are two additional lines in the output. The first line is generated by gprofng and confirms that an experiment directory with the name perf-exp.er is created. It is used to store the results. The output also shows the process id of the job that has been profiled. The second line is from the JVM. Gprofng sets environment variable JAVA_TOOL_OPTIONS and the JVM echoes the settings.

At this point, there is an experiment directory called perf-data.er and it contains all the performance information of interest to gprofng. The next step is to display and analyze this information.

Where the time is spent

The "gprofng display text" tool reads the experiment data and through the commands specified, the user controls what is to be shown. In the remainder of this section, various options are used that control what needs to be displayed. Although individual commands are presented, it is possible to combine multiple options and use one single invocation of "gprofng display text" to generate all the results.

The function overview

To start with, let's look at the 5 most time consuming methods. The -functions option shows the list of the functions, or methods in Java, that have been executed, plus their respective CPU times. Through the limit 5 option, the number of methods listed is limited to the top five:

$ gprofng display text -limit 5 -functions perf-small.er
Print limit set to 5
Functions sorted by metric: Exclusive Total CPU Time

Excl. Total   Incl. Total    Name
CPU           CPU
 sec.      %   sec.      %
0.781 100.00  0.781 100.00   <Total>
0.450  57.69  0.610  78.21   MxV.oneCell(long[][], long[][], long[][], int, int)
0.160  20.51  0.160  20.51   MxV.multiplyAdd(long, long, long)
0.100  12.82  0.100  12.82   <JVM-System>
0.020   2.56  0.030   3.85   MxV.allocate(int, int, int)
$

In the table shown above, there are five columns. The first four columns contain the CPU times and their respective percentages. The fifth column shows the method names. The first entry in this list is always present. It is a pseudo function called <Total> that is created by gprofng. It shows the total CPU time accumulated over all the functions, or methods. The percentages displayed have been normalized to this total time.

There are two CPU time related metrics. The inclusive CPU time includes all the CPU time that has been accumulated while the method was executed. This is the total time for the dynamic call tree underneath the entry. The exclusive value is the CPU time spent in the methods, outside of calling other methods.

Both metrics are useful to consider and are shown by default. A relatively high value for the inclusive time points at an expensive branch in the application. The exclusive time then shows how much time was spent in the individual functions, or methods.

In the overview above, with 0.610 seconds, MxV.oneCell is the most time-consuming method. It is responsible for 78% of the total time. By itself, it took 0.450 seconds. That is over 57% of the time spent in all the methods. These numbers also show that 0.610-0.450=0.160 CPU seconds were spent executing other methods. With this information it is immediately clear that any optimization effort(s) should be directed towards MxV.oneCell. It is responsible for the majority of the CPU time spent.

Now that it is clear which method is the most expensive one, the next question is how the CPU times are distributed over the source lines of MxV.oneCell. Again the "gprofng display text" tool is used, but this time with the -source option. It prints an annotated source listing of the method specified, which is MxV.oneCell in this case. The desired metrics are also defined. The -metrics e.totalcpu option selects the exclusive total CPU time as the metric to be displayed:

$ gprofng display text -metrics e.totalcpu -source MxV.oneCell perf-small.er

The output of this command shows the entire source listing with the exclusive CPU times in the first column. The relevant part of this output is displayed below. It shows that all the 0.450 seconds of CPU time are spent in source lines 20-21:

          <Function: MxV.oneCell(long[][], long[][], long[][], int, int)>
   0.     18.      final int rightRows = rowCount(right);
   0.     19.      long partial = 0;
   0.190  20.      for (int i = 0; i < rightRows; i += 1) {
## 0.260  21.          partial = multiplyAdd(partial, left[row][i], right[i][column]);
          22.      }
   0.     23.      product[row][column] = partial;
   0.     24.  }

The most time-consuming source lines are marked with ##. This makes it easier to find such lines in case the source file is large.

Often the source level performance information is sufficient to identify possible room for improvements, but in more complicated situations, information at a lower level is needed. This is why the -disasm option is also supported. It shows the disassembly, annotated with the CPU time spent in each instruction. The command is very similar to the source line level command that was just seen:

$ gprofng display text -metrics e.totalcpu -disasm MxV.oneCell perf-small.er

Below, the relevant snippet with the time-consuming instructions is shown. The timings are given in the first column. The source line number the instruction corresponds to is shown between square brackets ([ and ]). This is followed by the address and the instruction:

   0.                   [ 21] 0000001a: aaload
   0.                   [ 21] 0000001b: iload 9
## 0.200                [ 21] 0000001d: laload
   0.                   [ 21] 0000001e: aload_3
   0.                   [ 21] 0000001f: iload 9
   0.050                [ 21] 00000021: aaload
   0.                   [ 21] 00000022: iload 5
   0.                   [ 21] 00000024: laload
   0.010                [ 21] 00000025: invokevirtual multiplyAdd()
   0.                   [ 21] 00000028: lstore 7
## 0.190                [ 20] 0000002a: iinc 9 1
   0.                   [ 20] 0000002d: goto 0xd

Just as with the source listing, the most time-consuming instructions are marked with the ## symbol. As is clear from this output, the total CPU time is distributed over just four instructions.

The dynamic call tree

Another useful feature is the dynamic call tree. This can be obtained with the "gprofng display text" tool by using the -calltree option. Also in this case, the -limit option is used to limit the output. Here it's used to select the first 15 lines only:

$ gprofng display text -limit 15 -calltree perf-small.er
Print limit set to 15
Functions Call Tree. Metric: Attributed Total CPU Time

Attr. Total    Name
CPU
 sec.      %
0.781 100.00   +-
0.660  84.62     +-MxV.main(java.lang.String[])
0.660  84.62     |  +-MxV.test(int, int, int, int)
0.620  79.49     |    +-MxV.timeMultiply(long[][], long[][], long[][], long[][])
0.620  79.49     |    |  +-MxV.multiply(long[][], long[][], long[][])
0.610  78.21     |    |    +-MxV.oneCell(long[][], long[][], long[][], int, int)
0.160  20.51     |    |      +-MxV.multiplyAdd(long, long, long)
0.030   3.85     |    +-MxV.allocate(int, int, int)
0.010   1.28     |    |  +-MxV.initialize(long[][], int)
0.010   1.28     |    +-java.io.PrintStream.printf(java.lang.String, java.lang.Object[])
0.010   1.28     |      +-java.io.PrintStream.format(java.lang.String, java.lang.Object[])
0.010   1.28     |        +-java.util.Formatter.()
0.010   1.28     |          +-java.util.regex.Pattern.compile(java.lang.String)
0.010   1.28     |            +-java.util.regex.Pattern.(java.lang.String, int)
0.010   1.28     |              +-java.util.regex.Pattern.compile()
$

This call tree clearly reveals the execution structure and how the total CPU time is distributed over the various methods.

Comparing profiles

Very often, profiling is about comparing results. For example, to find a regression, to see how a certain optimization feature impacts the execution time, or to analyze the performance as a function of the number of threads used.

This is why the "gprofng display text" tool has a built-in feature to compare profiles. Multiple experiments can be specified when invoking this tool. By default, the performance data is aggregated over the experiments specified, but with the -compare option, the data is compared across the experiments. By default, in comparison mode, a side-by-side overview is displayed, but it is also possible to look at deltas and ratios.

In the overview shown below, two runs are compared. The one that is focused on so far, is compared with a run where both matrix dimensions have been doubled. The results of this second run have been stored in an experiment directory called perf-large.er. The command below was used to generate the comparison. The -compare delta option displays the timings as a difference, relative to the first experiment:

$ gprofng display text -limit 5 -compare delta -functions perf-small.er perf-large.er
Print limit set to 5
Functions sorted by metric: Exclusive Total CPU Time

perf-small.er  perf-large.er  perf-small.er  perf-large.er
Excl. Total    Excl. Total    Incl. Total    Incl. Total     Name
CPU            CPU            CPU            CPU
 sec.      %    delta      %   sec.      %    delta      %
0.781 100.00   +2.332 100.00  0.781 100.00   +2.332 100.00   
0.450  57.69   +1.251  54.66  0.610  78.21   +1.791  77.17   MxV.oneCell(long[][], long[][], long[][], int, int)
0.160  20.51   +0.540  22.51  0.160  20.51   +0.540  22.51   MxV.multiplyAdd(long, long, long)
0.100  12.82   +0.490  18.97  0.100  12.82   +0.490  18.97   
0.020   2.56   +0.020   1.29  0.030   3.85   +0.030   1.93   MxV.allocate(int, int, int)

The results speak for themselves and it is immediately clear how much longer it took to compute the larger matrix multiplication. For example, the inclusive time for MxV.oneCell increased by 1.791 CPU seconds.

The view mode

By default, "gprofng display text" displays the performance information at the user level, but in case of a Java application, it may be useful to view the data from other angles. This is why the -viewmode option is supported on "gprofng display text". It takes a single keyword as an argument. This keyword defines what will be shown.

The following keywords are supported:

  • user - This is the default and shows the Java call stacks for Java threads. No call stacks for any housekeeping threads are shown. The function list contains a function <JVM-System> that represents the aggregated time from non-Java threads. When the JVM software does not report a Java call stack, time is reported against the function <no Java callstack recorded>.
  • expert - Shows the Java call stacks for Java threads when the Java code from the user is executed and machine call stacks when JVM code is executed, or when the JVM software does not report a Java call stack. Show the machine call stacks for housekeeping threads.
  • machine - Shows the actual native call stacks for all threads.

What was not covered

We've only scratched the surface. There are many more features supported by "gprofng display text" that are either worth exploring further, and others that haven't even been mentioned here. There are also other tools in the gprofng suite that were not covered.

Additionally, we recommend exploring the features available to analyze a multithreaded application and support for hardware event counters.

Extensive coverage of all the tools and features can be found in the documentation. There is a user guide and there are man pages for the various tools. A link to a copy of the gprofng user guide as included in GNU binutils, can be found in the "Documentation" section of this binutils page. There is also a wiki page that is expanded and updated on a regular basis.

What happens under the hood of a Java program?

In the blog titled What Does the Java Virtual Machine Do All Day?, it is demonstrated how gprofng can be used to analyze what happens while a Java application is running. Since the execution is under control of the JVM, several background tasks are executed, but this is transparent to the user. Examples are Garbage Collection (GC) and the HotSpot compiler.

The blog shows how gprofng can be used to uncover the activities and impact the runtime Java compiler has on the performance. It is also explained how several JVM options have been used to improve the performance of the application.

Acknowledgements

The authors of this blog would like to thank Peter Kessler at Ampere Computing for making his MxV test program available.

Elena Zannoni

Elena Zannoni is a Senior Director in Oracle, responsible for Linux Toolchain and tracing development.


Previous Post

The Resize Inode in the Ext4 Filesystem

Srivathsa Dara | 9 min read

Next Post


Guarding beyond the perimeter—Zero Trust security with Oracle Linux

Gursewak Sokhi | 2 min read