This blog entry was contributed by: Ruud van der Pas, Kurt Goebel, Vladimir Mezentsev. They work in the Oracle Linux Toolchain Team and are involved with gprofng on a daily basis.
Gprofng is a next generation application profiling tool. It supports the profiling of programs written in C, C++, Java, or Scala running on systems using processors from Intel, AMD, Arm, or compatible vendors. The extent of the support is processor dependent, but the basic views are always available.
Two distinct steps are needed to produce a profile. In the first step, the performance data is collected. This information is stored in a directory called the experiment directory. There are several tools available to display and analyze the information stored in this directory.
These are some of the features:
Before we describe gprofng in more detail, we would like to present a brief history of this tool.
The origin of gprofng lies in the Performance Analyzer tool that is included in the Oracle Developer Studio tool suite.
As part of the work, this performance analysis tool was isolated from the Studio environment and turned into a standalone tool. The user interface was completely redesigned to make it more intuitive, easier to use, and better prepared for future extensions. Commonly utilized GNU and Linux utilities were leveraged, and support for Arm/aarch64 was added.
In August 2022, gprofng was made available as part of the open source GNU binutils tool suite. The binutils tools can be downloaded here.
Several tools (or commands) are part of the gprofng suite.
All commands start with gprofng, followed by a qualifier to describe the functionality. In some cases, this is followed by a keyword to further specify what needs to be done. This is followed by the target. Examples of a target are the executable to be profiled, or the experiment directory to be analyzed. As an example, this is the command to collect performance data on an executable with the name a.out:
$ gprofng collect app ./a.out
These are the gprofng tools/commands and their functionality:
The demo code used in the example below implements a matrix-vector multiplication algorithm. It is written in C and has been parallelized with the Pthreads parallel programming model.
This is an example how to multiply a 5000x4000 matrix with a vector of length 4000, using 2 threads:
$ ./mxv-pthreads.exe -m 5000 -n 4000 -t 2 mxv: error check passed - rows = 5000 columns = 4000 threads = 2 $
The program performs an internal check for correctness and if this test passes, prints a one line message to indicate successful execution. The algorithm parameters are printed as well.
There are two steps needed to get a basic profile on the screen. In the first step, the performance data is collected. The results are stored in what is called an experiment directory. In the second step, the performance results are displayed using the data in this directory.
In this case we use a single experiment only, but multiple experiments may be loaded simultaneously. The results can either be aggregated or compared.
It is important to note that nothing special needs to be done in order to collect the performance data. There is no need to recompile the code, set environment variables, and/or use special libraries. This is also true for multithreaded applications.
All that needs to be done is to execute the program under control of gprofng. Since we want to collect the performance data, we use the gprofng collect app command to do so:
$ gprofng collect app ./mxv-pthreads.exe -m 5000 -n 4000 -t 2 Creating experiment directory test.1.er (Process ID: 2607338) ... mxv: error check passed - rows = 5000 columns = 4000 threads = 2 $
The second line in the output is from gprofng. It echoes the process id and confirms that we are indeed gathering performance data. The results will be stored in an experiment directory with the name test.1.er.
This is a default name. Since by default existing experiment directories are not overwritten, a subsequent data collection experiment will use test.2.er to store the results in. And so on for additional experiments.
The -o option can be used to give the experiment a more meaningful name. Existing experiments with the same name will not be overwritten though. For that, the -O option may be used.
The job ran to completion and the data in the newly created experiment directory can be analyzed using one of the display tools. Note that the data can be also be viewed while the profile job is running still.
The command to display the performance data on the screen is gprofng display text. It takes one or multiple experiment directories as input.
If it is invoked without any additional options, the tool is started in interactive, or "interpreter" mode:
$ gprofng display text test.1.er Warning: History and command editing is not supported on this system. (gp-display-text) quit $
In this environment, commands can be issued interactively. Although there is nothing wrong in using it like this, most users specifiy the display related commands on the command line. An alternative is to add commands to a script file and have these executed.
Here we show how to use commands on the command line. We are actually only going to use a single command. The -functions command shows how much time is spent in each function that has been executed:
$ gprofng display text -functions test.1.er Functions sorted by metric: Exclusive Total CPU Time Excl. Total Incl. Total Name CPU CPU sec. % sec. % 5.554 100.00 5.554 100.00 Total 5.274 94.95 5.274 94.95 mxv_core 0.140 2.52 0.270 4.86 init_data 0.090 1.62 0.110 1.98 erand48_r 0.020 0.36 0.020 0.36 __drand48_iterate 0.020 0.36 0.130 2.34 drand48 0.010 0.18 0.010 0.18 _int_malloc 0. 0. 0.280 5.05 __libc_start_main 0. 0. 0.010 0.18 allocate_data 0. 0. 5.274 94.95 collector_root 0. 0. 5.274 94.95 driver_mxv 0. 0. 0.280 5.05 main 0. 0. 0.010 0.18 malloc 0. 0. 5.274 94.95 start_thread $
The first line in the output shows that the function list is sorted with respect to the exclusive total CPU time. It is one of the things the user can change by explicitly defining the metric to be used to sort the function list.
There are 5 columns in the table that follows next. The first two columns show the exclusive total CPU time as a number and a percentage. The next two columns contain the inclusive total CPU time as a number and a percentage. The last column shows the name of the function.
These two different timings require some explanation. The inclusive total CPU time includes all the CPU time spent when executing this function and all of its children. It is the total time spent in this part of the execution path.
The exclusive total CPU time for a function is the time spent in this function, outside of calling other functions. It is the "pure" time spent in this function.
Both metrics are meaningful. The function with the highest inclusive value, points at the part of the call tree that is most expensive. A high exclusive time means that the corresponding function is expensive in terms of the CPU time.
The first function in the table is called <Total>. This is a pseudo function created by gprofng. It contains the total value for the various metrics, e.g. the CPU time or cache misses, that have been recorded. This number is also the reference for the percentages shown.
From the table it is clear that with 5.274 seconds out of a total of 5.554 seconds of CPU time, function mxv_core is responsible for about 95% of the overall performance. We also see that the inclusive and exclusive times are the same for this function. This means that it is a leaf function. A leaf function does not call any other function(s).
Although insignificant for the total performance, we also see that function init_data has an inclusive time of 0.270 seconds. With 0.140 seconds the exclusive time is about half of that. This means that this function is calling one or more other functions and collectively these take 0.130 seconds.
Through features like the call tree, and the callers-callees view, one can then drill deeper into this part of the profile. This is however beyond the scope of this introductory blog. We refer to section Where to learn more about gprofng for pointers to get more information.
Now that we have seen a first example, one may wonder how this all works.
When the executable is launched through the gprofng collect app command, a preload mechanism is used to execute the program under control of gprofng. It is in charge of the execution.
The underlying data collection technology used by gprofng is called statistical call stack sampling. With this technique, the execution of the application is interrupted periodically. When it is interrupted, the relevant information is recorded. For example, the execution path (or "call stack") and the thread id, but several other things are recorded and stored. Once this has been done, program execution resumes until the next time that it is interrupted.
In other words, the data is sampled with a certain frequency. Although in many cases the default sampling frequency works well, it is one of the things the user can set explicitly as part of the data collection process.
Thanks to the preload mechanism and the sampling strategy, the application does not need to be recompiled, nor linked against special libraries. All the views are available, but if you would like to get source line level information, the compiler needs to include that. For example with the gcc compiler, the -g option does this.
Applications written in C or C++, Java, or Scala are supported.
Any executable in the ELF (Executable and Linkable Format) object format can be used for profiling with gprofng. If debug information is available, gprofng can provide more details, but this is not a requirement.
Java 1.8 and later is supported. For OpenJDK, this is version 8 and later.
Oracle Linux 7 and later is supported, just as RHEL 7 and beyond. There is also a gprofng package for Fedora and several users reported that gprofng works fine on Ubuntu as well.
Aside from fixing bugs, these are some of the enhancements that are currently planned:
We also plan to write more blog articles on gprofng. In particular provide updates in terms of new features added (e.g. when the GUI comes out), as well as show more examples and use cases.
Keep an eye on this blog space to watch for these future articles!
The gprofng profiler is a new addition to the GNU binutils tools suite. It supports profiling of applications written in C, C++, Java, and Scala. Processors from Intel/AMD (x86_64) and Arm (aarch64) are supported.
After the profile information has been collected, there are various ways to display and analyze the data. Performance information can be obtained at the function, source line, and instruction level.
Now that the basic framework is in place, several enhancements are in the pipeline. This includes a graphical interface to easily navigate through the performance data, but also additional features in the currently available tools are planned.
Elena Zannoni is a Senior Director in Oracle, responsible for Linux Toolchain and tracing development.