Last year I wrote two articles on code layout optimisations. These optimisations change the way that the code is laid out in memory to do at least one of the following:
- Move the code around so that the hot routines, or even hot blocks of code are close together in memory. This improves performance because it reduces the footprint of the memory in the caches, and requires fewer ITLB entries to map the useful part of the application.
- Make the fall-through case be the normal execution path for branches. This improves performance because the processor gets to execute more straight lines of code, and spends less time jumping around.
- Remove calls to remote routines by inlining the routines into the places where they are most frequently called. Performance improves because a call to a subroutine is avoided, but also because the inlining can lead to further opportunities to perform optimisations.
The first article is an overview article and covers three techniques with a light amount of detail:
- Mapfiles. These tell the linker how to layout the code for an application or library. It's easy to use, but works at the routine level, so is likely to be most useful for large applications that encounter a lot of ITLB misses.
- Profile feedback. This technique involves two compiles of the application; the first produces an instrumented binary. This instrumented binary is run with a 'training' workload and data is gathered to indicate how the code is used. This data is then fed into a second pass of the compiler which lays out the code appropriately - doing optimisations like inlining and improving the layout of branch instructions.
- Link-time optimisation. This technique builds on the profile feedback data by looking at all the information that is available at link time and using it to do further optimisation. The advantage of doing it at this stage is that all the code has been generated and so more is known about the exact size and location of all routines in memory.
Of these three approaches, mapfiles is probably the easiest to use - once mapfile has been put together it only requires extra flags to be passed to the compiler. However, it has the smallest potential for performance gains. Profile feedback, particularly in combination with link-time optimisation, has a much greater chance of getting performance gains. Hence the second of these two articles focuses in much more detail on profile feedback.