Wednesday May 20, 2009

Libraries (5) - Runtime costs - TLBs

The next consideration when using libraries is that each library will get mapped in on a new virtual page of memory; as shown in this pmap output:

% pmap 60500
60500:  a.out
00010000       8K r-x--  /libraries/a.out
00020000       8K rwx--  /libraries/a.out
FEEC0000      24K rwx--    [ anon ]
FEED0000       8K r-x--  /libraries/lib1_26.so
FEEE0000       8K rwx--  /libraries/lib1_26.so
FEEF0000       8K r-x--  /libraries/lib1_25.so
FEF00000       8K rwx--  /libraries/lib1_25.so
FEF10000       8K r-x--  /libraries/lib1_24.so
FEF20000       8K rwx--  /libraries/lib1_24.so
FEF30000       8K r-x--  /libraries/lib1_23.so
FEF40000       8K rwx--  /libraries/lib1_23.so
FEF50000       8K rwx--    [ anon ]
FEF60000       8K r-x--  /libraries/lib1_22.so
FEF70000       8K rwx--  /libraries/lib1_22.so
FEF80000       8K r-x--  /libraries/lib1_21.so
FEF90000       8K rwx--  /libraries/lib1_21.so
FEFA0000       8K r-x--  /libraries/lib1_20.so
FEFB0000       8K rwx--  /libraries/lib1_20.so
FEFC0000       8K r-x--  /libraries/lib1_19.so
....

There are finite number of TLB entries on a chip. If each library takes an entry, and the code jumps around between libraries, then a single application can utilise quite a few TLB entries. Take a CMT system where there are multiple applications (or copies of the same application) running, and there becomes a lot of pressure on the TLB.

One of the enhancements in Solaris to support CMT processors is Shared Context. When multiple applications map the same library at the same address, then they can share a single context to map that library. This can lead to a significant reduction in the TLB pressure. Shared context only works for libraries that are loaded into the same memory locations in different contexts, so it can be defeated if the libraries are loaded in different orders or any other mechanisms that scramble the locations in memory.

If each library is mapped into a different TLB entry, then every call into a new library is a new ITLB entry, together with a jump through the PLT, together with the normal register spill/fill overhead. This can become quite a significant chunk of overhead.

To round this off, lets look at some figures from an artificial code run on an UltraSPARC T1 system that was hanging around here.

ExperimentRuntime
Application that jumps between 26 different routines a->b->c...->z. All the routines are included in the same executable. 3s
Application that jumps between 26 different routines a->...z. The routines are provided as a library, and calls are therefore routed through the PLT. 6s
Application that jumps between 26 different routines a->...z. The routines are provided as a library, but all are declared static except for the initial routine that is called by main. Therefore the calls within the library avoid the PLT. 3s
Application that jumps between 26 different routines a->...z. Each routine is defined in its own library, so calls to the routine have to go through the PLT, and also require a new ITLB entry to be used. 60s

Since the routines in this test code don't actually do anything, the overhead of calling through the PLT is clearly shown as a doubling of runtime. However, this is insignificant when compared with the costs of calling to separate libraries, which is about 10x slower than this.

Moving the experiment to look at the impact on CMT systems:

ExperimentRuntime
One copy of this executable per core of an UltraSPARC T1 processor 1 minute
Two copies of this executable per core 5 minutes
Four copies of this executable per core (fully loaded system) 8 minutes

Running multiple copies of the application has a significant impact on performance. The performance counters show very few instructions being executed, and much time being lost to ITLB misses. Now this performance is from a system without the shared context changes - so I would expect much better scaling on a system with these improvements (if I find one I'll rerun the experiment).

The conclusion is that care needs to be taken when deciding to split application code into libraries.

Thursday Apr 02, 2009

Strong prefetch

Looked at an interesting problem yesterday. One of the hot routines in an application was showing a lot of system time. The overall impact on the application was minor, but any unexpected system time is worth investigating.

Profiling under the performance analyzer indicated a pair of instructions, a prefetch followed by a load. The load had all the time attributed to it, but it's usual for the performance analyzer to attribute time to the instruction following the one that is slow. Something like (this is synthetic data, measured data later):

User  System
 0.5     0.5     prefetch [%o0],23
10.0    20.0     ld [%i1],%i2

So there are three things that sprang to mind:

  • Misaligned memory accesses. This would probably be the most common cause of system time on memory operations. However, it didn't look like a perfect file since the system time would usually be reported on the instruction afterwards.
  • TSB misses. The TLB is the on-chip structure that holds virtual to physical mappings. If the mapping is not in the TLB, then the mapping gets fetched from a software managed structure called the TSB. This is a TLB miss. However, the TSB has a capacity limit too, and it is possible for there to be a TSB miss where the mapping is fetched from memory and placed into the TSB. TLB misses are fast traps and don't get recorded under system time. However, TSB misses do cause the switch. This is quite a strong candidate, however trapstat -t didn't show much activity.
  • The third option was the prefetch instruction.

A prefetch instruction requests data from memory in advance of the data being used. For many codes this results in a large gain in performance. There's a number of variants of prefetch, and their exact operation depends on the processor. The SPARC architecture 2007 on pages 300-303 gives an outline of the variants. There's prefetch to read, prefetch to write, prefetch for single use, prefetch for multiple use. The processor has the option to do something different depending on which particular prefetch was used to get the data.

The system I was running on was an UltraSPARC IV+. This introduced the strong prefetch variant.

The idea of the strong prefetch variant is that it will cause a TLB miss to be taken if the mapping of the requested address is not present in the TLB. This is useful for situations where TLB misses are frequently encountered and the prefetch needs to be able to deal with them. To explain why this is important, consider this example. I have a loop which strides through a large range of memory, the loop optimally takes only a few cycles to execute - so long as the data is resident in cache. It takes a couple of hundred cycles to fetch data from memory, so I end up fetching for eight iterations ahead (eight is the maximum number of outstanding prefetches on that processor). If the data resides on a new page, and prefetches do not cause TLB misses, then eight iterations will complete before a load or store touches the new page of memory and brings the mapping into the TLB. The eight prefetches that were issued for those iterations will have been dropped and all the iterations will see the full memory latency (a total of 200\*8=1600 cycles of cost). However, if the prefetches are strong prefetches, then the first strong prefetch will cause the mapping to be brought into the TLB, and all the prefetches will hit in the TLB.

However, there's a corner case. A strong prefetch of a page that is not mapped will still cause the TLB miss, and the corresponding system time while the kernel figures drops the access to a non-existant page.

This was my suspicion. The code had a loop, and each iteration of the loop would prefetch the next item in the linked list. If the code had reached the end of the list, then it would be issuing a prefetch for address 0, which (of course) is not mapped.

It's relatively easy to provide a test case for this, just loop around issuing prefetches for address zero and see how long the code takes to run. Prefetches are defined in the header file <sun_prefetch.h>. So the first code looks like:

#include <sun_prefetch.h>

void main()
{
  for (int i=0; i<1000000; i++)
  {
   sparc_prefetch_write_many(0);
  }
}

Compiling and running this gives:

$ cc pref.c
$ timex a.out

real           2.59
user           1.24
sys            1.34

So the code has substantial system time - which fits with the profile of the application. Next thing to look at is the location of that system time:

$ collect a.out
$ er_print -metrics e.user:e.system -dis main test.1.er
...
   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
                                
...
   0.        0.                 [?]    10b8c:  bset        576, %o1 ! 0xf4240
   0.        0.                 [?]    10b90:  prefetch    [0], #n_writes_strong
## 1.571     1.181              [?]    10b94:  inc         %i5
   0.        0.                 [?]    10b98:  cmp         %i5, %o1
...

So the system time is recorded on the instruction following the prefetch.

The next step is to investigate ways to solve it. The <sun_prefetch.h> header file does not define any variants of weak prefetch. So it's necessary to write an inline template to provide the support:

.inline weak,4
  prefetch [%o0],2
.end
.inline strong,4
  prefetch [%o0],22
.end

The identifier for many writes strong is #22, for many writes weak it is #2. The initial code uses strong since I want to validate that I get the same behaviour with my new code:

void weak(int\*);
void strong(int\*);

void main()
{
  for (int i=0; i<1000000; i++)
  {
   strong(0);
  }
}

Running this gives:

$ cc pref.c pref.il
$ collect a.out
$ er_print -metrics e.user:e.system -dis main test.1.er
...
   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
                                
 ...
   0.        0.                 [?]    10b90:  clr         %o0
   0.        0.                 [?]    10b94:  nop
   0.        0.                 [?]    10b98:  prefetch    [%o0], #n_writes_strong
## 1.761     1.051              [?]    10b9c:  inc         %i5
...

So that looks good. Replacing the strong prefetch with a weak prefetch:

void weak(int\*);
void strong(int\*);

void main()
{
  for (int i=0; i<1000000; i++)
  {
   weak(0);
  }
}

And then repeat the profiling experiment:

   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
                                
...
   0.        0.                 [?]    10b90:  clr         %o0
   0.        0.                 [?]    10b94:  nop
   0.        0.                 [?]    10b98:  prefetch    [%o0], #n_writes
   0.        0.                 [?]    10b9c:  inc         %i5
...

Running the code under timex:

$ timex a.out

real           0.01
user           0.00
sys            0.00

So that got rid of the system time. Of course with this change the prefetches issued will now be dropped if there is a TLB miss, and it is possible that could cause a slowdown for the loads in the loop - which is a trade-off. However, examining the application using pmap -xs <pid> indicated that the data resided on 4MB pages, so the number of TLB misses should be pretty low (this was confirmed by the trapstat data I gathered earlier).

One final comment is that this behaviour is platform dependent. Running the same strong prefetch code on an UltraSPARC T2, for example, shows no system time.

Thursday Feb 07, 2008

Page size and memory layout

Support for large pages has been available since Solaris 9, I've previously talked about the various ways that an application can be coaxed into using large pages. However, I wanted to quickly write up how the large pages are laid out in memory. Take the following code that allocates a large chunk of memory, and then iterates over it for enough time to run pmap -xs on it:

#include <stdlib.h>

void main()
{
  int x,y;
  char \*c;
  c=(char\*)malloc(sizeof(char)\*300000000);
  for (y=0; y<; y++)
  for (x=0; x<300000000; x++) { c[x]=c[x]+y;}
}

Compiling this code to use 4MB pages and then running the resulting executable produces a pmap output like:

% cc -xpagesize=4M t.c
% a.out&
[1] 15501
% pmap -xs 15501
15501:  a.out
 Address  Kbytes     RSS    Anon  Locked Pgsz Mode   Mapped File
00010000       8       8       -       -   8K r-x--  a.out
00020000       8       8       8       -   8K rwx--  a.out
00022000    3960    3960    3960       -   8K rwx--    [ heap ]
00400000  290816  290816  290816       -   4M rwx--    [ heap ]
...

Notice that the heap starts on 8KB pages, and uses these up until the memory reaches a 4MB boundary and then starts using 4MB pages. In this case it means that nearly 4MB of the memory is not using 4MB pages - if this happens to be where the majority of the program's active data resides, then there will still be plenty of TLB misses.

Fortunately, it is possible to tell the linker where to start the heap. There are some mapfiles provided in /usr/lib/ld/ for various scenarios, the one that we need is map.bssalign. Recompiling with this produces the following memory layout:

% cc -M /usr/lib/ld/map.bssalign -xpagesize=4M t.c
% a.out&
[1] 19077
% pmap -xs 19077
19077:  a.out
 Address  Kbytes     RSS    Anon  Locked Pgsz Mode   Mapped File
00010000       8       8       -       -   8K r-x--  a.out
00020000       8       8       8       -   8K rwx--  a.out
00400000  294912  294912  294912       -   4M rwx--    [ heap ]

With this change the heap now starts on a 4MB boundary and is entirely mapped with 4MB pages.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs