Thursday Apr 02, 2009

Strong prefetch

Looked at an interesting problem yesterday. One of the hot routines in an application was showing a lot of system time. The overall impact on the application was minor, but any unexpected system time is worth investigating.

Profiling under the performance analyzer indicated a pair of instructions, a prefetch followed by a load. The load had all the time attributed to it, but it's usual for the performance analyzer to attribute time to the instruction following the one that is slow. Something like (this is synthetic data, measured data later):

User  System
 0.5     0.5     prefetch [%o0],23
10.0    20.0     ld [%i1],%i2

So there are three things that sprang to mind:

  • Misaligned memory accesses. This would probably be the most common cause of system time on memory operations. However, it didn't look like a perfect file since the system time would usually be reported on the instruction afterwards.
  • TSB misses. The TLB is the on-chip structure that holds virtual to physical mappings. If the mapping is not in the TLB, then the mapping gets fetched from a software managed structure called the TSB. This is a TLB miss. However, the TSB has a capacity limit too, and it is possible for there to be a TSB miss where the mapping is fetched from memory and placed into the TSB. TLB misses are fast traps and don't get recorded under system time. However, TSB misses do cause the switch. This is quite a strong candidate, however trapstat -t didn't show much activity.
  • The third option was the prefetch instruction.

A prefetch instruction requests data from memory in advance of the data being used. For many codes this results in a large gain in performance. There's a number of variants of prefetch, and their exact operation depends on the processor. The SPARC architecture 2007 on pages 300-303 gives an outline of the variants. There's prefetch to read, prefetch to write, prefetch for single use, prefetch for multiple use. The processor has the option to do something different depending on which particular prefetch was used to get the data.

The system I was running on was an UltraSPARC IV+. This introduced the strong prefetch variant.

The idea of the strong prefetch variant is that it will cause a TLB miss to be taken if the mapping of the requested address is not present in the TLB. This is useful for situations where TLB misses are frequently encountered and the prefetch needs to be able to deal with them. To explain why this is important, consider this example. I have a loop which strides through a large range of memory, the loop optimally takes only a few cycles to execute - so long as the data is resident in cache. It takes a couple of hundred cycles to fetch data from memory, so I end up fetching for eight iterations ahead (eight is the maximum number of outstanding prefetches on that processor). If the data resides on a new page, and prefetches do not cause TLB misses, then eight iterations will complete before a load or store touches the new page of memory and brings the mapping into the TLB. The eight prefetches that were issued for those iterations will have been dropped and all the iterations will see the full memory latency (a total of 200\*8=1600 cycles of cost). However, if the prefetches are strong prefetches, then the first strong prefetch will cause the mapping to be brought into the TLB, and all the prefetches will hit in the TLB.

However, there's a corner case. A strong prefetch of a page that is not mapped will still cause the TLB miss, and the corresponding system time while the kernel figures drops the access to a non-existant page.

This was my suspicion. The code had a loop, and each iteration of the loop would prefetch the next item in the linked list. If the code had reached the end of the list, then it would be issuing a prefetch for address 0, which (of course) is not mapped.

It's relatively easy to provide a test case for this, just loop around issuing prefetches for address zero and see how long the code takes to run. Prefetches are defined in the header file <sun_prefetch.h>. So the first code looks like:

#include <sun_prefetch.h>

void main()
{
  for (int i=0; i<1000000; i++)
  {
   sparc_prefetch_write_many(0);
  }
}

Compiling and running this gives:

$ cc pref.c
$ timex a.out

real           2.59
user           1.24
sys            1.34

So the code has substantial system time - which fits with the profile of the application. Next thing to look at is the location of that system time:

$ collect a.out
$ er_print -metrics e.user:e.system -dis main test.1.er
...
   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
                                
...
   0.        0.                 [?]    10b8c:  bset        576, %o1 ! 0xf4240
   0.        0.                 [?]    10b90:  prefetch    [0], #n_writes_strong
## 1.571     1.181              [?]    10b94:  inc         %i5
   0.        0.                 [?]    10b98:  cmp         %i5, %o1
...

So the system time is recorded on the instruction following the prefetch.

The next step is to investigate ways to solve it. The <sun_prefetch.h> header file does not define any variants of weak prefetch. So it's necessary to write an inline template to provide the support:

.inline weak,4
  prefetch [%o0],2
.end
.inline strong,4
  prefetch [%o0],22
.end

The identifier for many writes strong is #22, for many writes weak it is #2. The initial code uses strong since I want to validate that I get the same behaviour with my new code:

void weak(int\*);
void strong(int\*);

void main()
{
  for (int i=0; i<1000000; i++)
  {
   strong(0);
  }
}

Running this gives:

$ cc pref.c pref.il
$ collect a.out
$ er_print -metrics e.user:e.system -dis main test.1.er
...
   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
                                
 ...
   0.        0.                 [?]    10b90:  clr         %o0
   0.        0.                 [?]    10b94:  nop
   0.        0.                 [?]    10b98:  prefetch    [%o0], #n_writes_strong
## 1.761     1.051              [?]    10b9c:  inc         %i5
...

So that looks good. Replacing the strong prefetch with a weak prefetch:

void weak(int\*);
void strong(int\*);

void main()
{
  for (int i=0; i<1000000; i++)
  {
   weak(0);
  }
}

And then repeat the profiling experiment:

   Excl.     Excl.
   User CPU  Sys. CPU
    sec.      sec.
                                
...
   0.        0.                 [?]    10b90:  clr         %o0
   0.        0.                 [?]    10b94:  nop
   0.        0.                 [?]    10b98:  prefetch    [%o0], #n_writes
   0.        0.                 [?]    10b9c:  inc         %i5
...

Running the code under timex:

$ timex a.out

real           0.01
user           0.00
sys            0.00

So that got rid of the system time. Of course with this change the prefetches issued will now be dropped if there is a TLB miss, and it is possible that could cause a slowdown for the loads in the loop - which is a trade-off. However, examining the application using pmap -xs <pid> indicated that the data resided on 4MB pages, so the number of TLB misses should be pretty low (this was confirmed by the trapstat data I gathered earlier).

One final comment is that this behaviour is platform dependent. Running the same strong prefetch code on an UltraSPARC T2, for example, shows no system time.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs