Thursday Apr 19, 2007

Locating DTLB misses using the Performance Analyzer

DTLB misses typically appear in the Performance Analyzer as loads with significant user time. The following code strides through memory in blocks of 8192 bytes, and so encounters many DTLB misses

#include<stdlib.h>
void main()
{
  double \*a;
  double total=0;
  int i;
  int j;
  a=(double\*)calloc(sizeof(double),10\*1024\*1024+10001);
  for (i=0;i<10000;i++)
   for(j=0;j<10\*1024\*1024;j+=1024)
    total+=a[j+i];
}

A profile can be gathered as follows:

cc -g -O -xbinopt=prepare -o tlb tlb.c
collect tlb

Viewing the profile for the main loop using er_print produces the following snippet:

   Excl.     
   User CPU  
    sec. 
...
   0.230                [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
   0.                   [11]    10c74:  faddd       %f4, %f2, %f8
   3.693                [11]    10c78:  ldd         [%i5], %f12
## 7.685                [10]    10c7c:  add         %i5, %i3, %o3
   0.                   [11]    10c80:  faddd       %f8, %f0, %f10
   4.183                [11]    10c84:  ldd         [%o3], %f2
## 6.935                [10]    10c88:  add         %o3, %i3, %o1
   0.                   [11]    10c8c:  faddd       %f10, %f12, %f4
   0.                   [10]    10c90:  inc         3072, %i4
   4.123                [11]    10c94:  ldd         [%o1], %f0
## 7.065                [10]    10c98:  cmp         %i4, %i0
   0.                   [10]    10c9c:  ble,pt      %icc,0x10c70
   0.                   [10]    10ca0:  add         %o1, %i3, %i5

This can be compared with the situation where mpss.so.1 has been preloaded to enable the application to get large pages:

   0.                   [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
   0.                   [11]    10c74:  faddd       %f4, %f2, %f8
   0.                   [11]    10c78:  ldd         [%i5], %f12
## 7.445                [10]    10c7c:  add         %i5, %i3, %o3
   0.                   [11]    10c80:  faddd       %f8, %f0, %f10
   0.220                [11]    10c84:  ldd         [%o3], %f2
## 6.955                [10]    10c88:  add         %o3, %i3, %o1
   0.                   [11]    10c8c:  faddd       %f10, %f12, %f4
   0.                   [10]    10c90:  inc         3072, %i4
   0.340                [11]    10c94:  ldd         [%o1], %f0
   0.                   [10]    10c98:  cmp         %i4, %i0
   0.                   [10]    10c9c:  ble,pt      %icc,0x10c70
   0.                   [10]    10ca0:  add         %o1, %i3, %i5

The difference between the two profiles is the appearance of user time attributed directly to the load instruction (and not the normal instruction after the load).

It is possible to confirm that these are DTLB misses using the Performance Analyzer's ability to profile an application using the hardware performance counters:

collect -h DTLB_miss tlb
...
   Excl.      
   DTLB_miss  
   Events   
...
          0             [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
          0             [11]    10c74:  faddd       %f4, %f2, %f8
   30000090             [11]    10c78:  ldd         [%i5], %f12
          0             [10]    10c7c:  add         %i5, %i3, %o3
          0             [11]    10c80:  faddd       %f8, %f0, %f10
## 42000126             [11]    10c84:  ldd         [%o3], %f2
          0             [10]    10c88:  add         %o3, %i3, %o1
          0             [11]    10c8c:  faddd       %f10, %f12, %f4
          0             [10]    10c90:  inc         3072, %i4
   30000090             [11]    10c94:  ldd         [%o1], %f0
          0             [10]    10c98:  cmp         %i4, %i0
          0             [10]    10c9c:  ble,pt      %icc,0x10c70
          0             [10]    10ca0:  add         %o1, %i3, %i5

The events are reported on the load instructions that are causing the DTLB misses.

Wednesday Apr 18, 2007

Detecting data TLB misses

There are a couple of easy ways that an application can be tested for whether it is encountering DTLB misses.

  • One way is to use the command trapstat, this command requires administrator privileges to run and either reports trap activity on a system-wide basis, or can be used to follow the traps that a single process encounters
  • The second way is to use cputrack to track the events recorded by the hardware performance counters on the processor. The particular counters will depend on the hardware. An example using an UltraSPARC III is:
cputrack -c pic0=Instr_cnt,pic1=DTLB_miss -p <pid>

Using large DTLB page sizes

The TLB is a structure on the chip that handles the mapping of virtual memory addresses (used by the application) into physical memory addresses (used by the hardware). It is a list of such mappings, each mapping describes a range of memory (called the page size), the default on SPARC in 8KB page sizes, but it can be configured up to impressively large sizes (eg 256MB for UltraSPARC T1). The command to display what page sizes the hardware supports is pagesize:

pagesize -a

If the application requests a virtual to physical translation that is not mapped in the TLB, then there's a TLB miss. On UltraSPARC III/IV the process of fetching a TLB entry takes about a hundred cycles.

Using a larger page size will reduce the number of TLB misses. Of course a large page size requires a large chunk of contiguous physical memory, and it's not always possible to get this.

An application can request large pages in one of three ways:

  • Using the ppgsz command to set the preferred page sizes.
  • Using the compiler flag -xpagesize= to set the preferred page size at compile time.
  • Preloading the mpss.so.1 library and using the MPSSHEAP, MPSSSTACK environment variables to describe the page size.

When an application is running it is possible to inspect the page sizes of the allocated memory using the command:

pmap -xs <pid>
About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs