Locating DTLB misses using the Performance Analyzer

DTLB misses typically appear in the Performance Analyzer as loads with significant user time. The following code strides through memory in blocks of 8192 bytes, and so encounters many DTLB misses

#include<stdlib.h>
void main()
{
  double \*a;
  double total=0;
  int i;
  int j;
  a=(double\*)calloc(sizeof(double),10\*1024\*1024+10001);
  for (i=0;i<10000;i++)
   for(j=0;j<10\*1024\*1024;j+=1024)
    total+=a[j+i];
}

A profile can be gathered as follows:

cc -g -O -xbinopt=prepare -o tlb tlb.c
collect tlb

Viewing the profile for the main loop using er_print produces the following snippet:

   Excl.     
   User CPU  
    sec. 
...
   0.230                [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
   0.                   [11]    10c74:  faddd       %f4, %f2, %f8
   3.693                [11]    10c78:  ldd         [%i5], %f12
## 7.685                [10]    10c7c:  add         %i5, %i3, %o3
   0.                   [11]    10c80:  faddd       %f8, %f0, %f10
   4.183                [11]    10c84:  ldd         [%o3], %f2
## 6.935                [10]    10c88:  add         %o3, %i3, %o1
   0.                   [11]    10c8c:  faddd       %f10, %f12, %f4
   0.                   [10]    10c90:  inc         3072, %i4
   4.123                [11]    10c94:  ldd         [%o1], %f0
## 7.065                [10]    10c98:  cmp         %i4, %i0
   0.                   [10]    10c9c:  ble,pt      %icc,0x10c70
   0.                   [10]    10ca0:  add         %o1, %i3, %i5

This can be compared with the situation where mpss.so.1 has been preloaded to enable the application to get large pages:

   0.                   [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
   0.                   [11]    10c74:  faddd       %f4, %f2, %f8
   0.                   [11]    10c78:  ldd         [%i5], %f12
## 7.445                [10]    10c7c:  add         %i5, %i3, %o3
   0.                   [11]    10c80:  faddd       %f8, %f0, %f10
   0.220                [11]    10c84:  ldd         [%o3], %f2
## 6.955                [10]    10c88:  add         %o3, %i3, %o1
   0.                   [11]    10c8c:  faddd       %f10, %f12, %f4
   0.                   [10]    10c90:  inc         3072, %i4
   0.340                [11]    10c94:  ldd         [%o1], %f0
   0.                   [10]    10c98:  cmp         %i4, %i0
   0.                   [10]    10c9c:  ble,pt      %icc,0x10c70
   0.                   [10]    10ca0:  add         %o1, %i3, %i5

The difference between the two profiles is the appearance of user time attributed directly to the load instruction (and not the normal instruction after the load).

It is possible to confirm that these are DTLB misses using the Performance Analyzer's ability to profile an application using the hardware performance counters:

collect -h DTLB_miss tlb
...
   Excl.      
   DTLB_miss  
   Events   
...
          0             [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
          0             [11]    10c74:  faddd       %f4, %f2, %f8
   30000090             [11]    10c78:  ldd         [%i5], %f12
          0             [10]    10c7c:  add         %i5, %i3, %o3
          0             [11]    10c80:  faddd       %f8, %f0, %f10
## 42000126             [11]    10c84:  ldd         [%o3], %f2
          0             [10]    10c88:  add         %o3, %i3, %o1
          0             [11]    10c8c:  faddd       %f10, %f12, %f4
          0             [10]    10c90:  inc         3072, %i4
   30000090             [11]    10c94:  ldd         [%o1], %f0
          0             [10]    10c98:  cmp         %i4, %i0
          0             [10]    10c9c:  ble,pt      %icc,0x10c70
          0             [10]    10ca0:  add         %o1, %i3, %i5

The events are reported on the load instructions that are causing the DTLB misses.

Comments:

Post a Comment:
Comments are closed for this entry.
About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs