Reading the %tick counter

It's often necessary to time the duration of a piece of code. To do this I often use gethrtime(), which is typically a quick call. Obviously, the faster the call more accurate I can expect my timing to be. Sometimes the call to gethrtime is too long (or perhaps too costly, because it's done so frequently). The next thing to try is to read the %tick register.

The %tick register is a 64-bit register that gets incremented on every cycle. Since it's stored on the processor reading the value of the register is a low cost operation. The only complexity is that being a 64-bit value it needs special handling under 32-bit codes where the a 64-bit return value from a function is passed in the %o0 (upper bits) and %o1 (lower bits) registers.

The inline template to read the %tick register in a 64-bit code is very simple

.inline tk,0
   rd %tick,%o0
.end

The 32-bit version requires two more instructions to get the return value into the appropriate registers:

.inline tk,0
   rd %tick,%o2
   srlx %o2,32,%o0
   sra %o2,0,%o1
.end

Here's an example code which uses the %tick register to get the current value plus an estimate of the cost of reading the %tick register:

#include 

long long tk();

void main()
{
  printf("value=%llu duration=%llu\\n",tk(),-tk()+tk());
}

The compile line is:

$ cc -O tk.c tk.il

The output should be something like:

$ a.out
value=4974674895272956 duration=32

Indicating that 32 cycles (in this instance) elapsed between the two reads of the %tick register. Looking at the disassembly, there are certainly a number of cycles of overhead:

        10bbc:  95 41 00 00  rd         %tick, %o2
        10bc0:  91 32 b0 20  srlx       %o2, 32, %o0
        10bc4:  93 3a a0 00  sra        %o2, 0, %o1
        10bc8:  97 32 60 00  srl        %o1, 0, %o3 
        10bcc:  99 2a 30 20  sllx       %o0, 32, %o4
        10bd0:  88 12 c0 0c  or         %o3, %o4, %g4
        10bd4:  c8 73 a0 60  stx        %g4, [%sp + 96]
        10bd8:  95 41 00 00  rd         %tick, %o2

This overhead can be reduced by treating the %tick register as a 32-bit read, and effectively ignoring the upper bits. For very short duration codes this is probably acceptable, but is unsuitable for longer running code blocks. With this (inelegant hack) the following code is generated:

        10644:  91 41 00 00  rd         %tick, %o0
        10648:  b0 07 62 7c  add        %i5, 636, %i0
        1064c:  b8 10 00 08  mov        %o0, %i4
        10650:  91 41 00 00  rd         %tick, %o0

Which returns usually a value of 8 cycles on the same platform.

Comments:

Post a Comment:
Comments are closed for this entry.
About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download

Search

Categories
Archives
« April 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs