Friday Aug 13, 2004

Perl+DTrace=Instruction Tracing

As promised, an ugly perl script that can be used to generate a custom DTrace script which will generate an instruction trace.

For instance:

trogdor:utrace -> ls
hello\*      hello.c     utrace.pl\*  utrace.txt\* 
trogdor:utrace -> utrace.pl -i hello
trogdor:utrace -> sudo dtrace -Cs hello.t -c hello
...

For the interested, here is the text of the traced program, the DTrace script, and the output generated.

The perl script is no where near "feature complete", nor is it an example of good perl programming practice ( parts of it are legible ). But, say it with me, it scratched an itch :).

-ejo

Thursday Jul 29, 2004

DTrace Instruction Tracing Continues..

Wow, this has me bugged. Using this script ( called itrc.d ):

pid$target:a.out::
{
        printf("%d %016llx",vtimestamp,uregs[R_PC]);
}

# dtrace -s itrc.d -c hello3 > rtrc.hello3

a custom SPARC disassembler, and a really awful perl script, I produced:

Delta NS Address          Instruction   (disassembled)
       0 0000000100000740 bc102000      or      %g0, 0x0, %fp
   12960 0000000100000744 e05ba87f      ldx     [%sp+0x87f], %l0
    7120 0000000100000748 1b000000      sethi   0x0, %o5
    5040 000000010000074c 9a136001      or      %o5, 0x1, %o5
    4960 0000000100000750 9b2b7020      sllx    %o5, 0x1020, %o5
    4640 0000000100000754 13000403      sethi   0x100c00, %o1
    3920 0000000100000758 9212400d      or      %o1, %o5, %o1
    4240 000000010000075c e0226208      stw     %l0, [%o1+ 0x208]
    4240 0000000100000760 a203a887      add     %sp, 0x887, %l1
 ...

I think I'm almost done.

-ejo

DTrace Instruction Tracing

Adam Leventhal pointed out that the DTrace pid provider can instrument user-land instructions, so my interest was piqued enough to give it a whirl last night. The script I ended up with is just stupidly short, I feel like I'm cheating.

pid$target:a.out::
{
        printf("pc %p: %x",uregs[R_PC],\*(uint )copyin(uregs[R_PC],4));
}

Using a new DTrace feature, I start the trace and get some output:

# dtrace -c ./hello2 -s itrc.d 
CPU     ID                    FUNCTION:NAME
  0  38695                     _start:entry pc 1000006a0: 91d02038
  0  38696                         _start:0 pc 1000006a0: 91d02038
  0  38697                         _start:4 pc 1000006a4: 91d02038
  0  38698                         _start:8 pc 1000006a8: 91d02038
...

That doesn't look right. The PC's look good, but those aren't the instructions at those addresses. Or are they? I was a bit foggy last night when I started messing with this, so I assumed that I wasn't using copyin() right. Somehow I was getting the address of the DTrace buffer that copyin returns or something. In the cold light of morning things look different. It occurs to me that 91d02038 looks familiar. Yes I know, it's numbers. The crawling itch in my lizard hacker brain was trying to tell me something:

91d02038 <=> ta      %icc, 0x38

What? Your lizard hacker brain doesn't tell you stuff like that? Lucky you. So I was using copyin correctly. I am fetching the instruction at the PC, just not the original instruction. As everybody knows, 0x38 is the DTRACE pid provider trap, so this instruction is the DTrace instrumentation that makes it all work.

I'm feeling moderately clever for having figured that out on only two cups of coffee. :)

-ejo

Wednesday Jul 28, 2004

Adventures in Instruction Tracing: Part Duex!

Yesterday I began talking about cobbling together a user-land instruction trace tool using a variety of UltraSPARC and Solaris features. Today I'm going to cop out alittle and present the comments for the user trap handler I wrote which is the heart of the tool. For the truly bored, I've provided a hello.c and associated (post-processeed) trace.

-ejo

#if 0

illtrap_hndlr - Btrace3 ILLTRAP utrap handler

erik.oshaughnessy@sun.com


Btrace3 instruments the code of a program by substituting control
transfer instructions (CTIs) with specially prepared ILLTRAP
instructions.  When the code executes, the ILLTRAP instructions cause
the utrap handler to execute.  This allows us to observe the actual
path of execution by noting which ILLTRAP instructions are executed.

It probably is worth mentioning that the ILLTRAP instruction is a
sneaky way of hiding 22 bit constant data in your code.  According to
_The SPARC Architecture Manual_ Version 9, David Weaver and Tom
Germond, page 168:

"The ILLTRAP instruction causes a illegal_instruction exception. The
 const22 value is ignored by the hardware; specifically, this field is
 NOT reserved by the architecure for any future use."

I have taken advantage of this by crafting the ILLTRAP instructions so
that they serve as indices into the array of saved original
instructions. Used in a flat array, a const22 value gives us 2\^22-1 or
roughly 4 million points where code can be instrumented.  For the
current generation of software this number of probe points appears to
suffice, but it is not difficult to conceive of future applications
where the number of probe points exceeds this flat space.

The first implementation of Btrace played a dangerous shell game,
storing the original instruction back into the executing stream of
instructions and re-inserting the ILLTRAP when handling the next
probed CTI.  Despite the odd corner case, this technique worked
surprisingly well in a single- threaded application.  It falls apart
when attempting to trace multi- threaded applications since the
complexity of managing multiple stores into the instruction stream and
guranteeing correctness of the instrumented program proved simply
overwhelming.

Btrace3 sidesteps the self-modifying code aspects of Btrace1 and 2 by
emulating the CTI instruction instead of storing the probed
instruction back into the instruction stream.  This is possible since
the utrap handler can choose where the trapping context begins
executing after the utrap handler returns.  Typical utrap handlers
either skip the trapping instruction or restart the trapping
instruction after modifying the state of the trapping context.  A CTI
instruction can be emulated by calculating the target address if the
branch is taken and then determining if the CTI is taken.  After
evaluating any conditions predicated by the type of instruction being
emulated, it is a simple matter to return with arbitrary PC and nPC
addresses.

Since the instruction stream remains instrumented during the lifetime
of of the process, the synchronization complexity of the tracing code
is greatly reduced.  The only remaining synchronization point is trace
buffer management, which is a relatively well understood problem.

The first Btrace implementation also uncovered the importance of
preserving the state of the trapping context before proceeding with
tracing activities.  The two macros UTRAP_PROLOG and UTRAP_EPILOG
preserve and restore the following registers: %ccr, %g1 - %g7.
Additionally, UTRAP_PROLOG writes ASI_PNF to %asi as required by the
__sparc_utrap_install(2) man page before calls are made to ABI
conforming functions.  Values are preserved on the stack, making the
utrap handler reasonably re-entrant.  The stack also has room for
sundry other values; %tick, %fsr, %l6 and %l7 (PC and nPC) from
trapping context, and the ILLTRAP instruction at [%l6].  The macro
UTRAP_EPILOG restores the state of %ccr, %g1 - %g7 which have may have
been corrupted by the utrap handler activites.

The following is a pseudo pseudocode description of how illtrap_hndlr works:

illtrap_hndlr{

   sampled_tick_register = gethrtime()

   save trapping context

   original instruction = ProbeTable[ trapping ILLTRAP instruction ]

   tid = thr_self();     /\* re-implemented to avoid linking libthread \*/

   cpuid = getcpuid()    /\* new in Solaris 9 \*/

   switch( original instruction )
   
     case CALL:
        emulate CALL instruction

     case BPr:
        emulate BPr instruction	
   	
     case FBfcc:
        emulate FBfcc instruction	
   
     case FBPfcc:
        emulate FBPfcc instruction	
   
     case Bicc:
        emulate Bicc instruction	

     case BPcc:
        emulate BPcc instruction	

     case JMPL:
        emulate JMPL instruction	
     
     default:   /\* getting to default is an error condition \*/
        
        unregister illtrap_hndlr
        restore trapping context
        restart trapping instruction, this time without the handler to
          catch it when it falls.  the program should die.
     
   /\* end of switch \*/

   pick PC and nPC values based on contents of FLG register which
   indicates if the branch was taken and if the delay slot instruction was
   annulled.

   if( !TraceOn )
     restore trapping context
     start executing at PC and nPC computed during instruction emulation
   fi
   
   lock(TrcLock,tid)
   			
   if( cTrcSlot > TRCMAX )
     write(btrc->fd,TraceBuffer,TRCBUFSZ)
     cTrcSlot = 0;
   fi

   store btfrec_t built in emulate_\* to TraceBuffer[cTrcSlot]

   cTrcSlot++

   unlock(TrcLock)

   restore trapping conext

   start executing at PC and nPC computed during instruction emulation

}

Of course there's alittle bit of hand waving, but that should give the
reader a basic idea of what's being done and in what order. 

#endif

Tuesday Jul 27, 2004

Anything You Can Do, D Can Do Better!

Not sour grapes honest! Jon Haslam saw my fsflush post and couldn't just bask in the glory that was old school hacking. He had to throw the new hotness at the problem. I, for one, welcome our new DTrace overlords!

-ejo

Adventures in Instruction Tracing

One thing that performance people like to do is actually know what instructions are executed by a CPU. Crazy, I know. You can disassemble the program to know what specific machine instructions compose the program, but you still don't know what instructions specifically are executed at run-time due to branching in the code induced by the input to the program.

So way back when, clever programmers found ways to instrument their code ( or more often, somebody else's code ) in order to determine what instructions a CPU was executing. There are lots of techniques for doing this, but I'd like to talk about the one I implemented and was disclosed in a short paper that was published in the May 2004 issue of Research Disclosure Journal. I'm happy to have been published, but the two page limit did sort of cramp my style.

Branch Tracing

Some computer architectures provide a spiffy feature, often referred to as "trap on branch". It is possible to build a pretty efficient instruction tracing tool using trap on branch since it interrupts the code under observation only at the end of a basic block, rather than on every instruction executed. On average a SPARCv9 basic block consists of 5 instrutions, so using trap on branch to trace would trap 80% less than trapping on every instruction. Wait for it. Unfortunately, SPARCv9 does not define a trap on branch. Oh if only.

Work With What You've Got

Despite the lack of branch on trap in SPARCv9, there are some other architectural goodies defined which the intrepid hacker could exploit. The first thing I looked at was the Instruction Breakpoint Register( Ok, technically not defined by SPARCv9, it's a USIII implementation defined extension ), which caused a breakpoint trap to fire when an instruction matched a user definable mask. Originally intended as a means of hot-patching broken instructions in the field, this feature never really worked as designed and was disabled in USIII+.

Another cool feature that doesn't get alot of love is the user trap handler facility, described in __sparc_utrap_install(2) and in the SPARC architecture manual (verson 9). It allows a user to define and install a custom trap handler for a specific (but diverse) set of exceptions. Now that is pretty cool!

You Got Your Debugger In My Instruction Tracer!

Skipping ahead alittle bit, I had the idea of combining an old debugger technique for breakpointing with SPARC user trap handlers to create a poor-man's trap on branch. Many debuggers implement breakpointing by inserting an illegal instruction at the specified address in the program and catching the illegal instruction trap generated when the client program executes it. The debugger then patches up the replaced instruction or emulates it, whichever seemed easier to do at the time. An instruction tracing tool could patch up the victim target application, replacing all it's control transfer instructions (CTIs) with ILLTRAP instructions ( SPARC instructions whose specific job is cause an illegal instruction trap ). Then when the target runs, every time it tries to branch it causes a trap which we can intercept with a user trap handler. The handler could write a record to a trace file and then fix up the broken branch and restart the target. Simplicity itself!

It's Never That Easy

Where to start. First, user trap handling for ILLTRAP was broken as detailed in bug 4650095. Second, there's the small matter of introducing ILLTRAPs into a program. Third, what to do with the CTIs that you've replaced with ILLTRAPs and all the trace output you will generate?

I discovered that an LD_AUDIT shared object would be an excellent way to introduce the trace instrumentation into a target application. The runtime linker audit interfaces are described adequately in rtld_audit(3ext) and in excellent detail in the Linker and Libraries Guide. The audit interfaces give the determined hacker a great deal of freedom to mess with an executable before control is passed to it from the runtime linker. Another advantage of this technique is the instrumenting code and the target code all share the same address space, so it's very easy to find and replace all the CTIs in the target code with ILLTRAPs.

A Commercial Plug And Then A Pause For Breath

As an engineer for Sun, I have found these three books to be invaluable:

  1. The SPARC Architecture Manual, version 9
  2. Solaris Interals
  3. Linker And Libraries Guide

The Linker and Library Guide in particular is an amazing piece of work, especially when you learn that it's written by the same guys who write the linker code. Those guys rock! All that's missing is a blog from Siezo!

Tomorrow, I'll post some condensed notes on the design of a user-land instruction tracing tool built on user traps called Btrace3, and why it ultimately was a flop.

-ejo

Friday Jul 23, 2004

fsflush: what is it doing?

Yesterday I was reading performance related bug reports, trolling for work, when I stumbled upon 5024143. I don't know alot about fsflush, so I did some reading and found that on10's fsflush is keeping some internal statistics concerning it's operation that it isn't sharing ( a kstat would be a good place to start ). So being a curious sort, I whipped up a libkvm(3LIB) consumer to expose fsflush's stats to userland. fsfstat is born! Caveat: on10 required.

Some Example Output

Here is sample output from fsfstat run on a 12-way 4800 with 28G of real memory, mostly idle ( Burninator! ).

trogdor:fsfstat -> sudo ./fsfstat 1
#3594320   SCANNED   PCT EXAMINED   LOCKED MODIFIED COALESCE RELEASES   TIME(ns)
15:08:44    119810   3.3      235        0        0        0        0     340320
15:08:45    119810   6.7      235        0        0        0        0     324880
15:08:46    119810  10.0      235        0        0        0        0     324800
15:08:47    119810  13.3      235        0        0        0        0     318000
15:08:48    119810  16.7      634        2        0        0        0     501840
15:08:49    119810  20.0    13802      773        6        7        0    4269440
15:08:50    119810  23.3      235        0        0        0        0     335520
15:08:51    119810  26.7      235        0        0        0        0     320800
15:08:52    119810  30.0      235        0        0        0        0     317840
15:08:53    119810  33.3      235        0        0        0        0     317360
15:08:54    119810  36.7      235        0        0        0        0     320640
15:08:55    119810  40.0      235        0        0        0        0     323680
15:08:56    119810  43.3      235        0        0        0        0     320400
15:08:57    119810  46.7      235        0        0        0        0     315440
15:08:58    119810  50.0      235        0        0        0        0     318800
15:08:59    119810  53.3      235        0        0        0        0     319440
15:09:00    119810  56.7     6367     3124        0        0        0    7615280
15:09:01    119810  60.0   110739    54692        0        6        0  118389680
15:09:03    119810  63.3    33125    14333      374       10        0   31190320
15:09:04    239620  70.0      470        0        0        0        0     663520
15:09:05    119810  73.3      235        0        0        0        0     324240
15:09:06    119810  76.7      235        0        0        0        0     320080
15:09:07    119810  80.0      235        0        0        0        0     317280
15:09:08    119810  83.3      235        0        0        0        0     319680
15:09:09    119810  86.7      235        0        0        0        0     318640
#3594320   SCANNED   PCT EXAMINED   LOCKED MODIFIED COALESCE RELEASES   TIME(ns)
15:09:10    119810  90.0     6572      773       18        0        0    3049280
15:09:11    119810  93.3   113238      617        4        0        0   20368080
15:09:12    119810  96.7   103493        0        0       21        0   17344240
15:09:13    119810 100.0      235        0        0        0        0     346960
15:09:14    119810   3.3      235        0        0        0        0     321040
15:09:15    119810   6.7      235        0        0        0        0     324240
15:09:16    119810  10.0      235        0        0        0        0     322960
15:09:17    119810  13.3      235        0        0        0        0     318400
15:09:18    119810  16.7     2139       96        0        0        0    1095680
15:09:19    119810  20.0    12249      679        6        0        0    3521120
15:09:20    119810  23.3      235        0        0        0        0     334400
15:09:21    119810  26.7      235        0        0        0        0     326960
15:09:22    119810  30.0      235        0        0        0        0     325840
15:09:23    119810  33.3      235        0        0        0        0     320240
15:09:24    119810  36.7      235        0        0        0        0     316720
15:09:25    119810  40.0      235        0        0        0        0     317040
15:09:26    119810  43.3      235        0        0        0        0     316640
15:09:27    119810  46.7      235        0        0        0        0     317120
15:09:28    119810  50.0      235        0        0        0        0     317040
15:09:29    119810  53.3      235        0        0        0        0     317600
15:09:30    119810  56.7    13010     6448        0        0        0   15070240
15:09:31    119810  60.0   115114    56225      109        0        0  121045120
15:09:32    119810  63.3    21994     9476      265        0        0   20595040
15:09:33    119810  66.7      235        0        0        0        0     339840
15:09:34    119810  70.0      235        0        0        0        0     319760

The first column is a timestamp, the second column is number of 8k pages scanned. The fourth column is number of pages examined, which can be a smaller number due to larger pages being composed of sets of smaller pages. Locked, modified, coalesce and release I'm still foggy on ( I need to read and understand fsflush.c some more ). The final column is the time in nanoseconds that fsflush ran over the previous interval. I've saved the third column for last because I thought I was being rather clever. The third column tracks how far thru memory fsflush has scanned since fsfstat has been observing it. In this way, we can tell when fsflush has made a complete sweep thru physical memory and is starting over again.

Some Interesting Observations

Using the locality information gained by the percentage progress thru memory, we can see a pattern emerging in the sample data. Starting at the 20%, 56% and 90% range, we see three second bursts of activity that recurr. I don't have a definite explanation for this activity, and I wouldn't mind hearing thoughts about what's going on here.

Anyway, today is a double-two-fer :) I'm just happy with how this hack turned out.

-ejo

Tuesday Jul 20, 2004

LD_PRELOAD, or how I added timestamps to the output of anything.

A pet peeve of mine forever has been the fact that the \*stat family of tools don't timestamp their output. Oh sure, iostat has been granted a timestamp but vmstat, mpstat and the rest were left out in the cold. Yes, you could write a wrapper script which time stamps the beginning and the end of a ream of vmstat output but you still have to work to figure out when an event occurred in the middle of a 11,111,111 line capture of performance data.

Given my need to explore the odd nooks and crannies for what is possible, I thought it might be fun to write an LD_PRELOAD shared object which interjected a timestamp at the beginning of a line of output. All I'd need to do is provide a new printf and I'd be done right? Wrong!

After some messing about, it turned out that write(2) and _write() were the culprits that needed to be overridden to accomplish the mission. It also turned out that I needed to do some line buffering since I wanted the timestamp to be nice and tidy in column 1 of the output. For those of you with access to the internal network, you can find the fruit of my labor here. Those of you who don't can contact me and if my manager approves I could make it more widely available.

An example using venerable vmstat:

schwa:stamp-1.1 -> vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr f0 s0 s1 s6   in   sy   cs us sy id
 0 0 0 1729512 1154760 0 116 1  0  0  0  0  0  0  1  0  634 1594  717  1  1 97
 0 0 0 1648560 1030352 0 21  0  0  0  0  0  0  0  0  0  633 1323  466  0  0 100
 0 0 0 1648560 1030352 0 12  0  0  0  0  0  0  0  0  0  585 1157  425  0  0 100
 0 0 0 1648560 1030352 0 15  0  0  0  0  0  0  0 52  0  756 1143  492  0 21 79
 0 0 0 1648560 1030352 0 15  0  0  0  0  0  0  0  0  0  722 1658  509  0  0 100
\^C
schwa:stamp-1.1 -> setenv LD_PRELOAD stamp.so.1
schwa:stamp-1.1 -> vmstat 1
14:21:28  kthr      memory            page            disk          faults      cpu
14:21:28  r b w   swap  free  re  mf pi po fr de sr f0 s0 s1 s6   in   sy   cs us sy id
14:21:28  0 0 0 1729504 1154744 0 116 1  0  0  0  0  0  0  1  0  634 1594  717  1  1 97
14:21:29  0 0 0 1648480 1030272 0 23  0  0  0  0  0  0  0  0  0  591 1211  415  0  0 99
14:21:30  0 0 0 1648480 1030272 0 15  0  0  0  0  0  0  0  0  0  577 1179  413  0  1 99
14:21:31  0 0 0 1648480 1030272 0 12  0  0  0  0  0  0  0  0  0  582 1240  438  0  0 100
14:21:32  0 0 0 1648480 1030272 0 15  0  0  0  0  0  0  0  0  0  584 1150  404  0  0 100
14:21:33  0 0 0 1648480 1030272 0 15  0  0  0  0  0  0  0  0  0  644 1543  492  1  0 99
14:21:34  0 0 0 1648480 1030272 0 15  0  0  0  0  0  0  0  0  0  677 1353  477  0  0 100
14:21:35  0 0 0 1648480 1030272 0 18  0  0  0  0  0  0  0 27  0  843 1644  552  0  1 99
14:21:36  0 0 0 1647280 1030032 0 325 0  0  0  0  0  0  0  0  0  777 2056  687  8  0 92
14:21:37  0 0 0 1648480 1030272 0 16  0  0  0  0  0  0  0  0  0  705 1422  526  2  0 98
14:21:38  0 0 0 1648480 1030272 0 40  0  0  0  0  0  0  0  0  0  772 2260  723  3  0 97
\^C

Admittedly, not horribly clever but it scratched an itch.

-ejo

About

ejo

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks