Fast interpreter using gcc's computed goto

Writing interpreter could be fun, but many people end up doing them in assembly, for performance reasons. Gcc provides neat "computed goto" feature, very useful in writing interpreters. Consider following simple bytecode, containing 3 instructions:
0x1 - takes following byte in insn stream and prints "first <byte>"
0x2 - prints "second"
0x3 - terminates program
We can write an interpreter for such a language in following manner:
#include <stdio.h>

void interp(char\* start) {
  void\* table[4] = {NULL, &&inst1, &&inst2, &&inst3};
#define DISPATCH(n) start += n; goto \*(table[\*start]);
  DISPATCH(0);
  do {    
    inst1:
      printf("first %d\\n", \*(start+1));
      DISPATCH(2);
    inst2:
      printf("second\\n");
      DISPATCH(1);
    inst3:
      break;
  } while (1);
}

int main() {
  char test[] = {0x1, 0x5, 0x1, 0x2, 0x2, 0x3};
  interp(test);
  return;
}
For x86 gcc -O4 -fomit-frame-pointer generates following machine code
 80483c6:       0f be 03                movsbl (%ebx),%eax
 80483c9:       8b 04 84                mov    (%esp,%eax,4),%eax
 80483cc:       ff e0                   jmp    \*%eax
 80483ce:       89 f6                   mov    %esi,%esi
 80483d0:       83 ec 08                sub    $0x8,%esp
 80483d3:       0f be 43 01             movsbl 0x1(%ebx),%eax
 80483d7:       50                      push   %eax
 80483d8:       68 0c 85 04 08          push   $0x804850c
 80483dd:       e8 0a ff ff ff          call   80482ec 
 80483e2:       83 c3 02                add    $0x2,%ebx
 80483e5:       0f be 03                movsbl (%ebx),%eax
 80483e8:       8b 44 84 10             mov    0x10(%esp,%eax,4),%eax
 80483ec:       83 c4 10                add    $0x10,%esp
 80483ef:       ff e0                   jmp    \*%eax
 80483f1:       8d 76 00                lea    0x0(%esi),%esi
 80483f4:       83 ec 0c                sub    $0xc,%esp
 80483f7:       68 16 85 04 08          push   $0x8048516
 80483fc:       e8 cb fe ff ff          call   80482cc 
 8048401:       43                      inc    %ebx
 8048402:       0f be 03                movsbl (%ebx),%eax
 8048405:       8b 44 84 10             mov    0x10(%esp,%eax,4),%eax
 8048409:       83 c4 10                add    $0x10,%esp
 804840c:       ff e0                   jmp    \*%eax
 804840e:       89 f6                   mov    %esi,%esi
 8048410:       83 c4 10                add    $0x10,%esp
 8048413:       5b                      pop    %ebx
 8048414:       5e                      pop    %esi
 8048415:       5f                      pop    %edi
 8048416:       c3                      ret    
 8048417:       90                      nop    
which is pretty effective.
Comments:

What's the use of "&&" ? I tried "&" which gives compiler error, as it is not able to resolve labels. So the only way is to declare labels before use (or prototype them, but how to prototype labels ?) or use "&&" operator. Does that instruct compiler to resolve labels later, hmm... ?

Posted by Ashish Shukla on June 07, 2007 at 10:52 AM MSD #

Goto considered harmful, computed goto considered ???? Is assembler or a computed goto really necessary for an efficient interpreter? Sun Studio (and I think GCC) can or will convert a more easily understood switch statement into a lookup table - effectively doing the same as your code.

Posted by Andrew on June 07, 2007 at 10:58 AM MSD #

&& stands for "get label address".

Posted by nike on June 07, 2007 at 11:02 AM MSD #

"computed goto really necessary for an efficient interpreter"? Yes, it is. No compiler I know of is smart enough to create dispatch table like in shown example. If you could show counterexample - you're welcome.

Posted by nike on June 07, 2007 at 11:02 AM MSD #

@nike: Thanks

Posted by Ashish Shukla on June 07, 2007 at 11:18 AM MSD #

The Sun Studio Compilers also support the computed goto in C. Last I saw they didn't for C++. My experiments with the C++ interpreter in hotspot shows that using the computed goto doubles the speed of the dispatch which is a large amount of the overhead in an interpreter. Part of that comes from being able to schedule the load from the table at the beginning of the bytecode so that by the time you branch the load is sure to complete.

Posted by fatcatair on June 07, 2007 at 03:02 PM MSD #

This trick is good. Before reading this post, I never read about computed gotos. Thanks.

Posted by DP on June 08, 2007 at 01:31 PM MSD #

Goto considered harmful was harmful. See Steele's Lambda: the ultimate goto (or something like that).

Posted by Nico on June 08, 2007 at 09:11 PM MSD #

Does Sun Studio also support gcc's local functions?

Posted by Nico on June 08, 2007 at 09:11 PM MSD #

"Goto considered harmful was harmful." - absolutely agree. Goto is just yet another tool, and Dejkstra wrote that in times of spaghetti-like Fortran programs. In proper place (exceptions, fast non-structural control flow change) it's very useful.

Posted by nike on June 09, 2007 at 02:58 AM MSD #

"No compiler I know of is smart enough to create dispatch table like in shown example." I've looked at this sort of thing before with respect to the core for Fuse; in that case, gcc (even without optimisation) reduces the main Z80 opcode dispatch (see z80/z80_ops.c in the source) to one indirect jump...
[...]
        movl    -1176(%ebp), %edx
        movl    .L333(,%edx,4), %eax
        jmp     \*%eax
        .section        .rodata
        .align 4
        .align 4
.L333:
        .long   .L41
        .long   .L78
[ ... ]
Or am I missing something here?

Posted by Philip Kendall on July 30, 2007 at 10:26 AM MSD #

I have the same problem as Philip. With G++ the compiler seems to go though incredible contortions to preserve a single indirect jump. Even going so far as to combine jumps from separate jump tables - with a series of direct jumps. This seems utterly bewildering behaviour as it specially breaks the performance gain having a jmp \*%eax for each interpreter leg.

The performance gain comes bout because the branch prediction logic has many separate jmp instructions - not one. So there is a far better chance of a good predict. Funnelling the execution pretty much guarantees a branch mispredict, and the dire penalty that comes with it.

Posted by Francis Vaughan on September 07, 2007 at 08:45 AM MSD #

Post a Comment:
  • HTML Syntax: NOT allowed
About

nike

Search

Categories
Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today