Friday Dec 05, 2008

See how your java source file is parsed

I've written a new visual tool showing how a Java file is parsed using the Java grammar from Compiler Grammar project.

For more details, see here .

Monday Oct 20, 2008

Generating java byte code by building AST trees

There are several ways to generate byte code to run on JVM. You can write a .java file then compile it with javac, or write ASM similar code directly then compile it with tools like JASML, or with tools like BCEL it is even possible to generate your own class in runtime. Aside from these, a quite interesting approach is to construct AST nodes representing the structure of the java code, then generate byte code from that. Actually, this is what the javac does.

When javac compiles .java files into .class files, there is a two step process involved.
First is parsing. Javac reads in the source code, parses it, builds a tree structure representing the source code.
Second is code generation. The code generator takes the tree, acts upon it, produces the .class file.

These two steps are quite independent from each other, which makes it possible to replace either of the two without affecting the other.
So, to achieve our goal, we can create a tree ourself, then hand it to the code generator to generate code.
This is actually not a difficult task. The javac is very decently implemented, with a very clear separation between the two steps.

The javac source is located on the OpenJDK langtools repository, which hosts a series of tools like javadoc, javah etc.. Go to this link for more detail about langtools. If you have Mercurial installed, check out the code from, or if not, you can download an archived copy from this link as well. After you get the source, try to build and run it to make sure it works properly. Refer to this link for how to do this.

After getting the code, let's try to make a very simple javac tree for this file []

public class Test{
    public static void main(String[] args){

The code we are interested in are located located at:

  • src/share/classes/com/sun/tools/javac/parser/, which contains the parser related code.
  • src/share/classes/com/sun/tools/javac/main/, which contains main controller that calls and integrates the two steps.
  • src/share/classes/com/sun/tools/javac/tree/, which contains all the tree related classes.

There are four major types of node for our file:

  • .java file, which may contain more than one classes, that is parsed as
  • class, that is parsed as
  • method, this is parsed as
  • And the System.out.println("Hello!") method call, which is a combination of and

These classes all represent complicated data structures, to create instance of these them you need an instance of

For example, to create an instance of JCClassDecl, call this method

    public JCClassDecl ClassDef(JCModifiers mods,
                                Name name,
                                List<JCTypeParameter> typarams,
                                JCTree extending,
                                List<JCExpression> implementing,
                                List<JCTree> defs)

The names of the parameters are quite self explanatory. And most of them are not really needed here, like the third one- the type parameters. We are not using generics, so just leave it to be a blank list.

Two things worth noting here are:

First, the List here is not of java.util.List. It's instance of

Second, the parameter name is a special class for storing identifiers string in the parser. Refer to the attached source file for how to constructing it. For now, just think it as a String representing the name of the class.

So, as you can see, it is pretty easy to builds a tree for our simple class. I won't dwell on how to make the rest of the tree nodes, refer to for details, which creates the tree matches

Then after building the tree, next thing to do is to make the code generator to generate code for us.

However, javac is not designed to let you do this and there is no easy way to achieve this without some inelegant hacking.

What we want to do is to add an -XDxtest=true option, so when you invoke the javac against any file, it will still verify the existence of the file, but not try to parse it, rather, it takes the tree we build, thinks it as the product of the parser, hands it to the code generator, then write the class file to the disk.

All this can be done by switching two lines of code in


Find the method    

protected JCCompilationUnit parse(JavaFileObject filename, CharSequence content)

Then look for the following two lines:

    Parser parser = parserFactory.newParser(content, keepComments(), genEndPos, lineDebugInfo);
    tree = parser.parseCompilationUni

That's the only place the compiler interacts with the parser. All we need to do is to trick the compiler to take our tree in the second line.
Replace the two lines with

    if (Options.instance(context).get("xtest") != null) {
          DummyTreeMaker maker = new DummyTreeMaker(parserFactory);
          tree = maker.getTree();
    } else {
          Parser parser = parserFactory.newParser(content, keepComments(), genEndPos, lineDebugInfo);
          tree = parser.parseCompilationUnit();

Then build the workspace, create a blank file called, try to compile it with the javac you just built, like

javac -XDxtest=true

Run the generated file and you'll see "Hello!" printed on the screen.
As you can see, javac takes a blank java file, but uses our tree to generate code.

This is a pretty simple example, and is pretty much how javac parses your source file, although the real process is a little more complicated because the javac parser also has to generate line-info, process javadoc, do error report etc..

Now imagine we have a grammar, like this one in the Java Language Specification, and we embed java code into the grammar calling different method in TreeMaker to build different AST nodes as the grammar recognize different constructs. Then we have an automated parser doing the same thing as the javac. This is how the Compiler Grammar project works -- with the help of Antlr, a automatically grammar-generated parser building the same kind of AST trees as javac does. Refer to my previous post on how to build and run the Compiler Grammar project. What's more interesting, once we have this grammar, it can be used more than building trees -- code formatting, code translation etc. all made possible, just use some imagination :)

Download the source files:  This need to be put under src/share/classes/com/sun/tools/javac/parser/  Replace this one with the file with the same name located under src/com/sun/tools/javac/main. This file may not compile in the future, as the langtool source code is changed very often. If so, just locate the file and make changes as decribed above.

Saturday Oct 11, 2008

How to build and run the Openjdk Compiler Grammar workspace

The Compiler Grammar workspace is a branch from the main Openjdk workspace. It integrates an Antlr parser which builds the same AST trees as javac does.  

To build the workspace, you need the following prerequisites:


    Antlr-3.1.1+ is needed. At the time this document is written, 3.1.1 is the latest version.
    Click this link to download the file directly. Or go to the Antlr website to get the latest version. You need the “Complete ANTLR 3.1.1 jar”, which contains the runtime, and the ANTLR tools.

    You must have Ant installed and properly set up so you can execute the command “ant” under command line.
    Go to the Ant website to download.

    Mercurial is needed to get the latest source code. Or you can download the source archive from here.
    You can download Mercurial from here.
    Refer to this page for how to install and setup Mercurial.

Once you have everything installed and downloaded, you can start to build the workspace by following the steps listed bellow. It assumes that you are working on a linux or similar OS, although windows users should be able to build the workspace same way.

1) Create a work directory, for example "/workspace".

2) Place the downloaded antlr-3.1.1.jar under a “lib” directory, for example "/home/lib"
    NOTE: At the time this document is written, antlr-3.1.1 is the latest version and the build file is configured to search for a file with the name “antlr-3.1.1.jar”. You will have to change the file name accordingly in the build.xml if new antlr version comes out.
    ALSO NOTE that the antlr-3.1.1.jar must be placed under a “lib” directory. You have to set the parent directory of that "lib" directory to the antlr.home property. In the case of “/home/lib”, the antlr.home has to be set to “/home”

3) Go to "/workspace", run

    hg clone

    this will download all the source into "/workspace/langtools"

    Or optionally, if you don't have Mercurial installed, you can download an archived version of the workspace from here.

4) Create a new file "/workspace/langtools/" with the following content = /absolute_path_to_your_jdk_home
   antlr.home = /absolute_path_to_your_antlr_home 

5) Go to /workspace/langtools/make, run


    This will build the workspace.

6) Run the new javac. Go to "/workspace/langtools/", run

    dist/bootstrap/bin/javac -XDantlrdebug=true -XDparser=antlr  /path_to_any_dot_java_file

    You should see “Parsing with antlr” printed out as a result of the -XDantlrdebug option. Also, try to set "-XDparser=default" to use the default javac parser.

Sunday Jun 15, 2008

Stack based vs Recursive descent

For expression like  "a && b || c == d", javac parses int a stack based way, with while loops, pushes and pops.

With the current antlr java grammar, this is parsed by making various sub-rules, which are reflected into varous method calls in the parser code:

        ID ( '&&' ID )\*

        AndExpression( '||' AndExpression)\*

        OrExpression '==' OrExpression

This handles operator precedence very well and is easy to read and code, but with each expression (even a single identifier) being parsed by making many method calls, speed is going to be a problem.  

Optimization is possible by writing a grammar like
    expression :
            ID (('&&' | '||' | '==') ID)\*

This doesn't handle precedence very well, but this way it is possible to parse the expression with stacks, thus not that many method calls, with more space requirement though :)
And I'm also expecting the new antlr feature could help here, which gives operators precedence by the sequence they are listed.
Only I'm not sure if it is still been translated into many sub-rules, or if it is implemented in a stack based way?


Yang Jiang


« April 2014