I am learning about the new NetBeans Lexer API
and asked Mila Metelka, the main engineer responsible for this API, a few questions about it. Below are summaries of his responses, together with some of my own research.
- What is a "lexer"? A lexer "lexes" (also known as "parsing" or "tokenizing") textual input and turns it into tokens. The textual input currently supported by the new Lexer API can come from java.lang.CharSequence and java.io.Reader. javax.swing.text.Document is also supported by wrapping it by means of a CharSequence. Character sequences having a textual content that can be modified (such as the Swing Document) may benefit from the main feature of the new Lexer API, which is incremental updates of the token list based on modifications made in the textual input.
In the context of the NetBeans Platform, the Lexer API is useful for lexing text in an editor so that groups of related characters can be recognized as 'tokens', which can then be bound to a color. For example, for the syntax highlighting (also known as "syntax coloring") to work in the screenshot below, the document had to be lexed so that everything before the colon is recognized as one token, everything after the colon (up to the linebreak) is recognized as another token, and the colon itself is recognized as a token too. Then colors are bound to the tokens, with this result:
Without NetBeans APIs that provide lexing support, the NetBeans editors would need to have an implementation of a DrawLayer (or a HighlightLayer, in terms of the new Highlighting SPI) that would perform the syntax coloring in their own manner. In 6.0, with a back port to 5.5, the NetBeans APIs provide a new and improved approach to lexing. Basically, the new Lexer API makes it easy to provide incremental lexing, which is otherwise a lot of work.
- Where does the new Lexer API come from? The algorithm that updates the token list incrementally is based on "General Incremental Lexical Analysis", a PhD thesis written by Tim A. Wagner and Susan L. Graham, from the University of California, Berkeley. It is available online here: twagner-lexing.pdf.
The Lexer API does not expose the algorithm in any way, because the TokenListUpdater is hidden in the implementation (here). Although the lexer module has existed since 05/2002, its API was rewritten completely for NetBeans 6.0, with additional support for language embedding.
- What's the crux of the thesis in relation to the NetBeans Platform? Well, this paragraph in the thesis sums it up: "In some situations, such as in a software development environment (SDE), a series of character streams are repeatedly analyzed with few differences (relative to the total number of tokens in the program) from one application of lexical analysis to the next. Since tokens typically have very limited contextual dependencies, the resulting differences in the token stream are typically limited to the area immediately surrounding modification sites. In this setting it makes sense to retain the token stream as a persistent data structure and use it to decrease the time required for subsequent analyses. We will refer to this as incremental lexing and the tool that performs it as an incremental lexer. Our approach exploits existing technology for generating batch lexers by using
the output of these tools produce in conjunction with an incremental run-time
- Is the Lexer API implemented exactly as outlined in the thesis? No. The NetBeans Lexer API is a simplification of the algorithm proposed by the thesis. The thesis proposes a very generic approach. In NetBeans only those parts that are relevant are implemented. For example, something called "lookbacks" (discussed in section 3.6) are calculated on the fly and are not stored. A lookback is the number of tokens to look back from the current token, to work out the boundaries of subsequent tokens. However, a disadvantage of this is that it gives you an extra value to maintain and store, while typical languages do not need to store a lookback. Furthermore, the thesis argues that the algorithm should not be able to restart at the beginning of certain tokens, while the NetBeans Lexer API assumes each token is restartable. For further details, read the extensive comments at the start of the source of org.netbeans.lib.lexer.inc.TokenListUpdater.java.
- What's so much better about this new API? Before NetBeans used the Lexer API, a different approach was used, via the NetBeans API Syntax class. Using the Lexer API, the improvements are as follows:
- Permanent token instances. The original support was designed in early times, where NetBeans engineers didn't want to create permanent token instances, because in JDK 1.1 object creation was expensive. So there were no permanent, long lived tokens. If you wanted to iterate back into the tokens, no backward scanning was possible, the infrastructure needed to create batches of tokens, and then go backward. So, backward iteration was very awkward. In contrast, the new Lexer API has a permanent token list, so you can iterate forward and backward.
- Smaller token updates. With the new Lexer API, updates after document modifications are typically very small. The old support updated at least the whole line, which is not a problem in regular typing, but when you reformat a bunch of code, or the whole source, because of the reformatting to save bookmarks, other annotations, and positions, it became problematic. Now, the reformatting does incremental changes to the whole document, because if there are a lot of incremental changes, the amount of update information becomes important. The Lexer API is based on the technology that was mentioned in Tim Wagner's dissertation, which includes proof that this algorithm relexes only a minimal necessary number of tokens.
- Triggering. The old lexing support was triggered after each request. Each request that needed tokens, such as painting and brace matching, triggered lexing. In contrast, the new Lexer API is only triggered once, after modification, to fix the token list. This also helps debugging of your Lexer implementation—originally, if you put any debugging code into your syntax code, the debugger got triggered many many times and you didn't know what was going on. There was a lot of debugging info, because there were many requests for lexical scanning. Right now, there is only one invocation of the Lexer API for each modification, which helps you to find problems more easily.
- Testing. There's now much better support for testing of lexical correctness. There are several ways of testing.
- TokenDumpTest. In this test, batch lexing is done. You have some text input that you can run the Lexer API on and check whether the created tokens match your expectations.
- ManifestLexerRandomTest. In this test, you specify the characters (or strings) with a certain probability of insertion. Then the test builds the global list of probabilities and randomly inserts/removes the data in a Swing text document. This will trigger the Lexer API to fix the token list after modification and this fixed token list will be compared with another token list, created by taking the string text of the whole document after the modification and scanning it by batch scanning (from the beginning until the end). These two token lists are compared and must match, otherwise the test fails.
So, you can reuse the testing infrastructure that is provided by the Lexer API, for your own lexer implementation.
- Embedded languages. The old support was very awkward in relation to language embedding. It was necessary for token IDs to have ordinals, and it was necessary to shift the ordinals of the embedded languages. The old support was only able to see the source as a flat list of tokens. Switching between the syntaxes, and other complexity of the embedding, was the responsibility of the implementer of the solution. There was a very infamous NetBeans class called JSPMultiSyntax. It was very large and it was very easy to make some error that was very hard to find. So, language embedding was very difficult.
Now, the lexers can be reused for embedded languages. You just specify the top level language, you specify in which tokens there can be potential embedding, which you can either specify directly in your language hierarchy class (method called "embedding") or declaratively in the XML Layer. You only specify the MIME type or the language, and the infrastructure will do the rest, will create the physical tokens and update them. It should be easy, especially compared to how it was before.
- Simplification of hand coded lexers. No editorkit class is needed, because you reuse the LexerEditorKit, registered declaratively in the XML Layer. No settings initializer is used, because default font colors are made available declaratively through the XML Layer. As a side effect of removing the editorkit, no Options window extension is needed either. (However, the effect of this is that the Manifest Editor will not be listed in the Advanced section of the Options window. But the colorings will be configurable in the main part of the Options window, so that should be sufficient.) And because you're not using a settings initializer, you don't need to call it from the module installer either. So, unless you need a module installer for other reasons, you don't need it here. (Not having a module installer is a good thing, because even though they're necessary in various cases, they slow down the start up process.) For details on the changes to make, see the Lexer Migration Guide.
Clearly, a lot less classes, a lot more done declaratively, a lot more is reused from the API itself without requiring you to implement anything. As evidence, in the screenshots below, the first shows the files needed for the old way of providing syntax highlighting for manifest files, while the second shows the fact that far less is needed for the new Lexer API:
- What are the steps for creating an implementation of the Lexer API? It depends on whether you want to use a Lexer Generator or if you want to hand code the implementation. For simple lexers, you may want to hand code the implementation yourself. You need to implement org.netbeans.api.lexer.TokenId, using Enums to define the token IDs. Next, you need to provide a lexical scanner, implemented by Lexer<T extends TokenId>, among other API classes, which together take the input, turn it into tokens, and bind them to colors. The colors come from an XML file, registered in the XML layer for the applicable MIME type. The XML file contains the default colors, mapped to the names of the token IDs. (The user can modify the default colors, using the Options window to do so.)
- Will the new Lexer API, once stabilized, be compatible with the old support for syntax highlighting in NetBeans? No. The old support will be deprecated. By the time 6.0 comes out, all NetBeans sources will be rewritten to use the new Lexer API.
- Can I migrate my existing hand coded syntax highlighting implementation to create a hand coded implementation of the Lexer API instead? Yes! The Lexer Migration Guide provides a quick journey through the NetBeans Manifest File Syntax Highlighting Module Tutorial, telling you what to change where and, in some cases, why. If the migration guide proves useful, it will be expanded to include more detailed descriptive explanations.
- Where can I find out more about this API? The current homepage for this API is http://lexer.netbeans.org/, though it will probably move to the NetBeans Wiki soon. On the homepage, you'll find tips, a FAQ, source description, lexer generated tools, and a variety of use cases. Everything you need to know about the new Lexer API should be available from there. In addition, make sure to read Lexer API Javadoc.
- What do I need to get started? The modules that provide the Lexer support will be included out of the box in 6.0. For 5.5, you will (soon) be able to get the modules from the update center. The modules are as follows:
- Lexer module. The module that provides the infrastructure that contains the APIs and implementation.
- Lexer/editorbridge. The module that provides for the binding of the Lexer module to NetBeans editors and provides syntax highlighting.
- Lexer/nbbridge. The module that provides for finding and instantiating the languages registered declaratively in the XML layer.
The Lexer support is already available in trunk builds, although work continues to be done to improve it further. (See Mila's list of issues here, which includes Lexer-related issues.) However, everything currently works as described in the migration document listed above.
Thanks to Mila Metelka and Tor Norbye (who blogs about his Ruby Lexer, based on the new NetBeans Lexer API, here) for feedback on this blog entry. Any errors or misinterpretations are purely my own fault.