X

Geertjan's Blog

  • August 27, 2007

How to Write a Groovy Editor (Part 4)

Geertjan Wielenga
Product Manager
This is going to be one of those 'thinking out loud' blog entries, written in notepad as the thoughts occur (not unlike James Joyce, but then different) and then pasted into my blog. (Only that way does it make sense, in the sense of it being helpful, it won't be watered down or summarized, but really as things happen.) Basically, fundamentally, what I want to get done, finally, absolutely, and completely, is the Groovy tokens. Without that, no grammar rules and therefore no reliable Groovy editor. An ad-hoc on-the-fly editor won't cut it. And I do, despite earlier comments on this, think that the more religiously one sticks to the original ANTLR, the better. So, the time's come to pull out that bottle of merlot and really figure all this out for real. The template that I'm starting with, after considering the javascript.nbs, is the java.nbs. Not sure why, just instinctively. Maybe the javascript.nbs is closer to what the groovy.nbs should be than the java.nbs, but it just seems an additional step removed from the original language (i.e., Java), so let's take the existing Java tokens as our starting point and then compare to whatever comes up in the GroovyLexer.html, which defines Groovy tokens. So, take the GroovyLexer.html, look at each of the tokens, and map each to the tokens in the java.nbs. And then see what comes of it. Here's what comes of it, in the order that it comes:

The GroovyLexer.html begins with a long list of individual tokens (i.e., not tokens defining groups of characters, but just single characters). Sixty of them, in fact. For example, two of them are as follows:

mSEMI

:

';'

;
mDOLLAR

:

'$'

;

Jim Clarke's conversion (i.e., draft conversion, since we didn't, and still don't yet, know the 100% correct, if there is such a thing, mapping between ANTLR and NBS) of the above is as follows:

TOKEN:SEMI: ( ';'  )
TOKEN:DOLLAR: ( '$' )

That definitely resembles what I expect NBS to be, and, in fact, there are no errors when the above is pasted into an NBS file. Plus, as can be seen in yesterday's blog entry, I was able to assign a color to the 'DOLLAR' token (otherwise the dollar character would not have been red in yesterday's blog entry).

So, how do tokens such as the above appear in java.nbs, assuming (which they logically must) that they appear there at all? Like so:

TOKEN:operator: (
"==" | "!=" | "<=" | ">=" | "?" | ":" | "<" | ">" | "/" | "\*" | "-" |
"+" | "." | "," | "=" | "(" | ")" | "[" | "]" | "!" | "@" | "#" | "$" |
"%" | "\^" | "&" | "~" | "|" | "\\\\"
)

And, for good measure, here's the javascript.nbs version:

TOKEN:js_operator: (
"==" | "!=" | "<<" | ">>" | ">>>" | ">=" | "<=" | "++" | "--" |
"+=" | "-=" | "\*=" | "/=" | "%=" | "<<=" | ">>=" | ">>>=" | "&=" |
"\^=" | "|=" | "&&" | "||" | "===" | "!==" |
"?" | ":" | "<" | ">" | "\*" | "-" | "+" | "." | "," | "=" |
"(" | ")" | "[" | "]" | "!" | "@" | "#" | "%" | "\^" | "&" |
"~" | "|" | "\\\\"
)

Let's be rigorous. Let's look at each individual token in GroovyLexer.html. Let's look at each individual token and see if each individual token is covered in the above grouped TOKENs (for want of a better word). Question mark - yes. Left parentheses - yes. Right parentheses - yes. Left bracket - yes. Right bracket - yes. Left curly - no. Right curly - no. Colon - yes. Comma - yes. Dot - yes. Assign - yes. Compare to (<=>) - no. (Cool! This is known as the spaceship operator.) Equal - yes. Exclamation mark - yes. Tilde - yes. Not equal - yes. Forward slash (i.e., DIV) - yes for java.nbs, but no for javascript.nbs. Forward slash + assign (DIV_ASSIGN) - no for java.nbs, yes for javascript.nbs. Plus - yes. Plus assign - no for java.nbs, yes for javascript.nbs. Plus plus - no for java.nbs, yes for javascript.nbs. Minus - yes. Minus assign - no for java.nbs, yes for javascript.nbs. Double minus - no for java.nbs, yes for javascript.nbs. Star - yes. Star assign - no for java.nbs, yes for javascript.nbs. Modulus - yes. Modulus assign - no for java.nbs, yes for javascript.nbs. Two right arrows - no for java.nbs, yes for javascript.nbs. Two right arrows assign - no, yes. Three right arrows - no, yes. Three right arrows assign - no, yes. One right arrow assign - yes. Greater than - yes. Two left arrows - no, yes. Two left arrows assign - no, yes. One left arrow assign - yes. Less than - yes. Hat - yes. Hat assign - no, yes. Or - yes. Or assign - no, yes. Double or - no, yes. Ampersand - yes. Ampersand assign - no, yes. Double ampersand - no, yes. Semicolon - no. Dollar - yes, no. Inclusive range - no. Exclusive range - no. Triple dot - no. Spread dot - no. Optional dot - no. Member point - no. Member find - no. Regex match - no. Star star - no. Star star assign - no. Closure op - no. At -yes.

Right. Time for the next glass of merlot before drawing any conclusions here. OK. So, the first thing to say is that there seem to be a bunch of operators unique to Groovy (at least, when compared to Java and JavaScript):

<=>
..
..<
...
\*.
?.
.&
=~
==~
\*\*
\*\*=
->

Here's some info on some of the above. And here's some more. In addition, the Groovy definition of tokens includes the semi-colon, the left curly brace, and the right curly brace. That's kind of interesting.

And what tokens, from the java.nbs and javascript.nbs, are NOT included in the individual characters that make up the tokens of the GroovyLexer.html?

\\\\
#
!==
===

So if we were to create a single TOKEN for all the single characters that are assigned to tokens in the GroovyLexer.html, it would be as follows:

TOKEN:OPERATOR: (
"==" | "!=" | "<<" | ">>" | ">>>" | ">=" | "<=" | "++" | "--" |
"+=" | "-=" | "\*=" | "/=" | "%=" | "<<=" | ">>=" | ">>>=" | "&=" |
"\^=" | "|=" | "&&" | "||"| "?" | ":" | "<" | ">" | "\*" | "-" | "+" |
"." | "," | "=" | "(" | ")" | "[" | "]" | "!" | "@" | "%" | "\^" |
"&" | "~" | "|" | "<=>" | ".." | "..<" | "..." | "\*." | "?." | ".&" |
"=~" | "==~" | "\*\*" | "\*\*=" | "->" | "{" | "}"| ";"


)

I'm not sure if we'd want to make the above token, because possibly we want some of those characters to have unique colors (maybe the "->", for example, should have its own distinct color). However, the above is the token for Groovy that most closely matches the 'operator' token for Java and JavaScript. Also interesting to see that the result is closer to JavaScript than Java. However, that could simply mean that the java.nbs isn't very complete, which is even more true when you look at this document, which seems to imply that there are more 1:1 operator mappings between Groovy and Java than one would assume from the above.

Next, the first multi-character token in GoovyLexer.html is this one:

mWS

:

(

' '

|

'\\t'

|

'\\f'

|

'\\\\' mONE_NL

)+


;

Jim Clarke converted that to this for NBS:

TOKEN:WS: ( ( ' ' 
| '\\t'
| '\\f'
| '\\\\' ONE_NL )+
)

Here, we're dealing with white space, hence the capitalized WS, i.e., standing for "White Space". In the javascript.nbs, we find the following:

TOKEN:js_whitespace: ([" " "\\t" "\\n" "\\r"]+)

The same is true for the java.nbs:

TOKEN:whitespace: ([" " "\\t" "\\n" "\\r"]+)

Let's adopt this for our groovy.nbs, because even sticking in the converted WS produces errors. So now we have this:

TOKEN:WS: ([" " "\\t" "\\n" "\\r"]+)

Look, white space is white space. Let's not get hung up about it. I could be very wrong, but I doubt Groovy is all that different when it comes to white space. In addition to white space, there's a new line token in the GroovyLexer.html. In the conversion, it is like this:

TOKEN:ONE_NL: ( ( "\\r\\n"  | '\\r' 
| '\\n' )
)

So, we'll turn it into this:

TOKEN:ONE_NL: ("\\r\\n" | "\\r" | "\\n")

Next, as explained in this wonderfully commented document, there is a token called NLS, which "groups any number of newlines (with comments and whitespace) into a single token". The conversion gives me this:

TOKEN:NLS: ( ONE_NL 
( ( ONE_NL
| WS
| SL_COMMENT
| ML_COMMENT )+ )
)

Above, the arguments refer to 'one new line', 'white space', 'single line comment', and 'multiline comment', respectively. Let's not get too excited about any of these. Simply:

TOKEN:ML_COMMENT: ("/\*" - "\*/")
TOKEN:SL_COMMENT: ("//"[\^"\\n""\\r""\\uffff"]\*)

So, based on the converted token (i.e., in case you forgot at this point, the above is how the NLS token looks after conversion by Jim Clarke, from the original GroovyLexer.html) we now should have this:

TOKEN:NLS: ( ONE_NL ( ( ONE_NL | WS | SL_COMMENT | ML_COMMENT  )+  )  )

It seems, though, that tokens in NBS cannot refer to other tokens. However, if we turn it into a grammar rule, instead of a token, we might end up doing ourselves a favor. To do so, we need to write the statement like this:

nls=(<ONE_NL> ( ( <ONE_NL> | <WS> | <SL_COMMENT> | <ML_COMMENT> )+  ) );

Notice that the grammar rule ends with a semicolon, that each of the references to tokens are surrounded by pointy brackets, that there is no 'TOKEN' text in front of it, and that an equal sign is used instead of a colon.

The absence of square brackets is a bit disconerting at this point. Somehow, no square brackets, which imply optional statements, are in the original GroovyLexer.html. Hmmm. Ah, well. Problem for later.

Looking back, I find the clunkiness of this interesting:

TOKEN:ML_COMMENT: ( "/\*" 
( '\*'
| ONE_NL
| ( '\*'
| '\\n'
| '\\r'
| '\\uffff' ) )\*
"\*/" )

Why couldn't that have been expressed like this:

TOKEN:ML_COMMENT: ("/\*" - "\*/")

That's how Schliemann does it. Everything from the "/\*" to the "\*/" is a comment, and because it can skip whitespaces, I guess because of that reason, it swallows up everything between the beginning point and the end. No need for all those line breaks and ufffs... :-) OK, so what we have right now has been hard won, that's for sure. No time like the present, then, to try it out, a little bit, by adding a declaration of a code fold and a declaration of a color to our NBS file:

FOLD:ML_COMMENT:"comment"
COLOR:SL_COMMENT: {
foreground_color: "orange";
font_type:"bold";
}

And now, you should see this when you run your editor, assuming you type the same text as me:

That's encouraging. When you collapse one of the folds, you should see the word 'comment' as the fold's label. If you want a different label there, change the fold's definition above. Also, hovering the mouse over a collapsed fold presents you with a popup showing the content of the fold, just like any other code fold in the IDE.

Let's move on now... Next up, there's a TOKEN for the script header, which is like this in the original GroovyLexer.html:

// Script-header comments
SH_COMMENT
options {
paraphrase="a script header";
}
: {getLine() == 1 && getColumn() == 1}? "#!"
(
options { greedy = true; }:
// '\\uffff' means the EOF character.
// This will fix the issue GROOVY-766 (infinite loop).
~('\\n'|'\\r'|'\\uffff')
)\*
{ if (!whitespaceIncluded) $setType(Token.SKIP); }
//ONE_NL //Never a significant newline, but might as well separate it.
;

From the above, I'm guessing the tilde symbol is the equivalent of the hat symbol in regular expressions, for indicating the characters that should be excluded, so that we should end up like this:

TOKEN:SH_COMMENT: ("/#!"[\^"\\n""\\r""\\uffff"]\*)

For the rest, there's a bunch of string tokens and regular expression tokens, as well as an identifier token. Instead of those, I'll simply use these two, until everything blows up in my face:

TOKEN:IDENTIFIER: (
["a"-"z" "A"-"Z"]
[\^" " "\\t" "\\n" "\\r" "?" ":" "<" ">" "/" "\*" "-" "+" "." "," "=" "{" "}"
"(" ")" "[" "]" "!" "@" "#" "$" "%" "\^" "&" "~" "|" "\\\\" ";"
]\*
)
TOKEN:STRING: (
"\\""
[\^ "\\"" "\\n" "\\r"] \*
"\\""
)

So the first is simply for texts that do not start with quotation marks, while the second must begin with quotation marks. Let's test that, as follows:

COLOR:STRING: {
foreground_color: "magenta";
font_type:"bold";
}
COLOR:IDENTIFIER: {
foreground_color: "red";
font_type:"bold";
}

And that gives us this color scheme:

Since I didn't actually put the Groovy operator into the NBS file, the exclamation mark is unrecognized and therefore it is underlined in squiggly red, plus obviously it doesn't have a color (because, not only didn't I assign a color, I also didn't recognize it because I didn't declare a token for it, as stated in this sentence, i.e., prior to this parenthetical part).

This time, we'll include the Groovy operator and assign it a color:

COLOR:OPERATOR: {
foreground_color: "green";
font_type:"bold";
}

Plus, we'll include numbers, loosely interpreted from the GroovyLexer.html:

TOKEN:NUMBER: (["0"-"9"] ["0"-"9" "."]\*)

And that's it. We'll leave it there. Clearly, we've abandoned all claims to being religious now. We have a rather truncated interpretation of the original GroovyLexer.html, but that's okay. So long as we're aware of what we've done, we'll know that when things are going wrong, we need to go back to our GroovyLexer.html and see if the problem might be because of our flexible interpretation. Just need to remember that when the grammar rules are added, a variety of tokens will need to be replaced by the more relaxed ones discussed here. Also, it's probably better to keep it simple the first time round anyway. Now, we'll assign colors to our operators and our numbers:

COLOR:OPERATOR: {
foreground_color: "green";
font_type:"bold";
}
COLOR:NUMBER: {
foreground_color: "black";
font_type:"bold";
}

And now, when we run the editor again, we see that our operators are green and our numbers are black:

So, now we've pretty much negotiated our way to a set of tokens to use for Groovy. This is all of them:

TOKEN:OPERATOR: (
"==" | "!=" | "<<" | ">>" | ">>>" | ">=" | "<=" | "++" | "--" |
"+=" | "-=" | "\*=" | "/=" | "%=" | "<<=" | ">>=" | ">>>=" | "&=" |
"\^=" | "|=" | "&&" | "||"|
"?" | ":" | "<" | ">" | "\*" | "-" | "+" | "." | "," | "=" |
"(" | ")" | "[" | "]" | "!" | "@" | "%" | "\^" | "&" |
"~" | "|" | "<=>" | ".." | "..<" | "..." | "\*." | "?." | ".&" |
"=~" | "==~" | "\*\*" | "\*\*=" | "->" | "{" | "}"| ";"


)
TOKEN:WS: (" " | "\\t" | "\\\\f" | "\\\\"+)
TOKEN:ONE_NL: ("\\r\\n" | "\\r" | "\\n")
TOKEN:ML_COMMENT: ("/\*" - "\*/")
TOKEN:SL_COMMENT: ("//"[\^"\\n""\\r""\\uffff"]\*)
TOKEN:SH_COMMENT: ("/#!"[\^"\\n""\\r""\\uffff"]\*)
TOKEN:STRING: (
"\\""
[\^ "\\"" "\\n" "\\r"] \*
"\\""
)
TOKEN:IDENTIFIER: (
["a"-"z" "A"-"Z"]
[\^" " "\\t" "\\n" "\\r" "?" ":" "<" ">" "/" "\*" "-" "+" "." "," "=" "{" "}"
"(" ")" "[" "]" "!" "@" "#" "$" "%" "\^" "&" "~" "|" "\\\\" ";"
]\*
)
TOKEN:NUMBER: (["0"-"9"] ["0"-"9" "."]\*)

Admittedly, not all of it is crystal clear to me. Parts of it come from the original GroovyLexer.html, other bits come from the java.nbs and/or the javascript.nbs, exactly as discussed above. What's especially cool is that we've been really rigorous with the operations, so at least the nice shiny Groovy operations will be included from the start. As for our strings, and other tokens, they're not brilliant, but they'll do for the first cut (and potentially be sufficient right the way through). So, that's tokens. (Above, they're all capitalized, but that's not a requirement at all, by the way.) Don't know if any of this is useful to anyone, but I imagine that the considerations I went through above are going to be common to anyone creating an editor on the Schliemann framework. Time for a celebratory glass of merlot. And next time... grammar rules!

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha