Software Secret Weapons™


 
JJTree Tutorial for Advanced Java Parsing
by Pavel Simakov on 2008-06-11 18:51:39 under Meta-Modeling, view comments
Bookmark and Share
 


The Problem

JJTree is a part of JavaCC is a parser/scanner generator for Java. JJTree is a preprocessor for JavaCC that inserts parse tree building actions at various places in the JavaCC source. To follow along you need to understand the core concepts of parsing. Also review basic JJTree documentation and samples provided in JavaCC distribution (version 4.0).

JJTree is magically powerful, but it is as complex. We used it quite successfully at my startup www.moola.com. After some the basic research into the grammar rules, lookaheads, node annotations and prototyping I felt quite comfortable with the tool. However, just recently when I had to use JJTree again I hit the same steep learning curve as if I have never seen JJTree before.

How to write a tutorial that gets you back in shape quickly without forcing the full relearning?

The Solution

Here I capture my notes in a specific form that I do not have to face that same learning curve again in the future. You can think my approach as layered improvement to a grammar that follows these steps:

  • get lexer
  • complete grammar
  • optimize produced AST
  • define custom node
  • define actions
  • write evaluator

I always start simple and need to go more complex - this is exactly how I will document it. In each example I start with a trivial portion of grammar and then add some more to it to force specific behavior. New code is always in green. Let's hope this save all of us the relearning.

Reorder tokens from more specific to less specific

The token in TOKEN section can be declared in any order. But you have to pay very close attention to the order because the matching of tokens starts from the top and down the list until first matching token is found. For example notice how "interface" or "exception" are defined before STRING_LITERAL. If we had defined "interface" after STRING_LITERAL "interface" would never get matched,  STRING_LITERAL would. 

TOKEN : {
	  <INTERFACE: "interface" >
	| < EXCEPTION: "exception" >
	| < ENUM: "enum" >
	| < STRUCT: "struct" >
	
	| < STRING_LITERAL: "'" (~["'","\n","\r"])* "'" >
	| < TERM: <LETTER> (<LETTER>|<DIGIT>)* >
	
	| < NUMBER: <INTEGER> | <FLOAT> > 
	| < INTEGER: ["0"-"9"] (["0"-"9"])* >
	| < FLOAT: (["0"-"9"])+ "." (["0"-"9"])* >
	| < DIGIT: ["0"-"9"] >
	| < LETTER: ["_","a"-"z","A"-"Z"] >
} 

The ordering is the same reason why we can't just use "interface" inline in the definition of productions. The STRING_LITERAL will always match first.

Remove some nodes from final AST

Some nodes do not have any special meaning and should be excluded from the final AST.  This is done by using #void like this:

void InterfaceDecl() #void : {
}{ 
	ExceptionClause()
	| 
	EnumClause()
	|
	StructClause()
	|
	MethodDecl()
}

Add action to a production

You will definitely need to add actions to the production for your parser to be useful. Here I capture the text of the current token (t.image) and put it into jjThis node that will resolve to my custom node class TypeDecl. You bind a variable "t" to a token using "="; the action itself is in curly braces right after the production and can refer to current token as "t" and current AST node as "jjtThis".

void TypeDecl() : {
	Token t;
}
{
	<VOID>
	|
	t=<TERM> { jjtThis.name = t.image; } ("[]")?}
}

Here I further set isArray property to true only if "[]" is found after the <TERM>:

void TypeDecl() : {
	Token t;
}
{
	<VOID>
	|
	t=<TERM> { jjtThis.name = t.image; } ("[]" { jjtThis.isArray = true; } )?}
}

Multiple actions inside one production rule

Just as we have seen earlier you can access values of multiple token in one production rule. Notice how I declare two separate tokens "t" and "n". Here:

void ConstDecl() : {
	Token t;
	Token n;
}
{
	LOOKAHEAD(2)
	
	t=<TERM> { jjtThis.name = t.image; } "=" n=<NUMBER> { jjtThis.value = Integer.valueOf(n.image); }
	|
	<TERM>
}

Lookaheads

There are certain points in complex  grammars that might not get parsed unambiguously using just one token look ahead. If you are writing high performance parser you might need to rewrite grammar. But if do not care about performance you can force lookahead for more that one symbol.

JJTree generator will give you a warning about ambiguities. Go the the rule it refers to and set lookahead of 2 or more like this:

void EnumDeclItem() : {}
{
	LOOKAHEAD(2)
	
	<TERM> "=" <NUMBER>
	|
	<TERM>
}

Node return values

It is possible to return nodes from the productions, just like function return values. Here I am declaring the ASTTypeDecl will be returned.

ASTTypeDecl TypeDecl() : {
	Token t;
}
{
	<VOID>
	|
	t=<TERM> { jjtThis.name = t.image; } ("[]" { jjtThis.isArray = true; } )?}

	{ return jjtThis; }
}

Once you start having a lot of expressions in one production it is better to group them together so return statement applies to all of them. The above example will actually result in a bug due to a fact that the return statement is attached to one branch of "|" production and not to both branches. We can easily fix the issue using parenthesis to force order of precendence:

ASTTypeDecl TypeDecl() : {
	Token t;
}
{
	(
		<VOID>
		|
		t=<TERM> { jjtThis.name = t.image; } ("[]" { jjtThis.isArray = true; } )?}
	)

	{ return jjtThis; }
}

Build abstract syntax tree as you go

After you have all production return values you can build AST tree on the fly while parsing. Just provide found overloaded add() methods in the ASTInterfaceDecl class and call them like this:

void InterfaceDecl() #void : {
	ASTExceptionClause ex;
	ASTEnumClause en;
	ASTStructClause st;
	ASTMethodDecl me;
}
	ex=ExceptionClause() { jjtThis.add(ex); }
	| 
	en=EnumClause() { jjtThis.add(en); }
	|
	st=StructClause() { jjtThis.add(st); }
	|
	me=MethodDecl() { jjtThis.add(me); }
}

Use <EOF>

Quite often you can get your grammar written and start celebration when you notice that part of the file is not being parsed... This happens because you did not tell the parser to read all content till the end of file and it feels free to stop parsing at will. Force parsing to reach end of file by demanding <EOF> token at the top most production:

void InterfaceDecl() #void : {
}{ 
	ExceptionClause()
	| 
	EnumClause()
	|
	StructClause()
	|
	MethodDecl()
	|
	<EOF>
}

The Final Word

JJTree works incredibly well. No excuse to regex parsing no more... Don't even try to convince me!

Drop me a line if you need help with JJTree - will be glad to share the experiences with you.

References

  1. The JavaCC FAQ by Theodore S. Norvell

Comments (13)

  • Comment by Zviki — June 15, 2008 @ 7:33 am

    Do you have any experience with ANTLR?
    Is so, can you compare?

  • Comment by Andrei — June 16, 2008 @ 12:11 am

    I have no experience with ANTLR, though I considered it in my recent investigation on this question.

    Most of similar software I have tried has one essential disadvantage – it does not provide automatic tools for semantic analysis (probably because the theory for semantic analysis could not be elaborated in such easy way as syntactic). But all of them make perfectly main task – parser generation.

    There are a couple of dozens of such software tools:
    http://catalog.compilertools.net/java.html (+ many other spread over internet)

    ANTLR was rejected because of high complexity of learning. There are many other tools that you can learn and use right away. Nevertheless ANTLR is the most supported and fast growing product.

    My criteria was: clear EBNF syntax and ability to apply semantic actions easily. Target compiler language is Java (obligatory) and C++ (not necessary but a plus). I choose 3 best tools:

    JavaCC
    + Semantic actions inside the grammar. Together with jjTree or JTB allows to generate AST and a set of classes with support of visitor design pattern. Simple implementation of LL(k) lookaheads and hence simple grammar conflicts resolution.
    - has the most complex EBNF grammar syntax.

    SableCC
    + very clear EBNF syntax. Generates AST and a full set of tree-walker classes using visitor design pattern.
    - non informative messages when compiler face conflicts in LALR grammar. Semantic actions are not allowed in grammar (they should be implemented inside generated classes).

    Coco/R
    + the most clear EBNF syntax. Attributed EBNF and comprehensible syntax of semantics inside the grammar. Produce only 2 easily readable and editable parser files.
    - no automatic AST generation (however Attributed EBNF allows to implement it). LL(1) conflicts oblige to write resolvers in target (java) language.

    Outline.
    When developing huge and real project (such as java translator with full syntactic and semantic analysis) then the defects of SableCC and Coco/R become more pronounced. Therefore JavaCC
    achieve a leading position …

    (this observation is only my opinion)

  • Comment by Zviki — June 16, 2008 @ 5:54 am

    Thank you for the elaborate response.

  • Comment by anaand — November 22, 2008 @ 9:45 am

    Hi
    Is there a way to traverse the generated parse tree
    and track changes to the value of variables in the tree

  • Comment by ali — April 15, 2009 @ 2:08 am

    hi iam ali.i am student of accp from aptech .i need a some notice for advance java .

  • Comment by sambhav — May 1, 2009 @ 2:37 pm

    hi
    nice work out there.
    i am trying to code a SQL parser using jjtree. is there any way i can get grammar for SQL , as it will be very tedious to write even the small part of it. any help would be appreciated
    Thanks

  • Comment by sambhav — May 7, 2009 @ 4:28 am

    can anyone elaborate how can i Get syntax tree similiar to one that i make on paper

  • Comment by juntao.qiu — June 5, 2009 @ 11:38 pm

    Thanks, it’s very useful. I’m try to write a doc about jjTree, so I can refer to this.

  • Comment by govind — March 12, 2010 @ 1:50 am

    Is there a way to visualize the jjtree like swing jtree

  • Comment by lyqin — April 17, 2010 @ 9:05 am

    hi!
    when I use {jjtThis.name=t.image;} as you say to capture the text of the current token (t.image) and put it into jjThis node,but after add this line,it will has a error in this line-”jjtn000.name=t.image;“ in another java file!

  • Comment by kaze — April 24, 2010 @ 3:02 pm

    @lyqin: He is using a custom node to build the tree (e.g. ASTTypeDecl) and he most probably has name as a public attribute.

    The only location that is available to the default node (SimpleNode.java) is the 'value' attribute which one can access using jjtSetValue and jjtGetValue

    @govind Not that I know of, but the dump() method may prove useful for trees that are not too large

    @sambav, take a look at the grammars provided for JavaCC at https://javacc.dev.java.net/servlets/ProjectDocumentList?folderID=110&expandFolder=110&folderID=0

    SQL statements are a subset of the PL/SQL one (look for selectstatement())

  • Comment by lyqin — April 24, 2010 @ 9:25 pm

    @kaze:Thanks,it had been resolved!

  • Comment by lyqin — May 2, 2010 @ 1:13 am

    Hi,
    I have printed out each statement,and would like to generate a xml of the source code's parent and child structure.if there any ideas or any help that could help me.Really i need a help!


Leave a comment


 
Dog Emotional 2010 Calendar Dog Emotional Mousepad Dog Fashionable 2010 Calendar Dog Fashionable Mousepad

Copyright © 2004-2010 by Pavel Simakov
any conclusions, recommendations, ideas, thoughts or the source code presented on this site are my own and do not reflect a official opinion of my current or past employers, partners or clients
SourceForge.net Logo