Software Secret Weapons™

 
From LEXYACC To AST
by Pavel Simakov on 2006-03-24 00:10:43 under Code Linguine, view comments
Bookmark and Share
 

About  •  Contact  •  Articles  •  Projects  •  My Links  •  My Bookshelf  •  Past And Present

The grammars in Code Linguine package were built using modification to Version 4.1 of TPLY (Turbo Pascal Lex and Yacc), a compiler generator for Turbo Pascal and compatibles.

Normally Lex/Yacc produces a list of tokens for an input file. However, we needed to produce an abstract syntax tree (AST) for the input file, not just a list. The key difference is that TPLY does not produce trees.

Our modifications to TPLY convert list of tokens into trees with the help of frames. A "frame" is a container for a group of tokens. Frames can be directly converted into AST trees. A simple example follows to illustrate conversion of token list into frames followed by conversion of frames into AST tree.

Here is the Pascal source file:


    unit Test;
    interface
    uses A,B;
    implementation
    end.

Here is list of non-whitespace tokens normally produced by TPLY:


    0=unit
    1=Test
    2=;
    3=interface
    4=uses
    5=A
    6=,
    7=B
    8=;
    9=implementation
    10=end
    11=.

Here is list of rule completion frames. Each frame specifies start and end token for the corresponding YACC rule followed by the rule id:


    [1, 1]53
    [0, 2]52
    [5, 5]20
    [5, 5]19
    [7, 7]20
    [5, 7]18
    [4, 8]17
    [4, 4]66
    [3, 4]54
    [9, 9]16
    [9, 9]68
    [9, 9]55
    [10, 10]56
    [0, 11]51
    [0, 0]2

Here is the token enclosure frames derived from the frames.


    0----|
    1]-53|-52-----------------------|
    2----|                          |
    3-------------------------|     |
    4----------------|        |-54  |
    5]-20]-19----|   |        |     |
    6            |-18|-17]-66-|     |-51]-2
    7]-20--------|   |              |
    8----------------|              |
    9]-16]-68                       |
   10]-56                           |
   11-------------------------------|

Our modifications were quite complex algorithmically, but overall did not slow down parsing. Parsing is still being done in a single pass in LLR(0) and LLR(1) style with no backtracking. It turns out that for any completed YACC rule around the token (X, Y) the frame must be formed not around tokens X and Y. It must be formed around all the tokens that are enclosed by any YACC rule that encloses X and Y. The frame containment must be resolved while parsing is in progress as YACC enclosures are being found in the source file.

The obtained AST trees can be saved as XML or in any other format as desired. XML format gives an advantage of XSLT query and ease of parsing. For the complete details and full source code please look inside Code Linguine.

No comments yet


Leave a comment


  Copyright © 2004-2014 by Pavel Simakov
any conclusions, recommendations, ideas, thoughts or the source code presented on this site are my own and do not reflect a official opinion of my current or past employers, partners or clients