开发者

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

I am trying to preprocess my C++ source files by ANTLR. I would like to output an input file preserving all the whitespace formatting of the original source file while inserting some new source codes of my own at the appropriate locations.

I know preserving WS requires this lexer rule:

WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};

With this my parser rules would have a $text attribute containing all the hidden WS. But the problem is, for any parser rule, its $text attribute开发者_开发技巧 only include those input text starting from the position that matches the first token of the rule. For example, if this is my input (note the formatting WS before and in between the tokens):

line   1;     line   2;

And, if I have 2 separate parser rules matching

"line   1;" 

and

"line   2;" 

above separately but not the whole line:

"    line   1;     line   2;"

, then the leading WS and those WS in between "line 1" and "line 2" are lost (not accessible by any of my rules).

What should I do to preserve ALL THE WHITESPACEs while allowing my parser rules to determine when to add new codes at the appropriate locations?

EDIT

Let's say whenever my code contains a call to function(1) using 1 as the parameter but not something else, it adds an extraFunction() before it:

void myFunction() {
   function();
   function(1);
}

Becomes:

void myFunction() {
   function();
   extraFunction();
   function(1);
}

This preprocessed output should remain human readable as people would continue coding on it. For this simple example, text editor can handle it. But there are more complicated cases that justify the use of ANTLR.


Another solution, but maybe also not very practical (?): You can collect all Whitespaces backwards, something like this untested pseudocode:

grammar T;

@members {
    public printWhitespaceBetweenRules(Token start) {
        int index = start.getTokenIndex() - 1;

        while(index >= 0) {
            Token token = input.get(index);
            if(token.getChannel() != Token.HIDDEN_CHANNEL) break;
            System.out.print(token.getText());
            index--;
        }
    }
}

line1: 'line' '1' {printWhitespaceBetweenRules($start); };
line2: 'line' '2' {printWhitespaceBetweenRules($start); };
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};

But you would still need to change every rule.


I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.


Here's another way to solve it (at least the example you posted).

So you want to replace ...function(1) with ...extraFunction();\nfunction(1), where the dots are indents, and \n a line break.

What you could do is match:

Function1
  :  Spaces 'function' Spaces '(' Spaces '1' Spaces ')' 
  ;

fragment Spaces
  :  (' ' | '\t')*
  ;

and replace that with the text it matches, but pre-pended with your extra method. However, the lexer will now complain when it stumbles upon input like:

'function()'

(without the 1 as a parameter)

or:

'    x...'

(indents not followed by the f from function)

So, you'll need to "branch out" in your Function1 rule and make sure you only replace the proper occurrence.

You also must take care of occurrences of function(1) inside string literals and comments, assuming you don't want them to be pre-pended with extraFunction();\n.

A little demo:

grammar T;

parse
  :  (t=. {System.out.print($t.text);})* EOF
  ;

Function1
  :  indent=Spaces 
     ( 'function' Spaces '(' Spaces ( '1' Spaces ')' {setText($indent.text + "extraFunction();\n" + $text);}
                                    | ~'1' // do nothing if something other than `1` occurs
                                    )
     | '"' ~('"' | '\r' | '\n')* '"'       // do nothing in case of a string literal
     | '/*' .* '*/'                        // do nothing in case of a multi-line comment
     | '//' ~('\r' | '\n')*                // do nothing in case of a single-line comment
     | ~'f'                                // do nothing in case of a char other than 'f' is seen
     )
  ;

OtherChar
  :  . // a "fall-through" rule: it will match anything if none of the above matched
  ;

fragment Spaces
  :  (' ' | '\t')* // fragment rules are only used inside other lexer rules
  ;

You can test it with the following class:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source = 
        "/*                      \n" +
        "  function(1)           \n" +
        "*/                      \n" +
        "void myFunction() {     \n" +
        "   s = \"function(1)\"; \n" + 
        "   function();          \n" + 
        "   function(1);         \n" + 
        "}                       \n";
    System.out.println(source);
    System.out.println("---------------------------------");
    TLexer lexer = new TLexer(new ANTLRStringStream(source));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

And if you run this Main class, you will see the following being printed to the console:

bart@hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart@hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart@hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar Main

/*                      
  function(1)           
*/                      
void myFunction() {     
   s = "function(1)"; 
   function();          
   function(1);         
}                       

---------------------------------
/*                      
  function(1)           
*/                      
void myFunction() {     
   s = "function(1)"; 
   function();          
   extraFunction();
   function(1);         
}                       

I'm sure it's not fool-proof (I did't account for char-literals, for one), but this could be a start to solve this, IMO.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜