ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

2023-04-05 03:28 问答作者：

I am trying to preprocess my C++ source files by ANTLR. I would like to output an input file preserving all the whitespace formatting of the original source file while inserting some new source codes of my own at the appropriate locations.

I know preserving WS requires this lexer rule:

WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};

With this my parser rules would have a $text attribute containing all the hidden WS. But the problem is, for any parser rule, its $text attribute开发者_开发技巧 only include those input text starting from the position that matches the first token of the rule. For example, if this is my input (note the formatting WS before and in between the tokens):

line   1;     line   2;

And, if I have 2 separate parser rules matching

"line   1;"

and

"line   2;"

above separately but not the whole line:

"    line   1;     line   2;"

, then the leading WS and those WS in between "line 1" and "line 2" are lost (not accessible by any of my rules).

What should I do to preserve ALL THE WHITESPACEs while allowing my parser rules to determine when to add new codes at the appropriate locations?

EDIT

Let's say whenever my code contains a call to function(1) using 1 as the parameter but not something else, it adds an extraFunction() before it:

void myFunction() {
   function();
   function(1);
}

Becomes:

void myFunction() {
   function();
   extraFunction();
   function(1);
}

This preprocessed output should remain human readable as people would continue coding on it. For this simple example, text editor can handle it. But there are more complicated cases that justify the use of ANTLR.

Another solution, but maybe also not very practical (?): You can collect all Whitespaces backwards, something like this untested pseudocode:

grammar T;

@members {
    public printWhitespaceBetweenRules(Token start) {
        int index = start.getTokenIndex() - 1;

        while(index >= 0) {
            Token token = input.get(index);
            if(token.getChannel() != Token.HIDDEN_CHANNEL) break;
            System.out.print(token.getText());
            index--;
        }
    }
}

line1: 'line' '1' {printWhitespaceBetweenRules($start); };
line2: 'line' '2' {printWhitespaceBetweenRules($start); };
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};

But you would still need to change every rule.

I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.

Here's another way to solve it (at least the example you posted).

So you want to replace ...function(1) with ...extraFunction();\nfunction(1), where the dots are indents, and \n a line break.

What you could do is match:

Function1
  :  Spaces 'function' Spaces '(' Spaces '1' Spaces ')' 
  ;

fragment Spaces
  :  (' ' | '\t')*
  ;

and replace that with the text it matches, but pre-pended with your extra method. However, the lexer will now complain when it stumbles upon input like:

'function()'

(without the 1 as a parameter)

or:

'    x...'

(indents not followed by the f from function)

So, you'll need to "branch out" in your Function1 rule and make sure you only replace the proper occurrence.

You also must take care of occurrences of function(1) inside string literals and comments, assuming you don't want them to be pre-pended with extraFunction();\n.

A little demo:

grammar T;

parse
  :  (t=. {System.out.print($t.text);})* EOF
  ;

Function1
  :  indent=Spaces 
     ( 'function' Spaces '(' Spaces ( '1' Spaces ')' {setText($indent.text + "extraFunction();\n" + $text);}
                                    | ~'1' // do nothing if something other than `1` occurs
                                    )
     | '"' ~('"' | '\r' | '\n')* '"'       // do nothing in case of a string literal
     | '/*' .* '*/'                        // do nothing in case of a multi-line comment
     | '//' ~('\r' | '\n')*                // do nothing in case of a single-line comment
     | ~'f'                                // do nothing in case of a char other than 'f' is seen
     )
  ;

OtherChar
  :  . // a "fall-through" rule: it will match anything if none of the above matched
  ;

fragment Spaces
  :  (' ' | '\t')* // fragment rules are only used inside other lexer rules
  ;

You can test it with the following class:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source = 
        "/*                      \n" +
        "  function(1)           \n" +
        "*/                      \n" +
        "void myFunction() {     \n" +
        "   s = \"function(1)\"; \n" + 
        "   function();          \n" + 
        "   function(1);         \n" + 
        "}                       \n";
    System.out.println(source);
    System.out.println("---------------------------------");
    TLexer lexer = new TLexer(new ANTLRStringStream(source));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

And if you run this Main class, you will see the following being printed to the console:

bart@hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart@hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart@hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar Main

/*                      
  function(1)           
*/                      
void myFunction() {     
   s = "function(1)"; 
   function();          
   function(1);         
}                       

---------------------------------
/*                      
  function(1)           
*/                      
void myFunction() {     
   s = "function(1)"; 
   function();          
   extraFunction();
   function(1);         
}

I'm sure it's not fool-proof (I did't account for char-literals, for one), but this could be a start to solve this, IMO.

继续阅读：antlr compiler-construction grammar parsing

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

EDIT

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

EDIT

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？