开发者

Characters Matching Multiple Lexer Rules in ANTLR

I've defined multiple lexer rules that potentially matches the same character sequence. For example:

LBRACE:  '{' ;
RBRACE: '}' ;
LPARENT: '(' ;
RPARENT: ')' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
SEMICOLON: ';' ;
ASTERISK: '*'  ;
AMPERSAND: '&'  ;

IGNORED_SYMBOLS:   ('!' | '#' | '%' | '^' | '-' | '+' | '=' | 
                    '\\'| '|' | ':' | '"' | '\''| '<' | '>' | ',' | '.' |'?' | '/'  ) ;


// WS comments*****************************
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
ML_COMMENT: '/*' .* '*/' {$channel=HIDDEN;};
SL_COMMENT: '//' .* '\r'? '\n' {$channel=HIDDEN;};

STRING_LITERAL:  '"' (STR_ESC | ~( '"' ))* '"'; 
fragment STR_ESC:  '\\'  '"'  ; 

CHAR_LITERAL :  '\'' (CH_ESC | ~( '\'' )) '\''  ;  
fragment CH_ESC :  '\\' '\''; 

My IGNORED_SYMBOLS and ASTERISK match /, " and * respectively. Since they're placed (unintentionally) before my comment and string literal rules which also match /* and ", I expect the comment and string literal rules would be disabled (unintentionally) . But surprisely, the ML_COMMENT, SL_COMMENT and STRING_LITERAL rules still work correctly.

This is somewhat confusing. Isn't that a /, whether it is part of /* or just a standalone /, will always be matched and consumed by the IGNORED_SYMBOLS first before it has any chance to be matched by the ML_COMMENT?

What is the way the lexer decides which rules to apply if 开发者_开发百科the characters match more than one rule?


What is the way the lexer decides which rules to apply if the characters match more than one rule?

Lexer rules are matched from top to bottom. In case two (or more) rules match the same number of characters, the one that is defined first has precedence over the one(s) later defined in the grammar. In case a rule matches N number of characters and a later rule matches the same N characters plus 1 or more characters, then the later rule is matched (greedy match).

Take the following rules for example:

DO : 'do';
ID : 'a'..'z'+;

The input "do" would obviously be matched by the rule DO.

And input like: "done" would be greedily matched by ID. It is not tokenized as the 2 tokens: [DO:"do"] followed by [ID:"ne"].

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜