How do I distinguish a very keyword-like token from a keyword using ANTLR?
I am having trouble distinguishing a keyword from a non-keyword when a grammar allows the non-keyword to have a similar "look" to the keyword.
Here's the grammar:
grammar Query;
options {
output = AST;
backtrack = true;
}
tokens {
DefaultBooleanNode;
}
// Parser
startExpression : expression EOF ;
expression : withinExpression ;
withinExpression
: defaultBooleanExpression
(WSLASH^ NUMBER defaultBooleanExpression)*
defaultBooleanExpression
: (queryFragment -> queryFragment)
(e=queryFragment -> ^(DefaultBooleanNode $defaultBooleanExpression $e))*
;
queryFragment : unquotedQuery ;
unquotedQuery : UNQUOTED | NUMBER ;
// Lexer
WSLASH : ('W'|'w') '/';
NUMBER : Digit+ ('.' Digit+)? ;
UNQUOTED : UnquotedStartChar UnquotedChar* ;
fragment UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '/' | '(' | ')' | '[' | ']'
| '{' | '}' | '-' | '+' | '~' | '&' | '|'
| '!' | '^' | '?' | '*' )
;
fragment UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '(' | ')' | '[' | ']' | '{'
| '}' | '~' | '&' | '|' | '!' | '^' | '?'
| '*' )
;
fragment EscapeSequence
: '\\'
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~( 'u' )
)
;
fragment Digit : ('0'..'9') ;
fragment HexDigit : ('0'..'9' | 'a'..'f开发者_开发百科' | 'A'..'F') ;
WHITESPACE : ( ' ' | '\r' | '\t' | '\u000C' | '\n' ) { skip(); };
I have simplified it enough to get rid of the distractions but I think removing any more would remove the problem.
- A slash is permitted in the middle of an unquoted query fragment.
- Boolean queries in particular have no required keyword.
- A new syntax (e.g. W/3) is being introduced but I'm trying not to affect existing queries which happen to look similar (e.g. X/Y)
- Due to '/' being valid as part of a word, ANTLR appears to be giving me "W/3" as a single token of type UNQUOTED instead of it being a WSLASH followed by a NUMBER.
- Due to the above, I end up with a tree like: DefaultBooleanNode(DefaultBooleanNode(~first clause~, "W/3"), ~second clause~), whereas what I really wanted was WSLASH(~first clause~, "3", ~second clause~).
What I would like to do is somehow write the UNQUOTED rule as "what I have now, but not matching ~~~~", but I'm at a loss for how to do that.
I realise that I could spell it out in full, e.g.:
- Any character from UnquotedStartChar except 'w', followed by the rest of the rule
- 'w' followed by any character from UnquotedChar except '/', followed by the rest of the rule
- 'w/' followed by any character from UnquotedChar except digits
- ...
However, that would look awful. :)
When a lexer generated by ANTLR "sees" that certain input can be matched by more than 1 rule, it chooses the longest match. If you want a shorter match to take precedence, you'll need to merge all the similar rules into one and then check with a gated sematic predicate if the shorter match is ahead or not. If the shorter match is ahead, you change the type of the token.
A demo:
Query.g
grammar Query;
tokens {
WSlash;
}
@lexer::members {
private boolean ahead(String text) {
for(int i = 0; i < text.length(); i++) {
if(input.LA(i + 1) != text.charAt(i)) {
return false;
}
}
return true;
}
}
parse
: (t=. {System.out.printf("\%-10s \%s\n", tokenNames[$t.type], $t.text);} )* EOF
;
NUMBER
: Digit+ ('.' Digit+)?
;
UNQUOTED
: {ahead("W/")}?=> 'W/' { $type=WSlash; /* change the type of the token */ }
| {ahead("w/")}?=> 'w/' { $type=WSlash; /* change the type of the token */ }
| UnquotedStartChar UnquotedChar*
;
fragment UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '/' | '(' | ')' | '[' | ']'
| '{' | '}' | '-' | '+' | '~' | '&' | '|'
| '!' | '^' | '?' | '*' )
;
fragment UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '(' | ')' | '[' | ']' | '{'
| '}' | '~' | '&' | '|' | '!' | '^' | '?'
| '*' )
;
fragment EscapeSequence
: '\\'
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~'u'
)
;
fragment Digit : '0'..'9';
fragment HexDigit : '0'..'9' | 'a'..'f' | 'A'..'F';
WHITESPACE : (' ' | '\r' | '\t' | '\u000C' | '\n') { skip(); };
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
QueryLexer lexer = new QueryLexer(new ANTLRStringStream("P/3 W/3"));
QueryParser parser = new QueryParser(new CommonTokenStream(lexer));
parser.parse();
}
}
To run the demo on *nix/MacOS:
java -cp antlr-3.3.jar org.antlr.Tool Query.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
or on Windows:
java -cp antlr-3.3.jar org.antlr.Tool Query.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
which will print the following:
UNQUOTED P/3
WSlash W/
NUMBER 3
EDIT
To eliminate the warning when using the WSlash
token in a parser rule, simply add an empty fragment rule to your grammar:
fragment WSlash : /* empty */ ;
It's a bit of a hack, but that's how it's done. No more warnings.
精彩评论