Seeming non-determinism in ANTLR parse
If I have an ANTLR g开发者_开发问答rammar as follows:
grammar Test;
options {
language = Java;
}
rule : (foo | bar);
foo : FOO ',' FOO;
bar : BAR;
FOO: ('0'..'9')+;
BAR: ('a'..'z' | 'A'..'Z' | '0'..'9' | ' ')+;
WHITESPACE: (' ' | '\t')+ { $channel=HIDDEN; };
And I use a test string:
12abc3
this (I believe) is a BAR
token which satisfies a bar
rule and is parsed as such. Bravo.
However, if I have this string:
12
I receive line 1:2 mismatched input '' expecting ','
This seems rather non-deterministic although I'm sure it's not. I understand I'm already in trouble by having two tokens: FOO
and BAR
that accept digits. But if the parser is going to succeed or fail it should succeed or fail consistently. In other words, in the first case the first character is a 1 and apparently is being evaluated as a member of the BAR
token and thus the parser heads down a successful path. In the second case, the SAME first character is being evaluated as a FOO
token and thus the path is doomed to fail despite the fact that the string COULD be a successful bar
parse. Why the inconsistency? Or am I missing something more fundamental about ANTLR and/or parsing?
ANTLR doesn't determine the token type until it sees the first character for the next token(or EOF). ANTLR will also attempt a longest match, which is why you see '12abc3' as BAR and not as FOO BAR. In the second case ANTLR will use FOO for '12' because it is listed first in the grammar.
ANTLR basics
ANTLR lexers
In addition to Adam answer, you must realize that the lexer and parser, although defined in the same grammar, are being constructed at different times. First the input source is being tokenized and when that has happened, only then the parser operates on these tokens. The tokens are not created while the parser goes through the source (character stream) to favor a complete match (ie. tokenize "12"
as BAR
). The fact that "12"
is being tokenized as FOO
is because FOO
comes before the BAR
rule and has therefor a higher precedence in case of an equal long match.
In short: ANTLR grammars are not PEG's.
精彩评论