Why this error happened? - "The following alternatives can never be matched"
I'm interested in making compiler, parser, and parser generator, but I don't know much about them.
After reading an answer of this question, I t开发者_JAVA技巧ried to make a 'very' simple LaTeX parser.
This is the code:
grammar Latex;
latex : ITEM*;
ITEM : CMD|LAWTEXT;
CMD : CHEAD ARGS;
CHEAD : '\\' LETTER(LETTER|DIGIT)*;
LETTER : 'A'..'Z'|'a'..'z';
DIGIT : '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
ARGS : '{' ITEM* '}';
LAWTEXT : (LETTER|DIGIT|WHITESPACE|PUNC)*;
WHITESPACE
: ' '|'\t'|'\n'|'\r';
PUNC : '!'|'^';
(There's only two char in PUNC, for test purpose)
And this is the error message:
[18:39:09] warning(200): C:\Users\***\Documents\Latex.g:9:12: Decision can match input such as "{'\t'..'\n', '\r', ' '..'!', '0'..'9', 'A'..'Z', '\\', '^', 'a'..'z', '}'}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[18:39:09] error(201): C:\Users\***\Documents\Latex.g:9:12: The following alternatives can never be matched: 2
[18:39:09] error(211): C:\Users\***\Documents\Latex.g:1:8: [fatal] rule Tokens has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
I found that this error happens because there's an ambiguity, a code can be interpreted in more than two ways, but I have no idea how this ambiguity is generated.
And this is the diagram and two ways something can be interpreted (maybe).
... but how \
and }
can be confused?
JiminP wrote:
I'm interested in making compiler, parser, and parser generator, but I don't know much about them.
ANTLR creates a lexer and parser for you based on the grammar you write. ANTLR itself is the parser generator, so you don't write the parser generator itself (luckily!). A compiler is an application that takes the tree your parser generates and translates the input into some other form: this is something you need to do yourself. So, to emphasize: ANTLR only helps you create a parser for your language, the rest is up to you.
Now, the problem(s).
Your grammar contains almost only lexer rules. Lexer rules start with a capital letter and are used to tokenize your input source. So rules like these:
LETTER : 'A'..'Z'|'a'..'z';
...
LAWTEXT : (LETTER|DIGIT|WHITESPACE|PUNC)*;
might cause the lexer to create a LETTER
token on its own. If you always want a lower- or uppercase ascii letter to become a LAWTEXT
token, then you need to make LETTER
a fragment rule like this:
fragment LETTER : 'A'..'Z'|'a'..'z';
...
LAWTEXT : (LETTER|DIGIT|WHITESPACE|PUNC)+;
And as you can see, I ended the LAWTEXT
rule with a +
instead of a *
: you don't want to create tokens that contain nothing (empty string).
Also, args
, item
and cmd
are no suitable candidates for lexer rule: they should be parser rules instead.
Here's a grammar that generates a lexer and parser without any errors:
grammar Latex;
latex
: item* EOF
;
item
: cmd
| LAWTEXT
;
cmd
: CHEAD args
;
args
: '{' item* '}'
;
CHEAD
: '\\' LETTER (LETTER | DIGIT)*
;
LAWTEXT
: (LETTER | DIGIT | WHITESPACE | PUNC)+
;
fragment
WHITESPACE
: ' ' | '\t' | '\n' | '\r'
;
fragment
PUNC
: '!' | '^'
;
fragment
LETTER
: 'A'..'Z' | 'a'..'z'
;
fragment
DIGIT
: '0'..'9'
;
EDIT
As I already mentioned: lexer rules start with a capital letter while parser rules start with a lower case letter. A lexer, which is sometimes called a tokenizer or scanner, is responsible for chopping up the input source. The input source starts of as being just a stream of characters. These characters are then grouped together by the lexer. So, given the following lexer rules:
Identifier
: (Letter | '_') (Letter | '_' | Digit)*
;
Assign
: '='
;
Number
: Digit+ ('.' Digit+)?
;
fragment Digit
: '0'..'9'
;
fragment Letter
: 'a'..'z' | 'A'..'Z'
;
Spaces
: (' ' | '\t' | '\r' | '\n') {skip();}
;
could take input source like:
foo = 12.34
which the lexer sees as:
'f', 'o', 'o', ' ', '=', ' ', '1', '2', '.', '3', '4', EOF
and will create the following tokens:
Identifier "foo"
Assign "="
Number "12.34"
(note that there are no tokens being created from the white spaces: I skipped these!)
After the lexer has created tokens from your input source, the parser is passed these tokens. An assignment parser rule could then look like:
assignment
: Identifier Assign Number
;
It is important to keep in mind that the input source is first tokenized by the lexer and only after that process, the parser rules come into play.
精彩评论