ANTLRv3 not reading options
I am very new to ANTLR and am trying to understand how the Lexer and Parser rules work. I'm experiencing issues with a grammar I've written that seem to be related to lexer tokens with multiple characters being seen as "matches" even when only the first few characters actually match. To demonstrate this, I have written a simple ANTLR 3 Grammar:
grammar test;
options {
k=3;
}
@lexer::header { package test;}
@header {package test;}
sentence : (CHARACTER)*;
CHARACTER : 'a'..'z'|' ';
SPECIAL : 'special';
I'm using AntlrWorks to parse the following test input:
apple basic say sponsor speeds speckled specific wonder
The output I get 开发者_StackOverflow社区is:
apple basic say nsor ds led ic wonder
It seems to me that the LEXER is using k=1 and therefore matching my SPECIAL token with anything that includes the two letters 'sp'. Once it encounters the letters 'sp', it then matches sucessive characters within the SPECIAL literal until the actual input fails to match the expected token - at which point it throws an error (consuming that character) and then continues with the rest of the sentence. Each error is of the form:
line 1:18 mismatched chracter 'o' expecting 'e'
However, this isn't the behaviour I'm trying to create. I wish to create a lexer token that matches the keyword ('special') - for use in other parser rules not included in this test example. However, I don't want other rules/input that just happens to include the same initial characters to be affected
To summarize:
- How do I actually set antlr 3 options (such as k=2 or k=3 etc)? It seems to me, at least, that the options I'm trying to use here aren't being set.
- Is there a better way to create parser or lexer rules to match a particular keyword in my input, without affecting processing of other parts of the input that don't contain a full match?
The k
in the options { ... }
section defines the look ahead of the parser, not the lexer.
Note that the grammar
CHARACTER : 'a'..'z'|' ';
SPECIAL : 'special';
is ambiguous: your 'special'
could also be considered as 7 'a'..'z'
's. Normally, it'd be lexed as follows:
grammar Test;
sentence : (special | word | space)+ EOF;
special : SPECIAL;
word : WORD;
space : SPACE;
SPECIAL : 'special';
WORD : 'a'..'z'+;
SPACE : ' ';
which will parse the input:
specia special specials
as follows:
I.e. it gets (more or less) tokenized as a combination of LL(1) and "longest-matched". Sorry, I realize that's a bit vague, but the Definitive ANTLR Reference does not clarify this exactly (at least, I can't find it...). But I realize that this might not be what you're looking for.
AFAIK, the only way to produce single char-tokens and define keywords that are made up from these single char-tokens, is done by merging these two tokens in a single rule, and use predicates and manual look-ahead to see if they conform to a key word, and if not, change the type of the token in a "fall through" sub rule. A demo:
grammar test;
tokens {
LETTER;
}
@lexer::members {
// manual look ahead
private boolean ahead(String text) {
for(int i = 0; i < text.length(); i++) {
if(input.LA(i+1) != text.charAt(i)) {
return false;
}
}
return true;
}
}
sentence
: (t=. {System.out.printf("\%-7s :: '\%s'\n", tokenNames[$t.type], $t.text);})+ EOF
;
SPECIAL
: {ahead("special")}?=> 'special'
| {ahead("keyword")}?=> 'keyword'
| 'a'..'z' {$type = LETTER;} // Last option and no keyword is found:
// change the type of this token
;
SPACE
: ' '
;
The parser generated from the above grammar can be tested with the class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("apple basic special speckled keyword keywor");
testLexer lexer = new testLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
testParser parser = new testParser(tokens);
parser.sentence();
}
}
As you can see, when parsing the input:
apple basic special speckled keyword keywor
the following output is generated:
LETTER :: 'a'
LETTER :: 'p'
LETTER :: 'p'
LETTER :: 'l'
LETTER :: 'e'
SPACE :: ' '
LETTER :: 'b'
LETTER :: 'a'
LETTER :: 's'
LETTER :: 'i'
LETTER :: 'c'
SPACE :: ' '
SPECIAL :: 'special'
SPACE :: ' '
LETTER :: 's'
LETTER :: 'p'
LETTER :: 'e'
LETTER :: 'c'
LETTER :: 'k'
LETTER :: 'l'
LETTER :: 'e'
LETTER :: 'd'
SPACE :: ' '
SPECIAL :: 'keyword'
SPACE :: ' '
LETTER :: 'k'
LETTER :: 'e'
LETTER :: 'y'
LETTER :: 'w'
LETTER :: 'o'
LETTER :: 'r'
See the Q&A What is a 'semantic predicate' in ANTLR? to learn more about predicates in ANTLR.
精彩评论