Tracking down problems with text being ignored by ANTLR parser
I'm working on a parser that will split a string containing a person's full name into components (first, middle, last, title, suffix, ...). When I try a basic example "J. A. Doe" in ANTLRWorks, it matches the fname and lname rules, but ignores the "A.". How do I troubleshoot this type of problem?
Stripped-down grammar:
grammar PersonNamesMinimal;
fullname returns [Name name]
: (directory_style[name] | standard[name] | proper_initials[name]);
fullname_only returns [Name name]: f=fullname EOF;
standard[Name name]
: fname[name] ' ' (mname[name] ' ')* lname[name] ;
proper_initials[Name name]: a=INITIAL ' '? b=INITIAL lname[name];
sep: ',' | ', ' | ' ';
dir_sep: ',' | ', ' | ' , ';
directory_style[Name name]
: lname[name] dir_sep fname[name] (' ' mname[name])*;
fname[Name name] : (f=NAME | f=INITIAL);
mname[Name name] : (m=NAME | m=INITIAL); // Weird bug when mname is "F."
lname[Name name] : a=single_lname (b='-' c=single_lname)?;
single_lname returns [String s]
: (p=LNAME_PREFIX r=NAME)
| r=NAME;
LNAME_PREFIX : (V O N | V A N ' ' D E R | V A N ' ' D E N | V A N | D E ' ' L A | D E | B I N) ' ';
O_APOS: ('O'|'o') '\'';
NAME: (O_APOS? LETTER LETTER+) | LETTER;
INITIAL: LETTER '.';
AND: ( ' '+ A N D ' '+ ) | (' '* '&' ' '*);
fragment WORD : LETTER+;
COMMA : ',';
//WS : ( '\t' | ' ' );
fragment DIGIT : '0' .. '9';
fragment LETTER : 'A' .. 'Z' | 'a' .. 'z';
//{{{ fragments for each letter of alphabet
fragment A : 'A' | 'a';
fragment B : 'B' | 'b';
fragment C : 'C' | 'c';
fragment D : 'D' | 'd';
fragment E : 'E' 开发者_如何学Go| 'e';
fragment F : 'F' | 'f';
fragment G : 'G' | 'g';
fragment H : 'H' | 'h';
fragment I : 'I' | 'i';
fragment J : 'J' | 'j';
fragment K : 'K' | 'k';
fragment L : 'L' | 'l';
fragment M : 'M' | 'm';
fragment N : 'N' | 'n';
fragment O : 'O' | 'o';
fragment P : 'P' | 'p';
fragment Q : 'Q' | 'q';
fragment R : 'R' | 'r';
fragment S : 'S' | 's';
fragment T : 'T' | 't';
fragment U : 'U' | 'u';
fragment V : 'V' | 'v';
fragment W : 'W' | 'w';
fragment X : 'X' | 'x';
fragment Y : 'Y' | 'y';
fragment Z : 'Z' | 'z';
//}}}
In creating this stripped-down version I discovered that removing either the directory_style
rule or the LNAME_PREFIX
rule causes the mname
rule to work as expected, but I'm not sure why.
The problem is not with your parser rules, at least, not the problem you're facing at the moment... :). There's something going wrong in the lexer.
The initial A.
from the input "J. A. Doe"
is not being tokenized as an INITIAL
but the lexer tries to create an AND
token from it (note the space before the 'A'
!). You can see this by parsing the input "J. X. Doe"
instead, with the even more trimmed grammar:
grammar PersonNamesMinimal;
// just parse zero or more tokens (no matter what) and print their type and text
parse
: (t=. {System.out.printf("\%-25s \%s\n", tokenNames[$t.type], $t.text);})* EOF
;
LNAME_PREFIX : (V O N | V A N ' ' D E R | V A N ' ' D E N | V A N | D E ' ' L A | D E | B I N) ' ';
O_APOS : ('O'|'o') '\'';
NAME : (O_APOS? LETTER LETTER+) | LETTER;
INITIAL : LETTER '.';
AND : ( ' '+ A N D ' '+ ) | (' '* '&' ' '*);
COMMA : ',';
fragment LETTER : 'A' .. 'Z' | 'a' .. 'z';
fragment A : 'A' | 'a';
fragment B : 'B' | 'b';
fragment C : 'C' | 'c';
fragment D : 'D' | 'd';
fragment E : 'E' | 'e';
fragment F : 'F' | 'f';
fragment G : 'G' | 'g';
fragment H : 'H' | 'h';
fragment I : 'I' | 'i';
fragment J : 'J' | 'j';
fragment K : 'K' | 'k';
fragment L : 'L' | 'l';
fragment M : 'M' | 'm';
fragment N : 'N' | 'n';
fragment O : 'O' | 'o';
fragment P : 'P' | 'p';
fragment Q : 'Q' | 'q';
fragment R : 'R' | 'r';
fragment S : 'S' | 's';
fragment T : 'T' | 't';
fragment U : 'U' | 'u';
fragment V : 'V' | 'v';
fragment W : 'W' | 'w';
fragment X : 'X' | 'x';
fragment Y : 'Y' | 'y';
fragment Z : 'Z' | 'z';
SPACE : ' ';
with the class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
PersonNamesMinimalLexer lexer = new PersonNamesMinimalLexer(new ANTLRStringStream(args[0]));
PersonNamesMinimalParser parser = new PersonNamesMinimalParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And then generate a lexer & parser, compile it all and then run Main with "J. X. Doe"
as a command line parameter:
java -cp antlr-3.3.jar org.antlr.Tool PersonNamesMinimal.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main "J. X. Doe"
which prints the following on your console:
INITIAL J.
SPACE
INITIAL X.
SPACE
NAME Doe
(.ie. the expected output)
But now provide "J. A. Doe"
:
java -cp .:antlr-3.3.jar Main "J. A. Doe"
and the following output is produced:
line 1:4 mismatched character '.' expecting set null
INITIAL J.
SPACE
NAME Doe
If you now comment the rule AND
in your lexer:
...
INITIAL : LETTER '.';
//AND : ( ' '+ A N D ' '+ ) | (' '* '&' ' '*);
COMMA : ',';
...
and test "J. A. Doe"
again:
java -cp antlr-3.3.jar org.antlr.Tool PersonNamesMinimal.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main "J. A. Doe"
you will see this:
INITIAL J.
SPACE
INITIAL A.
SPACE
NAME Doe
(i.e. all goes well!)
How to fix it? If I were you, I'd first get the lexer much cleaner by removing all the literal spaces and put them on the HIDDEN
channel so you won't have to account for them inside other parser- and lexer rules:
SPACE
: (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;}
;
That will at least solve this current problem you're facing. But there will probably be more...
EDIT
bemace wrote:
How would I modify the AND rule then so that it only matches whole words and not things like "stand"?
You don't need to do anything special for that to happen. As long as you have a rule that matches "stand"
, or even "andre"
, then they will not be tokenized as AND
. In your case, NAME
will match both of them, and because NAME
matches more characters than AND
for the input "stand"
and "andre"
, they will become NAME
tokens.
This is how ANTLR's lexer work: the longest match is chosen, and if two rules match the same number of characters, the rule that is first defined gets precedence of the other rule.
A small test:
grammar PersonNamesMinimal;
parse
: (t=. {System.out.printf("\%-25s \%s\n", tokenNames[$t.type], $t.text);})* EOF
;
AND
: A N D
| '&'
;
LNAME_PREFIX
: V O N
| V A N SPACES D E R
| V A N SPACES D E N
| V A N
| D E SPACES L A
| D E
| B I N
;
INITIAL
: LETTER '.'
;
NAME
: (O '\'')? LETTER+
;
COMMA
: ','
;
SPACE
: (' ' | '\t') {$channel=HIDDEN;}
;
fragment LETTER : 'A' .. 'Z' | 'a' .. 'z';
fragment SPACES : (' ' | '\t')+;
fragment A : 'A' | 'a';
fragment B : 'B' | 'b';
fragment C : 'C' | 'c';
fragment D : 'D' | 'd';
fragment E : 'E' | 'e';
fragment F : 'F' | 'f';
fragment G : 'G' | 'g';
fragment H : 'H' | 'h';
fragment I : 'I' | 'i';
fragment J : 'J' | 'j';
fragment K : 'K' | 'k';
fragment L : 'L' | 'l';
fragment M : 'M' | 'm';
fragment N : 'N' | 'n';
fragment O : 'O' | 'o';
fragment P : 'P' | 'p';
fragment Q : 'Q' | 'q';
fragment R : 'R' | 'r';
fragment S : 'S' | 's';
fragment T : 'T' | 't';
fragment U : 'U' | 'u';
fragment V : 'V' | 'v';
fragment W : 'W' | 'w';
fragment X : 'X' | 'x';
fragment Y : 'Y' | 'y';
fragment Z : 'Z' | 'z';
And if you now parse the input:
"Andre and stand van der"
you will see the expected tokens being created:
java -cp .:antlr-3.3.jar Main "Andre and stand van der"
NAME Andre
AND and
NAME stand
LNAME_PREFIX van der
精彩评论