Tracking down problems with text being ignored by ANTLR parser

2023-04-02 17:30 问答作者：

I'm working on a parser that will split a string containing a person's full name into components (first, middle, last, title, suffix, ...). When I try a basic example "J. A. Doe" in ANTLRWorks, it matches the fname and lname rules, but ignores the "A.". How do I troubleshoot this type of problem?

Stripped-down grammar:

grammar PersonNamesMinimal;

fullname returns [Name name]
 : (directory_style[name] | standard[name] | proper_initials[name]);

fullname_only returns [Name name]: f=fullname EOF;

standard[Name name]
 : fname[name] ' ' (mname[name] ' ')* lname[name] ;

proper_initials[Name name]: a=INITIAL ' '? b=INITIAL lname[name];

sep: ',' | ', ' | ' ';
dir_sep: ',' | ', ' | ' , ';

directory_style[Name name]
 : lname[name] dir_sep fname[name] (' ' mname[name])*;

fname[Name name] : (f=NAME | f=INITIAL);

mname[Name name] : (m=NAME | m=INITIAL); // Weird bug when mname is "F."

lname[Name name] : a=single_lname (b='-' c=single_lname)?;
single_lname returns [String s]
 : (p=LNAME_PREFIX r=NAME)
 | r=NAME;
LNAME_PREFIX : (V O N | V A N ' ' D E R | V A N ' ' D E N | V A N | D E ' ' L A | D E | B I N) ' ';

O_APOS: ('O'|'o') '\'';
NAME: (O_APOS? LETTER LETTER+) | LETTER;
INITIAL: LETTER '.';

AND: ( ' '+ A N D ' '+ ) | (' '* '&' ' '*);
fragment WORD : LETTER+;
COMMA : ',';
//WS : ( '\t' | ' ' );

fragment DIGIT : '0' .. '9';
fragment LETTER : 'A' .. 'Z' | 'a' .. 'z';

//{{{ fragments for each letter of alphabet
fragment A : 'A' | 'a';
fragment B : 'B' | 'b';
fragment C : 'C' | 'c';
fragment D : 'D' | 'd';
fragment E : 'E' 开发者_如何学Go| 'e';
fragment F : 'F' | 'f';
fragment G : 'G' | 'g';
fragment H : 'H' | 'h';
fragment I : 'I' | 'i';
fragment J : 'J' | 'j';
fragment K : 'K' | 'k';
fragment L : 'L' | 'l';
fragment M : 'M' | 'm';
fragment N : 'N' | 'n';
fragment O : 'O' | 'o';
fragment P : 'P' | 'p';
fragment Q : 'Q' | 'q';
fragment R : 'R' | 'r';
fragment S : 'S' | 's';
fragment T : 'T' | 't';
fragment U : 'U' | 'u';
fragment V : 'V' | 'v';
fragment W : 'W' | 'w';
fragment X : 'X' | 'x';
fragment Y : 'Y' | 'y';
fragment Z : 'Z' | 'z';
//}}}

In creating this stripped-down version I discovered that removing either the directory_style rule or the LNAME_PREFIX rule causes the mname rule to work as expected, but I'm not sure why.

The problem is not with your parser rules, at least, not the problem you're facing at the moment... :). There's something going wrong in the lexer.

The initial A. from the input "J. A. Doe" is not being tokenized as an INITIAL but the lexer tries to create an AND token from it (note the space before the 'A'!). You can see this by parsing the input "J. X. Doe" instead, with the even more trimmed grammar:

grammar PersonNamesMinimal;

// just parse zero or more tokens (no matter what) and print their type and text
parse
  :  (t=. {System.out.printf("\%-25s \%s\n", tokenNames[$t.type], $t.text);})* EOF
  ;


LNAME_PREFIX : (V O N | V A N ' ' D E R | V A N ' ' D E N | V A N | D E ' ' L A | D E | B I N) ' ';
O_APOS       : ('O'|'o') '\'';
NAME         : (O_APOS? LETTER LETTER+) | LETTER;
INITIAL      : LETTER '.';
AND          : ( ' '+ A N D ' '+ ) | (' '* '&' ' '*);
COMMA        : ',';

fragment LETTER : 'A' .. 'Z' | 'a' .. 'z';

fragment A : 'A' | 'a';
fragment B : 'B' | 'b';
fragment C : 'C' | 'c';
fragment D : 'D' | 'd';
fragment E : 'E' | 'e';
fragment F : 'F' | 'f';
fragment G : 'G' | 'g';
fragment H : 'H' | 'h';
fragment I : 'I' | 'i';
fragment J : 'J' | 'j';
fragment K : 'K' | 'k';
fragment L : 'L' | 'l';
fragment M : 'M' | 'm';
fragment N : 'N' | 'n';
fragment O : 'O' | 'o';
fragment P : 'P' | 'p';
fragment Q : 'Q' | 'q';
fragment R : 'R' | 'r';
fragment S : 'S' | 's';
fragment T : 'T' | 't';
fragment U : 'U' | 'u';
fragment V : 'V' | 'v';
fragment W : 'W' | 'w';
fragment X : 'X' | 'x';
fragment Y : 'Y' | 'y';
fragment Z : 'Z' | 'z';

SPACE : ' ';

with the class:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    PersonNamesMinimalLexer lexer = new PersonNamesMinimalLexer(new ANTLRStringStream(args[0]));
    PersonNamesMinimalParser parser = new PersonNamesMinimalParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

And then generate a lexer & parser, compile it all and then run Main with "J. X. Doe" as a command line parameter:

java -cp antlr-3.3.jar org.antlr.Tool PersonNamesMinimal.g 
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main "J. X. Doe"

which prints the following on your console:

INITIAL                   J.
SPACE                      
INITIAL                   X.
SPACE                      
NAME                      Doe

(.ie. the expected output)

But now provide "J. A. Doe":

java -cp .:antlr-3.3.jar Main "J. A. Doe"

and the following output is produced:

line 1:4 mismatched character '.' expecting set null
INITIAL                   J.
SPACE                      
NAME                      Doe

If you now comment the rule AND in your lexer:

...
INITIAL      : LETTER '.';
//AND          : ( ' '+ A N D ' '+ ) | (' '* '&' ' '*);
COMMA        : ',';
...

and test "J. A. Doe" again:

java -cp antlr-3.3.jar org.antlr.Tool PersonNamesMinimal.g 
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main "J. A. Doe"

you will see this:

INITIAL                   J.
SPACE                      
INITIAL                   A.
SPACE                      
NAME                      Doe

(i.e. all goes well!)

How to fix it? If I were you, I'd first get the lexer much cleaner by removing all the literal spaces and put them on the HIDDEN channel so you won't have to account for them inside other parser- and lexer rules:

SPACE
  :  (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;}
  ;

That will at least solve this current problem you're facing. But there will probably be more...

EDIT

bemace wrote:

How would I modify the AND rule then so that it only matches whole words and not things like "stand"?

You don't need to do anything special for that to happen. As long as you have a rule that matches "stand", or even "andre", then they will not be tokenized as AND. In your case, NAME will match both of them, and because NAME matches more characters than AND for the input "stand" and "andre", they will become NAME tokens.

This is how ANTLR's lexer work: the longest match is chosen, and if two rules match the same number of characters, the rule that is first defined gets precedence of the other rule.

A small test:

grammar PersonNamesMinimal;

parse
  :  (t=. {System.out.printf("\%-25s \%s\n", tokenNames[$t.type], $t.text);})* EOF
  ;

AND
  :  A N D
  |  '&'
  ;

LNAME_PREFIX 
  :  V O N 
  |  V A N SPACES D E R 
  |  V A N SPACES D E N 
  |  V A N 
  |  D E SPACES L A 
  |  D E 
  |  B I N
  ;

INITIAL
  :  LETTER '.'
  ;

NAME
  :  (O '\'')? LETTER+ 
  ;

COMMA
  :  ','
  ;

SPACE 
  :  (' ' | '\t') {$channel=HIDDEN;}
  ;

fragment LETTER : 'A' .. 'Z' | 'a' .. 'z';
fragment SPACES : (' ' | '\t')+;
fragment A : 'A' | 'a';
fragment B : 'B' | 'b';
fragment C : 'C' | 'c';
fragment D : 'D' | 'd';
fragment E : 'E' | 'e';
fragment F : 'F' | 'f';
fragment G : 'G' | 'g';
fragment H : 'H' | 'h';
fragment I : 'I' | 'i';
fragment J : 'J' | 'j';
fragment K : 'K' | 'k';
fragment L : 'L' | 'l';
fragment M : 'M' | 'm';
fragment N : 'N' | 'n';
fragment O : 'O' | 'o';
fragment P : 'P' | 'p';
fragment Q : 'Q' | 'q';
fragment R : 'R' | 'r';
fragment S : 'S' | 's';
fragment T : 'T' | 't';
fragment U : 'U' | 'u';
fragment V : 'V' | 'v';
fragment W : 'W' | 'w';
fragment X : 'X' | 'x';
fragment Y : 'Y' | 'y';
fragment Z : 'Z' | 'z';

And if you now parse the input:

"Andre and stand van     der"

you will see the expected tokens being created:

java -cp .:antlr-3.3.jar Main "Andre and stand van     der"

NAME                      Andre
AND                       and
NAME                      stand
LNAME_PREFIX              van     der

继续阅读：antlr

Tracking down problems with text being ignored by ANTLR parser

Stripped-down grammar:

EDIT

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Stripped-down grammar:

EDIT

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？