开发者

Parse sentences with different word types

I'm looking for a grammar for analyzing two type of sentences, that means words separated by white spaces:

  1. ID1: sentences with words not beginning with numbers
  2. ID2: sentences with words not beginning with numbers and numbers

Basically, the structure of the grammar should look like

ID1 separator ID2  

ID1: Word can contain number like Var1234 but not start with a number  

ID2: Same as above but 1234 is allowed  

separator: e. g. '='

@Bart

I just tried to add two tokens '_' and '"' as lexer-rule Special for later use in lexer-rule Word. Even I haven't used Special in the following grammar, I get the following error in ANTLRWorks 1.4.2:

The following token definitions can never be matched because prior tokens match the same input: Special

But when I add fragment before Special, I don't get that error. Why?

grammar Sentence1b1;

tokens
{
  TCUnderscore  = '_' ;
  TCQuote       = '"' ;
}

assignment
  :  id1 '=' id2
  ;

id1
  :  Word+
  ;

id2
  :  ( Word | Int )+
  ;

Int
  :  Digit+
  ;

// A word must start with a letter
Word
  :  ( 'a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit )*
  ;

Special
  : ( TCUnderscore | TCQuote )
  ;

Space
  :  ( ' ' | '\t' | '\r' | '\n' ) { $channel = HIDDEN; }
  ;

fragment Digit
  :  '0'..'9'
  ;

Lexer-rule Special shall then be used in lexer-rule Word:

Word
  :  ( 'a'..'z' | 'A'..'Z' | Special ) ('a'..'z' | 'A'..'Z' | Special | Digit )*
  ;
开发者_开发百科


I'd go for something like this:

grammar Sentence;

assignment
  :  id1 '=' id2
  ;

id1
  :  Word+
  ;

id2
  :  (Word | Int)+
  ;

Int
  :  Digit+
  ;

// A word must start with a letter
Word
  :  ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit)*
  ;

Space
  :  (' ' | '\t' | '\r' | '\n') {skip();}
  ;

fragment Digit
  :  '0'..'9'
  ;

which will parse the input:

Word can contain number like Var1234 but not start with a number = Same as above but 1234 is allowed

as follows:

Parse sentences with different word types

EDIT

To keep lexer rule nicely packed together, I'd keep them all at the bottom of the grammar instead of partly in the tokens { ... } block, which I only use for defining "imaginary tokens" (used in AST creation):

// wrong!
Special      : (TCUnderscore | TCQuote);
TCUnderscore : '_';
TCQuote      : '"';

Now, with the rules above, TCUnderscore and TCQuote can never become a token because when the lexer stumbles upon a _ or ", a Special token is created. Or in this case:

// wrong!
TCUnderscore : '_';
TCQuote      : '"';
Special      : (TCUnderscore | TCQuote);

the Special token can never be created because the lexer would first create TCUnderscore and TCQuote tokens. Hence the error:

The following token definitions can never be matched because prior tokens match the same input: ...

If you make TCUnderscore and TCQuote a fragment rule, you don't have that problem because fragment rules only "serve" other lexer rules. So this works:

// good!
Special               : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote      : '"';

Also, fragment rules can therefor never be "visible" in any of your parser rules (the lexer will never create a TCUnderscore or TCQuote token!).

// wrong!
parse : TCUnderscore;

Special               : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote      : '"';


I'm not sure if that fits your needs but with Bart's help in my post ANTLR - identifier with whitespace i came to this grammar:

grammar PropertyAssignment;

assignment
    : id_nodigitstart '=' id_digitstart EOF
    ;

id_nodigitstart
    :   ID_NODIGITSTART+
    ;

id_digitstart
    :   (ID_DIGITSTART|ID_NODIGITSTART)+
    ;

ID_NODIGITSTART
    :   ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*
    ;           

ID_DIGITSTART
    :   ('0'..'9'|'a'..'z'|'A'..'Z')+
    ;

WS  :   (' ')+ {skip();}
    ;

"a name = my 4value" works while "4a name = my 4value" causes an exception.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜