开发者

How do I match unicode characters in antlr

I am trying to pick out all tokens in a text and need to match all Ascii and Unicode characters, so here is how I have laid them out.

fragment CHAR     :  ('A'..'Z') | ('a'..'z');
fragment DIGIT    :  ('0'..开发者_开发问答'9');
fragment UNICODE  :  '\u0000'..'\u00FF';

Now if I write my token rule as:

TOKEN  :  (CHAR|DIGIT|UNICODE)+;

I get "Decision can match input such as "'A'..'Z'" using multiple alternatives: 1, 3 As a result, alternative(s) 3 were disabled for that input" " Decision can match input such as "'0'..'9'" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input"

And nothing gets matched: And also if I write it as

TOKEN  :  (UNICODE)+;

Nothing gets matched.

Is there a way of doing this.


One other thing to consider if you are planning on using Unicode is that you should set the charvocabulary option to say that you want to allow any char in the Unicode range of 0 through FFFE

options
{
charVocabulary='\u0000'..'\uFFFE';
}

The default you'll usually see in the examples is

options
{
charVocabulary = '\3'..'\377';
}

To cover the point made above. Generally if you needed both the ascii character range 'A'..'Z' and the unicode range you'd make a unicode lexer rule like: '\u0080'..'\ufffe'


Practically speaking, TOKEN: (UNICODE)+ is completely useless.

Since everything is a token character, if you try to use such a rule to match a Java program, say, it will simply match the whole program and return it to you as one big token.

You really do need to break your characters down into different groups if you want to split your input apart into meaningful fragments.

It might help you to take a look at how the "pros" have done it. Here is a BNF grammar for Java, and here is BNF for an identifier, which shows how they took to the trouble to group out

identifier 
  ::= "a..z,$,_" { "a..z,$,_,0..9,unicode character over 00C0" } 
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜