开发者

Recognize Identifiers in Chinese characters by using Lex/Yacc

How can I use Lex/Yacc to recognize identifiers in Chin开发者_运维百科ese characters?


I think you mean Lex (the lexer generator). Yacc is the parser generator.

According to What's the complete range for Chinese characters in Unicode?, most CJH characters fall in the 3400-9FFF range.

According to http://dinosaur.compilertools.net/lex/index.html

Arbitrary character. To match almost any character, the operator character . is the class of all characters except newline. Escaping into octal is possible although non-portable:

                             [\40-\176]

matches all printable characters in the ASCII character set, from octal 40 (blank) to octal 176 (tilde).

So I would assume what you need is something like [\32000-\117777].


Yacc does not care about Chinese characters, but lex does: it is responsible for analyzing the input bytes (and characters) to recognize tokens. However, Chinese characters generally are multibyte. There are programs like lex which may support this, but they're not lex. It has been discussed several times.

Further reading:

  • Adding utf-8 Encoding to Lex

The standard lexical tokenizer, lex (or flex), does not accept multi-byte characters, and is thusly impractical for many modern languages. This document describes a mapping from regular expressions describing UTF-8 multi-byte characters to a regular expressions of single bytes.

  • Flex(lexer) support for unicode (2012/3/8)

    Answers point out how you can work around the limitation by using special cases of UTF-8 patterns.

  • Unicode Support in Flex (2009/4/26)

    Essentially the same as the previous (but preceding, and a possible source for those comments)

  • How do I lex unicode characters in C?

    An answer lists some alternative implementations which may do that was asked here.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜