开发者

Lexers/tokenizers and character sets

When constructing a lexer/tokenizer is it a mistake to rely on functions (in C) such as isdigit/isalpha/...? They are dependent on locale as far as I know. Should I pick a character set and concentrate on it and make a character mapping myself from which I look up classifications? Then the problem becomes being able to lex multiple character sets. Do I produce one lexer/tokenizer for each character set or do I try to code the one I wrote so that the only thing I have to do is change the character mapping. Wh开发者_开发问答at are common practices?


For now, I would concentrate on getting the lexer working first using the plain ASCII character set, then when the lexer is working, put in a mapping support for different character types such as UTF-16 and locale support.

And no, it is not a mistake to rely on the ctype's functions such as isdigit, isalpha and so on...

Actually, maybe at a later stage, there is a Posix equivalent of ctype for wide characters 'wctype.h' so it might be in your best interests to define a macro, later on...so that you will be able to transparently change the code to handle the different locale sets...

#ifdef LEX_WIDECHARS
#include <wctype.h>
#define isdigit  iswdigit
#else
#define isdigit
#endif

It would be defined something like that in that context...

Hope this helps, Best regards, Tom.


The ctype.h functions are not very usable for chars that contain anything but ASCII. The default locale is C (essentially the same as ASCII on most machines), no matter what the system locale is. Even if you use setlocale to change the locale, the chances are that the system uses a character set with bigger than 8 bit characters (e.g. UTF-8), in which case you cannot tell anything useful from a single char.

Wide chars handle more cases properly, but even they fail too often.

So, if you want to support non-ASCII isspace reliably, you have to do it yourself (or possibly use an existing library).

Note: ASCII only has character codes 0-127 (or 32-127) and what some call 8 bit ASCII is actually some other character set (commonly CP437, CP1252, ISO-8859-1 and often also something else).


You are likely not to get very far in trying to build a local sensitive parser -- it will drive you mad. ASCII works fine for most parsing needs -- don't fight it :D

If you do want to fight it and use some of the classifications of characters you should look to the ICU library that implements Unicode religiously.


Generally you need to ask yourself:

  • what exactly do you want to do, what kind of parsing?
  • What languages do you want to support, wide range or Western-European only?
  • What encoding do you want to use UTF-8 or localized 8-bit encoding?
  • What OS are you using?

Lets start, if you work with Western languages with localized 8-bit encoding, then probably yes, you may relay on is*, if locales are installed and configured.

However:

  • if you work with UTF-8 you can't because only ASCII would be covered you can't, because all outside of ASCII takes more then one byte.
  • If you want to support Eastern languages, all your assumptions about parsing would be wrong, like Chinese do not use space to separate words. Most languages even do not have upper or lower case, even alphabet based like Hebrew or Arabic.

So, what exactly do you want to do?

I'd suggest to take a look on ICU library that have various break iterators, or other toolkits like Qt that provide some basic boundary analysis.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜