开发者

Does Lua support Unicode?

Based on the link below, I'm confused as to whether the Lua programming language supports Unicode.

http://lua-users.org/wi开发者_StackOverflowki/LuaUnicode

It appears it does but has limitations. I simply don't understand, are the limitation anything big/key or not a big deal?


You can certainly store unicode strings in lua, as utf8. You can use these as you would any string.

However Lua doesn't provide any default support for higher-level "unicode aware" operations on such strings—e.g., counting string length in characters, converting lower-to-upper-case, etc. Whether this lack is meaningful for you really depends on what you intend to do with these strings.

Possible approaches, depending on your use:

  1. If you just want to input/output/store strings, and generally use them as "whole units" (for table indexing etc), you may not need any special handling at all. In this case, you just treat these strings as binary blobs.

  2. Due to utf8's clever design, some types of string manipulation can be done on strings containing utf8 and will yield the correct result without taking any special care.

    For instance, you can append strings, split them apart before/after ascii characters, etc. As an example, if you have a string "開発.txt" and you search for "." in that string using string.find (string_var, "."), and then split it using the normal string.sub function into "開発" and ".txt", those result strings will be correct utf8 strings even though you're not using any kind of "unicode-aware" algorithm.

    Similarly, you can do case-conversions on only the ASCII characters in strings (those with the high bit zero), and treat the rest of the strings as binary without screwing them up.

  3. Some utf8-aware operations are so simple that it's easy to just write one's own functions to do them.

    For instance, to calculate the length in unicode-characters of a string, just count the number of characters with the high bit zero (ASCII characters), and the number of characters with the top two bits 11 ("leading bytes" for non-ASCII characters); the length is the sum of those two.

  4. For more complex operations—e.g., case-conversion on non-ASCII characters, etc.—you'll probably have to use a Lua unicode library, such as those on the (previously mentioned) Lua-users Unicode page


Lua does not have any support for unicode (other than accepting any byte value in strings). The library slnunicode has a lot of unicode string functions, however. For example unicode.utf8.len.

(note: this answer is completely stolen from grom's comment on another question - I just think it deserves its own answer)


If you want a short answer, it is 'yes and no' as put on the linked site.

Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.

What is not supported is iteration by unicode character, there is no standard function for string length in unicode characters etc. So the higher-level kind of Unicode support (like what is available in Python with length, lower -> upper case conversion, encoding in arbitrary coding etc) is not available.


Lua 5.3 was released now. It comes with a basic UTF-8 library.

You can use the utf8 library to do things about UTF-8 encoding, like getting the length of a UTF-8 string (not number of bytes as string.len), matching each characters (not bytes), etc.

It doesn't provide native support other than encoding, like is this character a Chinese character?


It supports it in the sense that you can use Unicode in Lua strings. It depends specifically on what you're planning to do, but most of the limitations can be fairly easily worked around by extending Lua with your own functions.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜