Regular Expressions for Japanese in Lua
I want to process japanese vocabulary in Lua (LuaTeX to be specific). The vo开发者_StackOverflow中文版cabulary is stored in a text file which is to be read. While reading each line of the file the words should be matched by a regular expression (lines are written like:
| がくせい | student |
):
function readFile(fn)
local file = assert(io.open(fn, "r"))
local contents = file:read("*a")
file:close()
return contents
end
function processTest(contents)
for line in contents:gmatch("%a+") do
print(line)
end
end
a = readFile("vocabulary.org")
processTest(a)
The problem now is that only the english words are printed:
student
I have to mention that I'm new to Lua and LuaTeX, so if there is a better approach to it I would be happy to know.
Anyway, is there any possibility to get the Japanese words?
You cannot use %a
for this. It only matches a single octet (locale-dependent but usually only a byte that encodes a letter in ASCII or Latin-1.)
To match UTF-8 encoded letters you would need to break them down into ranges of bytes, as in the example here.
For example some patterns for UTF-8-encoded Hiragana might include:
(\227\129[\129-\191])
(\227\130[\128-\160])
A full list of patterns to match all unicode letters (which would need to include hundreds of subranges) would be unwieldy.
I'm not a Lua guru, but I think you are probably out of luck. Lua doesn't consume Unicode files "natively," as it were. It just treats what it reads as a series of bytes and doesn't do any interpretation on it. In particular, your gmatch() call isn't likely to do what you want.
There was a big discussion about i18n on the mailing list recently here. This discussion here may also help.
精彩评论