开发者

Is there a library for Python that gives the script name for a given unicode character or string?

Is there a library that tells what script a particular unicode character belongs to?

For example for the input "u'ሕ'" it should return Ethiopic or similar开发者_开发问答.


Maybe the data in the unicodedata module is what you are looking for:

print unicodedata.name(u"ሕ")

prints

ETHIOPIC SYLLABLE HHE

The printed name can be used to look up the corresponding character:

unicodedata.lookup("ETHIOPIC SYLLABLE HHE")


You can parse the Scripts.txt file:

# -*- coding: utf-8; -*-

import bisect

script_file = "/path/to/Scripts.txt"
scripts = []

with open(script_file, "rt") as stream:
    for line in stream:
        line = line.split("#", 1)[0].strip()
        if line:
            rng, script = line.split(";", 1)
            elems = rng.split("..", 1)
            start = int(elems[0], 16)
            if len(elems) == 2:
                stop = int(elems[1], 16)
            else:
                stop = start
            scripts.append((start, stop, script.lstrip()))

scripts.sort()
indices = [elem[0] for elem in scripts]

def find_script(char):
    if not isinstance(char, int):
        char = ord(char)
    index = bisect.bisect(indices, char) - 1
    start, stop, script = scripts[index]
    if start <= char <= stop:
        return script
    else:
        return "Unknown"

print find_script(u'A')
print find_script(u'Д')
print find_script(u'ሕ')
print find_script(0x1000)
print find_script(0xE007F)
print find_script(0xE0080)

Note that is code is neither robust nor optimized. You should test whether the argument denotes a valid character or code point, and you should coalesce adjacent equivalent ranges.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜