Is there a library for Python that gives the script name for a given unicode character or string?
Is there a library that tells what script a particular unicode character belongs to?
For example for the input "u'ሕ'" it should return Ethiopic or similar开发者_开发问答.
Maybe the data in the unicodedata
module is what you are looking for:
print unicodedata.name(u"ሕ")
prints
ETHIOPIC SYLLABLE HHE
The printed name can be used to look up the corresponding character:
unicodedata.lookup("ETHIOPIC SYLLABLE HHE")
You can parse the Scripts.txt
file:
# -*- coding: utf-8; -*-
import bisect
script_file = "/path/to/Scripts.txt"
scripts = []
with open(script_file, "rt") as stream:
for line in stream:
line = line.split("#", 1)[0].strip()
if line:
rng, script = line.split(";", 1)
elems = rng.split("..", 1)
start = int(elems[0], 16)
if len(elems) == 2:
stop = int(elems[1], 16)
else:
stop = start
scripts.append((start, stop, script.lstrip()))
scripts.sort()
indices = [elem[0] for elem in scripts]
def find_script(char):
if not isinstance(char, int):
char = ord(char)
index = bisect.bisect(indices, char) - 1
start, stop, script = scripts[index]
if start <= char <= stop:
return script
else:
return "Unknown"
print find_script(u'A')
print find_script(u'Д')
print find_script(u'ሕ')
print find_script(0x1000)
print find_script(0xE007F)
print find_script(0xE0080)
Note that is code is neither robust nor optimized. You should test whether the argument denotes a valid character or code point, and you should coalesce adjacent equivalent ranges.
精彩评论