开发者

Google AJAX Language API with Chinese language

Does anyone know if there is support for Chinese pinyin? I get the results here with c开发者_JS百科orrect Chinese pinyin (see "Show romanization" link).

Thank you.


I don't know if Google AJAX Language APIs have support for converting to pinyin, but if they don't it actually isn't too hard to do a passable conversion on your on. (The reverse conversion, from pinyin to hanzi (characters) is much more tricky, because pinyin is very lossy.)

To do the conversion yourself, grab the Unihan.zip, a downloaable verion of the Unihan database. The file you actually care about is Unihan_Readings.txt. It also contains a bunch of stuff you don't care about, and it's also stored in a pretty inefficient way, so don't be too worried about the large file sizes. You should extract the stuff you care about and store it in a more efficient way.

In it you'll find tab-delimited lines like this:

U+597D  kCantonese      hou2 hou3
U+597D  kDefinition     good, excellent, fine; well
U+597D  kHangul         호
U+597D  kHanyuPinlu     hao3(6060) hao1(142) hao4(115)
U+597D  kHanyuPinyin    21028.010:hǎo,hào
U+597D  kJapaneseKun    KONOMU SUKU YOI
U+597D  kJapaneseOn     KOU
U+597D  kKorean         HO
U+597D  kMandarin       HAO3 HAO4
U+597D  kTang           *xɑ̀u *xɑ̌u
U+597D  kVietnamese     háo
U+597D  kXHC1983        0445.030:hǎo 0448.030:hào

The left column ("U+597D") is the unicode codepoint, the middle column is an attribute name, and the right column is the attribute value. You can extract either the kHanyuPinyin attributes or the kMandarin attributes. They encode basically the same information -- just go with whichever is an easier format for you to deal with. (hǎo == HAO3, hào == HAO4, if that isn't obvious)

You'll note that for some characters (like the example I've chosen here) there are multiple pronunciations. This is the one tricky bit. Depending on how much precision you want, you may be able to get away with just using the first romanization listed, as they're in order of decreasing frequency. (Actually, this is one of the places where kHanyuPinyin is a bit different from kMandarin -- it actually has multiple lists of pronunciations, each ordered by frequency.)


You can trick the API into giving you Pinyin by translating from Chinese to Chinese. Sample link.


Google translate includes "show/hide romanization" which is BETTER than UNIHAN for two reasons. First, known words are logically grouped together in the proper manner (at least it tries to do that). Secondly, Chinese characters have more than one possible pronunciation. It is not a trivial problem to figure out which pinyin transliteration is the right one. That's what the translation engine does.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜