How do you sort CJK (Asian) characters in Perl, or with any other programming language?

2023-01-19 11:20 问答作者：

How do you sort Chinese, Japanese and Korean (CJK) characters in Perl?

As far as I can tell, sorting CJK characters by stroke count, then by radical, seems to be the way these languages are sorted. There are also some methods that sort by soun开发者_高级运维ds, but this seems less common.

I've tried using:

perl -e 'print join(" ", sort qw(工 然 一 人 三 古 二 )), "\n";'
# Prints: 一 三 二 人 古 工 然 which is incorrect

And I've tried using Unicode::Collate from CPAN, but it says:

By default, CJK Unified Ideographs are ordered in Unicode codepoint order...

If I could get a database of stroke count per character, I could easily sort all of the characters, but this doesn't seem to come with Perl nor is it encapsulated in any module I could find.

If you know how to sort CJK in other languages, it would be helpful to mention it in an answer to this question.

See TR38 for the dirty details and corner cases. It's not as easy as you think and as this code sample looks like.

use 5.010;
use utf8;
use Encode;
use Unicode::Unihan;
my $u = Unicode::Unihan->new;

say encode_utf8 sprintf "Character $_ has the radical #%s and %d residual strokes." , split /[.]/, $u->RSUnicode($_) for qw(工 然 一 人 三 古 二);
__END__
Character 工 has the radical #48 and 0 residual strokes.
Character 然 has the radical #86 and 8 residual strokes.
Character 一 has the radical #1 and 0 residual strokes.
Character 人 has the radical #9 and 0 residual strokes.
Character 三 has the radical #1 and 2 residual strokes.
Character 古 has the radical #30 and 2 residual strokes.
Character 二 has the radical #7 and 0 residual strokes.

See http://en.wikipedia.org/wiki/List_of_Kangxi_radicals for a mapping from radical ordinal number to stroke count.

A Japanese phonebook is sorted on a phonetic basis (gojûon collation). However, kanji character order is not based on phonetics, no matter whether in Unicode, JIS, S-JIS or EUC. Only kana are based on phonetic order. This means you can not collate meaningfully without phonetic conversion!

For example:

a) kanji:           東京駅
b) kana converted:  とうきょうえき
c) romanisation:    tôkyô eki

With b) or c), you can make a meaningful sort. But you can not do with only a). Of course, you can run the plain sort function, but it is not meaningful for Japanese.

Check out my rubygem toPinyin, which will convert a UTF-8 encoded chinese character to their PinYin (pronunciation). And then, a sort could be done on the Pinyin easily.

Simply, gem install toPinyin

require 'toPinyin'

words = "
人
没有
理想
跟
咸鱼
有
什么
区别
".split("\n")

words.sort! {|a ,b|   a.pinyin.join <=> b.pinyin.join }

https://github.com/pierrchen/toPinyin

继续阅读：cjk collation perl sorting unicode

How do you sort CJK (Asian) characters in Perl, or with any other programming language?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？