Printing out Japanese (Chinese) characters
I read Japanese, and want to try processing some Japanese text. I tried this using Python 3:
for i in range(1,65535):
print(chr(i), end='')
Python then gave me tons of errors. What went wrong?
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Traceback (most recent call last):
File "C:\test\char.py", line 11, in <module>
print(chr(i), end='')
File "C:\Python31\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 0: character maps to <undefined>
My understanding is that the chr function goes on to convert Unicode numbers into the respective Japanese characters. If so, why are the Japanese characters not outputted? Why does it crash at the end of the list of Roman characters?
Please also correct me if I am mistaken in my understanding that th开发者_JS百科e Unicode set was devised solely to cater for non-Western languages.
EDIT:
I tried the 3 lines suggested by John Machin in IDLE, and the output worked!
Before this, I had been using Programmer's Notepad, with the Tools set to capture python.exe compiler's output. Perhaps that is why the errors came about.
However, for most other things, the output is captured properly; then why does it fail particularly in this process? i.e. Why does the code work in the IDLE Python Shell, but not through Programmer's Notepad output capture? Shouldn't the output be the same, regardless of the interface?
If as you say you read Japanese, you must be aware that Japanese is written using FOUR different types of characters: (1) kanji (Chinese characters) (2) Katakana (3) Hiragana (4) Romaji ("Roman" letters). There are many tens of thousands of kanji of which only a few thousand are in common use.
Your code, had it worked as you imagined it might, would have printed not only the the "Roman" characters, but also Greek, Arabic, Hebrew, Cyrillic (used in Russian etc), Armenian, half a dozen or so different but related character sets used in India, many I've left out, about 11 thousand Hangul Syllables (used in Korean) and a bunch of gibberish for code points that aren't used, and (depending on which shell you were running it in) may have crashed when it got to 0xD800 (the first surrogate).
A little less ambition will give you Hiragana, Katakana, and a few "CJK Unified Ideographs". The examples below were run in IDLE.
>>> for i in range(0x3040, 0x30a0): print(chr(i), end='')
ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖ゙゚゛゜ゝゞゟ
>>> for i in range(0x30a0, 0x3100): print(chr(i), end='')
゠ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ・ーヽヾヿ
>>> for i in range(0x4e00, 0x4f00): print(chr(i), end='')
一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乎乏乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿
Update The reason you had a problem is that the shell/IDE that you were using supplies only the Windows GUI bog-standard stdout, for which the default encoding (in your neck of the woods) is cp1252 (remember the mention of cp1252 in your traceback?) which is adequate in your case for the Romaji but not much else. Available-anywhere-without-downloads alternatives: (1) IDLE (2) write file encoded in UTF-8 and read it in Notepad. I'm sure others could suggest other IDEs.
You problem is your default terminal (output) encoding. Probably latin-1 or even the perennial Python default, ASCII. Those can't encode japanese characters (since it's assumed that the terminal can't display them).
If your terminal does UTF-8 (the most often used Unicode encoding in the western world), you can either "trick" Python into taking this as the default output encoding, or you can explicity encode the unicode to UTF-8 with
>>>> print (chr(i).encode("UTF-8"), end='')
And as to the "solely", I think that's wrong. It was created to be the one encoding to bind them... ehm, sorry, the one and only encoding we'll ever need. The encoding (okay, that's using "encoding" not in the sense it's used in the Unicode definition) that can be used to encode all text documents.
You're attempting to encode a character (\x80) that isn't defined by your codec; there is no correct mapping so charmap_encode raises an exception. You could wrap the print statement in a try: block, then catch and ignore the exception to only print the characters that you can encode.
No need to try all 65536 codes of the BMP. Just use the code blocks used for Japanese text
for i in range(0x3040, 0x30a0): print unichr(i),
This above is for the Hiragana charset. You can use the same utf-8 encoding above for Katakana, and Kanji as well.
Keep in mind that the average japanese uses around 2000-2500 Kanji charachters. However, chinese is probably around 5000-6000.
精彩评论