开发者

Python - codec encoding ascii to unicode: error

:) I am trying to go about the process of reversing transliteration of an input file(currently in english) back to its original form(in hindi)

A sample or a part of the input file looks like this:

E-k- b-u-d-z*dhi-m-aan- p-ksii#

E-k- ghn-e- j-ngg-l- m-e-ng E-k- b-h-u-t- UUNNc-aa p-e-dr thaa#
U-s- k-ii p-t-z*t-o-ng s-e- l-d-ii shaakhaay-e-ng m-j-*zb-uut- b-aaj-u-O-ng k-ii t-r-h- pheil-ii h-u-II thiing#
w-n- h-NNs-o-ng k-aa E-k- jhu-nhz*D- I-s- p-e-dr p-r- n-i-w-aas- k-r-t-aa thaa#
w-e- s-b- y-h-aaNN s-u-r-ksi-t- the- AUr- b-dre- AAr-aam- s-e- r-h-t-e- the-#
U-n- m-e-ng s-e- E-k- p-ksii b-h-u-t- b-u-d-z*dhi-m-aan- thaa#
I-s- b-u-d-z*dhi-m-aan- p-ksii n-e- E-k- d-i-n- p-e-dr k-ii j-dr m-e-ng s-e- E-k- l-t-aa k-o- U-g-t-e- d-e-khaa# 
I-s- k-e- b-aar-e- m-e-ng U-s-n-e- d-uus-r-e- p-ksi-y-o-ng s-e- b-aat- k-ii#
"k-z*y-aa t-u-m-z*h-e-ng w-h- l-t-aa d-i-khaaII d-e-t-ii h-ei", U-s- n-e- U-n- s-e- p-uuchaa "t-u-m-z*h-e-ng I-s-e- n-Shz*T- k-r- d-e-n-aa c-aah-i-E-"#
"I-s-e- k-z*y-o-ng n-Shz*T- k-r- d-e-n-aa c-aah-i-E-?" h-NNs-o-ng n-e- AAshz*c-*ry- s-e- p-uuchaa "y-h- t-o- I-t-n-ii cho-T-ii s-e- h-ei#
h-m-e-ng y-h- k-z*y-aa h-aan-i- p-h-u-NNc-aa s-k-t-ii h-ei"#
"m-e-r-e- m-i-tro-ng," b-u-d-z*dhi-m-aan- p-ksii n-e- U-t-z*t-r- d-i-y-aa "w-h- cho-T-ii s-ii l-t-aa j-l-z*d-ii h-ii b-drii h-o- j-aay-e-g-ii#
y-h- h-m-aar-e- p-e-dr p-r- c-Dh*z k-r- U-s- s-e- l-i-p-T-t-ii j-aay-e-g-ii AUr- phi-r- m-o-T-ii AUr- m-j-*zb-uut- h-o- j-aay-e-g-ii"#
"t-o- k-z*y-aa h-u-AA"#

Its equivalent meaning in english is:

A WISE OLD BIRD.

Deep in the forest stood a very tall tree.
Its leafy branches spread out like long arms.
This was the home of a flock of wild geese.
They were safe there.
One of the geese was a wild old bird.
One  day this wise old bird noticed  a small creeper growing at the foot of the tree.
He spoke to the other birds about it.
"Do you see that creeper ?" he said to them.
"You must destroy it."
"Why must we destroy it ?" asked the geese in surprise.
"It is so small.
What harm can it do?"
"My friends," replied the wise old bird, " that little creeper will soon grow.

My script looks like this:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys
CODEC = 'utf-8'
input_file=sys.argv[1]
output_file=sys.argv[2]
list1=[]



f=open(input_file,'r')
f1 = open(output_file,'w')

english_hindi_dict={'A' : u'अ' ,  'AA' : u'आ ' , 'I' : u'इ' , 'II' : u'ई ' , 'U' : u'उ ' ,\
                'UU' : u'ऊ' , 'r' : u'ऋ' , 'E' : u'ए' , 'ai' : u'ऐ' , 'O' : u'ओ' , 'AU' : u'औ' ,\
                'k' : u'क' , 'kh' : u'ख' , 'g' : u'ग' , 'gh' : u'घ' , 'c' : u'च' , 'ch' : u'छ',\
                'j': u'ज' , 'jh' : u'झ' , 'tr' : u'त्र' , 'T' : u'ट'  , 'Th' : u'ठ' , 'D' : u'ड',\
                'dr' : u'ड' , 'Dh' : u'ढ' , 'Na' : u'ण' , 'th' : u'त' ,  'tha' : u'थ',\
                'd' : u'द' , 'dh': u'ध' , 'n' : u'न' , 'p' : u'प' , 'ph' : u'फ' ,\
                'b' : u'ब' , 'bh' : u'भ' , 'm' : u'म' , 'y' : u'य' , 'r' : u'र' , 'l开发者_开发百科' : u'ल' ,\
                'w' : u'व' , 'sh' : u'श' , 'sha' : u'ष', 's' : u'स' , 'h' : u'ह' , 'ks' : u'क्ष' ,\
                'i' : u'ि' , 'ii' : u'ी' , 'u' : u'ु' , 'uu' : u'ू' , 'e' : u'े' ,\
                'aa' : u'ै' , 'o' : u'ो' , 'AU' : u'ौ' ,'H' : u'्' ,'mn' : u'ं' ,\
                'NN' : u'ँ' , 'AW' : u'ॅ' , 'rr' : u'ृ' , '4' : u'४' , '6': u'६'  , '8' : u'८',\
                '2' : u'२' , '5' : u'५' , '3' : u'३' , '7' : u'७' , '9' : u'९' , '1' : u'१'}
for line in f:
      #line=line.strip() to remove a line from its newline character....  
      #line=line.rstrip('.')   
      line=line.replace('-','')
      line=line.replace('#','|') # i am using the or symbol for poornviram
      #line=line.replace('।','')
      #line = line.lower()
for word in line:
    for ch in word:
        if (ch in english_hindi_dict) :
            translatedToken = english_hindi_dict[ch]
        else :
                translatedToken = ch

#{ translatedToken = english_hindi_dict[ch] }

#for ch in line:
    f1.write(translatedToken)
    #print translatedToken
    #line = line.replace( char,english_hindi_dict[char] )   
      #list1.append(line)
f.close()

f1.write(' '.join(list1))

f1.close()

the error that I am getting is:

python transliterate_eh_nw.py Hstory.txt op1.txt
Traceback (most recent call last):
  File "transliterate_eh_nw.py", line 43, in <module>
    f1.write(translatedToken)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u092f' in position 0: ordinal not in range(128)

Could you please tell me how do I deal with this error. Thank you..:)


You have a few problems other than the one which you asked about.

(1) A conceptual problem: "E-k- b-u-d-z*dhi-m-aan- p-ksii#" is not "english". It is Hindi language written in ASCII using some romanization scheme. It looks like ITRAN but ITRAN doesn't have AA and A, it has only aa and a. Does the scheme have a name? Can you supply a URL? Your object is better described as "transliterate some Hindi text from the unnamed romanization to Devanagari script".

(2) Showing the result of translating your text from Hindi to English ("A WISE OLD BIRD" etc) is only moderately useful. The expected Devanagari output would be a better idea.

(3) As remarked by @kaiser.se, the transliteration dictionary has multi-byte (up to 3 bytes!) keys, some of which are prefixes of others. Presumably AA must be recognised in priority to A, gh must be recognised before g, etc. Iterating over the items of a dictionary happens in an order that is predictable but for your purposes should be regarded as random. In the code that follows, I've given priority to longer "keys".

(4) Either the dictionary is missing some letter keys (a S t z) or the transliteration rules are more complicated than any of us has guessed so far

(5) The meaning of the characters # * and - is not 100% obvious. It appears from your input text that z and * appear only in combination as z*

(6) It would be a good idea if you explained the interpretation of e.g. shaakhaay-e-ng ... does it start with sh then aa or does it start with sha then a? What are the rules?

The answer to the problem that you asked about is of course as several others have pointed out that you need to encode your unicode output using an encoding that is supported by your display device e.g. UTF-8.

Here's some code:

#!/usr/bin/python
# -*- coding: UTF-8 -*-

input_data = """
E-k- b-u-d-z*dhi-m-aan- p-ksii#

E-k- ghn-e- j-ngg-l- m-e-ng E-k- b-h-u-t- UUNNc-aa p-e-dr thaa#
[snip]
"t-o- k-z*y-aa h-u-AA"#
"""

roman_devanagari_dict={'A' : u'अ' ,  'AA' : u'आ ' , 'I' : u'इ' , 'II' : u'ई ' , 'U' : u'उ ' ,\
[snip]
            '2' : u'२' , '5' : u'५' , '3' : u'३' , '7' : u'७' , '9' : u'९' , '1' : u'१'}

#Presuming we need to do the 3-letter cases then the 2-letter then the 1-letter
replacements = [(-len(k), unicode(k), v) for k, v in roman_devanagari_dict.items()]
replacements.sort()

data = input_data.decode('ascii')

for _junk, from_text, to_text in replacements:
    data = data.replace(from_text, to_text)

# Presuming the '-' are inter-character markers, delete them last, not first
data = data.replace(u'-', '')
data = data.replace(u'#', '')
print "untransliterated:", set(c for c in data if 0x20 < ord(c) < 0x7f)

BOM = u'\ufeff'
outf = open('devanagari.txt', 'w')
outf.write(BOM.encode('utf8')) # for the benefit of clueless Windows s/w
outf.write(data.encode('utf8'))
outf.close()

Output:

एक बुदz*धिमैन पक्षी

एक घने जनगगल मेनग एक बहुt ऊँचै पेड थa उ स की पtztोनग से लदी षaखैयेनग मजzबूt बैजुओनग की tरह फेिली हुई तीनग वन हँसोनग कै एक झुनहzड इस पेड पर निवैस करtै थa वे सब यहैँ सुरक्षिt ते ौर बडे आ रैम से रहtे ते उ न मेनग से एक पक्षी बहुt बुदzधिमैन थa इस बुदzधिमैन पक्षी ने एक दिन पेड की जड मेनग से एक लtै को उ गtे देखै इस के बैरे मेनग उ सने दूसरे पक्षियोनग से बैt की "कzयै tुमzहेनग वह लtै दिखैई देtी हेि", उ स ने उ न से पूछै "tुमzहेनग इसे नSहzट कर देनै चैहिए" "इसे कzयोनग नSहzट कर देनै चैहिए?" हँसोनग ने आ शzरय से पूछै "यह tो इtनी छोटी से हेि हमेनग यह कzयै हैनि पहुँचै सकtी हेि" "मेरे मित्रोनग," बुदzधिमैन पक्षी ने उ tztर दियै "वह छोटी सी लtै जलzदी ही बडी हो जैयेगी यह हमैरे पेड पर चढz कर उ स से लिपटtी जैयेगी ौर फिर मोटी ौर मजzबूt हो जैयेगी" "tो कzयै हुआ "

which has only a few recognisable words when shoved through Google Translate.

Update after examining the transliteration table more closely:

  • Three of the entries (AA, II, and U) have a space after the Devanagari equivalent. Perhaps the spaces should be removed.

  • The general pattern for consonants appears to be:

DEVANAGARI LETTER XA is represented by x
DEVANAGARI LETTER XXA is represented by X
DEVANAGARI LETTER XHA is represented by xh
DEVANAGARI LETTER XXHA is represented by Xh

However 3 entries break the pattern:
SSA -> sha but pattern says S
TA -> th but pattern says t
THA -> tha but pattern says th

Note: changing the above 3 entries stopped my code from complaining that S and t were left unchanged when transliterating your sample text, and removed the seemingly-anomalous sha and tha entries.

  • Entries (D and dr) are mapped to the same character, DEVANAGARI LETTER DDA. D is the expected entry for that character; perhaps dr should be mapped elsewhere.

  • There is no entry for DEVANAGARI LETTER NGA (U+0919); perhaps it should be encoded as ng -- there are a few words ending in ng in the sample text.

  • Are the uncatered-for "z*" occurrences in the sample text anything to do with DEVANAGARI LETTER ZA (U+095B)?


f1.write(' '.join(list1))

list1, at this point, contains Unicode strings. You can't write Unicode directly to a file, it's a byte interface. You should either encode it explicitly (' '.join(list1).encode('utf-8')), or, as Ignacio suggests, use a codecs wrapper to implicitly encode Unicode strings you send to it. At the moment you are defining a variable CODEC, but not doing anything with it.


Are you sure you want to remove all the hyphens(-)? Looking at your input file, it looks like all replacements are two- or three-character codes, such as u'I-':u'इ'. If this is so, you could do something like below, but make sure you're using Unicode strings for all your keys and values in the dictionary:

import codecs

# read the whole file at once
f = codecs.open(input_file,'r','ascii')
data = f.read()
f.close()

# perform all the replacements
for k,v in english_hindi_dict.items():
    data = data.replace(k,v)

# write the whole file result
f = codecs.open(output_file,'w',CODEC)
f.write(data)
f.close()

Following that theory, I got the following result, which looks like translations such as 'z*', 't-', 'ng', and 'ei' are missing from the dictionary. I don't read Hindi, but Google Translate came up with some of the English words in your translation, so I think I'm on the right track.

-z*धिमैन पक्षी

एक घने जngगल मेng एक बहुt- ऊँचै पेड तै
उस की पt-z*t-ोng से लदी शैखैयेng मज*zबूt- बैजुओng की t-रह फeiली हुई तीng
वन हँसोng कै एक झुnhz*ड इस पेड पर निवैस करt-ै तै
वे सब यहैँ सुरक्षिt- ते ौर बडे आरैम से रहt-े ते
उन मेng से एक पक्षी बहुt- बुदz*धिमैन तै
इस बुदz*धिमैन पक्षी ने एक दिन पेड की जड मेng से एक लt-ै को उगt-े देखै 
इस के बैरे मेng उसने दूसरे पक्षियोng से बैt- की
"कz*यै t-ुमz*हेng वह लt-ै दिखैई देt-ी हei", उस ने उन से पूछै "t-ुमz*हेng इसे नShz*ट कर देनै चैहिए"
"इसे कz*योng नShz*ट कर देनै चैहिए?" हँसोng ने आशz*च*rय से पूछै "यह t-ो इt-नी छोटी से हei
हमेng यह कz*यै हैनि पहुँचै सकt-ी हei"
"मेरे मित्रोng," बुदz*धिमैन पक्षी ने उt-z*t-र दियै "वह छोटी सी लt-ै जलz*दी ही बडी हो जैयेगी
यह हमैरे पेड पर चढ*z कर उस से लिपटt-ी जैयेगी ौर फिर मोटी ौर मज*zबूt- हो जैयेगी"
"t-ो कz*यै हुआ"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜