Python string conversion (localization) question
source = '\xe3\xc7\x9f'
destination = u'\u0645\u0627\u06ba'
How do I get from the source, to the destination?
(The source and the destination are both the same 3 characters, in the same order, just represented differently.)
Technically, the source is in Urdu and the destination is the Unicode code points for the same 3 characters. See: https://www.codeaurora.org/git/projects/froyo-gb-dsds-7227/repository/revisions/39141d7a9dbdd2e9acf006430a7e7557ffd1efce/entry/external/icu4c/data/mappings/ibm-5352_P100-1998.ucm
If I do:
source.decode('cp1006')
I get:
u'开发者_Go百科\ufed9\ufb84\x9f'
Which is not what I'm looking for...
If I do:
source.decode('raw_unicode_escape')
I get:
u'\xe3\xc7\x9f'
Which is also not what I'm looking for...
How do I get from point A (source) to point B (destination) in Python?
In [129]: source = '\xe3\xc7\x9f'
In [130]: source.decode('cp1256')
Out[130]: u'\u0645\u0627\u06ba'
In [131]: destination
Out[131]: u'\u0645\u0627\u06ba'
PS. The question "What codec transforms this str object into that unicode object?" comes up from time to time on SO. Here's a little script which can help answer these questions quickly (it simply tries to decode the str
object with every possible encoding):
guess_encoding.py:
import binascii
import zlib
import codecs
import pkgutil
import os
import encodings
def all_encodings():
modnames=set([modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases=set(encodings.aliases.aliases.values())
return modnames.union(aliases)
def main():
encodings=all_encodings()
while 1:
text=raw_input()
text=codecs.escape_decode(text)[0]
# print('Attempting to decode {0!r}'.format(text))
for enc in encodings:
try:
msg=text.decode(enc)
except (IOError,UnicodeDecodeError,LookupError,
TypeError,ValueError,binascii.Error,zlib.error) as err:
pass
# print('{e} failed: {err}'.format(e=enc,err=err))
else:
if msg:
print('Decoding with {enc}:'.format(enc=enc))
print(msg)
if __name__=='__main__':
main()
After running guess_encoding.py
you type in the repr
of the str
object:
% guess_encoding.py
\xe3\xc7\x9f
It spits out the associated unicode object with respect to every possible Python encoding.
Since you told us the desired unicode object was
In [128]: print(destination)
ماں
you can quickly search the output for
ماں
and find the successful codec:
Decoding with cp1256:
ماں
精彩评论