UTF-16 to Ascii ignoring characters with decimal value greater than 127
I know there are quite a few solutions for this problem but mine was peculiar in the sense that, I might get truncated utf16 data and yet have to make the best effort of dealing with conversions where decode and encode will fail with UnicodeDecodeError. So came up with the following code in python. Please let me know your comments on how I can improve them for faster processing.
try:
# conversion to ascii if utf16 data is formatted correctly
input = open(filename).read().decode('UTF16')
asciiStr = input.encode('ASCII', 'ignore')
open(filename).close()
return asciiStr
except:
# if fail with UnicodeDecodeError, then use brute force
# to decode truncated data
try:
unicode = open(filename).read()
if (ord(unicode[0]) == 255 and ord(unicode[1]) == 254):
print("Little-Endian format, UTF-16")
leAscii = "".join([(unicode[i]) for i in range(2, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
open(filename).close()
return leAscii
elif (ord(unicode[0]) == 254 and ord(unicode[1]) == 255):
print("Big-Endian format, UTF-16")
beAscii = "".join([(unicode[i]) for i in range(3, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
open(filename).close()
return beAscii
else:
open(filename).close()
return None
except:
open(filename).close()
print("Error in converting开发者_运维百科 to ASCII")
return None
What about:
data = open(filename).read()
try:
data = data.decode("utf-16")
except UnicodeDecodeError:
data = data[:-1].decode("utf-16")
I.e. if it's truncated mid-way through a code unit, snip the last byte off, and do it again. That should get you back to a valid UTF-16 string, without having to try to implement a decoder yourself.
To tolerate errors you could use the optional second argument to the byte-string's decode method. In this example the dangling third byte ('c') is replaced with the "replacement character" U+FFFD:
>>> 'abc'.decode('UTF-16', 'replace')
u'\u6261\ufffd'
There is also an 'ignore' option which will simply drop bytes that can't be decoded:
>>> 'abc'.decode('UTF-16', 'ignore')
u'\u6261'
While it is common to desire a system that is "tolerant" of incorrectly encoded text, it is often quite difficult to define precisely what the expected behavior is in these situations. You may find that the one who provided the requirement to "deal with" incorrectly encoded text does not fully grasp the concept of character encoding.
This just jumped out at me as a "best practice" improvement. File accesses should really be wrapped in with
blocks. This will handle opening and cleaning up for you.
精彩评论