开发者

UTF-16 to Ascii ignoring characters with decimal value greater than 127

I know there are quite a few solutions for this problem but mine was peculiar in the sense that, I might get truncated utf16 data and yet have to make the best effort of dealing with conversions where decode and encode will fail with UnicodeDecodeError. So came up with the following code in python. Please let me know your comments on how I can improve them for faster processing.

    try:
        # conversion to ascii if utf16 data is formatted correctly
        input = open(filename).read().decode('UTF16')
        asciiStr = input.encode('ASCII', 'ignore')
        open(filename).close()
        return asciiStr
    except:
        # if fail with UnicodeDecodeError, then use brute force 
        # to decode truncated data
        try:
            unicode = open(filename).read()
            if (ord(unicode[0]) == 255 and ord(unicode[1]) == 254):
                print("Little-Endian format, UTF-16")
                leAscii = "".join([(unicode[i]) for i in range(2, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
                open(filename).close()
                return leAscii
            elif (ord(unicode[0]) == 254 and ord(unicode[1]) == 255):
                print("Big-Endian format, UTF-16")
                beAscii = "".join([(unicode[i]) for i in range(3, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
                open(filename).close()
                return beAscii
            else:
                open(filename).close()
                return None
        except:
            open(filename).close()
            print("Error in converting开发者_运维百科 to ASCII")
            return None


What about:

data = open(filename).read()
try:
    data = data.decode("utf-16")
except UnicodeDecodeError:
    data = data[:-1].decode("utf-16")

I.e. if it's truncated mid-way through a code unit, snip the last byte off, and do it again. That should get you back to a valid UTF-16 string, without having to try to implement a decoder yourself.


To tolerate errors you could use the optional second argument to the byte-string's decode method. In this example the dangling third byte ('c') is replaced with the "replacement character" U+FFFD:

>>> 'abc'.decode('UTF-16', 'replace')
u'\u6261\ufffd'

There is also an 'ignore' option which will simply drop bytes that can't be decoded:

>>> 'abc'.decode('UTF-16', 'ignore')
u'\u6261'

While it is common to desire a system that is "tolerant" of incorrectly encoded text, it is often quite difficult to define precisely what the expected behavior is in these situations. You may find that the one who provided the requirement to "deal with" incorrectly encoded text does not fully grasp the concept of character encoding.


This just jumped out at me as a "best practice" improvement. File accesses should really be wrapped in with blocks. This will handle opening and cleaning up for you.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜