UTF-16 to Ascii ignoring characters with decimal value greater than 127

2023-03-13 01:05 问答作者：

I know there are quite a few solutions for this problem but mine was peculiar in the sense that, I might get truncated utf16 data and yet have to make the best effort of dealing with conversions where decode and encode will fail with UnicodeDecodeError. So came up with the following code in python. Please let me know your comments on how I can improve them for faster processing.

    try:
        # conversion to ascii if utf16 data is formatted correctly
        input = open(filename).read().decode('UTF16')
        asciiStr = input.encode('ASCII', 'ignore')
        open(filename).close()
        return asciiStr
    except:
        # if fail with UnicodeDecodeError, then use brute force 
        # to decode truncated data
        try:
            unicode = open(filename).read()
            if (ord(unicode[0]) == 255 and ord(unicode[1]) == 254):
                print("Little-Endian format, UTF-16")
                leAscii = "".join([(unicode[i]) for i in range(2, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
                open(filename).close()
                return leAscii
            elif (ord(unicode[0]) == 254 and ord(unicode[1]) == 255):
                print("Big-Endian format, UTF-16")
                beAscii = "".join([(unicode[i]) for i in range(3, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
                open(filename).close()
                return beAscii
            else:
                open(filename).close()
                return None
        except:
            open(filename).close()
            print("Error in converting开发者_运维百科 to ASCII")
            return None

What about:

data = open(filename).read()
try:
    data = data.decode("utf-16")
except UnicodeDecodeError:
    data = data[:-1].decode("utf-16")

I.e. if it's truncated mid-way through a code unit, snip the last byte off, and do it again. That should get you back to a valid UTF-16 string, without having to try to implement a decoder yourself.

To tolerate errors you could use the optional second argument to the byte-string's decode method. In this example the dangling third byte ('c') is replaced with the "replacement character" U+FFFD:

>>> 'abc'.decode('UTF-16', 'replace')
u'\u6261\ufffd'

There is also an 'ignore' option which will simply drop bytes that can't be decoded:

>>> 'abc'.decode('UTF-16', 'ignore')
u'\u6261'

While it is common to desire a system that is "tolerant" of incorrectly encoded text, it is often quite difficult to define precisely what the expected behavior is in these situations. You may find that the one who provided the requirement to "deal with" incorrectly encoded text does not fully grasp the concept of character encoding.

This just jumped out at me as a "best practice" improvement. File accesses should really be wrapped in with blocks. This will handle opening and cleaning up for you.

继续阅读：ascii python utf-16

UTF-16 to Ascii ignoring characters with decimal value greater than 127

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？