Python kludge to read UCS-2 (UTF-16?) as ASCII

2023-02-15 12:45 问答作者：

I'm in a little over my head on this one, so please pardon my terminology in advance.

I'm running this using Python 2.7 on Windows XP.

I found some Python code that reads a log file, does some stuff, then displays something.

What, that's not enough detail? Ok, here's a simplified version:

#!/usr/bin/python

import re
import sys

class NotSupportedTOCError(Exception):
    pass

def filter_toc_entries(lines):
    while True:
        line = lines.next()
        if re.match(r""" \s* 
                   .+\s+ \| (?#track)
                \s+.+\s+ \| (?#start)
                \s+.+\s+ \| (?#length)
                \s+.+\s+ \| (?#start sec)
                \s+.+\s*$   (?#end sec)
                """, line, re.X):
            lines.next()
            break

    while True:
        line = lines.next()
        m = re.match(r"""
            ^\s*
            (?P<num>\d+)
            \s*\|\s*
            (?P<start_time>[0-9:.]+)
            \s*\|\s*
            (?P<length_time>[0-9:.]+)
            \s*\|\s*
            (?P<start_sector>\d+)
            \s*\|\s*
            (?P<end_sector>\d+)
            \s*$
            """, line, re.X)
        if not m:
            break
        yield m.groupdict()

def calculate_mb_toc_numbers(eac_entries):
    eac = list(eac_entries)
    num_tracks = len(eac)

    tracknums = [int(e['num']) for e in eac]
    开发者_Go百科if range(1,num_tracks+1) != tracknums:
        raise NotSupportedTOCError("Non-standard track number sequence: %s", tracknums)

    leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
    offsets = [(int(x['start_sector']) + 150) for x in eac]
    return [1, num_tracks, leadout_offset] + offsets

f = open(sys.argv[1])

mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))

print mb_toc_urlpart

The code works fine as long as the log file is "simple" text (I'm tempted to say ASCII although that may not be precise/accurate - for e.g. Notepad++ indicates it's ANSI).

However, the script doesn't work on certain log files (in these cases, Notepad++ says "UCS-2 Little Endian").

I get the following error:

Traceback (most recent call last):
  File "simple.py", line 55, in <module>
    mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_
toc_entries(f)))
  File "simple.py", line 49, in calculate_mb_toc_numbers
    leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
IndexError: list index out of range

This log works

This log breaks

I believe it's the encoding that's breaking the script because if I simply do this at a command prompt:

type ascii.log > scrubbed.log

and then run the script on scrubbed.log, the script works fine (this is actually fine for my purposes since there's no loss of important information and I'm not writing back to a file, just printing to the console).

One workaround would be to "scrub" the log file before passing it to Python (e.g. using the type pipe trick above to a temporary file and then have the script run on that), but I would like to have Python "ignore" the encoding if it's possible. I'm also not sure how to detect what type of log file the script is reading so I can act appropriately.

I'm reading this and this but my eyes are still spinning around in their head, so while that may be my longer term strategy, I'm wondering if there's an interim hack I could use.

codecs.open() will allow you to open a file using a specific encoding, and it will produce unicodes. You can try a few, going from most likely to least likely (or the tool could just always produce UTF-16LE but ha ha fat chance).

Also, "Unicode In Python, Completely Demystified".

works.log appears to be encoded in ASCII:

>>> data = open('works.log', 'rb').read()
>>> all(d < '\x80' for d in data)
True

breaks.log appears to be encoded in UTF-16LE -- it starts with the 2 bytes '\xff\xfe'. None of the characters in breaks.log are outside the ASCII range:

>>> data = open('breaks.log', 'rb').read()
>>> data[:2]
'\xff\xfe'
>>> udata = data.decode('utf16')
>>> all(d < u'\x80' for d in udata)
True

If these are the only two possibilities, you should be able to get away with the following hack. Change your mainline code from:

f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(
    str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart

to this:

f = open(sys.argv[1], 'rb')
data = f.read()
f.close()
if data[:2] == '\xff\xfe':
    data = data.decode('utf16').encode('ascii')
# ilines is a generator which produces newline-terminated strings
ilines = (line + '\n' for line in data.splitlines())
mb_toc_urlpart = "%20".join(
    str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(ilines))        )
print mb_toc_urlpart

Python 2.x expects normal strings to be ASCII (or at least one byte). Try this:

Put this at the top of your Python source file:

from __future__ import unicode_literals

And change all the str to unicode.

[edit]

And as Ignacio Vazquez-Abrams wrote, try codecs.open() to open the input file.

继续阅读：encoding python

Python kludge to read UCS-2 (UTF-16?) as ASCII

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？