开发者

python3: readlines() indices issue?

Python 3.1.2 (r312:79147, Nov  9 2010, 09:41:54)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> open("/home/madsc13ntist/test_file.txt", "r").readlines()[6]
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 2230: unexpected code byte

and yet...

Python 2.4.3 (#1, Sep  8 2010, 11:37:47)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> open("/home/madsc13ntist/test_file.txt", "r").readlines()[6]
'2010-06-14 21:14:43 613 xxx.xxx.xxx.xxx 200 TCP_NC_MISS 4198 635 GET http www.thelegendssportscomplex.com 80 /thumbnails/t/sponsors/145x138/007.gi开发者_JAVA百科f - - - DIRECT www.thelegendssportscomplex.com image/gif http://www.thelegendssportscomplex.com/ "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; InfoPath.1; MS-RTC LM 8)" OBSERVED "Sports/Recreation" - xxx.xxx.xxx.xxx xxx.xxx.xxx.xxx\r\n'

does anyone have any idea why .readlines()[6] doesn't work for python-3 but does work in 2.4?

also... I thought 0xAE was ®


From the Python wiki:

The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode() to fail

It appears as though you have a different encoding than you think you do.


open function doc:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

reading files using encoding for ever:

open("/home/madsc13ntist/test_file.txt", "r",encoding='iso8859-1').readlines()[6]

ignore decoding error? Setting the errors='ignore'. The default value of 'errors' is 'None', same with 'strict'.


As it is about two years from asking the question, you probably already know the reason. Basically, Python 3 strings are Unicode strings. To make them abstract you need to tell Python what encoding is used for the file.

Python 2 strings are actually byte sequences and Python feels fine to read whatever bytes from the file. Some of the characters are interpreted (newlines, tabs,...), but the rest is left untouched.

Python 3 open() is similar to Python 2 codecs.open().

... the time has come ... to close the question by accepting one of the answers.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜