开发者

Python os.walk and japanese filename crash [duplicate]

This question already has answers here: 开发者_JAVA百科 Closed 12 years ago.

Possible Duplicate:

Python, Unicode, and the Windows console

I have a folder with a filename "01 - ナナナン塊.txt"

I open python at the interactive prompt in the same folder as the file and attempt to walk the folder hierachy:

Python 3.1.2 (r312:79149, Mar 21 2010, 00:41:52) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for x in os.walk('.'):
...     print(x)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\dev\Python31\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 17-21: character maps to <undefined>

Clearly the encoding I'm using isn't able to deal with Japanese characters. Fine. But Python 3.1 is meant to be unicode all the way down, as I understand it, so I'm at a loss as to what I'm meant to do with this. Anyone have any ideas?


It seems like all answers so far are from Unix people who assume the Windows console is like a Unix terminal, which it is not.

The problem is that you can't write Unicode output to the Windows console using the normal underlying file I/O functions. The Windows API WriteConsole needs to be used. Python should probably be doing this transparently, but it isn't.

There's a different problem if you redirect the output to a file: Windows text files are historically in the ANSI codepage, not Unicode. You can fairly safely write UTF-8 to text files in Windows these days, but Python doesn't do that by default.

I think it should do these things, but here's some code to make it happen. You don't have to worry about the details if you don't want to; just call ConsoleFile.wrap_standard_handles(). You do need PyWin installed to get access to the necessary APIs.

import os, sys, io, win32api, win32console, pywintypes

def change_file_encoding(f, encoding):
    """
    TextIOWrapper is missing a way to change the file encoding, so we have to
    do it by creating a new one.
    """

    errors = f.errors
    line_buffering = f.line_buffering
    # f.newlines is not the same as the newline parameter to TextIOWrapper.
    # newlines = f.newlines

    buf = f.detach()

    # TextIOWrapper defaults newline to \r\n on Windows, even though the underlying
    # file object is already doing that for us.  We need to explicitly say "\n" to
    # make sure we don't output \r\r\n; this is the same as the internal function
    # create_stdio.
    return io.TextIOWrapper(buf, encoding, errors, "\n", line_buffering)


class ConsoleFile:
    class FileNotConsole(Exception): pass

    def __init__(self, handle):
        handle = win32api.GetStdHandle(handle)
        self.screen = win32console.PyConsoleScreenBufferType(handle)
        try:
            self.screen.GetConsoleMode()
        except pywintypes.error as e:
            raise ConsoleFile.FileNotConsole

    def write(self, s):
        self.screen.WriteConsole(s)

    def close(self): pass
    def flush(self): pass
    def isatty(self): return True

    @staticmethod
    def wrap_standard_handles():
        sys.stdout.flush()
        try:
            # There seems to be no binding for _get_osfhandle.
            sys.stdout = ConsoleFile(win32api.STD_OUTPUT_HANDLE)
        except ConsoleFile.FileNotConsole:
            sys.stdout = change_file_encoding(sys.stdout, "utf-8")

        sys.stderr.flush()
        try:
            sys.stderr = ConsoleFile(win32api.STD_ERROR_HANDLE)
        except ConsoleFile.FileNotConsole:
            sys.stderr = change_file_encoding(sys.stderr, "utf-8")

ConsoleFile.wrap_standard_handles()

print("English 漢字 Кири́ллица")

This is a little tricky: if stdout or stderr is the console, we need to output with WriteConsole; but if it's not (eg. foo.py > file), that's not going to work, and we need to change the file's encoding to UTF-8 instead.

The opposite in either case will not work. You can't output to a regular file with WriteConsole (it's not actually a byte API, but a UTF-16 one; PyWin hides this detail), and you can't write UTF-8 to a Windows console.

Also, it really should be using _get_osfhandle to get the handle to stdout and stderr, rather than assuming they're assigned to the standard handles, but that API doesn't seem to have any PyWin binding.


For hard-coded strings, you'll need to specify the encoding at the top of source files. For bytestrings input from some other source - such as os.walk -, you need to specify the byte string's encoding (see unutbu's answer).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜