开发者

Confirm that Python 2.6 ftplib does not support Unicode file names? Alternatives?

Can someone confirm that Python 2.6 ftplib does NOT support Unicode file names? Or must Unicode file names be specially encoded in order to be used with the ftplib module?

The following email exchange seems to support my conclusion that the ftplib module only supports ASCII file names.

Should ftplib use UTF-8 instead of latin-1 encoding? http://mail.python.org/pipermail/python-dev/2009-January/085408.html

Any recommendations on a 3rd party Python FTP module that supports Unicode file names? I've googled this question without success [1], [2].

The official Python documentation does not mention Unicode file names [3].

Thank you, Malcolm

[1] ftputil wraps ftplib and inherits ftplib's apparent ASCII only support?

[2] Paramiko's SFTP library does support Unicode file names, however I'm looking specifically for ftp (vs. sftp) support relative to our current project.

[3] http://docs.python.org/library/ftplib.html

WORKAROUND:

The encodings.idna.ToASCII and .ToUnicode methods can be used to convert Unicode path names to an ASCII format. If you wrap all your remote path names and the output of the dir/nlst methods with these functions, then you can create a way to preserve Unicode path names using the standard ftplib开发者_如何学编程 (and also preserve Unicode file names on file systems that don't support Unicode paths). The downside to this technique is that other processes on the server will also have to use encodings.idna when referencing the files that you upload to the server. BTW: I understand that this is an abuse of the encodings.idna library.

Thank you Peter and Bob for your comments which I found very helpful.


ftplib has no knowledge of Unicode whatsoever. It is intended to be passed byte-strings for filenames, and it'll return byte strings when asked for a directory list. Those are the exact strings of bytes passed-to/returned-from the server.

If you pass a Unicode string to ftplib in Python 2.x, it'll end up getting coerced to bytes when it's sent to the underlying socket object. This coercion uses Python's ‘default’ encoding, ie. US-ASCII for safety, with exceptions generated for non-ASCII characters.

The python-dev message to which you linked is talking about ftplib in Python 3.x, where strings are Unicode by default. This leaves modules like ftplib in a tricky situation because although they now use Unicode strings at their front-end, the actual protocol behind it is byte-based. There therefore has to be an extra level of encoding/decoding involved, and without explicit intervention to specify what encoding is in use, there's a fair change it'll choose wrong.

ftplib in 3.x chose to default to ISO-8859-1 in order to preserve each byte as a character inside the Unicode string. Unfortunately this will give unexpected results in the common case where the target server uses a UTF-8 collation for filenames (whether or not the FTP daemon itself knows that filenames are UTF-8, which it commonly won't). There are a number of cases like this where the Python standard libraries have been brutally hacked to Unicode strings with negative consequences; Python 3's batteries-included are still leaking corrosive fluid IMO.


Personally I would be more worried about what is on the other side of the ftp connection than the support of the library. FTP is a brittle protocol at the best of times without trying to be creative with filenames.

from RFC 959:

     Pathname is defined to be the character string which must be
     input to a file system by a user in order to identify a file.
     Pathname normally contains device and/or directory names, and
     file name specification.  FTP does not yet specify a standard
     pathname convention.  Each user must follow the file naming
     conventions of the file systems involved in the transfer.

To me that means that the filenames should conform to the lowest common denominator. Since nowadays the number of DOS servers, Vax and IBM mainframes is negligeable and chances are you'll end up on a Windows or Unix box so the common denominator is quite high, but making assumptions on which codepage the remote site wants to accept appears to me pretty risky.


To get around this, I used the following code

ftp.storbinary("STOR " + target_name.encode( "utf-8" ), open(file_name, 'rb'))

This assumes that the FTP server supports RFC 2640 http://www.ietf.org/rfc/rfc2640.txt which allows for utf-8 file names. In my case I used SwiFTP server for Android and it transfers the files with the proper names successfully.


Can someone confirm that Python 2.6 ftplib does NOT support Unicode file names?

It doesn't.

Should ftplib use UTF-8 instead of latin-1 encoding?

It's debatable. UTF-8 is the preferred encoding as dictated by RFC-2640 but latin-1 is usually more friendly for misbehaving implementations (either server or client). If server includes "UTF8" as part of the FEAT response then you should definitively use UTF8.

 >>> utf8_server = 'UTF8' in ftp.sendcmd('FEAT')

To support unicode in python 2.x you can adopt the following monkey patched version of ftpdlib:

class UnicodeFTP(ftplib.FTP):
    """A ftplib.FTP subclass supporting unicode file names as 
   described by RFC-2640."""

    def putline(self, line):
        line = line + '\r\n'
        if isinstance(line, unicode):
            line = line.encode('utf8')
        self.sock.sendall(line)

...and pass unicode strings when using the remaining API as in:

>>> ftp = UnicodeFTP(host='ftp.site.com', user='foo', passwd='bar')
>>> ftp.delete(u'somefile')


We got UTF8 encoded filenames working for Python 2.7's FTPlib.

Note 1: Here's a background to easily explain UTF8 and unicode: https://code.google.com/p/iqbox-ftp/wiki/ProgrammingGuide_UnicodeVsAscii

Note 2: You can take a look at the AGPL libraries we use for IQBox. You might be able to use those (or parts of those), and they support UTF8 over FTP. Look at filetransfer_abc.py

You do need to add code to (1) Determine if the server supports UTF8, and (2) encode the unicode Python string in UTF8 format. (3) (Full code not shown since everyone gets file listings differently) When you get the file listings you need to also use if UTF8_support: name = name.decode('utf-8')

# PART (1): DETERMINE IF SERVER HAS UTF8 SUPPORT:
# Get FTP features:
    try:
    features_string_ftp = ftp.sendcmd('FEAT')
    print features_string_ftp

    # Determine UTF8 support:
    if 'UTF8' in features_string_ftp.upper():
        print "FTP>> Server supports international characters (UTF8)"
        UTF8_support = True
    else:
        print "FTP>> Server does NOT support international (non-ASCII) characters."
        UTF8_support = False
    except:
    print "FTP>> Could not get list of features using FEAT command."
    print "FTP>> Server does NOT support international (non-ASCII) characters."
    UTF8_support = False


# Part (2): Encode FTP commands needed to be sent using UTF8 encoding, if it's supported.
    def sendFTPcommand(ftp, command_string, UTF8_support):
    # Needed for UTF8 international file names etc.
    c = None
    if UTF8_support:
        c = command_string.encode('utf-8')
    else:
        c = command_string

    # TODO: Add try-catch here and connection error retries.
    return ftp.sendcmd(c)

    # If you just want to get a string with the UTF8 command and send it yourself, then use this:
       def encodeFTPcommand(self, command_string. UTF8_support):
        # Needed for UTF8 international file names etc.
        c = None
        if UTF8_support:
            c = command_string.encode('utf-8')
        else:
            c = command_string  
        return c
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜