Using urllib2 in Python. How do I get the name of the file I am downloading?
I am a python beginner. I am using urllib2 to download files. When I download a file, I specify a filename to with which to save the downloaded file on my hard drive. However, if I download the file using my browser, a default filename i开发者_C百科s automatically provided.
Here is a simplified version of my code:
def downloadmp3(url):
webFile = urllib2.urlopen(url)
filename = 'temp.zip'
localFile = open(filename, 'w')
localFile.write(webFile.read())
The file downloads just fine, but if I type the string stored in the variable "url" into my browser, there is a default filename given to the file when I download it. I want to use this filename for my downloaded file not 'temp.zip' or whatever I assign it.
How do I use urllib2 (or some other Python library) to save the file with the filename that the server I am downloading from intends it to have?
If anyone doesn't understand this question, please say so, so that I can try to make it clearer.
The filename is usually included by the server through the content-disposition header:
content-disposition: attachment; filename=foo.pdf
You have access to the headers through
result = urllib2.urlopen(...)
result.info() <- contains the headers
i>>> import urllib2
ur>>> result = urllib2.urlopen('http://zopyx.com')
>>> print result
<addinfourl at 4302289808 whose fp = <socket._fileobject object at 0x1006dd5d0>>
>>> result.info()
<httplib.HTTPMessage instance at 0x1006fbab8>
>>> result.info().headers
['Date: Mon, 04 Apr 2011 02:08:28 GMT\r\n', 'Server: Zope/(unreleased version, python 2.4.6, linux2) ZServer/1.1 Plone/3.3.4\r\n', 'Content-Length: 15321\r\n', 'Content-Type: text/html; charset=utf-8\r\n', 'Via: 1.1 www.zopyx.com\r\n', 'Cache-Control: max-age=3600\r\n', 'Expires: Mon, 04 Apr 2011 03:08:28 GMT\r\n', 'Connection: close\r\n']
See
http://docs.python.org/library/urllib2.html
But be aware that this header does not need to be present. Otherwise you need to generate a reasonable name yourself from the URL requested - e.g. from the last component of the URI. Use the urlparse() method of Python in this case.
My issue with the previous answers is that they were using the original URL, and that would fail in the case of a redirect. Here's how I do it: (note the use of result.url
instead of url
)
import os
import urllib2
result = urllib2.urlopen(url)
filename = os.path.basename(urllib2.urlparse.urlparse(result.url).path)
You can do that using urlretrieve :
http://docs.python.org/library/urllib.html
I had an issue where server did not give me any content-disposition
header so if it's also your case, you can extract filename from url like this:
os.path.basename(urlparse.urlparse(file_url))
In my case, I used file_stream.headers.subtype
which contained file extension and I renamed files based on my django's model slug, here's an example:
import urlparse, os
tmp_file = NamedTemporaryFile(delete=True)
file_stream = urllib2.urlopen(file_url)
tmp_file.write(file_stream.read())
tmp_file.flush()
new_file_name = "some_prefix_" + my_model.slug + "." + file_stream.headers.subtype
#You may prefer this:
# new_file_name = os.path.basename(urlparse.urlparse(file_url))
my_model.file.save(new_file_name, File(tmp_file))
Last line is saving file using django's save method, also handling duplicated file names by adding random characters at the end :)
Awesome.
精彩评论