Python: Downloading files from HTTP server
I have written some python scripts to download images off an HTTP website, but because I'm using urllib2, it closes the existing connection and then opens another before opening another. I don't really understand networking all that much, but this probably slows things down considerably, and grabbing 100 images at a time would take a considerable amount of time.
I started looking开发者_如何转开发 at other alternatives like pycurl or httplib, but found them complicated to figure out compared to urllib2 and haven't found a lot of code snippets that I could just take and use.
Simply, how would I establish a persistent connection to a website and download a number of files and then close the connection only when I am done? (probably an explicit call to close it)
since you asked for an httplib snippet:
import httplib
images = ['img1.png', 'img2.png', 'img3.png']
conn = httplib.HTTPConnection('www.example.com')
for image in images:
conn.request('GET', '/images/%s' % image)
resp = conn.getresponse()
data = resp.read()
with open(image, 'wb') as f:
f.write(data)
conn.close()
this would issue multiple (sequential) GET requests for the images in the list, then close the connection.
I found urllib3 and it claims to reuse exisiting TCP connection.
As I already stated in a comment to the question I disagree with the claim, that this will not make a big difference: Because auf TCP Slow Start Algorithm every newly created connection will be slow at first. So reusing the same TCP socket will make a difference if the data is big enoug. And I think for 100 the data will be between 10 and 100 MB.
Here is a code sample from http://code.google.com/p/urllib3/source/browse/test/benchmark.py
TO_DOWNLOAD = [
'http://code.google.com/apis/apps/',
'http://code.google.com/apis/base/',
'http://code.google.com/apis/blogger/',
'http://code.google.com/apis/calendar/',
'http://code.google.com/apis/codesearch/',
'http://code.google.com/apis/contact/',
'http://code.google.com/apis/books/',
'http://code.google.com/apis/documents/',
'http://code.google.com/apis/finance/',
'http://code.google.com/apis/health/',
'http://code.google.com/apis/notebook/',
'http://code.google.com/apis/picasaweb/',
'http://code.google.com/apis/spreadsheets/',
'http://code.google.com/apis/webmastertools/',
'http://code.google.com/apis/youtube/',
]
from urllib3 import HTTPConnectionPool
import urllib
pool = HTTPConnectionPool.from_url(url_list[0])
for url in url_list:
r = pool.get_url(url)
If you are not going to make any complicated requests you could open a socket
and make requests your self like:
import sockets
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((server_name, server_port))
for url in urls:
sock.write('get %s\r\nhost: %s\r\n\r\n' % (url, server_name))
# Parse HTTP header
# Download picture (Size should be in the HTTP header)
sock.close()
But I do not think establishing 100 TCP sessions will make a lot of overhead in general.
精彩评论