开发者

Python: Downloading files from HTTP server

I have written some python scripts to download images off an HTTP website, but because I'm using urllib2, it closes the existing connection and then opens another before opening another. I don't really understand networking all that much, but this probably slows things down considerably, and grabbing 100 images at a time would take a considerable amount of time.

I started looking开发者_如何转开发 at other alternatives like pycurl or httplib, but found them complicated to figure out compared to urllib2 and haven't found a lot of code snippets that I could just take and use.

Simply, how would I establish a persistent connection to a website and download a number of files and then close the connection only when I am done? (probably an explicit call to close it)


since you asked for an httplib snippet:

import httplib

images = ['img1.png', 'img2.png', 'img3.png']

conn = httplib.HTTPConnection('www.example.com')

for image in images:
    conn.request('GET', '/images/%s' % image)
    resp = conn.getresponse()
    data = resp.read()
    with open(image, 'wb') as f:
        f.write(data)

conn.close()

this would issue multiple (sequential) GET requests for the images in the list, then close the connection.


I found urllib3 and it claims to reuse exisiting TCP connection.

As I already stated in a comment to the question I disagree with the claim, that this will not make a big difference: Because auf TCP Slow Start Algorithm every newly created connection will be slow at first. So reusing the same TCP socket will make a difference if the data is big enoug. And I think for 100 the data will be between 10 and 100 MB.

Here is a code sample from http://code.google.com/p/urllib3/source/browse/test/benchmark.py

TO_DOWNLOAD = [
'http://code.google.com/apis/apps/',
'http://code.google.com/apis/base/',
'http://code.google.com/apis/blogger/',
'http://code.google.com/apis/calendar/',
'http://code.google.com/apis/codesearch/',
'http://code.google.com/apis/contact/',
'http://code.google.com/apis/books/',
'http://code.google.com/apis/documents/',
'http://code.google.com/apis/finance/',
'http://code.google.com/apis/health/',
'http://code.google.com/apis/notebook/',
'http://code.google.com/apis/picasaweb/',
'http://code.google.com/apis/spreadsheets/',
'http://code.google.com/apis/webmastertools/',
'http://code.google.com/apis/youtube/',
]

from urllib3 import HTTPConnectionPool
import urllib

pool = HTTPConnectionPool.from_url(url_list[0])
for url in url_list:
    r = pool.get_url(url)


If you are not going to make any complicated requests you could open a socket and make requests your self like:

import sockets

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((server_name, server_port))

for url in urls:
    sock.write('get %s\r\nhost: %s\r\n\r\n' % (url, server_name))
    # Parse HTTP header
    # Download picture (Size should be in the HTTP header)

sock.close()

But I do not think establishing 100 TCP sessions will make a lot of overhead in general.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜