Python: Downloading files from HTTP server

2023-02-07 23:16 问答作者：

I have written some python scripts to download images off an HTTP website, but because I'm using urllib2, it closes the existing connection and then opens another before opening another. I don't really understand networking all that much, but this probably slows things down considerably, and grabbing 100 images at a time would take a considerable amount of time.

I started looking开发者_如何转开发 at other alternatives like pycurl or httplib, but found them complicated to figure out compared to urllib2 and haven't found a lot of code snippets that I could just take and use.

Simply, how would I establish a persistent connection to a website and download a number of files and then close the connection only when I am done? (probably an explicit call to close it)

since you asked for an httplib snippet:

import httplib

images = ['img1.png', 'img2.png', 'img3.png']

conn = httplib.HTTPConnection('www.example.com')

for image in images:
    conn.request('GET', '/images/%s' % image)
    resp = conn.getresponse()
    data = resp.read()
    with open(image, 'wb') as f:
        f.write(data)

conn.close()

this would issue multiple (sequential) GET requests for the images in the list, then close the connection.

I found urllib3 and it claims to reuse exisiting TCP connection.

As I already stated in a comment to the question I disagree with the claim, that this will not make a big difference: Because auf TCP Slow Start Algorithm every newly created connection will be slow at first. So reusing the same TCP socket will make a difference if the data is big enoug. And I think for 100 the data will be between 10 and 100 MB.

Here is a code sample from http://code.google.com/p/urllib3/source/browse/test/benchmark.py

TO_DOWNLOAD = [
'http://code.google.com/apis/apps/',
'http://code.google.com/apis/base/',
'http://code.google.com/apis/blogger/',
'http://code.google.com/apis/calendar/',
'http://code.google.com/apis/codesearch/',
'http://code.google.com/apis/contact/',
'http://code.google.com/apis/books/',
'http://code.google.com/apis/documents/',
'http://code.google.com/apis/finance/',
'http://code.google.com/apis/health/',
'http://code.google.com/apis/notebook/',
'http://code.google.com/apis/picasaweb/',
'http://code.google.com/apis/spreadsheets/',
'http://code.google.com/apis/webmastertools/',
'http://code.google.com/apis/youtube/',
]

from urllib3 import HTTPConnectionPool
import urllib

pool = HTTPConnectionPool.from_url(url_list[0])
for url in url_list:
    r = pool.get_url(url)

If you are not going to make any complicated requests you could open a socket and make requests your self like:

import sockets

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((server_name, server_port))

for url in urls:
    sock.write('get %s\r\nhost: %s\r\n\r\n' % (url, server_name))
    # Parse HTTP header
    # Download picture (Size should be in the HTTP header)

sock.close()

But I do not think establishing 100 TCP sessions will make a lot of overhead in general.

继续阅读：http python

Python: Downloading files from HTTP server

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？