Threading HTTP requests (with proxies)
I've looked at similar questions, but there always seems to be a whole lot of disagreement over the best way to handle threading with HTTP.
What I specifically want to do: I'm using Python 2.7, and I want to try and thread HTTP requests (specifically, POSTing something), with a SOCKS5 proxy for each. The code I have already works, but is rather slow since it's waiting for each request (to the proxy server, then the web server) to finish before starting another. Each thread would most likely be making a different request with a different SOCKS proxy.
So far I've purely been using urllib2. I looked into modules like PycURL, but it is extremely difficult to install properly with Python 2.7 on Windows, which I want to support and which I am coding on. I'd be willing to use any other module though.
I've looked at these questions in particular:
Python urllib2.urlopen() is slow, need a better way to read several urls
Python - Example of urllib2 asynchronous / threaded request using HTTPS
Many of the examples received downvotes and arguing. Assuming the commenters are correct, making a client with an asynchronous framework like Twisted sounds like it would be the fastest thing to use. However, I Googled ferociously, and it does not provide any sort of support for SOCKS5 proxies. I'm currently using the Socksipy module, and I could try something like:
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, IP, port)
socks.wrapmodule(twisted.web.client)
I have no idea if that would work though, and I also don't even know if Twisted is what I really want to use. I could also just go with the threading module and work that 开发者_开发技巧into my current urllib2 code, but if that is going to be much slower than Twisted, I may not want to bother. Does anyone have any insight?
Perhaps an easier way would be to just rely on gevent (or eventlet) to let you open lots of connections to the server. These libs monkeypatch urllib to make then async, whilst still letting you write code that is sync-ish. Their smaller overhead vs threads also means you can spawn lots more (1000s would not be unusual).
Ive used something like this loads (plagiarized from here):
urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']
import gevent
from gevent import monkey
# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()
import urllib2
def print_head(url):
print ('Starting %s' % url)
data = urllib2.urlopen(url).read()
print ('%s: %s bytes: %r' % (url, len(data), data[:50]))
jobs = [gevent.spawn(print_head, url) for url in urls]
精彩评论