pycurl and lot of callback functions
I have big URL list, which I have to download in parallel and check one of headers that is returned with each response.
I can use CurlMulti for parallelization.
I can use /dev/null
as fb, because I am not interested in body, only headers.
But how can I check each header?
To receive header, I must set HEADERFUNCTION callback. I get that.
But in thi开发者_如何转开发s callback function I get only buffer with headers. How can I distinguish one request from another?
I don't like the idea of creating as much callback functions as there are URLs. Should I create some class and as much instances of that class? Also not very clever.
I would use Python's built in httplib and threading modules. I don't see need for a 3rd party module.
I know you're asking about pycurl, but I find it too hard and unpythonic to use. The API is weird.
Here's a twisted example:
from twisted.web.client import Agent
from twisted.internet import reactor, defer
def get_headers(response, url):
'''Extract a dict of headers from the response'''
return url, dict(response.headers.getAllRawHeaders())
def got_everything(all_headers):
'''print results and end program'''
print dict(all_headers)
reactor.stop()
agent = Agent(reactor)
urls = (line.strip() for line in open('urls.txt'))
reqs = [agent.request('HEAD', url).addCallback(get_headers, url) for url in urls if url]
defer.gatherResults(reqs).addCallback(got_everything)
reactor.run()
This example starts all requests asynchronously, and gather all results. Here's the output for a file with 3 urls:
{'http://debian.org': {'Content-Type': ['text/html; charset=iso-8859-1'],
'Date': ['Thu, 04 Mar 2010 13:27:25 GMT'],
'Location': ['http://www.debian.org/'],
'Server': ['Apache'],
'Vary': ['Accept-Encoding']},
'http://google.com': {'Cache-Control': ['public, max-age=2592000'],
'Content-Type': ['text/html; charset=UTF-8'],
'Date': ['Thu, 04 Mar 2010 13:27:25 GMT'],
'Expires': ['Sat, 03 Apr 2010 13:27:25 GMT'],
'Location': ['http://www.google.com/'],
'Server': ['gws'],
'X-Xss-Protection': ['0']},
'http://stackoverflow.com': {'Cache-Control': ['private'],
'Content-Type': ['text/html; charset=utf-8'],
'Date': ['Thu, 04 Mar 2010 13:27:24 GMT'],
'Expires': ['Thu, 04 Mar 2010 13:27:25 GMT'],
'Server': ['Microsoft-IIS/7.5']}}
The solution is to use a little bit of functional programming to 'stick' some additional information to our callback function.
functools.partial
精彩评论