开发者

Python Socket and Thread pooling, how to get more performance?

I am trying to implement a basic lib to issue HTTP GET requests. My target is to receive data through socket connections - minimalistic design to improve performance - usage with threads, thread pool(s).

I have a bunch of links which I group by their hostnames, so here's a simple demonstration of input URLs:

hostname1.com - 500 links
hostname2.org - 350 links
hostname3.co.uk - 100 links
...

I intend to use sockets because of performance issues. I intend to use a number of sockets which keeps connected (if possible and it usually is) and issue HTTP GET requests. The idea came from urllib low performance on continuous requests, then I met urllib3, then I realized it uses httplib and then I decided to try sockets. So here's what I accomplished till now:

GETSocket class, SocketPool class, ThreadPool and Worker classes

GETSocket class is a minified, "HTTP GET only" version of Python's httplib.

So, I use these classes like that:

sp = Comm.SocketPool(host,size=self.poolsize, timeout=5)
for link in linklist:
    pool.add_task(self.__get_url_by_sp, self.count, sp, link, results)
    self.count += 1
    pool.wait_completion()
    pass

__get_url_by_sp function is a wrapper which calls sp.urlopen and saves the result to results list. I am using a pool of 5 threads which has a socket pool of 5 GETSocket classes.

What I wonder is, is there any other possible way that I can improve performance of this system?

I've read about asyncore here, but I couldn't figure out how to use same socket connection with class HTTPClient(asyncore.dispatche开发者_运维问答r) provided.

Another point, I don't know if I'm using a blocking or a non-blocking socket, which would be better for performance or how to implement which one.

Please be specific about your experiences, I don't intend to import another library to do just HTTP GET so I want to code my own tiny library.

Any help appreciated, thanks.


Do this.

Use multiprocessing. http://docs.python.org/library/multiprocessing.html.

  1. Write a worker Process which puts all of the URL's into a Queue.

  2. Write a worker Process which gets a URL from a Queue and does a GET, saving a file and putting the File information into another Queue. You'll probably want multiple copies of this Process. You'll have to experiment to find how many is the correct number.

  3. Write a worker Process which reads file information from a Queue and does whatever it is that you're trying do.


I finally found a well chosen path to solve my problems. I was using Python 3 for my project and my only option was to use pycurl, so this made me have to port my project back to Python 2.7 series.

Using pycurl, I gained: - Consistent responses to my requests (actually my script has to deal with minimum 10k URLs) - With the usage of ThreadPool class I am receiving responses as fast as my system can (received data is processed later - so multiprocessing is not much of a possibility here)

I tried httplib2 first, I realized that it is not acting as solid as it acts on Python 2, by switching to pycurl I lost caching support.

Final conclusion: When it comes to HTTP communication, one could need a tool like (py)curl at his disposal. It is a lifesaver, especially when one is dealing with loads of URLs (try sometimes for fun: you will get lots of weird responses from them)

Thanks for the replies, folks.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜