Python Socket and Thread pooling, how to get more performance?

2023-03-18 11:30 问答作者：

I am trying to implement a basic lib to issue HTTP GET requests. My target is to receive data through socket connections - minimalistic design to improve performance - usage with threads, thread pool(s).

I have a bunch of links which I group by their hostnames, so here's a simple demonstration of input URLs:

hostname1.com - 500 links
hostname2.org - 350 links
hostname3.co.uk - 100 links
...

I intend to use sockets because of performance issues. I intend to use a number of sockets which keeps connected (if possible and it usually is) and issue HTTP GET requests. The idea came from urllib low performance on continuous requests, then I met urllib3, then I realized it uses httplib and then I decided to try sockets. So here's what I accomplished till now:

GETSocket class, SocketPool class, ThreadPool and Worker classes

GETSocket class is a minified, "HTTP GET only" version of Python's httplib.

So, I use these classes like that:

sp = Comm.SocketPool(host,size=self.poolsize, timeout=5)
for link in linklist:
    pool.add_task(self.__get_url_by_sp, self.count, sp, link, results)
    self.count += 1
    pool.wait_completion()
    pass

__get_url_by_sp function is a wrapper which calls sp.urlopen and saves the result to results list. I am using a pool of 5 threads which has a socket pool of 5 GETSocket classes.

What I wonder is, is there any other possible way that I can improve performance of this system?

I've read about asyncore here, but I couldn't figure out how to use same socket connection with class HTTPClient(asyncore.dispatche开发者_运维问答r) provided.

Another point, I don't know if I'm using a blocking or a non-blocking socket, which would be better for performance or how to implement which one.

Please be specific about your experiences, I don't intend to import another library to do just HTTP GET so I want to code my own tiny library.

Any help appreciated, thanks.

Do this.

Use multiprocessing. http://docs.python.org/library/multiprocessing.html.

Write a worker Process which puts all of the URL's into a Queue.
Write a worker Process which gets a URL from a Queue and does a GET, saving a file and putting the File information into another Queue. You'll probably want multiple copies of this Process. You'll have to experiment to find how many is the correct number.
Write a worker Process which reads file information from a Queue and does whatever it is that you're trying do.

I finally found a well chosen path to solve my problems. I was using Python 3 for my project and my only option was to use pycurl, so this made me have to port my project back to Python 2.7 series.

Using pycurl, I gained: - Consistent responses to my requests (actually my script has to deal with minimum 10k URLs) - With the usage of ThreadPool class I am receiving responses as fast as my system can (received data is processed later - so multiprocessing is not much of a possibility here)

I tried httplib2 first, I realized that it is not acting as solid as it acts on Python 2, by switching to pycurl I lost caching support.

Final conclusion: When it comes to HTTP communication, one could need a tool like (py)curl at his disposal. It is a lifesaver, especially when one is dealing with loads of URLs (try sometimes for fun: you will get lots of weird responses from them)

Thanks for the replies, folks.

继续阅读：http-get multithreading python sockets threadpool

Python Socket and Thread pooling, how to get more performance?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？