Does a multithreaded crawler in Python really speed things up?

2022-12-30 19:33 问答作者：

Was looking to write a little web crawler in python. I was starting to investigate writing it as a multithreaded script, one pool of threads downloading an开发者_运维技巧d one pool processing results. Due to the GIL would it actually do simultaneous downloading? How does the GIL affect a web crawler? Would each thread pick some data off the socket, then move on to the next thread, let it pick some data off the socket, etc..?

Basically I'm asking is doing a multi-threaded crawler in python really going to buy me much performance vs single threaded?

thanks!

The GIL is not held by the Python interpreter when doing network operations. If you are doing work that is network-bound (like a crawler), you can safely ignore the effects of the GIL.

On the other hand, you may want to measure your performance if you create lots of threads doing processing (after downloading). Limiting the number of threads there will reduce the effects of the GIL on your performance.

Look at how scrapy works. It can help you a lot. It doesn't use threads, but can do multiple "simultaneous" downloading, all in the same thread.

If you think about it, you have only a single network card, so parallel processing can't really help by definition.

What scrapy does is just not wait around for the response of one request before sending another. All in a single thread.

When it comes to crawling you might be better off using something event-based such as Twisted that uses non-blocking asynchronous socket operations to fetch and return data as it comes, rather than blocking on each one.

Asynchronous network operations can easily be and usually are single-threaded. Network I/O almost always has higher latency than that of CPU because you really have no idea how long a page is going to take to return, and this is where async shines because an async operation is much lighter weight than a thread.

Edit: Here is a simple example of how to use Twisted's getPage to create a simple web crawler.

Another consideration: if you're scraping a single website and the server places limits on the frequency of requests your can send from your IP address, adding multiple threads may make no difference.

Yes, multithreading scraping increases the process speed significantly. This is not a case where GIL is an issue. You are losing a lot of idle CPU and unused bandwidth waiting for a request to finish. If the web page you are scraping is in your local network (a rare scraping case) then the difference between multithreading and single thread scraping can be smaller.

You can try the benchmark yourself playing with one to "n" threads. I have written a simple multithreaded crawler on Discovering Web Resources and I wrote a related article on Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can select how many threads to use changing the NWORKERS class variable in FocusedWebCrawler.

继续阅读：multithreading python

Does a multithreaded crawler in Python really speed things up?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？