Designing a multi-process spider in Python

2022-12-12 12:10 问答作者：

I'm working on a multi-process spider in Python. It should start scraping one page for links and work from there. Specifically, the top-level page contains a list of categories, the second-level pages events in those categories, and the final, third-level pages participants in the events. I can't predict how many categories, events or participants there'll be.

I'm at a bit of a loss as to how best to design such a spider, and in particular, how to know when it's finished crawling (it's expected to keep going till it has 开发者_运维技巧discovered and retrieved every relevant page).

Ideally, the first scrape would be synchronous, and everything else async to maximise parallel parsing and adding to the DB, but I'm stuck on how to figure out when the crawling is finished.

How would you suggest I structure the spider, in terms of parallel processes and particularly the above problem?

You might want to look into Scrapy, an asynchronous (based on Twisted) web-scraper. It looks like for your task, the XPath description for the spider would be pretty easy to define!

Good luck!

(If you really want to do it yourself, maybe consider having small sqlite db that keeps track of whether each page has been hit or not... or if it's reasonable size, just do it in memory... Twisted in general might be your friend for hit.)

I presume you are putting items to visit in a queue, exhausting the queue with workers, and the workers find new items to visit and add them to the queue.

It's finished when all the workers are idle, and the queue of items to visit is empty.

When the workers take advantage of the queue's task_done() method, The main thread can join() the queue to block until it's empty.

继续阅读：multithreading python web-crawler

Designing a multi-process spider in Python

更多精彩内容

精彩评论

最新问答

37岁女人该怎样保养卵巢早衰？

电视果是什么东西?？

输卵管不通畅哪里医院治？

爱奇艺冰激凌套餐赠送的电视果如何使用?？

华为智慧屏SE65挂壁孔在哪？

问答排行榜

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？