organizing pools for multiple urls massive download

2023-01-28 19:10 问答作者：

I'm developing polish blogosphere monitoring website and I'm looking for "best practice" with handling massive content download in python.

Here is sample sheme of a workflow:

Description:

I've categorized database of rss feeds (around 1000). Every hour or so I should check feeds if there are some new items posted. If so, I should analyze each new item. Anal开发者_JAVA百科yze process handles metadata of each document and also downloads every image found inside.

Simplified one-thread version of a code:

for url, etag, l_mod in rss_urls:
    rss_feed = process_rss(url, etag, l_mod) # Read url with last etag, l_mod values
    if not rss:
        continue

    for new_item in rss_feed: # Iterate via *new* items in feed
        element = fetch_content(new_item) # Direct https request, download HTML source
        if not element:
            continue

        images = extract_images(element)
        goodImages = []
        for img in images:
            if img_qualify(img): # Download and analyze image if it could be used as a thumbnail
                goodImages.append(img)

So I iterate throught the rss feeds, downloads only feeds with new items. Download each new item from a feed. Download and analyze each image in item.

HTTR requests appears at the follwing stages: - downloading rss xml document - downloading x items found on rss - downloading all images of each item

I've decided to try python gevent (www.gevent.org) library to handle multiple urls content download

What I want to gain as a result: - Ability to limit number of external http requests - Ability to parralel downloading all listed content items.

What is a best way to do it?

I'm not sure, because I'm new to parralel programming at all (well this async request probably has nothing to do with parralel programming at all) and I have no idea how such tasks are done in a mature world, yet.

The only idea come to my mind is to use the following technique: - Run processing script via cronjob every 45 minutes - Try to lock file with written pid process inside at the very beggining. If locking failed, check process list for this pid. If pid not found, probably process failed at some point and its safe to srart new one. - Via the wrapper for gevent pool run task for rss feeds download, at every stage (new items found) add new job to quique to download item, at every item downloaded add tasks for image downloading. - Check every some seconds state of jobs currenly running, run new job from quique if there free slots available in FIFO mode.

Sound OK for me, however maybe this kind of task has some "best practise" and I'm reinventing the wheel now. Thats why I'm posting my question here.

Thx!

This approach sounds fine on initial read. The example here shows how to limit concurrency https://bitbucket.org/denis/gevent/src/tip/examples/dns_mass_resolve.py

继续阅读：asynchronous download gevent python

organizing pools for multiple urls massive download

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？