开发者

organizing pools for multiple urls massive download

I'm developing polish blogosphere monitoring website and I'm looking for "best practice" with handling massive content download in python.

Here is sample sheme of a workflow:

organizing pools for multiple urls massive download

Description:

I've categorized database of rss feeds (around 1000). Every hour or so I should check feeds if there are some new items posted. If so, I should analyze each new item. Anal开发者_JAVA百科yze process handles metadata of each document and also downloads every image found inside.

Simplified one-thread version of a code:

for url, etag, l_mod in rss_urls:
    rss_feed = process_rss(url, etag, l_mod) # Read url with last etag, l_mod values
    if not rss:
        continue

    for new_item in rss_feed: # Iterate via *new* items in feed
        element = fetch_content(new_item) # Direct https request, download HTML source
        if not element:
            continue

        images = extract_images(element)
        goodImages = []
        for img in images:
            if img_qualify(img): # Download and analyze image if it could be used as a thumbnail
                goodImages.append(img)

So I iterate throught the rss feeds, downloads only feeds with new items. Download each new item from a feed. Download and analyze each image in item.

HTTR requests appears at the follwing stages: - downloading rss xml document - downloading x items found on rss - downloading all images of each item

I've decided to try python gevent (www.gevent.org) library to handle multiple urls content download

What I want to gain as a result: - Ability to limit number of external http requests - Ability to parralel downloading all listed content items.

What is a best way to do it?

I'm not sure, because I'm new to parralel programming at all (well this async request probably has nothing to do with parralel programming at all) and I have no idea how such tasks are done in a mature world, yet.

The only idea come to my mind is to use the following technique: - Run processing script via cronjob every 45 minutes - Try to lock file with written pid process inside at the very beggining. If locking failed, check process list for this pid. If pid not found, probably process failed at some point and its safe to srart new one. - Via the wrapper for gevent pool run task for rss feeds download, at every stage (new items found) add new job to quique to download item, at every item downloaded add tasks for image downloading. - Check every some seconds state of jobs currenly running, run new job from quique if there free slots available in FIFO mode.

Sound OK for me, however maybe this kind of task has some "best practise" and I'm reinventing the wheel now. Thats why I'm posting my question here.

Thx!


This approach sounds fine on initial read. The example here shows how to limit concurrency https://bitbucket.org/denis/gevent/src/tip/examples/dns_mass_resolve.py

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜