开发者

Downloading RSS using python

I have list of 200 rss feeds, which I have to downloading. It's continuous process - I have to download every post, nothing can be missing, but also no duplicates. So best practice should be remember last update of feed and control it for change in x-hour interval? And how to handle if downloader will be restarted? So downloader should remember, wh开发者_如何转开发at were downloaded and dont download it again...

It's somewhere implemented yet? Or any tips for article? Thanks


Typically this is what you'd want to do:

  • Fetch the feeds periodically and parse them using the universal feedparser and store the entries somewhere.
  • Use ETags and IfModified headers when fetching feeds to avoid parsing feeds that have not changed since your last fetch. you'll have to maintain Etags and Ifmodified values recieved during last fetch of the feed.
  • To avoid duplication, each entry should be stored with its unique guid, then check whether an entry with the same guid is already stored or not. (fall back through entry_link, hash of title+feed url to uniquely identify the entry, in case the feed entries have no guid)


You can use feedparser to parse the feeds and store in a database the maximal published time per feed.

For a simple database you can use shelve.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜