Feedparser - retrieve old messages from Google Reader

2022-12-11 01:07 问答作者：

I'm using the feedparser library in python to retrieve news from a local newspaper (my intent is to do Natural Language Processing over this corpus) and would like to be able to retrieve many past entries from the RSS feed.

I'm not very acquainted with the technical issues of RSS, but I think this should be possible (I can see that, e.g., Google Reader and Feedly can 开发者_JS百科do this ''on demand'' as I move the scrollbar).

When I do the following:

import feedparser

url = 'http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml'
feed = feedparser.parse(url)
for post in feed.entries:
   title = post.title

I get only a dozen entries or so. I was thinking about hundreds. Maybe all entries in the last month, if possible. Is it possible to do this only with feedparser?

I intend to get from the rss feed only the link to the news item and parse the full page with BeautifulSoup to obtain the text I want. An alternate solution would be a crawler that follows all local links in the page to get a lot of news items, but I want to avoid that for now.

One solution that appeared is to use the Google Reader RSS cache:

http://www.google.com/reader/atom/feed/http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml?n=1000

But to access this I must be logged in to Google Reader. Anyone knows how I do that from python? (I really don't know a thing about web, I usually only mess with numerical calculus).

You're only getting a dozen entries or so because that's what the feed contains. If you want historic data you will have to find a feed/database of said data.

Check out this ReadWriteWeb article for some resources on finding open data on the web.

Note that Feedparser has nothing to do with this as your title suggests. Feedparser parses what you give it. It can't find historic data unless you find it and pass it into it. It is simply a parser. Hope that clears things up! :)

To expand on Bartek's answer: You could also start storing all of the entries in the feed that you've already seen, and build up your own historical archive of the feed's content. This would delay your ability to start using it as a corpus (because you'd have to do this for a month to build up a collection of a month's worth of entries), but you wouldn't be dependent on anyone else for the data.

I may be mistaken, but I'm pretty sure that's how Google Reader can go back in time: They have each feed's past entries stored somewhere.

继续阅读：feedparser google-reader python rss

Feedparser - retrieve old messages from Google Reader

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？