best algorithm to combine multiple RSS feeds using Python
I am writing a python script to combine about 20+ RSS feeds. I would like to use a custom solution instead of feedjack or planetfeed.
I use feedparser to parse the feeds and mysql to cache them.
The problem I am running into is determining which feeds have already be开发者_运维百科en cached and which haven't.
Some pseudo code for what I have tried:
- create a list of all feed items
- get the date of last item cached from db
- check which items in my list have a date greater than my item from the db and return this filtered list
- sort the returned filtered list by date the item was created
- add new items to the db
I feel like this would work, but my problem is that not all of the dates on the RSS feeds I am using are correct. Sometimes a publisher, for whatever reason, will have feed items with dates in the future. If this future date gets added to the db, then it will always be greater than the date of the items in my list. So, the comparison stops working and no new items get added to the db. I would like to come up with another solution and not rely on the publishers dates.
How would some of you pros do this? Assuming you have to combine multiple rss feeds, save them to a mysql db and then return them in ordered by date. I'm just looking for pseudo code to give me an idea of the best way to do this.
Thanks for your help.
Depending on how often the feeds are updated and how often you check, you could simply fix broken dates (if it's in the future, reset it to today), before adding them to the database.
Other than that, you'd have to use some sort of ID—I think RSS has an ID field on each item. If your feeds are kept in order, you can get the most recent cached ID, find that in the feed items list, and then add everything newer. If they're out of order, you'd have to check each one against your cache, and add it if it's missing.
精彩评论