What are good techniques for retrieving a list of keywords for the top news stories of the day
I am working on an application where I would like to retrieve a list of the day's top news stories from some source (such as the BBC) and parse these for keywords that I can use against my own tag data. There are obviously lots of webservices and APIs out there - but what would you suggest as good routes to take.
One thing I was considering is periodically downloading the RSS feed of BBC News and parsing the content using the Yahoo term extractor. This seems like a good solution to me, but the term extractor is for non-commercial use only and my application is commercial.
YQL looks promising but I'm not sure how easy it will be to condense the data down to keywords.
All suggestions welcome, both for the news source and the keyword/tag extraction, and for both commercial and non-commercial uses.
Update:
Building on the suggestion of an answer, here's the YQL for grabbing the keywords from the top UK news stores on the BBC:
select content
from search.termextract
where context in (
select title
from rss
where url='http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml'
)
which returns something like:
<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="46" yahoo:created="2009-11-13T11:49:05Z" yahoo:lang="en-US" yahoo:updated="2009-11-13T11:49:05Z" yahoo:uri="http://query.yahooapis.com/v1/yql?q=select+content+from+search.termextract+where+context+in+%28select+title+from+rss+where+url%3D%27http%3A%2F%2Fnewsrss.bbc.co.uk%2Frss%2Fnewsonline_uk_edition%2Ffront_page%2Frss.xml%27+%29">
<results>
<Result xmlns="urn:yahoo:cate">new york</Result>
<Result xmlns="urn:yahoo:cate">bolt gun</Result>
<Result xmlns="urn:yahoo:cate">stalker</Result>
<Result xmlns="urn:yahoo:cate">russia</Result>
<Result xmlns="urn:yahoo:cate">moon</Result>
<Result xmlns="urn:yahoo:cate">hijack</Result>
<Result xmlns="urn:yahoo:cate">yacht</Result>开发者_如何学Go;
<Result xmlns="urn:yahoo:cate">balloon</Result>
<Result xmlns="urn:yahoo:cate">parents</Result>
<Result xmlns="urn:yahoo:cate">bruce forsyth</Result>
<Result xmlns="urn:yahoo:cate">flu</Result>
Ultimately though, I don't think I can use this within a commercial app though due to the restrictions on the term extraction service.
You say YQL looks promising, so I'm sure you've investigated this already. You can use two YQL services together. search.termextract
will give you the keywords from the query made with search.news
select * from search.termextract where context in (select abstract from search.news where query="election")
You'd have to fiddle around to make the where part of the query specific to latest news.
From here: "The Term Extraction service is limited to 5,000 queries per IP address per day and to noncommercial use. See information on rate limiting."
精彩评论