Problem extracting text from RSS feeds
I am new to the world of Ruby and Rails.
I have seen rails cast 190 and I just started playing with it. I used selector gadget to find out the CSS and XPath
I have the following code..
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.telegraph.co.uk/sport/football/rss"
doc = Nokogiri::HTML(open(url))
doc.xpath('//a').each do |paragraph|
puts para开发者_开发问答graph.text
end
When I extracted text from a normal HTML page with css, I could get the extracted text on the console.
But when I try to do the same either with CSS or XPath for the RSS Feed for the following URL mentioned in the code above, I dont get any output.
How do you extract text from RSS feeds??
I also have another silly question.
Is there a way to extract text from 2 different feeds and display it on the console
something like
url1 = "http://www.telegraph.co.uk/sport/football/rss"
url2 = "http://www.telegraph.co.uk/sport/cricket/rss"
Looking forward for your help and suggestions
Thank You
Gautam
If you are processing feeds you should use Feedzilla
http://railscasts.com/episodes/168-feed-parsing
http://github.com/pauldix/feedzirra
Works like a charm.
Good luck!
Rss page is not HTML document, it is XML, so you should use Nokogiri::XML(open(url))
Then view source code of the rss page. There are no <a>
elements.
All links in document are created with the <link>
tag:
<link>http://www.telegraph.co.uk/sport/football/world-cup-2010/teams/france/7769203/France-2-Costa-Rica-1-match-report.html</link>
Links to each article are also duplicated as <guid>
tag, because article's ID in RSS is it's URL.
<guid>http://www.telegraph.co.uk/sport/football/world-cup-2010/teams/france/7769203/France-2-Costa-Rica-1-match-report.html</guid>
So, if you need all links in document, use:
url = "http://www.telegraph.co.uk/sport/football/rss"
doc = Nokogiri::XML(open(url))
doc.xpath('//link').each do |paragraph|
puts paragraph.text
end
If you need only links to articles, use doc.xpath('//guid')
As for the many feeds, just use loop
feeds = ["http://www.telegraph.co.uk/sport/football/rss", "http://www.telegraph.co.uk/sport/cricket/rss"]
feeds.each do |url|
#and here goes code as before
end
You have these installed: libxml2 libxml2-dev libxslt libxslt-dev
No need for the loop... simply
puts doc.xpath('//link/text()')
will print all link text.
精彩评论