开发者

Problem extracting text from RSS feeds

I am new to the world of Ruby and Rails.

I have seen rails cast 190 and I just started playing with it. I used selector gadget to find out the CSS and XPath

I have the following code..

require 'rubygems'  
require 'nokogiri'  
require 'open-uri'  

url = "http://www.telegraph.co.uk/sport/football/rss"  
doc = Nokogiri::HTML(open(url))  
doc.xpath('//a').each do |paragraph|
puts para开发者_开发问答graph.text
end

When I extracted text from a normal HTML page with css, I could get the extracted text on the console.

But when I try to do the same either with CSS or XPath for the RSS Feed for the following URL mentioned in the code above, I dont get any output.

How do you extract text from RSS feeds??

I also have another silly question.

Is there a way to extract text from 2 different feeds and display it on the console

something like

url1 = "http://www.telegraph.co.uk/sport/football/rss"
url2 = "http://www.telegraph.co.uk/sport/cricket/rss"

Looking forward for your help and suggestions

Thank You

Gautam


If you are processing feeds you should use Feedzilla

http://railscasts.com/episodes/168-feed-parsing

http://github.com/pauldix/feedzirra

Works like a charm.

Good luck!


Rss page is not HTML document, it is XML, so you should use Nokogiri::XML(open(url))

Then view source code of the rss page. There are no <a> elements.

All links in document are created with the <link> tag:

<link>http://www.telegraph.co.uk/sport/football/world-cup-2010/teams/france/7769203/France-2-Costa-Rica-1-match-report.html</link> 

Links to each article are also duplicated as <guid> tag, because article's ID in RSS is it's URL.

<guid>http://www.telegraph.co.uk/sport/football/world-cup-2010/teams/france/7769203/France-2-Costa-Rica-1-match-report.html</guid> 

So, if you need all links in document, use:

url = "http://www.telegraph.co.uk/sport/football/rss"  
doc = Nokogiri::XML(open(url))  
doc.xpath('//link').each do |paragraph|
  puts paragraph.text
end

If you need only links to articles, use doc.xpath('//guid')

As for the many feeds, just use loop

feeds = ["http://www.telegraph.co.uk/sport/football/rss", "http://www.telegraph.co.uk/sport/cricket/rss"]
feeds.each do |url|
  #and here goes code as before
end


You have these installed: libxml2 libxml2-dev libxslt libxslt-dev


No need for the loop... simply

puts doc.xpath('//link/text()')

will print all link text.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜