Problem extracting text from RSS feeds

2023-01-01 17:25 问答作者：

I am new to the world of Ruby and Rails.

I have seen rails cast 190 and I just started playing with it. I used selector gadget to find out the CSS and XPath

I have the following code..

require 'rubygems'  
require 'nokogiri'  
require 'open-uri'  

url = "http://www.telegraph.co.uk/sport/football/rss"  
doc = Nokogiri::HTML(open(url))  
doc.xpath('//a').each do |paragraph|
puts para开发者_开发问答graph.text
end

When I extracted text from a normal HTML page with css, I could get the extracted text on the console.

But when I try to do the same either with CSS or XPath for the RSS Feed for the following URL mentioned in the code above, I dont get any output.

How do you extract text from RSS feeds??

I also have another silly question.

Is there a way to extract text from 2 different feeds and display it on the console

something like

url1 = "http://www.telegraph.co.uk/sport/football/rss"
url2 = "http://www.telegraph.co.uk/sport/cricket/rss"

Looking forward for your help and suggestions

Thank You

Gautam

If you are processing feeds you should use Feedzilla

http://railscasts.com/episodes/168-feed-parsing

http://github.com/pauldix/feedzirra

Works like a charm.

Good luck!

Rss page is not HTML document, it is XML, so you should use Nokogiri::XML(open(url))

Then view source code of the rss page. There are no <a> elements.

All links in document are created with the <link> tag:

<link>http://www.telegraph.co.uk/sport/football/world-cup-2010/teams/france/7769203/France-2-Costa-Rica-1-match-report.html</link>

Links to each article are also duplicated as <guid> tag, because article's ID in RSS is it's URL.

<guid>http://www.telegraph.co.uk/sport/football/world-cup-2010/teams/france/7769203/France-2-Costa-Rica-1-match-report.html</guid>

So, if you need all links in document, use:

url = "http://www.telegraph.co.uk/sport/football/rss"  
doc = Nokogiri::XML(open(url))  
doc.xpath('//link').each do |paragraph|
  puts paragraph.text
end

If you need only links to articles, use doc.xpath('//guid')

As for the many feeds, just use loop

feeds = ["http://www.telegraph.co.uk/sport/football/rss", "http://www.telegraph.co.uk/sport/cricket/rss"]
feeds.each do |url|
  #and here goes code as before
end

You have these installed: libxml2 libxml2-dev libxslt libxslt-dev

No need for the loop... simply

puts doc.xpath('//link/text()')

will print all link text.

继续阅读：nokogiri ruby-on-rails web-crawler

Problem extracting text from RSS feeds

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？