开发者

Parsing an RSS item that has a colon in the tag with Ruby?

I'm trying to parse the info from an RSS feed that has this tag structure:

<dc:subject>foo bar</dc:subject>

using the built in Ruby RSS library. Obviously, doing item.dc:subject 开发者_运维技巧is throwing errors, but I can't figure out any way to pull out that info. Is there any way to get this to work? Or is it possible with a different RSS library?


Tags with ':' in them are really XML tags with a namespace. I never had good results using the RSS module because the feed formats often don't meet the specs, causing the module to give up. I highly recommend using Nokogiri to parse the feed, whether it is RDF, RSS or ATOM.

Nokogiri has the ability to use XPath accessors or CSS accessors, and, both support namespaces. The last two lines would be equivalent:

require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://somehost.com/rss_feed'))
doc.at('//dc:subject').text
doc.at('dc|subject').text

When dealing with namespaces you'll need to add the declaration to the XPath accessor:

doc.at('//dc:subject', 'dc' => 'link to dc declaration') 

See the "Namespaces" section for more info.

Without a URL or a better sample I can't do more, but that should get you pointed in a better direction.

A couple years I wrote a big RSS aggregator for my job using Nokogiri that handled RDF, RSS and ATOM. Ruby's RSS library wasn't up to the task but Nokogiri was awesome.

If you don't want to roll your own, Paul Dix's Feedzirra is a good gem for processing feeds.


The RSS module seems to have the ability to do those XML namespace attributes, i.e. <dc:date> like this:

feed.items.each do |item| puts "Date: #{item.dc_date}" end


I think item['dc:subject'] might work.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜