Parsing an RSS item that has a colon in the tag with Ruby?
I'm trying to parse the info from an RSS feed that has this tag structure:
<dc:subject>foo bar</dc:subject>
using the built in Ruby RSS library. Obviously, doing item.dc:subject
开发者_运维技巧is throwing errors, but I can't figure out any way to pull out that info. Is there any way to get this to work? Or is it possible with a different RSS library?
Tags with ':' in them are really XML tags with a namespace. I never had good results using the RSS module because the feed formats often don't meet the specs, causing the module to give up. I highly recommend using Nokogiri to parse the feed, whether it is RDF, RSS or ATOM.
Nokogiri has the ability to use XPath accessors or CSS accessors, and, both support namespaces. The last two lines would be equivalent:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://somehost.com/rss_feed'))
doc.at('//dc:subject').text
doc.at('dc|subject').text
When dealing with namespaces you'll need to add the declaration to the XPath accessor:
doc.at('//dc:subject', 'dc' => 'link to dc declaration')
See the "Namespaces" section for more info.
Without a URL or a better sample I can't do more, but that should get you pointed in a better direction.
A couple years I wrote a big RSS aggregator for my job using Nokogiri that handled RDF, RSS and ATOM. Ruby's RSS library wasn't up to the task but Nokogiri was awesome.
If you don't want to roll your own, Paul Dix's Feedzirra is a good gem for processing feeds.
The RSS module seems to have the ability to do those XML namespace attributes, i.e. <dc:date>
like this:
feed.items.each do |item|
puts "Date: #{item.dc_date}"
end
I think item['dc:subject']
might work.
精彩评论