开发者

How do I use Nokogiri::XML::Reader to parse large XML files?

I'm trying to use Ruby's Nokogiri to parse large (1 GB or more) XML files. I'm testing code on a smaller file, containing only 4 records available here. I'm using Nokogiri version 1.5.0, Ruby 1.8.7 on Ubuntu 10.10. Since I don't understand SAX very well, I'm trying Nokogiri::XML::Reader to start.

My first attempt, to retrieve the content of the PMID tag, looks like this:

#!/usr/bin/ruby
require "rubygems"
require "nokogiri"

file   = ARGV[0]
reader = Nokogiri::XML::Reader(File.open(file))
p      = []
reader.each do |node|
  if node.name == "PMID"
    p << node.inner_xml
  end
end

puts p.inspect

Here's what I hoped to see:

["21714156", "21693734", "21692271", "21692260"]

Here's what I actually saw:

["21714156", "", "21693734", "", "21692271", "", "21692260", ""]

It seems that for some reason, my code is finding, or generating, an extra,开发者_如何转开发 empty PMID tag for every instance of PMID. Either that or inner_xml does not work as I thought.

I'd be grateful if anyone could confirm that my code and data generates the result shown and suggest where I'm going wrong.


Each element in the stream comes through as two events: one to open the element and one to close it. The opening event will have

node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT

and the closing event will have

node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT

The empty strings you're seeing are just the element closing events. Remember that with SAX parsing, you're basically walking through a tree so you need the second event to tell you when you're going back up and closing an element.

You probably want something more like this:

reader.each do |node|
  if node.name == "PMID" && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
    p << node.inner_xml
  end
end

Or perhaps:

reader.each do |node|
  next if node.name      != 'PMID'
  next if node.node_type != Nokogiri::XML::Reader::TYPE_ELEMENT
  p << node.inner_xml
end

Or some other variation on that.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜