开发者

Can I write an XML reader that can cope with unclosed tags?

I'm parsing the wikipedia XML dump using a REXML StreamListener. After a few million articles, it complains that it can't find a matching close tag, and skips t开发者_运维技巧he rest of the file.

Is there any way to get it to ignore the unclosed tag, and to resume parsing the stream after it?


The Nokogiri SAX mode is very similar to REXML's SAX (StreamListener) mode. Sample:

require 'nokogiri'

include Nokogiri

class PostCallbacks < XML::SAX::Document
  def start_element(element, attributes)
    if element == 'tag'
      # Process tag data here
    end
  end
end

parser = XML::SAX::Parser.new(PostCallbacks.new)
parser.parse_file("data.xml")

Nokogiri also has a Reader interface which yields every node, in case you don't like the SAX-style callback interface.

reader = Nokogiri::XML::Reader(xml)    
reader.each do |node|
  # node is an instance of Nokogiri::XML::Reader
  puts node.name
end

The difference is that Nokogiri can recover from non-well-formedness better than pretty much any parser out there, thanks to the underlying libXML2 recover mode (on by default I believe).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜