开发者

Repairing broken XML file - removing extra less-than/greater-than signs

I have a large XML file which in the middle contains the following:

<ArticleName>Article 1 <START  </ArticleName>

Obviously libxml and other XML libraries can't read this because the less-than sign opens a new tag which is never closed. My question is, is there anything I can do to fix issues like this automatically (preferably in Ruby)? The solution should of course work for any field which h开发者_如何学编程as an error like this. Someone said SAX parsing could do the trick but I'm not sure how that would work.


You could do a regular expression search-and-replace, looking for <(?=[^<>]*<) and replacing with &lt;.

In Ruby,

result = subject.gsub(/<(?=[^<>]*<)/, '&lt;')

The rationale behind this being that you want to find < that don't have a corresponding >. Therefore, the regex only matches a < if it is followed by another without any > in-between.

EDIT: Improved the regex by using lookahead. I first thought Ruby didn't support lookahead, but it does. Just not lookbehind...


Nokogiri supports some options for handling bad XML. These might help:

http://rubyforge.org/pipermail/nokogiri-talk/2009-February/000066.html http://nokogiri.org/tutorials/ensuring_well_formed_markup.html

I just messed around with the broken fragment and Nokogiri handles it very nicely:

#!/usr/bin/ruby

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::XML('<?xml version="1.0"?><ArticleName>Article 1 <START  </ArticleName></xml>')
doc.to_s  # => "<?xml version=\"1.0\"?>\n<ArticleName>Article 1 <START/></ArticleName>\n"
doc.errors # => [#<Nokogiri::XML::SyntaxError: error parsing attribute name

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜