Repairing broken XML file - removing extra less-than/greater-than signs
I have a large XML file which in the middle contains the following:
<ArticleName>Article 1 <START </ArticleName>
Obviously libxml and other XML libraries can't read this because the less-than sign opens a new tag which is never closed. My question is, is there anything I can do to fix issues like this automatically (preferably in Ruby)? The solution should of course work for any field which h开发者_如何学编程as an error like this. Someone said SAX parsing could do the trick but I'm not sure how that would work.
You could do a regular expression search-and-replace, looking for <(?=[^<>]*<)
and replacing with <
.
In Ruby,
result = subject.gsub(/<(?=[^<>]*<)/, '<')
The rationale behind this being that you want to find <
that don't have a corresponding >
. Therefore, the regex only matches a <
if it is followed by another without any >
in-between.
EDIT: Improved the regex by using lookahead. I first thought Ruby didn't support lookahead, but it does. Just not lookbehind...
Nokogiri supports some options for handling bad XML. These might help:
http://rubyforge.org/pipermail/nokogiri-talk/2009-February/000066.html http://nokogiri.org/tutorials/ensuring_well_formed_markup.html
I just messed around with the broken fragment and Nokogiri handles it very nicely:
#!/usr/bin/ruby require 'rubygems' require 'nokogiri' doc = Nokogiri::XML('<?xml version="1.0"?><ArticleName>Article 1 <START </ArticleName></xml>') doc.to_s # => "<?xml version=\"1.0\"?>\n<ArticleName>Article 1 <START/></ArticleName>\n" doc.errors # => [#<Nokogiri::XML::SyntaxError: error parsing attribute name
精彩评论