Nokogiri HTML parse undefined method 'namespace_definitions' blows up on <o:p> tag
I have a rails app that is parsing HTML using the nokogiri gem version 1.4.0
To parse and cleanup the html fragment, I'm using this:
Nokogiri::HTML::DocumentFragment.parse(text).to_html
I'm getting this error when I try to parse certain inputs, which worked when using hpricot to parse:
NoMethodError: undefined method `namespace_definitions' for nil:NilClass
from .../nokogiri-1.4.0/lib/nokogiri/xml/fragment_handler.rb:33:in `start_element'
from .../nokogiri-1.4.0/lib/nokogiri/html/sax/parser.rb:34:in `parse_with'
from .../nokogiri-1.4.0/lib/nokogiri/html/sax/parser.rb:34:in `parse_memory'
from .../nokogiri-1.4.0/lib/nokogiri/xml/sax/parser.rb:83:in `parse'
from .../nokogiri-1.4.0/lib/nokogiri/xml/document_fragment.rb:7:in `initialize'
from .../nokogiri-1.4.0/lib/nokogiri/html/document_fragment.rb:9:in `new'
from .../nokogiri-1.4.0/lib/nokogiri/html/document_fragment.rb:9:in `parse'
I've tracked it down to the tag, which from what I get i开发者_JAVA技巧s something the MS Office uses to tag paragraph breaks.
<p class="MsoNormal"><span style="font-family:"Arial","sans-serif""><o:p></o:p></span></p>
Is there a way to get Nokogiri to not blow up on this tag? Ideally I would like that it just leaves the tag unchanged like hpricot would have, if that's possible. If not then at least stripping the tags would be better than throwing an error.
I was seeing this problem with Nokogiri 1.4.0. Nokogiri >= 1.4.1 solves the namespace definitions problem.
精彩评论