开发者

Nokogiri HTML parse undefined method 'namespace_definitions' blows up on <o:p> tag

I have a rails app that is parsing HTML using the nokogiri gem version 1.4.0

To parse and cleanup the html fragment, I'm using this:

Nokogiri::HTML::DocumentFragment.parse(text).to_html

I'm getting this error when I try to parse certain inputs, which worked when using hpricot to parse:

NoMethodError: undefined method `namespace_definitions' for nil:NilClass
    from .../nokogiri-1.4.0/lib/nokogiri/xml/fragment_handler.rb:33:in `start_element'
    from .../nokogiri-1.4.0/lib/nokogiri/html/sax/parser.rb:34:in `parse_with'
    from .../nokogiri-1.4.0/lib/nokogiri/html/sax/parser.rb:34:in `parse_memory'
    from .../nokogiri-1.4.0/lib/nokogiri/xml/sax/parser.rb:83:in `parse'
    from .../nokogiri-1.4.0/lib/nokogiri/xml/document_fragment.rb:7:in `initialize'
    from .../nokogiri-1.4.0/lib/nokogiri/html/document_fragment.rb:9:in `new'
    from .../nokogiri-1.4.0/lib/nokogiri/html/document_fragment.rb:9:in `parse'

I've tracked it down to the tag, which from what I get i开发者_JAVA技巧s something the MS Office uses to tag paragraph breaks.

<p class="MsoNormal"><span style="font-family:&quot;Arial&quot;,&quot;sans-serif&quot;"><o:p></o:p></span></p>

Is there a way to get Nokogiri to not blow up on this tag? Ideally I would like that it just leaves the tag unchanged like hpricot would have, if that's possible. If not then at least stripping the tags would be better than throwing an error.


I was seeing this problem with Nokogiri 1.4.0. Nokogiri >= 1.4.1 solves the namespace definitions problem.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜