Nokogiri leaving HTML entities untouched
I want Nokogiri to leave HTML entities untouched, but it seems to be co开发者_运维知识库nverting the entities into the actual symbol. For example:
Nokogiri::HTML.fragment('<p>®</p>').to_s
results in: "<p>®</p>"
Nothing seems to return the original HTML back to me.
The .inner_html, .text, .content methods all return '®'
instead of '®'
Is there a way for Nokogiri to leave these HTML entities untouched?
I've already searched stackoverflow and found similar questions, but nothing exactly like this one.
Not an ideal answer, but you can force it to generate entities (if not nice names) by setting the allowed encoding:
#encoding: UTF-8
require 'nokogiri'
html = Nokogiri::HTML.fragment('<p>®</p>')
puts html.to_html #=> <p>®</p>
puts html.to_html( encoding:'US-ASCII' ) #=> <p>®</p>
It would be nice if Nokogiri used 'nice' names of entities where defined, instead of always using the terse hexadecimal entity, but even that wouldn't be 'preserving' the original.
The root of the problem is that, in HTML, the following all describe the exact same content:
<p>®</p>
<p>®</p>
<p>®</p>
<p>®</p>
If you wanted the to_s
representation of a text node to be actually ®
then the markup describing that would really be: <p>&reg;</p>
.
If Nokogiri was to always return the same encoding per character as was used to enter the document it would need to store each character as a custom node recording the entity reference. There exists a class that might be used for this (Nokogiri::XML::EntityReference
):
require 'nokogiri'
html = Nokogiri::HTML.fragment("<p>Foo</p>")
html.at('p') << Nokogiri::XML::EntityReference.new( html.document, 'reg' )
puts html
#=> <p>Foo®</p>
However, I can't find a way to cause these to be created during parsing using Nokogiri v1.4.4 or v1.5.0. Specifically, the presence or absence of Nokogiri::XML::ParseOptions::NOENT
during parsing does not appear to cause one to be created:
require 'nokogiri'
html = "<p>Foo®</p>"
[ Nokogiri::XML::ParseOptions::NOENT,
Nokogiri::XML::ParseOptions::DEFAULT_HTML,
Nokogiri::XML::ParseOptions::DEFAULT_XML,
Nokogiri::XML::ParseOptions::STRICT
].each do |parse_option|
p Nokogiri::HTML(html,nil,'utf-8',parse_option).at('//text()')
end
#=> #<Nokogiri::XML::Text:0x810cca48 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc624 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc228 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cbe04 "Foo\u00AE">
精彩评论