how to make Nokogiri not to convert   to space

2023-01-31 08:41 问答作者：

i fetch one html fragment like

"<li>市&nbsp;场&nbsp;价"

which contains " ", but after calling to_s of Nokogiri NodeSet, it becomes

"<li>市 场 价"

, i want to keep the original html fragment, and tried to set :s开发者_JAVA百科ave_with option for to_s method, but failed.

can someone encounter the same problem and give me help? thank you in advance.

I encountered a similar situation, and what I came up was a bit of a hack, but it seems to work well.

nbsp = Nokogiri::HTML("&nbsp;").text
text.gsub(nbsp, " ")

In my case, I wanted the nbsp to be a regular space. I think in your case, you want them to be returned to a " ", so you could do something like:

nbsp = Nokogiri::HTML("&nbsp;").text
html.gsub(nbsp, "&nbsp;")

I think the problem is how you're looking at the string. It will look like a space, but it's not quite the same:

require 'nokogiri'

doc = Nokogiri::HTML('"<li>市&nbsp;场&nbsp;价"')
(doc % 'li').content.chars.to_a[1].ord # => 160
(doc % 'li').to_html # => "<li>市 场 价\"</li>"

A regular space is 32, 0x20 or ' '. 160 is the decimal value for a non-breaking-space, which is what   converts to after you use Nokogiri's various inner_text, content, text or to_s tags. It's no longer a HTML entity-encoding, but it's still a non-breaking space. I think Nokogiri's conversion from the entity-encoding is the appropriate behavior when asking for a stringification.

There might be a flag to tell Nokogiri to NOT decode the value, but I'm not aware of it off-hand. You can check on Nokogiri's mail-list that I mentioned in the comment above, to see if there is a flag. I can see an advantage for Nokogiri to not do the decode also so if there isn't such a flag it would be nice occasionally.

Now, all that said, I think the to_html method SHOULD return the value to its entity-encoded value, since a non-breaking space is a nasty thing to encounter in a HTML stream. And that I think you should mention on the mail-list or maybe even as a bug. I think it's an inappropriate result.

http://groups.google.com/group/nokogiri-talk/msg/0b81ef0dc180dc74

Okay, I can explain the behavior now. Basically, the problem boils down to encoding.

In Ruby 1.9, we examine the encoding of the string you're feeding to Nokogiri. If the input string is "utf-8", the document is assumed to be a UTF-8 document. When you output the document, since " " can be represented as a UTF-8 character, it is output as that UTF-8 character.

In 1.8, since we cannot detect the encoding of the document, we assume binary encoding and allow libxml2 to detect the encoding. If you set the encoding of the input document to binary, it will give you back the entities you want. Here is some code to demo:

 require 'nokogiri' 
 html = '<body>hello &nbsp; world</body>' 
 f    = Nokogiri.HTML(html) 
 node = f.css('body') 
 p node.inner_html 
 f    = Nokogiri.HTML(html.encode('ASCII-8BIT')) 
 node = f.css('body') 
 p node.inner_html

I posted a youtube video too! :-)

http://www.youtube.com/watch?v=X2SzhXAt7V4

Aaron Patterson

Your sample text isn't ASCII-8BIT so try changing that encoding string to the Unicode character set name and see if inner_html will return an entity-encoded value.

继续阅读：html-entities nokogiri ruby

how to make Nokogiri not to convert   to space

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？