How do I correctly deal with non-breaking spaces using Nokogiri?

2023-03-05 10:41 问答作者：

I am using Nokogiri to parse an HTML page, but I am having odd problems with non-breaking spaces. I tried different encodings, replacing the whitespace, and a few other headache inducing attempts.

Here is the HTML snippet in question:

<td>Amount 15,300&nbsp;at&nbsp;dollars</td>

Note the change for the  开发者_JS百科 representation after I use Nokogiri:

<td>Amount 15,300&#xa0;at&#xa0;dollars</td>

And outputting the inner_text:

Amount 15,300Â atÂ dollars

This is my base Nokogiri grab, I did try a few alternatives to solve but failed miserably:

doc = Nokogiri::HTML(open(url))

And then I do a doc.search for the item in question.

Note that if I look at the doc, the line shows up with the   on that line.

Clarification: I do not think I clearly stated the difficulty I am having. I can't get the inner_text to show up without the strange Â symbol.

Unless you really, really want to keep the   notation, there shouldn't be a problem here.

A0 is the hex character code for a non-breaking space. As such,   prints a non-breaking space, and is exactly equivalent to  .   does the same thing, too.

What Nokogiri is doing here is reading the text node, recognizing the entities, and converting them to their actual string representation internally. Then, when converting it back to an HTML-friendly version of the text node, it represents the non-breaking space by its hex code, rather than taking the performance overhead of looking it up in an entity table, since it's equivalent, anyway.

Assuming that Â was what you were seeing and wasn't just an issue pasting into StackOverflow, this is a text encoding issue: the output software (browser?) isn't in UTF-8 mode, so doesn't know how to handle character code A0, so does the best it can. If this is a browser, adding <meta charset="utf-8"> to the head will solve this issue, and will make the rest of the output more Unicode-friendly.

If you really, really want  , use gsub to replace them in your final output. Otherwise, don't worry about it.

I know this is old, but it took me an hour to find out how to solve this problem, and it is really easy once you know. Just pass your string to this function and it will be "de-nbsp-fied".

def strip_html(str)
  nbsp = Nokogiri::HTML("&nbsp;").text
  str.gsub(nbsp,'')
end

You could also replace it whith a space if you wished. May many of you find this answer!

As @sawa says, the main problem is what you see when writing to the console. It's not correctly displaying the non-breaking space after Nokogiri converts it to the appropriate binary value.

The usual way to fix the problem is to preprocess the content:

require 'nokogiri'

html = '<td>Amount 15,300&nbsp;at&nbsp;dollars</td>'
doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(/&(?:#xa0|#160|nbsp);/i, ' '))
puts doc.to_html

Which outputs:

<td>Amount 15,300 at dollars</td>

继续阅读：nokogiri ruby

How do I correctly deal with non-breaking spaces using Nokogiri?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？