Convert non-breaking spaces to spaces in Ruby

2022-12-25 16:37 问答作者：

I have cases where user-entered data from an html textarea or input is sometimes sent with \u00a0 (non-breaking spaces) instead of spaces when encoded as utf-8 json.

I believe that to be a bug in Firefox, as I know that the user isn't intentionally putting in non-breaking spaces instead of spaces.

There are also two bugs in Ruby, one of which can be used to combat the other.

For whatever reason \s doesn't match \u00a0.

However [^[:print:]], which definitely should not match) and \xC2\xA0 both will match, but I consider those to be less-than-ideal ways to deal wit开发者_如何学Goh the issue.

Are there other recommendations for getting around this issue?

Use /\u00a0/ to match non-breaking spaces. For instance s.gsub(/\u00a0/, ' ') converts all non-breaking spaces to regular spaces.

Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/, which matches only ASCII whitespace.

See also: Ruby Regexp documentation

If you cannot use \s for Unicode whitespace, that’s a bug in the Ruby regex implementation, because according to UTS#18 “Unicode Regular Expressions” Annex C on Compatibility Properties a \s, is absolutely required to match any Unicode whitespace code point.

There is no wiggle-room allowed since the two columns detailing the Standard Recommendation and the POSIX Compatibility are the same for the \s case. You cannot document your way around this: you are out of compliance with The Unicode Standard, in particular, with UTS#18’s RL1.2a, if you do not do this.

If you do not meet RL1.2a, you do not meet the Level 1 requirements, which are the most basic and elementary functionality needed to use regular expressions on Unicode. Without that, you are pretty much lost. This is why standards exist. My recollection is that Ruby also fails to meet several other Level 1 requirements. You may therefore wish to use a programming language that meets at least Level 1 if you actually need to handle Unicode with regular expressions.

Note that you cannot use a Unicode General Category property like \p{Zs} to stand for \p{Whitespace}. That’s because the Whitespace property is a derived property, not a general category. There are also control characters included in it, not just separators.

Actual functioning IRB code examples that answer the question, with latest Rubies (May 2012)

Ruby 1.9

require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]"
doc = '<html><body> &nbsp; </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text
s.each_codepoint {|c| print c, ' ' } #=> 32 160 32
s.strip.each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\s+/,'').each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\u00A0/,'').strip.empty? #true

Ruby 1.8

require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.8.7 (2012-02-08 patchlevel 358) [x86_64-linux]"
doc = '<html><body> &nbsp; </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text # " \302\240 "
s.gsub(/\s+/,'') # "\302\240"
s.gsub(/\302\240/,'').strip.empty? #true

For whatever reason \s doesn't match \u00a0.

I think the "whatever reason" is that is not supposed to. Only the POSIX and \p construct character classes are Unicode aware. The character-class abbreviations are not:

Sequence   As[...]        Meaning
     \d    [0-9]          ASCII decimal digit character
     \D    [^0-9]         Any character except a digit
     \h    [0-9a-fA-F]    Hexadecimal digit character
     \H    [^0-9a-fA-F]   Any character except a hex digit
     \s    [ \t\r\n\f]    ASCII whitespace character
     \S    [^ \t\r\n\f]   Any character except whitespace
     \w    [A-Za-z0-9\_]  ASCII word character
     \W    [^A-Za-z0-9\_] Any character except a word character

For the old versions of ruby (1.8.x), the fixes are the ones described in the question.

This is fixed in the newer versions of ruby 1.9+.

While not related to Ruby (and not directly to this question), the core of the problem might be that Alt+Space on Macs produces a non-breaking space.

This can cause all kinds of weird behaviour (especially in the terminal).

For those who are interested in more details, I wrote "Why chaining commands with pipes in Mac OS X does not always work" about this topic some time ago.

继续阅读：json ruby unicode utf-8 whitespace

Convert non-breaking spaces to spaces in Ruby

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？