开发者

how to remove whitespace but not utf-8 character in ruby

I want to prevent users to write an empty comm开发者_如何学运维ent (whitespaces,  , etc.). so I apply the following:

var.gsub(/^\s+|\s+\z|\s* \s*/.'')

However, then a smart user find a hole by using \302 or \240 unicode characters so I filtered out these characters too.

Then I ran into problem as I introduced several languages support, then a word like Déjà vu becomes an error. because part of the à character contains \240. is there any way to remove the whitespaces but leave the latin characters untouched?


A way around this is to use iconv to discard the invalid unicode characters (such as \230 on its own) before using your regexp to remove the whitespaces:

require 'iconv'

var1 = "Déjà vu"
var2 = "\240"

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid1 = ic.iconv(var1) # => "D\303\251j\303\240 vu" 
valid2 = ic.iconv(var2) # => ""
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜