how to remove whitespace but not utf-8 character in ruby
I want to prevent users to write an empty comm开发者_如何学运维ent (whitespaces,
, etc.). so I apply the following:
var.gsub(/^\s+|\s+\z|\s* \s*/.'')
However, then a smart user find a hole by using \302
or \240
unicode characters so I filtered out these characters too.
Then I ran into problem as I introduced several languages support, then a word like Déjà vu
becomes an error. because part of the à
character contains \240
. is there any way to remove the whitespaces but leave the latin characters untouched?
A way around this is to use iconv
to discard the invalid unicode characters (such as \230
on its own) before using your regexp to remove the whitespaces:
require 'iconv'
var1 = "Déjà vu"
var2 = "\240"
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid1 = ic.iconv(var1) # => "D\303\251j\303\240 vu"
valid2 = ic.iconv(var2) # => ""
精彩评论