How to match accented characters with a regex?
I am running Ruby on Rails 3.0.10 and Ruby 1.9.2. I am using the following Regex in order to match names:
NAME_REGEX = /^[\w\s'"\-_&@!?()\[\]-]*$/u
validates :name,
:presence => true,
:format => {
:with => NAME_REGEX,
:message => "format is invalid"
}
However, if I tr开发者_开发技巧y to save some words like the followings:
Oilalà
Pì
Rùby
...
# In few words, those with accented characters
I have a validation error "Name format is invalid.
.
How can I change the above Regex so to match also accented characters like à
, è
, é
, ì
, ò
, ù
, ...?
Instead of \w
, use the POSIX bracket expression [:alpha:]
:
"blåbær dèjá vu".scan /[[:alpha:]]+/ # => ["blåbær", "dèjá", "vu"]
"blåbær dèjá vu".scan /\w+/ # => ["bl", "b", "r", "d", "j", "vu"]
In your particular case, change the regex to this:
NAME_REGEX = /^[[:alpha:]\s'"\-_&@!?()\[\]-]*$/u
This does match much more than just accented characters, though. Which is a good thing. Make sure you read this blog entry about common misconceptions regarding names in software applications.
One solution would of course be to simply find all of them just use them as you normally would, although I assume they can be fairly many.
If you are using UTF8 then you will find that such characters are often split into two parts, the "base" character itself, followed by the accent (0x0300 and 0x0301 I believe) also called a combining character. However, this may not always be true since some characters can also be written using the "hardcoded" character code... so you need to normalize the UTF8 string to NFD form first.
Of course, you could also turn any string you have into UTF8 and then back into the original charset... but the overhead might become quite large if you are doing bulk operations.
EDIT: To answer your question specifically, the best solution is likely to normalize your strings into UTF8 NPD form, and then simply add 0x0300 and 0x0301 to your list of acceptable characters, and whatever other combining characters you want to allow (such as the dots in åäö, you can find them all in "charmap" in Windows, look at 0x0300 and "up").
精彩评论