Unicode regexp to match line-breaks?
I have this f开发者_运维知识库orm from where I want to submit data to a database. The data is UTF8. I am having trouble with matching line breaks. The pattern I am using is something like this:
~^[\p{L}\p{M}\p{N} ]+$~u
This pattern works fine until the user puts a new line in his text box. I have tried using \p{Z}
inside the class but with no success. I also tried "s" but it didn’t work.
Any help is much appreciated. Thanks!
A Unicode linebreak is either a carriage return immediately followed by a line feed, or else it is any character with the vertical whitespace property.
But it looks like you’re trying to match generic whitespace there. In Java, that would be
[\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u2029\u202F\u205F\u3000]
which can be shortened by using ranges to “only” this:
[\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]
to include both horizontal whitespace (\h
) and vertical whitespace (\v
), which may or may not be the same as general whitespace (\s
).
It also looks like you’re trying to match alphanumerics.
- Alphabetics alone are usually
[\pL\pM\p{Nl}]
. - Numerics are not so often all
\pN
as often as they are either just\p{Nd}
or else sometimes[\p{Nd}\p{Nl}]
. - Identifer characters need connector punctuation and a bit more, so
[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
— if your regex engine supports those sorts of operations (Java’s does). That’s what\w
works out to in Unicode-aware regex languages (of which Java is not one).
In older versions of Perl, you would likely write a linebreak as
(?:\r\n|\p{VertSpace})
although that’s now better written as
(?:(?>\r\n)|\v)
which is exactly what
\R
matches.
Java is very clumsy at these things. There you must write a linebreak as
(?:(?>\u000D\u000A)|[\u000A-\u000D\u0085\u2028\u2029])
which of course requires extra bbaacckkssllasshheess when written as a string.
The other Java equivalences for the 14 common character-class regex escapes so that they work with Unicode I give in this answer. You may have to use those in other Java-like regex languages that aren’t sufficiently Unicode-aware.
精彩评论