What are the whitespaces matched by \s in PHP?
What is the complete list of characters matched by the escape sequence \s in PH开发者_C百科P ? Some regex flavors include vertical space and other characters in this escape sequence.
From pcrepattern specifications page:
Generic character types
\s any white space character
For compatibility with Perl, \s did not used to match the VT character (code 11), which made it different from the the POSIX "space" class. However, Perl added VT at release 5.18, and PCRE followed suit at release 8.34. The default \s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space (32), which are defined as white space in the "C" locale. This list may vary if locale-specific matching is taking place. For example, in some locales the "non-breaking space" character (\xA0) is recognized as white space, and in others the VT character is not.
So \s
will match 5 characters plus more depending on:
- PCRE library version
- Locale setting
This test compares the result of preg_match across various versions of PHP.
PHP has \h
for horizontal whitespace characters only: http://www.php.net/manual/en/regexp.reference.escape.php
According to http://www.pcre.org/pcre.txt :
For compatibility with Perl, \s does not match the VT character (code 11). This makes it different from the the POSIX "space" class. The \s characters are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is included in a Perl script, \s may match the VT charac- ter. In PCRE, it never does.
So if "Vertical space" refers to vertical tab, the answer is no.
The sequences \h, \H, \v, and \V are features that were added to Perl at release 5.10. In contrast to the other sequences, which match only ASCII characters by default, these always match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters are: U+0009 Horizontal tab U+0020 Space U+00A0 Non-break space U+1680 Ogham space mark U+180E Mongolian vowel separator U+2000 En quad U+2001 Em quad U+2002 En space U+2003 Em space U+2004 Three-per-em space U+2005 Four-per-em space U+2006 Six-per-em space U+2007 Figure space U+2008 Punctuation space U+2009 Thin space U+200A Hair space U+202F Narrow no-break space U+205F Medium mathematical space U+3000 Ideographic space The vertical space characters are: U+000A Linefeed U+000B Vertical tab U+000C Formfeed U+000D Carriage return U+0085 Next line U+2028 Line separator U+2029 Paragraph separator
From http://www.pcre.org/pcre.txt:
\s any character that \p{Z} matches, plus HT, LF, FF, CR
精彩评论