Mathematica regular expressions on unicode strings
This was a fascinating debugging experience. Can you spot the difference between the following two lines?
StringReplace["–", RegularExpression@"[\\s\\S]" -> "abc"]
StringReplace["-", RegularExpression@"[\\s\\S]" -> "abc"]
They do very different things when you evaluate them. It turns out it's because the string being replaced in the first line cons开发者_开发技巧ists of a unicode en dash, as opposed to a plain old ascii dash in the second line.
In the case of the unicode string, the regular expression doesn't match. I meant the regex "[\s\S]" to mean "match any character (including newline)" but Mathematica apparently treats it as "match any ascii character".
How can I fix the regular expression so the first line above evaluates the same as the second? Alternatively, is there an asciify filter I can apply to the strings first?
PS: The Mathematica documentation says that its string pattern matching is built on top of the Perl-Compatible Regular Expressions library (http://pcre.org) so the problem I'm having may not be specific to Mathematica.
Here's an asciify function which I used as a workaround at first:
f[s_String] := s
f[x_] := FromCharacterCode[x]
asciify[s_String] :=
StringJoin[f /@ (ToCharacterCode[s] /. x_?(#>255&) :> "&"<>ToString[x]<>";")]
Then I realized, thanks to @Isaac's answer, that "." as a regular expression doesn't seem to have this unicode problem. I learned from the answers to Bug in Mathematica: regular expression applied to very long string that "(.|\n)" is ill-advised but that "(?s)." is recommended. So I think the best fix is the following:
StringReplace["–", RegularExpression@"(?s)." -> "abc"]
I would use a StringExpression
in place of RegularExpression
. This works as desired:
f[s_String] := StringReplace[s, _ -> "abc"]
In a StringExpression
, Blank[]
will match anything, including non-ASCII characters.
EDIT in response to version updates: as of Mathematica 11.0.1, it looks like letter characters with character codes up to 2^16 - 1
(which is called out as the maximum value for FromCharacterCode
), the results of StringMatchQ[LetterCharacter]
now match those of LetterQ
.
AllTrue[FromCharacterCode /@ Range[2^16 - 1],
LetterQ@# === StringMatchQ[#, LetterCharacter] &]
(* True *)
Using "(.|\n)"
for the input to RegularExpression seems to work for me. The pattern matches .
(any non-newline character) or \n
(a newline character).
精彩评论