Matching specific unicode char in haskell regexp
This is Mac/OSX related problem!
I 开发者_C百科have the following three character long haskell string:
"a\160b"
I want to match and replace the middle character
Several approaches like
ghci> :m +Text.Regex
ghci> subRegex (mkRegex "\160") "a\160b" "X"
"*** Exception: user error (Text.Regex.Posix.String died: (ReturnCode 17,"illegal byte sequence"))
ghci> subRegex (mkRegex "\\160") "a\160b" "X"
"a\160b"
did not yield the desired result.
How do I have to modify the regexp or my environment to replace the '\160' with the 'X' ?
The problem seems to have it's root in the locale/encoding of the input.
bash> locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
I already modified my .bashrc to export the following env-vars:
bash> locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
But this did not change the behavior at all.
I was able to reproduce your problem by setting my locale to 'en_US.UTF-8'. (I am also using MacOSX.)
bash> export LANG=en_US.UTF-8
bash> ghci
GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help
Prelude> :m +Text.Regex
Prelude Text.Regex> subRegex (mkRegex "\160") "a\160b" "X"
"*** Exception: user error (Text.Regex.Posix.String died: (ReturnCode 17,"illegal byte sequence"))
Setting your locale to 'C' should fix the problem:
bash> export LANG=C
bash> ghci
GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help
Prelude> :m +Text.Regex
Prelude Text.Regex> subRegex (mkRegex "\160") "a\160b" "X"
"aXb"
Unfortunately, I don't have an explanation as to why the locale is causing this problem.
Is there a specific reason you want to use regular expressions, and not simply map
?
replace :: Char -> Char
replace '\160' = 'X'
replace c = c
test = map replace "a\160b" == "aXb"
Note that if you want to work with Unicode strings, it's probably easier to use the text
package which is designed to handle Unicode, and more efficient than String
for larger strings.
精彩评论