开发者

Matching specific unicode char in haskell regexp

This is Mac/OSX related problem!

I 开发者_C百科have the following three character long haskell string:

"a\160b"

I want to match and replace the middle character

Several approaches like

ghci> :m +Text.Regex
ghci> subRegex (mkRegex "\160") "a\160b" "X"
  "*** Exception: user error (Text.Regex.Posix.String died: (ReturnCode 17,"illegal byte sequence"))
ghci> subRegex (mkRegex "\\160") "a\160b" "X"
  "a\160b"

did not yield the desired result.

How do I have to modify the regexp or my environment to replace the '\160' with the 'X' ?

The problem seems to have it's root in the locale/encoding of the input.

bash> locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

I already modified my .bashrc to export the following env-vars:

bash> locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

But this did not change the behavior at all.


I was able to reproduce your problem by setting my locale to 'en_US.UTF-8'. (I am also using MacOSX.)

bash> export LANG=en_US.UTF-8
bash> ghci                   
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Prelude> :m +Text.Regex
Prelude Text.Regex> subRegex (mkRegex "\160") "a\160b" "X"
"*** Exception: user error (Text.Regex.Posix.String died: (ReturnCode 17,"illegal byte sequence"))

Setting your locale to 'C' should fix the problem:

bash> export LANG=C
bash> ghci                   
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Prelude> :m +Text.Regex
Prelude Text.Regex> subRegex (mkRegex "\160") "a\160b" "X"
"aXb"

Unfortunately, I don't have an explanation as to why the locale is causing this problem.


Is there a specific reason you want to use regular expressions, and not simply map?

replace :: Char -> Char
replace '\160' = 'X'
replace c      = c

test = map replace "a\160b" == "aXb"

Note that if you want to work with Unicode strings, it's probably easier to use the text package which is designed to handle Unicode, and more efficient than String for larger strings.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜