Matching specific unicode char in haskell regexp

2023-02-11 04:32 问答作者：

This is Mac/OSX related problem!

I 开发者_C百科have the following three character long haskell string:

"a\160b"

I want to match and replace the middle character

Several approaches like

ghci> :m +Text.Regex
ghci> subRegex (mkRegex "\160") "a\160b" "X"
  "*** Exception: user error (Text.Regex.Posix.String died: (ReturnCode 17,"illegal byte sequence"))
ghci> subRegex (mkRegex "\\160") "a\160b" "X"
  "a\160b"

did not yield the desired result.

How do I have to modify the regexp or my environment to replace the '\160' with the 'X' ?

The problem seems to have it's root in the locale/encoding of the input.

bash> locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

I already modified my .bashrc to export the following env-vars:

bash> locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

But this did not change the behavior at all.

I was able to reproduce your problem by setting my locale to 'en_US.UTF-8'. (I am also using MacOSX.)

bash> export LANG=en_US.UTF-8
bash> ghci                   
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Prelude> :m +Text.Regex
Prelude Text.Regex> subRegex (mkRegex "\160") "a\160b" "X"
"*** Exception: user error (Text.Regex.Posix.String died: (ReturnCode 17,"illegal byte sequence"))

Setting your locale to 'C' should fix the problem:

bash> export LANG=C
bash> ghci                   
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Prelude> :m +Text.Regex
Prelude Text.Regex> subRegex (mkRegex "\160") "a\160b" "X"
"aXb"

Unfortunately, I don't have an explanation as to why the locale is causing this problem.

Is there a specific reason you want to use regular expressions, and not simply map?

replace :: Char -> Char
replace '\160' = 'X'
replace c      = c

test = map replace "a\160b" == "aXb"

Note that if you want to work with Unicode strings, it's probably easier to use the text package which is designed to handle Unicode, and more efficient than String for larger strings.

继续阅读：haskell macos regex unicode

Matching specific unicode char in haskell regexp

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？