开发者

PCRE seems to be removing particular characters

I have a piece of text (part French part English) that has the European style Canadian Dollar symbols ($C) in it multiple times. When I attempt to use a regex using either traditional or unicode characters, the symbols have been removed from the text and cannot be matched with. I used a lazy regex so that if it doesn't find the expected symbols it still works.

Additionally the tex开发者_高级运维t is in an xml utf-8 doc and being displayed from a web interface(made in house).


Escape the $ inside the RegExp, the dollar-sign has a special meaning in RegExp.


In perl, regex's and code are displayed in ascii, but if you want to embed unicode in your text, first you have to have an editor that does unicode, second you have to tell Perl your source code contains unicode (with a use utf8' pragma).

If you don't want to do that you can embed (in Perl) code points in strings (regex's) with a construct like this $regex = /this is some text, this: is \x{1209} a codepoint unicode character/;

It matches the character IF the data source is decoded Unicode (internalized) and contains that character.

Edit - I don't think there is a unicode for canadian dollar, rather '$C', like someone said you have to escape the $ if the regex is interpolated. If you keep the $C, the character class [$C] matches $ or C, not the combination. Maybe (?:\$|\$C) would be a better anchor.


The issue turned out to be a bug in code just before i called eval(). Something in the french unicode was screwing with the code passed to eval, so by not combining the text and regex it worked fine.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜