How to represent Unicode characters in an ASCII regex pattern?

2023-02-06 09:26 问答作者：

RegEx flavor: wxRegEx in C++.

One of the strings that I need to match contains characters like '…' (U+2026, Horizontal Ellipsis) which translates to \205 when pasted to Emac开发者_如何转开发s and '»' (U+00BB, Right-Pointing Double Angle Quotation Mark) which remains » when pasted to Emacs (ASCII source code mode).

In the regex pattern itself I tried representing '…' as both \205 and \\205 to no avail.

What is the right way of approaching this problem?

Update: The wxRegEx documentation states that to represent a Unicode character you use \uwxyz (where wxyz is exactly four hexadecimal digits) the Unicode character U+wxyz in the local byte ordering.

I tried that, but for some reason it doesn't work for me (yet).

It depends on the language. In many languages there’s no need to escape non-ASCII, but you may have to tell the compiler what encoding the source is in. For example:

$ java -encoding UTF-8 SomeThing.java

$ perl -Mutf8 somescript

Although with things like Perl, Python, and Ruby, you can put the declaration inside the file, providing it’s upwards compatible with ASCII. For example:

#!/usr/bin/perl

use utf8;
use strict;
use warnings;
use autodie;

my $s = "Où se trouve mon élève?";

if ($s =~ /élève/) { ... }

# although of course this also works fine:

while ($s =~ /\b(\w+)\b/g) {
     print "Found <$1>\n";  
}

That’s the easiest way to do it, and I highly recommend it: just put the real UTF-8 characters in your source code. If you have to figure out to escape things, well, it’s far less convenient.

If you are going to use escapes, well, how you specify non-ASCII symbolically also varies by language. In Java you can use the asquerous Java preprocessor via \uXXXX:

String s = "e\u0301le\u0300ve";

although I do not recommend that way. If it’s going to be used in a pattern, you can delay interpolation, which is cleaner and messier at the same time:

String s = "e\\u0301le\\u0300ve";

That second mechanism spares you from the trying to figure out what it is after the Java preprocessor has its way with it (you can’t use \u0022 but can use \\0022), but then it screws up your Pattern.CANON_EQ flag.

Most other languages have a more straightforward way to do it that Java — which also insists on ugly UTF-16 unless you use java -encoding UTF-8 for your source. Hardcoding UTF-16 surrogates is absolutely idiotic. Do not do it!!

In Perl you could use:

my $s = "e\x{301}le\x{300}ve";  # NFD form
my $s = "\xE9l\xE8ve";          # NFC form

but you can also name them symbolically

use charnames qw< :full >;
my $s_as_NFD = "e\N{COMBINING ACUTE ACCENT}le\N{COMBINING GRAVE ACCENT}e";
my $s_as_NFC = "\N{LATIN SMALL LETTER E WITH ACUTE}l\N{LATIN SMALL LETTER E WITH GRAVE}ve";

The last one can be made much shorter if you’d prefer:

use charnames qw< :full latin >;
my $s_as_NFC = "\N{e WITH ACUTE}l\N{e WITH GRAVE}ve";

All of those are just about infinitely superior to hardcoding magic numbers into your code.

This all assumes your language supports Unicode, but many do not.

继续阅读：ascii escaping regex unicode wxwidgets

How to represent Unicode characters in an ASCII regex pattern?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？