开发者

How to represent Unicode characters in an ASCII regex pattern?

RegEx flavor: wxRegEx in C++.

One of the strings that I need to match contains characters like '' (U+2026, Horizontal Ellipsis) which translates to \205 when pasted to Emac开发者_如何转开发s and '»' (U+00BB, Right-Pointing Double Angle Quotation Mark) which remains » when pasted to Emacs (ASCII source code mode).

In the regex pattern itself I tried representing '' as both \205 and \\205 to no avail.

What is the right way of approaching this problem?

Update: The wxRegEx documentation states that to represent a Unicode character you use \uwxyz (where wxyz is exactly four hexadecimal digits) the Unicode character U+wxyz in the local byte ordering.

I tried that, but for some reason it doesn't work for me (yet).


It depends on the language. In many languages there’s no need to escape non-ASCII, but you may have to tell the compiler what encoding the source is in. For example:

$ java -encoding UTF-8 SomeThing.java

or

$ perl -Mutf8 somescript

Although with things like Perl, Python, and Ruby, you can put the declaration inside the file, providing it’s upwards compatible with ASCII. For example:

#!/usr/bin/perl

use utf8;
use strict;
use warnings;
use autodie;

my $s = "Où se trouve mon élève?";

if ($s =~ /élève/) { ... }

# although of course this also works fine:

while ($s =~ /\b(\w+)\b/g) {
     print "Found <$1>\n";  
}

That’s the easiest way to do it, and I highly recommend it: just put the real UTF-8 characters in your source code. If you have to figure out to escape things, well, it’s far less convenient.

If you are going to use escapes, well, how you specify non-ASCII symbolically also varies by language. In Java you can use the asquerous Java preprocessor via \uXXXX:

String s = "e\u0301le\u0300ve";

although I do not recommend that way. If it’s going to be used in a pattern, you can delay interpolation, which is cleaner and messier at the same time:

String s = "e\\u0301le\\u0300ve";

That second mechanism spares you from the trying to figure out what it is after the Java preprocessor has its way with it (you can’t use \u0022 but can use \\0022), but then it screws up your Pattern.CANON_EQ flag.

Most other languages have a more straightforward way to do it that Java — which also insists on ugly UTF-16 unless you use java -encoding UTF-8 for your source. Hardcoding UTF-16 surrogates is absolutely idiotic. Do not do it!!

In Perl you could use:

my $s = "e\x{301}le\x{300}ve";  # NFD form
my $s = "\xE9l\xE8ve";          # NFC form

but you can also name them symbolically

use charnames qw< :full >;
my $s_as_NFD = "e\N{COMBINING ACUTE ACCENT}le\N{COMBINING GRAVE ACCENT}e";
my $s_as_NFC = "\N{LATIN SMALL LETTER E WITH ACUTE}l\N{LATIN SMALL LETTER E WITH GRAVE}ve";

The last one can be made much shorter if you’d prefer:

use charnames qw< :full latin >;
my $s_as_NFC = "\N{e WITH ACUTE}l\N{e WITH GRAVE}ve";

All of those are just about infinitely superior to hardcoding magic numbers into your code.

This all assumes your language supports Unicode, but many do not.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜