Remove all exclusive Latin characters using regex

2023-02-17 10:16 问答作者：

I'm developing a Portuguese software, so many of my entities have names like 'maça' or 'l开发者_JS百科ição' and I want to use the entity as a resource key. So I want keep every character except the 'ç,ã,õ....'

There is some optimum solution using regex? My actual regex is (as Remove characters using Regex suggest):

Regex regex = new Regex(@"[\W_]+");
string cleanText = regex.Replace(messyText, "").ToUpper();

only to emphasize, I'm worried just with Latin characters.

A simple option is to white-list the accepted characters:

string clean = Regex.Replace(messy, @"[^a-zA-Z0-9!@#]+", "");

If you want to remove all non-ASCII letters but keep all other characters, you can use character class subtraction:

string clean = Regex.Replace(messy, @"[\p{L}-[a-zA-Z]]+", "");

It can also be written as the more standard and complicated [^\P{L}a-zA-Z]+ (or \W), which reads "select all characters that are not word letters or ASCII letters", which ends up with the letters we're looking for.
Just some context for \W: It stands for "not a word character", meaning anything other than a-z,A-Z,0-9 and underscore _

You may also consider the following approach more useful: How do I remove diacritics (accents) from a string in .NET?

Another option might be to convert from Unicode to ASCII. This will not drop characters, but convert them to ?s. That might be better than dropping them (for use as keys).

string suspect = "lição";
byte[] suspectBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, Encoding.Unicode.GetBytes(suspect));
string purged = Encoding.ASCII.GetString(suspectBytes);
Console.WriteLine(purged); // li??o

Note that the question marks are often unique but unrepresentable characters, so you may get fewer collisions.

Does this work?

Regex regex = new Regex(@"[^a-zA-Z0-9_]");

I think the best regex would be to use:

[^\x00-\x80]

This is the negation of all ASCII characters. It matches all non-ASCII characters: The \x00 and \x80 (128) is the hexadecimal character code, and - means range. The ^ inside the [ and ] means negation.

Replace them with the empty string, and you should have what you want. It also frees you from worrying about punctuation, and the like, that are not ASCII, and can cause subtle but annoying (and hard to track down) errors.

If you want to use the extended ASCII set as legal characters, you can say \xFF instead of \x80.

The goal should be to simply include ASCII characters A-Z and numbers and punctuation. Just exclude everything outside of that range using RegEx.

string clean = Regex.Replace(messy, @"[^\x20-\x7e]", String.Empty);

To be clear, the regex I'm using is:

[^\x20-\x7e]

You may need to escape the \ character - I haven't tested this in anything but RegEx buddy :)

That excludes everything outside ASCII characters 0x20 and 0x7e, which translates to ASCII range decimal 32-127.

Good luck!

Best,

-Auri

This is more usefull to me:

([\p{L}]+)

继续阅读：regex resources

Remove all exclusive Latin characters using regex

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？