Remove all exclusive Latin characters using regex
I'm developing a Portuguese software, so many of my entities have names like 'maça' or 'l开发者_JS百科ição' and I want to use the entity as a resource key. So I want keep every character except the 'ç,ã,õ....'
There is some optimum solution using regex? My actual regex is (as Remove characters using Regex suggest):
Regex regex = new Regex(@"[\W_]+");
string cleanText = regex.Replace(messyText, "").ToUpper();
only to emphasize, I'm worried just with Latin characters.
A simple option is to white-list the accepted characters:
string clean = Regex.Replace(messy, @"[^a-zA-Z0-9!@#]+", "");
If you want to remove all non-ASCII letters but keep all other characters, you can use character class subtraction:
string clean = Regex.Replace(messy, @"[\p{L}-[a-zA-Z]]+", "");
It can also be written as the more standard and complicated [^\P{L}a-zA-Z]+
(or \W
), which reads "select all characters that are not word letters or ASCII letters", which ends up with the letters we're looking for.
Just some context for \W
: It stands for "not a word character", meaning anything other than a-z,A-Z,0-9 and underscore _
You may also consider the following approach more useful: How do I remove diacritics (accents) from a string in .NET?
Another option might be to convert from Unicode to ASCII. This will not drop characters, but convert them to ?
s. That might be better than dropping them (for use as keys).
string suspect = "lição";
byte[] suspectBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, Encoding.Unicode.GetBytes(suspect));
string purged = Encoding.ASCII.GetString(suspectBytes);
Console.WriteLine(purged); // li??o
Note that the question marks are often unique but unrepresentable characters, so you may get fewer collisions.
Does this work?
Regex regex = new Regex(@"[^a-zA-Z0-9_]");
I think the best regex would be to use:
[^\x00-\x80]
This is the negation of all ASCII characters. It matches all non-ASCII characters: The \x00
and \x80
(128) is the hexadecimal character code, and -
means range. The ^
inside the [
and ]
means negation.
Replace them with the empty string, and you should have what you want. It also frees you from worrying about punctuation, and the like, that are not ASCII, and can cause subtle but annoying (and hard to track down) errors.
If you want to use the extended ASCII set as legal characters, you can say \xFF
instead of \x80
.
The goal should be to simply include ASCII characters A-Z and numbers and punctuation. Just exclude everything outside of that range using RegEx.
string clean = Regex.Replace(messy, @"[^\x20-\x7e]", String.Empty);
To be clear, the regex I'm using is:
[^\x20-\x7e]
You may need to escape the \ character - I haven't tested this in anything but RegEx buddy :)
That excludes everything outside ASCII characters 0x20 and 0x7e, which translates to ASCII range decimal 32-127.
Good luck!
Best,
-Auri
This is more usefull to me:
([\p{L}]+)
精彩评论