Regular Expression: Split English and Non-English words with Comma?
开发者_运维问答Is there any regular expression pattern to change this string
This is a mix string of üößñ and English. üößñ üößñ are Unicode words.
to this?
This is a mix string of, üößñ, and English., üößñ üößñ, are Unicode words.
Actually, I want to split English words and non-English words with comma.
Thanks.
javascript
/((?:\ [^\w\d]+)+)/g
'This is a mix string of üößñ and English. üößñ üößñ are Unicode words.'.replace(/((?:\ [^\w\d]+)+)/g,',$1,')
This is a mix string of, üößñ, and English., üößñ üößñ, are Unicode words.
Mark
No regular expression can detect strings in a particular language, but you can certainly match characters in (or not in) a range of code points, by using unicode literals, such as
/[\u0900-\u097F]+/
which matches a sequence of Devanagari characters.
Remember that a Script (a collection of characters) can be used by many languages.
Sure, you can use \x to filter specific ASCII code ranges
For example (in JavaScript):
var x = "This is a mix string of üößñ and English. üößñ üößñ are Unicode characters.";
x.replace(/([^\x00-\x80]+\s)+/g, function(match) { return match.slice(0,-1)+", "; } ); // matches characters outside the 0-128 ASCII range
Output:
This is a mix string of üößñ, and English. üößñ üößñ, are Unicode characters.
I'm sure another regex savvy person can optimize further, but this is the best I can think of half-awake :)
String s = "This is a mix string of üößñ and English. üößñ üößñ are Unicode words.";
System.out.println(s.replaceAll("((?: ?[\\p{L}&&[^A-Za-z]]+)+)", ",$1,"));
Unicode scripts define about 45 different language scripts. The above simply detects any unicode not in the ASCII range.
精彩评论