regex to count english words as single char inside char count of asian words
need some help from a regex jedi master:
If I have a string of mb chars (specifically, Japanese, Korean or Chinese) with English words sprinkled throughout, I would like to count:
- asian characters as 1 per single char
- english "words" (no dictionary check needed - just a string 开发者_如何学编程of consecutive english letters) as a single char.
English only is fine - don't worry about special spanish, swedish, etc. chars.
I am searching for a regex pattern I can use to count these strings, that will function in php and js.
Example:
これは猫です、けどKittyも大丈夫。
should count as 13 chars.
thanks for your help!
jeff
What ever you are trying to achieve, this will help you:
To count only Hiragana+Katakana+Kanji (Japanese) Chars (excluding punctuation marks):
var x = "これは猫です、けどKittyも大丈夫。";
x.match(/[ぁ-ゖァ-ヺー一-龯々]/g).length; //Result: 12 : これは猫ですけども大丈夫
Updated:
To count only words in Alphabet:
x.match(/\w+/g).length; //Result: 1 : "Kitty"
All in one line (as function):
function myCount(str) {
return str.match(/[ぁ-ゖァ-ヺー一-龯々]|\w+/g).length;
}
alert(myCount("これは猫です、けどKittyも大丈夫。")); //13
alert(myCount("これは犬です。DogとPuppyもOKですね!")); //14
These are the arrays resulted of match:
["こ", "れ", "は", "猫", "で", "す", "け", "ど", "Kitty", "も", "大", "丈", "夫"]
["こ", "れ", "は", "犬", "で", "す", "Dog", "と", "Puppy", "も", "OK", "で", "す", "ね"]
Updated (JAP, KOR, CH):
function myCount(str) {
return str.match(/[ぁ-ㆌㇰ-䶵一-鿃々가-힣-豈ヲ-ン]|\w+/g).length;
}
These will cover around 99% of the Japanese, Chinese and Korean. You may need to manually add extra characters that are not included such as "〶".
A very good reference is:
http://www.tamasoft.co.jp/en/general-info/unicode.html
This should solve your question.
OK, so I would do two runs: First count the occurrences of the English words and then of the Asian ones. This is a JS example, it might be different in PHP. In JS, only ASCII chars match \w.
string = "これは猫です、けどKittyも大丈夫";
var m = string.match(/\w+/gm);
var e_count = m.length; // is 1
Next count the Asian chars.
m = string.match(/([^\w\s\d])/gm); // any non-whitespace, non-word, non-digit chars
var a_count = m.length; // is 13
You might have to tweak it a bit. But in JS, you can add up e_count and a_count, and you should be good to go.
Also check out Rubular: http://www.rubular.com
- Johannes
Something like /[[:ascii:]]+|./
will match one non-ASCII character or one or more ASCII characters. Probably is that'll give 15. So it seems that you want to ignore punctuation. So possibly: /[A-Za-z]+|[^[:punct:]]/
$ perl -E 'use utf8; $f = "これは猫です、けどKittyも大丈夫。"; ++$c while $f =~ /[A-Za-z]+|[^[:punct:]]/g; say $c'
13
So, that works in Perl at least. Probably in JS and PHP as well, provided their [[:punct:]] understands Unicode.
The alternative approach is to filter out stuff instead.
精彩评论