开发者

regex to count english words as single char inside char count of asian words

need some help from a regex jedi master:

If I have a string of mb chars (specifically, Japanese, Korean or Chinese) with English words sprinkled throughout, I would like to count:

  • asian characters as 1 per single char
  • english "words" (no dictionary check needed - just a string 开发者_如何学编程of consecutive english letters) as a single char.

English only is fine - don't worry about special spanish, swedish, etc. chars.

I am searching for a regex pattern I can use to count these strings, that will function in php and js.

Example:

これは猫です、けどKittyも大丈夫。

should count as 13 chars.

thanks for your help!

jeff


What ever you are trying to achieve, this will help you:

To count only Hiragana+Katakana+Kanji (Japanese) Chars (excluding punctuation marks):

var x = "これは猫です、けどKittyも大丈夫。";
x.match(/[ぁ-ゖァ-ヺー一-龯々]/g).length; //Result: 12 : これは猫ですけども大丈夫

Updated:

To count only words in Alphabet:

x.match(/\w+/g).length; //Result: 1 : "Kitty"

All in one line (as function):

function myCount(str) {
   return str.match(/[ぁ-ゖァ-ヺー一-龯々]|\w+/g).length;
}
alert(myCount("これは猫です、けどKittyも大丈夫。")); //13    
alert(myCount("これは犬です。DogとPuppyもOKですね!")); //14

These are the arrays resulted of match:

["こ", "れ", "は", "猫", "で", "す", "け", "ど", "Kitty", "も", "大", "丈", "夫"]
["こ", "れ", "は", "犬", "で", "す", "Dog", "と", "Puppy", "も", "OK", "で", "す", "ね"]

Updated (JAP, KOR, CH):

function myCount(str) {
   return str.match(/[ぁ-ㆌㇰ-䶵一-鿃々가-힣-豈ヲ-ン]|\w+/g).length;
}

These will cover around 99% of the Japanese, Chinese and Korean. You may need to manually add extra characters that are not included such as "〶".

A very good reference is:

http://www.tamasoft.co.jp/en/general-info/unicode.html

This should solve your question.


OK, so I would do two runs: First count the occurrences of the English words and then of the Asian ones. This is a JS example, it might be different in PHP. In JS, only ASCII chars match \w.

string = "これは猫です、けどKittyも大丈夫";
var m = string.match(/\w+/gm);
var e_count = m.length; // is 1

Next count the Asian chars.

m = string.match(/([^\w\s\d])/gm); // any non-whitespace, non-word, non-digit chars
var a_count = m.length; // is 13

You might have to tweak it a bit. But in JS, you can add up e_count and a_count, and you should be good to go.

Also check out Rubular: http://www.rubular.com

  • Johannes


Something like /[[:ascii:]]+|./ will match one non-ASCII character or one or more ASCII characters. Probably is that'll give 15. So it seems that you want to ignore punctuation. So possibly: /[A-Za-z]+|[^[:punct:]]/

$ perl -E 'use utf8; $f = "これは猫です、けどKittyも大丈夫。"; ++$c while $f =~ /[A-Za-z]+|[^[:punct:]]/g; say $c'
13

So, that works in Perl at least. Probably in JS and PHP as well, provided their [[:punct:]] understands Unicode.

The alternative approach is to filter out stuff instead.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜