regex to count english words as single char inside char count of asian words

2023-02-05 04:55 问答作者：

need some help from a regex jedi master:

If I have a string of mb chars (specifically, Japanese, Korean or Chinese) with English words sprinkled throughout, I would like to count:

asian characters as 1 per single char
english "words" (no dictionary check needed - just a string 开发者_如何学编程of consecutive english letters) as a single char.

English only is fine - don't worry about special spanish, swedish, etc. chars.

I am searching for a regex pattern I can use to count these strings, that will function in php and js.

Example:

これは猫です、けどKittyも大丈夫。

should count as 13 chars.

thanks for your help!

jeff

What ever you are trying to achieve, this will help you:

To count only Hiragana+Katakana+Kanji (Japanese) Chars (excluding punctuation marks):

var x = "これは猫です、けどKittyも大丈夫。";
x.match(/[ぁ-ゖァ-ヺー一-龯々]/g).length; //Result: 12 : これは猫ですけども大丈夫

Updated:

To count only words in Alphabet:

x.match(/\w+/g).length; //Result: 1 : "Kitty"

All in one line (as function):

function myCount(str) {
   return str.match(/[ぁ-ゖァ-ヺー一-龯々]|\w+/g).length;
}
alert(myCount("これは猫です、けどKittyも大丈夫。")); //13    
alert(myCount("これは犬です。DogとPuppyもOKですね！")); //14

These are the arrays resulted of match:

["こ", "れ", "は", "猫", "で", "す", "け", "ど", "Kitty", "も", "大", "丈", "夫"]
["こ", "れ", "は", "犬", "で", "す", "Dog", "と", "Puppy", "も", "OK", "で", "す", "ね"]

Updated (JAP, KOR, CH):

function myCount(str) {
   return str.match(/[ぁ-ㆌㇰ-䶵一-鿃々가-힣-豈ｦ-ﾝ]|\w+/g).length;
}

These will cover around 99% of the Japanese, Chinese and Korean. You may need to manually add extra characters that are not included such as "〶".

A very good reference is:

http://www.tamasoft.co.jp/en/general-info/unicode.html

This should solve your question.

OK, so I would do two runs: First count the occurrences of the English words and then of the Asian ones. This is a JS example, it might be different in PHP. In JS, only ASCII chars match \w.

string = "これは猫です、けどKittyも大丈夫";
var m = string.match(/\w+/gm);
var e_count = m.length; // is 1

Next count the Asian chars.

m = string.match(/([^\w\s\d])/gm); // any non-whitespace, non-word, non-digit chars
var a_count = m.length; // is 13

You might have to tweak it a bit. But in JS, you can add up e_count and a_count, and you should be good to go.

Also check out Rubular: http://www.rubular.com

Johannes

Something like /[[:ascii:]]+|./ will match one non-ASCII character or one or more ASCII characters. Probably is that'll give 15. So it seems that you want to ignore punctuation. So possibly: /[A-Za-z]+|[^[:punct:]]/

$ perl -E 'use utf8; $f = "これは猫です、けどKittyも大丈夫。"; ++$c while $f =~ /[A-Za-z]+|[^[:punct:]]/g; say $c'
13

So, that works in Perl at least. Probably in JS and PHP as well, provided their [[:punct:]] understands Unicode.

The alternative approach is to filter out stuff instead.

继续阅读：count regex

regex to count english words as single char inside char count of asian words

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？