using javascript, how can I count a mix of asian characters and english words

2022-12-20 19:56 问答作者：

I need to take a string of mixed Asian characters (for now, assume only Chinese kanji or Japanese kanji/hiragana/katakana) and "Alphanumeric" (i.e., Enlgish, French), and count it in the following way:

1) count ea开发者_高级运维ch Asian CHARACTER as 1; 2) count each Alphanumeric WORD as 1;

a few examples:

株式会社myCompany = 4 chars + 1 word = 5 total 株式会社マイコ = 7 chars

my only idea so far is to use:

var wordArray=val.split(/\w+/);

and then check each element to see if its contents are alphanumeric (so count as 1) or not (so take the array length). But I don't feel that's really very clever at all and the text being counted might be up to 10,000words, so not very quick.

Ideas?

Unfortunately JavaScript's RegExp has no support for Unicode character classes; \w only applies to ASCII characters (modulo some browser bugs).

You can use Unicode characters in groups, though, so you can do it if you can isolate each set of characters you are interested in as a range. eg.:

var r= new RegExp(
    '[A-Za-z0-9_\]+|'+                             // ASCII letters (no accents)
    '[\u3040-\u309F]+|'+                           // Hiragana
    '[\u30A0-\u30FF]+|'+                           // Katakana
    '[\u4E00-\u9FFF\uF900-\uFAFF\u3400-\u4DBF]',   // Single CJK ideographs
'g');

var nwords= str.match(r).length;

(This attempts to give a more realistic count of ‘words’ for Japanese, counting each run of one type of kana as a word. That's still not right, of course, but it's probably closer than treating each syllable as one word.)

Obviously there are many more characters that would have to be accounted for if you wanted to ‘do it properly’. Let's hope you don't have characters outside the basic multilingual plane, for one!

You can iterate over each character in the text, examining each one to look for word breaks. The following example does this, counting each Chinese/Japanese/Korean (CJK) ideograph as a single word, and treating all alphanumeric strings as single words.

Some notes on my implementation:

It probably doesn't handle accented characters correctly. They will probably trigger word breaks. You can modify the wordBreakRegEx to fix this.
cjkRegEx doesn't include some of the more esoteric code point ranges, since they require 5 hex digits to reference and JavaScript's regex engine doesn't seem to let you do that. But you probably don't need to worry about these, since I don't even think most fonts include them.
I deliberately left Japanese Hiragana and Katakana out of cjkRegEx, since I'm not sure how you want to handle these. Depending on the type of text you're dealing with, it might make more sense to treat strings of them as single words. In that case, you'd need to add logic to recognize being in a "kana word" versus in a "alphanumeric word". If you don't care, then you just need to add their code point ranges to cjkRegEx. Of course, you could try to recognize word breaks within kana strings, but that quickly becomes Very Hard.

Example implementation:

function getWordCount(text) {
  // This matches all CJK ideographs.
  var cjkRegEx = /[\u3400-\u4db5\u4e00-\u9fa5\uf900-\ufa2d]/;

  // This matches all characters that "break up" words.
  var wordBreakRegEx = /\W/;

  var wordCount = 0;
  var inWord = false;
  var length = text.length;
  for (var i = 0; i < length; i++) {
    var curChar = text.charAt(i);
    if (cjkRegEx.test(curChar)) {
      // Character is a CJK ideograph.
      // Count it as a word.
      wordCount += inWord ? 2 : 1;
      inWord = false;
    } else if (wordBreakRegEx.test(curChar)) {
      // Character is a "word-breaking" character.
      // If a word was started, increment the word count.
      if (inWord) {
        wordCount += 1;
        inWord = false;
    } else {
      // All other characters are "word" characters.
      // Indicate that a word has begun.
      inWord = true;
    }
  }

  // If the text ended while in a word, make sure to count it.
  if (inWord) {
    wordCount += 1;
  }

  return wordCount;
}

The Unihan Database is very helpful for learning about CJK in unicode. Also of course the Unicode home page has loads of info.

I think you want to loop over all characters, and increase a counter every time the current character is in a different word (according to your definition) than the previous one.

继续阅读：character counting javascript text

using javascript, how can I count a mix of asian characters and english words

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？