A PHP Library / Class to Count Words in Various Languages?
Some time in the near future I will need to implement a cross-language word count, or if that is not possible, a cross-language character count.
By word count I mean an accurate count of the words contained within the given text, taking the lang开发者_运维问答uage of the text. The language of the text is set by a user, and will be assumed to be correct.
By character count I mean a count of the "possibly in a word" characters contained within the given text, with the same language information described above.
I would much prefer the former count, but I am aware of the difficulties involved. I am also aware that the latter count is much easier, but very much prefer the former, if at all possible.
I'd love it if I just had to look at English, but I need to consider every language here, Chinese, Korean, English, Arabic, Hindi, and so on.
I would like to know if Stack Overflow has any leads on where to start looking for an existing product / method to do this in PHP, as I am a good lazy programmer*
A simple test showing how str_word_count with set_locale doesn't work, and a function from php.net's str_word_count page.
*http://blogoscoped.com/archive/2005-08-24-n14.html
Counting chars is easy:
echo strlen('一个有十的字符的句子'); // 30 (WRONG!)
echo strlen(utf8_decode('一个有十的字符的句子')); // 10
Counting words is where things start to get tricky, specially for Chinese, Japanese and other languages that don't use spaces (or other common "word boundary" characters) as word separators. I don't speak Chinese and I don't understand how word counting works in Chinese, so you'll have to educate me a bit - what makes a word in these languages? Is it any specific char or set of chars? I remember reading something related to how hard it was to identify Japanese words in T9 writing but can't find it anymore.
The following should correctly return the number of words in languages that use spaces or punctuation chars as words separators:
count(preg_split('~[\p{Z}\p{P}]+~u', $string, null, PREG_SPLIT_NO_EMPTY));
A quick trick if you only want approximate and not exact words is
<?php echo count(explode(' ',$string)); ?>
It works by counting spaces in just any language. I have used this for a translator script. Again it will not count exact words but give approximate words in a para.
Well, try:
<?
function count_words($str){
$words = 0;
$str = eregi_replace(" +", " ", $str);
$array = explode(" ", $str);
for($i=0;$i < count($array);$i++)
{
if (eregi("[0-9A-Za-zÀ-ÖØ-öø-ÿ]", $array[$i]))
$words++;
}
return $words;
}
echo count_words('This is the second one , it will count wrong as well" , it will count 12 instead of 11 because the comma is counted too.');
?>
精彩评论