Most used words in text with php
I found the code below on stackoverflow and it works well in finding the most common words in a string. But can I exclude the counting on common words like "a, if, you, have, etc"? Or would I have to remove the elements after counting? How would I do this开发者_开发百科? Thanks in advance.
<?php
$text = "A very nice to tot to text. Something nice to think about if you're into text.";
$words = str_word_count($text, 1);
$frequency = array_count_values($words);
arsort($frequency);
echo '<pre>';
print_r($frequency);
echo '</pre>';
?>
This is a function that extract common words from a string. it takes three parameters; string, stop words array and keywords count. you have to get the stop_words from txt file using php function that take txt file into array
$stop_words = file('stop_words.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
$this->extract_common_words( $text, $stop_words)
You can use this file stop_words.txt as your primary stop words file, or create your own file.
function extract_common_words($string, $stop_words, $max_count = 5) {
$string = preg_replace('/ss+/i', '', $string);
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase
preg_match_all('/\b.*?\b/i', $string, $match_words);
$match_words = $match_words[0];
foreach ( $match_words as $key => $item ) {
if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
unset($match_words[$key]);
}
}
$word_count = str_word_count( implode(" ", $match_words) , 1);
$frequency = array_count_values($word_count);
arsort($frequency);
//arsort($word_count_arr);
$keywords = array_slice($frequency, 0, $max_count);
return $keywords;
}
Here is my solution by using the built-in PHP functions:
most_frequent_words — Find most frequent word(s) appeared in a String
function most_frequent_words($string, $stop_words = [], $limit = 5) {
$string = strtolower($string); // Make string lowercase
$words = str_word_count($string, 1); // Returns an array containing all the words found inside the string
$words = array_diff($words, $stop_words); // Remove black-list words from the array
$words = array_count_values($words); // Count the number of occurrence
arsort($words); // Sort based on count
return array_slice($words, 0, $limit); // Limit the number of words and returns the word array
}
Returns array contains word(s) appeared most frequently in the string.
Parameters :
string $string - The input string.
array $stop_words (optional) - List of words which are filtered out from the array, Default empty array.
string $limit (optional) - Limit the number of words returned, Default 5.
There's not additional parameters or a native PHP function that you can pass words to exclude. As such, I would just use what you have and ignore a custom set of words returned by str_word_count
.
You can do this easily by using array_diff()
:
$words = array("if", "you", "do", "this", 'I', 'do', 'that');
$stopwords = array("a", "you", "if");
print_r(array_diff($words, $stopwords));
gives
Array
(
[2] => do
[3] => this
[4] => I
[5] => do
[6] => that
)
But you have to take care of lower and upper case yourself. The easiest way here would be to convert the text to lowercase beforehand.
精彩评论