Find 3-8 word common phrases in body of text using PHP

2023-02-07 05:10 问答作者：

I'm looking for a way to find common phrases within a body of text using PHP. If it's not possible in php, I'd be interested in other web languages that would help me complete this.

Memory or speed开发者_C百科 are not an issues.

Right now, I'm able to easily find keywords, but don't know how to go about searching phrases.

I've written a PHP script that does just that, right here. It first splits the source text into an array of words and their occurrence count. Then it counts common sequences of those words with the specified parameters. It's old code and not commented, but maybe you'll find it useful.

Using just PHP? The most straightforward I can come up with is:

Add each phrase to an array
Get the first phrase from the array and remove it
Find the number of phrases that match it and remove those, keeping a count of matches
Push the phrase and the number of matches to a new array
Repeat until initial array is empty

I'm trash for formal CS, but I believe this is of n^2 complexity, specifically involving n(n-1)/2 comparisons in the worst case. I have no doubt there is some better way to do this, but you mentioned that efficiency is a non-issue, so this'll do.

Code follows (I used a new function to me, array_keys that accepts a search parameter):

// assign the source text to $text
$text = file_get_contents('mytext.txt');

// there are other ways to do this, like preg_match_all,
// but this is computationally the simplest
$phrases = explode('.', $text);

// filter the phrases
// if you're in PHP5, you can use a foreach loop here
$num_phrases = count($phrases);
for($i = 0; $i < $num_phrases; $i++) {
  $phrases[$i] = trim($phrases[$i]);
}

$counts = array();

while(count($phrases) > 0) {
  $p = array_shift($phrases);
  $keys = array_keys($phrases, $p);
  $c = count($keys);
  $counts[$p] = $c + 1;

  if($c > 0) {
    foreach($keys as $key) {
      unset($phrases[$key]);
    }
  }
}

print_r($counts);

View it in action: http://ideone.com/htDSC

I think you should go for

str_word_count

$str = "Hello friend, you're
       looking          good today!";

print_r(str_word_count($str, 1));

will give

Array
(
    [0] => Hello
    [1] => friend
    [2] => you're
    [3] => looking
    [4] => good
    [5] => today
)

Then you can use array_count_values()

$array = array(1, "hello", 1, "world", "hello");
print_r(array_count_values($array));

which will give you

Array
(
    [1] => 2
    [hello] => 2
    [world] => 1
)

An ugly solution, since you said ugly is ok, would be to search for the first word for any of your phrases. Then, once that word is found, check if the next word past it matches the next expected word in the phrase. This would be a loop that would keep going so long as the hits are positive until either a word is not present or the phrase is completed.

Simple, but exceedingly ugly and probably very, very slow.

Coming in late here, but since I stumbled upon this while looking to do a similar thing, I thought I'd share where I landed in 2019:

https://packagist.org/packages/yooper/php-text-analysis

This library made my task downright trivial. In my case, I had an array of search phrases that I wound up breaking up into single terms, normalizing, then creating two and three-word ngrams. Looping through the resulting ngrams, I was able to easily summarize the frequency of specific phrases.

$words   = tokenize($searchPhraseText);
$words   = normalize_tokens($words);
$ngram2  = array_unique(ngrams($words, 2));
$ngram3  = array_unique(ngrams($words, 3));

Really cool library with a lot to offer.

If you want fulltext search in html files, use Sphinx - powerful search server. Documentation is here

继续阅读：data-mining php text-mining

Find 3-8 word common phrases in body of text using PHP

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？