开发者

Find 3-8 word common phrases in body of text using PHP

I'm looking for a way to find common phrases within a body of text using PHP. If it's not possible in php, I'd be interested in other web languages that would help me complete this.

Memory or speed开发者_C百科 are not an issues.

Right now, I'm able to easily find keywords, but don't know how to go about searching phrases.


I've written a PHP script that does just that, right here. It first splits the source text into an array of words and their occurrence count. Then it counts common sequences of those words with the specified parameters. It's old code and not commented, but maybe you'll find it useful.


Using just PHP? The most straightforward I can come up with is:

  • Add each phrase to an array
  • Get the first phrase from the array and remove it
  • Find the number of phrases that match it and remove those, keeping a count of matches
  • Push the phrase and the number of matches to a new array
  • Repeat until initial array is empty

I'm trash for formal CS, but I believe this is of n^2 complexity, specifically involving n(n-1)/2 comparisons in the worst case. I have no doubt there is some better way to do this, but you mentioned that efficiency is a non-issue, so this'll do.

Code follows (I used a new function to me, array_keys that accepts a search parameter):

// assign the source text to $text
$text = file_get_contents('mytext.txt');

// there are other ways to do this, like preg_match_all,
// but this is computationally the simplest
$phrases = explode('.', $text);

// filter the phrases
// if you're in PHP5, you can use a foreach loop here
$num_phrases = count($phrases);
for($i = 0; $i < $num_phrases; $i++) {
  $phrases[$i] = trim($phrases[$i]);
}

$counts = array();

while(count($phrases) > 0) {
  $p = array_shift($phrases);
  $keys = array_keys($phrases, $p);
  $c = count($keys);
  $counts[$p] = $c + 1;

  if($c > 0) {
    foreach($keys as $key) {
      unset($phrases[$key]);
    }
  }
}

print_r($counts);

View it in action: http://ideone.com/htDSC


I think you should go for

str_word_count

$str = "Hello friend, you're
       looking          good today!";

print_r(str_word_count($str, 1));

will give

Array
(
    [0] => Hello
    [1] => friend
    [2] => you're
    [3] => looking
    [4] => good
    [5] => today
)

Then you can use array_count_values()

$array = array(1, "hello", 1, "world", "hello");
print_r(array_count_values($array));

which will give you

Array
(
    [1] => 2
    [hello] => 2
    [world] => 1
)


An ugly solution, since you said ugly is ok, would be to search for the first word for any of your phrases. Then, once that word is found, check if the next word past it matches the next expected word in the phrase. This would be a loop that would keep going so long as the hits are positive until either a word is not present or the phrase is completed.

Simple, but exceedingly ugly and probably very, very slow.


Coming in late here, but since I stumbled upon this while looking to do a similar thing, I thought I'd share where I landed in 2019:

https://packagist.org/packages/yooper/php-text-analysis

This library made my task downright trivial. In my case, I had an array of search phrases that I wound up breaking up into single terms, normalizing, then creating two and three-word ngrams. Looping through the resulting ngrams, I was able to easily summarize the frequency of specific phrases.

$words   = tokenize($searchPhraseText);
$words   = normalize_tokens($words);
$ngram2  = array_unique(ngrams($words, 2));
$ngram3  = array_unique(ngrams($words, 3));

Really cool library with a lot to offer.


If you want fulltext search in html files, use Sphinx - powerful search server. Documentation is here

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜