开发者

Extracting the common words between two paragraphs?

How can I extract the common words between two or more paragraphs in php5? I guess it might work to summarize each text to create a list of highly ranked wor开发者_Go百科ds and then compare them.


I guess the most basic way would be to :

  • split each paragraph into an array of words, using either explode or preg_split
    • the first one might be a bit faster
    • the second one might provide a bit more options
  • maybe, do some filtering on the list of words :
    • clean each word
      • removing special characters, like accented letters
      • converting everything to lower/upper-case, to help the comparisons you'll be doing later
    • remove too common words
    • remove too short words
    • array_filter, here, could probably help
  • and then, get the list of words that are in both arrays, using something like array_intersect


There is probably a faster way but you could regex out punctuation like !?-./\@#$%^&*, then explode the two paragraphs into an array, and then try array_intersect() on both arrays. Anything in array 2 that is in array 1 should come back as a match.

http://php.net/manual/en/function.array-intersect.php

Theoretically you should receive back an array of matching words. From there, ranking is up to you and how you chose to do it.


Something like this might work...

<?php
  $paragraph = "hello this is some sample text. Sample text is usually used to test a program. For example, this sample text will be used to test the script below.";
  $words = array();
  preg_match_all('/\w+/', $paragraph, $matches);
  foreach($matches[0] as $w){
    $w = strtolower($w);
    if(!array_key_exists($w, $words)){
      $words[$w] = 0;
    }
    $words[$w]++;
  }
  asort($words);
  echo print_r($words, true);

  /* Output
  Array (
      [hello] => 1
      [will] => 1
      [example] => 1
      [a] => 1
      [program] => 1
      [usually] => 1
      [Sample] => 1
      [script] => 1
      [below] => 1
      [some] => 1
      [the] => 1
      [be] => 1
      [for] => 1
      [to] => 2
      [is] => 2
      [sample] => 2
      [test] => 2
      [used] => 2
      [this] => 2
      [text] => 3
  ) */

?>


<?php
/**
 * Gets all the words as an array for a given text blob
 *
 * @param string $paragraph The pragraph in question
 * @return string[] Words found
 */
function getWords($paragraph) {
   //only lowercase
   $paragraph = strtolower($paragraph);
   //replace all non alpha num characters with spaces (this way periods won't screw
   //with our words)
   $paragraph = preg_replace("/[^a-z]/", " ", $paragraph);
   $paragraph = explode(" ", $paragraph);
   //get rid of empty words
   $paragraph = array_flip($paragraph);
   unset($paragraph[""]);
   $paragraph = array_flip($paragraph);
   return $paragraph;
}

$paragraph1 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque sit amet ante
nisl. Morbi tempor varius semper. Suspendisse vel nisi dui. Sed tristique consectetur imperdiet.
Morbi nulla diam, lobortis non eleifend eget, ullamcorper nec tortor. Duis quis lectus felis.
In vulputate varius luctus. Maecenas gravida laoreet massa quis faucibus. Duis dictum, dui sit
amet pharetra laoreet, tortor nisi mattis tortor, et ornare purus dolor vitae ligula. Sed id
orci ut dolor fermentum imperdiet. Nulla non justo urna, in suscipit nunc. Donec ut nibh risus,
ut tempus mi. Proin fringilla pretium urna sed faucibus. Proin et porttitor sem. Nulla eros
arcu, sodales et aliquam in, pharetra et mauris. Duis placerat blandit justo at tincidunt.
Etiam eu rutrum arcu.";

$paragraph2 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam sit amet leo id
arcu feugiat tempus quis a risus. Proin non nisi augue. Cras ultricies dignissim augue vel gravida.
Vivamus sed orci sed leo sollicitudin aliquet non at dui. Nulla facilisi. Suspendisse nunc nibh,
sollicitudin vitae tincidunt eget, aliquet vitae magna. Aliquam vehicula cursus ante, vitae rhoncus
orci egestas et. Fusce condimentum metus at metus auctor pellentesque. Suspendisse potenti. Morbi
blandit, leo sed eleifend pretium, augue dui interdum eros, vel faucibus felis dolor id elit. Nam
condimentum, odio at mattis consequat, sem eros molestie risus, a tempus dolor arcu sit amet justo.";

$common = array_intersect(getWords($paragraph1), getWords($paragraph2));
sort($common);
var_dump($common);
?>


  1. Split each paragraph on spaces
  2. Select a token from paragraph A; if it is in paragraph B, put it in a 'matches' array.
  3. Repeat step 2 until there are no more tokens in paragraph A.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜