How can I get the most popular phrases from a lot of text?
I'm setting up a Twitter-style "trending topics" box for my forum. I've got the most popular /words/, but can't even begin to think how I will get popu开发者_开发百科lar phrases, like Twitter does.
As it stands I just get all the content of the last 200 posts into a string and split them into words, then sort by which words are used the most. How can I turn this from most popular words into the most popular phrases?
One technique you might consider is the use of ZSETs in Redis for something like this. If you've got very large sets of data, you'll find that you can do something like this:
$words = explode(" ", $input); // Pseudo-code for breaking a block of data into individual words.
$word_count = count($words);
$r = new Redis(); // Owlient's PHPRedis PECL extension
$r->connect("127.0.0.1", 6379);
function process_phrase($phrase) {
global $r;
$phrase = implode(" ", $phrase);
$r->zIncrBy("trending_phrases", 1, $phrase);
}
for($i=0;$i<$word_count;$i++)
for($j=1;$j<$word_count - $i;$j++)
process_phrase(array_slice($words, $i, $j));
To retrieve the top phrases, you'd use this:
// Assume $r is instantiated like it is above
$trending_phrases = $r->zReverseRange("trending_phrases", 0, 10);
$trending_phrases
will be an array of the top ten trending phrases. To do things like recent trending phrases (as opposed to a persistent, global set of phrases), duplicate all of the Redis interactions above. For each interaction, use a key that's indicative of, say, today's timestamp and tomorrow's timestamp (i.e.: days since Jan 1, 1970). When retrieving the results with $trending_phrases
, just retrieve both today and tomorrow's (or yesterday's) key and use array_merge
and array_unique
to find the union.
Hope this helps!
Im not sure what type of answer you were looking for but Laconica:
http://status.net/?source=laconica
Is an open source twitter clone (a much simpler version).
Maybe you could use part of the code to make your own popular frases?
Good luck!
Instead of splitting individual words split individual phrases, it's as simple as that.
$popular = array();
foreach ($tweets as $tweet)
{
// split by common punctuation chars
$sentences = preg_split('~[.!?]+~', $string);
foreach ($sentences as $sentence)
{
$sentence = strtolower(trim($sentence)); // normalize sentences
if (isset($popular[$sentence]) === false)
//if (array_key_exists($sentence, $popular) === false)
{
$popular[$sentence] = 0;
}
$popular[$sentence]++;
}
}
arsort($popular);
echo '<pre>';
print_r($popular);
echo '</pre>';
It'll be a lot slower if you consider a phrase as an aggregation of n consecutive words.
精彩评论