开发者

Get the two most frequent words within several strings

I have a list of phrases and I want to know which two words occurred the most often in all of my phrases.

I tried playing with regex and 开发者_运维知识库other codes and I just cannot find the right way to do this.

Can anyone help?

eg:

I am purchasing a wallet
a wallet for 20$
purchasing a bag

I'd know that

  • a wallet occurred 2 times
  • purchasing a occurred 2 times


<?
$string = "I am purchasing a wallet a wallet for 20$ purchasing a bag";
//split string into words
$words  = explode(' ', $string);

//make chunks block ie [0,1][2,3]...
$chunks = array_chunk($words, 2);

//remove first array element
unset($words[0]);
//make chunks block ie [0,1][2,3]...
//but since first element is removed , the real block will be  [1,2][3,4]...
$alternateChunks = array_chunk($words, 2);
//merge both chunks
$totalChunks = array_merge($chunks,$alternateChunks);

$finalChunks = array();
foreach($totalChunks as $t)
{
    //change the inside chunk to pharse using +
    //+ can be replaced to space, if neeced
    //to keep associative working + is used instead of white space
    $finalChunks[] = implode('+', $t);
}
//count the words inside array 
$result = array_count_values($finalChunks);
echo "<pre>";
print_r($result);


I hesitate to suggest this, as it's an extremely brute force way to go about it:

Take your string of words, explode it using the explode(" ", $string); command, then run it through a for loop checking every two word combination against every two words in the string.

$string = "I am purchasing a wallet a wallet for 20$ purchasing a bag";
$words = explode(" ", $string);
for ($t=0; $t<count($string); $t++)
{
    for ($i=0; $i<count($string); $i++)
    {
        if (($words[$t] . words[$t+1]) == ($words[$i] . $word[$i+1])) {$count[$words[$i].$words[$i+1]]++}
    }
}

So the nested for loop steps in, grabs the first two words, compares them to each other set of two consecutive words, then grabs the next two words and does it again. Every answer will have an answer of at least 1 (it will always match itself) but sorting the resulting array by size will give you the most repeated values.

Note that this will run (n-1)*(n-1) iterations, which could get unwieldy FAST.


Place them all into an array, and access them by the current word index and next word index.

I think this should do the trick. It will grab pairs of words, unless you are at the end of the string, where you'll get only one word.

$str = "I purchased a wallet because I wanted a wallet a wallet a wallet";
$words = explode(" ", $str);

$array_results = array();
for ($i = 0; $i<count($words); $i++) {
  if ($i < count($words)-1) {

     $pair = $words[$i] . " " . $words[$i+1]; echo $pair . "\n"; 
     // Have to check if the key is in use yet to avoid a notice
     $array_results[$pair] = isset($array_results[$pair]) ? $array_results[$pair] + 1 : 1;
  }
  // At the end of the array, just use a single word
  else $array_results[$words[$i]] = isset($array_results[$words[$i]]) ? $array_results[$words[$i]] + 1 : 1;
}

// Sort the results
// use arsort() instead to get the highest first
asort($array_results);

// Prints:
Array
(
    [I wanted] => 1
    [wanted a] => 1
    [wallet] => 1
    [because I] => 1
    [wallet because] => 1
    [I purchased] => 1
    [purchased a] => 1
    [wallet a] => 2
    [a wallet] => 4
)

Update changed ++ to +1 above since it wasn't working when tested...


Try to put it with explode into an array and count the values with array_count_values.

<?php
$text = "whatever";

$text_array = explode( ' ', $text);
$double_words = array();

for($c = 1; $c < count($text_array); $c++)
{ 
  $double_words[] = $text_array[$c -1] . ' ' . $text_array[$c];
}

$result = array_count_values($double_words);

?>

I updated it now to two word version. Does this work for you?

array(9) { 
  ["I am"]=> int(1) 
  ["am purchasing"]=> int(1) 
  ["purchasing a"]=> int(2) 
  ["a wallet"]=> int(2) 
  ["wallet a"]=> int(1) 
  ["wallet for"]=> int(1) 
  ["for 20$"]=> int(1) 
  ["20$ purchasing"]=> int(1) 
  ["a bag"]=> int(1) 
} 


Since you used the excel tag, I thought I'd give it a shot, and it's actually really easy.

  1. Split string using space as delimiter. Data > Text to Columns... > Delimited > Delimiter: Space. Each word is now in its own cell.
  2. Transpose the result (not strictly required but much easier to visualize). Copy, Edit > Paste Special... > Transpose.
  3. Make cells containing consecutive word pairs. So if your words are in cells B5:B15, cell C5 should be =B5&" "&B6 (and drag down).
  4. Count occurence of each word pair: In cell D5, =COUNTIF($C$5:$C$15,"="&C5), drag down.
  5. Highlight the winner(s). Select C5:D15, Format > Conditional Formatting... > Formula Is =$D5=MAX($D$5:$D$15) and choose e.g. a yellow background.

Note that there is some inefficiency in step 4 because the count of each word pair will be calculated multiple times if that word pair occurs multiple times. If this is a concern, then you can first make a list of unique word pairs using Data > Filter > Advanced Filter... > Unique records only.

An automated VBA solution could easily be crafted by recording a macro of the above followed by some minor editing.


One way to go about it is to use SPLIT or a regex to split the sentences into words and store each into an array. Then take the array and create a dictionary object. When you add a term to the dictionary, if it's already there, add 1 to the .value to tally the count.

Here is some example code (far from perfect as it's just to show the overlying concept) that will take all the string in column A and generate a word frequency list in columns B and C. It's not exactly what you want, but should give you some ideas on how you can go about doing it I hope:

Sub FrequencyList()

Dim vArray As Variant
Dim myDict As Variant
Set myDict = CreateObject("Scripting.Dictionary")
Dim i As Long
Dim cell As range

With myDict
    For Each cell In range("A1", cells(Rows.count, "A").End(xlUp))
        vArray = Split(cell.Value, " ")
        For i = LBound(vArray) To UBound(vArray)
            If Not .exists(vArray(i)) Then
                .Add vArray(i), 1
            Else
                .Item(vArray(i)) = .Item(vArray(i)) + 1
            End If
        Next
    Next
    range("B1").Resize(.count).Value = Application.Transpose(.keys)
    range("C1").Resize(.count).Value = Application.Transpose(.items)
    End With

End Sub
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜