开发者

Creating a simple text file based search engine

I need to create a simple text file based search engine asap (using PHP)! Basically it has to read files in a directory, remove stop and useless words, index each remaining useful word with how many times it appears in each document.

I guess the pseudo code for this is:

for each file in directory:
    read in contents,
    compare to stop words,
    add each remaining word to array,
    count how many times that word appears in document,
    add that number to the array,
    add the id/name of t开发者_运维技巧he file to the array,

also need to count the total amount of words (after useless removal i guess) in the whole file, which im guessing can be done afterwards as long as i can get the file id from that array and then count the words inside....?

Can anyone help, maybe provide a barebones structure? I think the main bit i need help with is getting the number of times each word appears in the document and adding it to the index array...

Thanks


$words=array();
foreach (glob('*') as $file) {
    $contents=file_get_contents($file);
    $words[$file]=array();
    preg_match_all('/\S+/',$contents,$matches,PREG_SET_ORDER);
    foreach ($matches as $match) {
        if (!isset($words[$file][$match[0]))
            $words[$file][$match[0]]=0;
        $words[$file][$match[0]]++;
    }
    foreach ($useless as $value)
        if (isset($words[$file][$value]))
            unset($words[$file][$value]);
    $count=count($words[$file]);
    var_dump($words[$file]);
    echo 'Number of words: '.$count;
}


Take a look at str_word_count. It counts words, but can also extract them to an array (each value in the array being a word). You can then post-process this array to remove stop words, count occurrences, etc.


Well getting each file in the directory should be simple by using glob
Then reading the files can be done with file_get_contents

/**
 * This is how you will add extra rows
 * 
 * $index[] = array(
 *  'filename' => 'airlines.txt',
 *  'word' => 'JFK',
 *  'count' => 3,
 *  'all_words_count' => 42
 * );
*/
$index = array();

$words = array('jfk', 'car');

foreach( $words as $word ) {

  // All files with a .txt extension
  // Alternate way would be "/path/to/dir/*"
  foreach (glob("test_files/*.txt") as $filename) {

    // Includes the file based on the include_path
    $content = file_get_contents($filename, true);

    $count = 0;

    $totalCount = str_word_count($content);

    if( preg_match_all('/' . $word . '/i', $content, $matches) ) {
      $count = count($matches[0]);
    }

    // And another item to the list
    $index[] = array(
        'filename' => $filename,
        'word' => $word,
        'count' => $count,
        'all_words_count' => $totalCount
      );

  }

}

// Debug and look at the index array,
// make sure it looks the way you want it.
echo '<pre>';
print_r($index);
echo '</pre>';

When I tested the above code, this is what I got.

Array
(
    [0] => Array
        (
            [filename] => test_files/airlines.txt
            [word] => jfk
            [count] => 2
            [all_words_count] => 38
        )

    [1] => Array
        (
            [filename] => test_files/rentals.txt
            [word] => jfk
            [count] => 0
            [all_words_count] => 47
        )

    [2] => Array
        (
            [filename] => test_files/airlines.txt
            [word] => car
            [count] => 0
            [all_words_count] => 38
        )

    [3] => Array
        (
            [filename] => test_files/rentals.txt
            [word] => car
            [count] => 3
            [all_words_count] => 47
        )

)

I think I have solved your question :D Add this to the after the above script and you should be able to sort the count, starting at zero with $sorted and from the highest with $sorted_desc

function sorter($a, $b) {
  if( $a['count'] == $b['count'] )
    return 0;

  return ($a['count'] < $b['count']) ? -1 : 1;
}

// Clone the original list
$sorted = $index;

// Run a custom sort function
uasort($sorted, 'sorter');

// Reverse the array to find the highest first
$sorted_desc = array_reverse($sorted);

// Debug and look at the index array,
// make sure it looks the way you want it.
echo '<h1>Ascending</h1><pre>';
print_r($sorted);
echo '</pre>';

echo '<h1>Descending</h1><pre>';
print_r($sorted_desc);
echo '</pre>';


Here's a basic structure:

  1. Create an $index array
  2. Use scandir (or glob, if you need to only get files of a certain type) to get the files in the directory.
  3. For each file:
    1. Get contents with file_get_contents
    2. Use str_word_count to get array $word_stream of word stream
    3. Create an array $word_array to hold word counts
    4. For each word in $word_stream:
      1. If it is in a $ignored_words array, skip it
      2. If it is not already in $word_array as a key, add $word_array[$word] = 1
      3. If it is already in $word_array, increment $word_array[$word]++
    5. Get the sum of $word_array with array_sum, or the sum of unique words with count; you can add them to $word_array with keys "_unique" and "_count" (which will not be words), if you like
    6. Add the filename as a key to the $index array, with the value being $word_array
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜