Creating a simple text file based search engine

2023-01-23 14:11 问答作者：

I need to create a simple text file based search engine asap (using PHP)! Basically it has to read files in a directory, remove stop and useless words, index each remaining useful word with how many times it appears in each document.

I guess the pseudo code for this is:

for each file in directory:
    read in contents,
    compare to stop words,
    add each remaining word to array,
    count how many times that word appears in document,
    add that number to the array,
    add the id/name of t开发者_运维技巧he file to the array,

also need to count the total amount of words (after useless removal i guess) in the whole file, which im guessing can be done afterwards as long as i can get the file id from that array and then count the words inside....?

Can anyone help, maybe provide a barebones structure? I think the main bit i need help with is getting the number of times each word appears in the document and adding it to the index array...

Thanks

$words=array();
foreach (glob('*') as $file) {
    $contents=file_get_contents($file);
    $words[$file]=array();
    preg_match_all('/\S+/',$contents,$matches,PREG_SET_ORDER);
    foreach ($matches as $match) {
        if (!isset($words[$file][$match[0]))
            $words[$file][$match[0]]=0;
        $words[$file][$match[0]]++;
    }
    foreach ($useless as $value)
        if (isset($words[$file][$value]))
            unset($words[$file][$value]);
    $count=count($words[$file]);
    var_dump($words[$file]);
    echo 'Number of words: '.$count;
}

Take a look at str_word_count. It counts words, but can also extract them to an array (each value in the array being a word). You can then post-process this array to remove stop words, count occurrences, etc.

Well getting each file in the directory should be simple by using glob
Then reading the files can be done with file_get_contents

/**
 * This is how you will add extra rows
 * 
 * $index[] = array(
 *  'filename' => 'airlines.txt',
 *  'word' => 'JFK',
 *  'count' => 3,
 *  'all_words_count' => 42
 * );
*/
$index = array();

$words = array('jfk', 'car');

foreach( $words as $word ) {

  // All files with a .txt extension
  // Alternate way would be "/path/to/dir/*"
  foreach (glob("test_files/*.txt") as $filename) {

    // Includes the file based on the include_path
    $content = file_get_contents($filename, true);

    $count = 0;

    $totalCount = str_word_count($content);

    if( preg_match_all('/' . $word . '/i', $content, $matches) ) {
      $count = count($matches[0]);
    }

    // And another item to the list
    $index[] = array(
        'filename' => $filename,
        'word' => $word,
        'count' => $count,
        'all_words_count' => $totalCount
      );

  }

}

// Debug and look at the index array,
// make sure it looks the way you want it.
echo '<pre>';
print_r($index);
echo '</pre>';

When I tested the above code, this is what I got.

Array
(
    [0] => Array
        (
            [filename] => test_files/airlines.txt
            [word] => jfk
            [count] => 2
            [all_words_count] => 38
        )

    [1] => Array
        (
            [filename] => test_files/rentals.txt
            [word] => jfk
            [count] => 0
            [all_words_count] => 47
        )

    [2] => Array
        (
            [filename] => test_files/airlines.txt
            [word] => car
            [count] => 0
            [all_words_count] => 38
        )

    [3] => Array
        (
            [filename] => test_files/rentals.txt
            [word] => car
            [count] => 3
            [all_words_count] => 47
        )

)

I think I have solved your question :D Add this to the after the above script and you should be able to sort the count, starting at zero with $sorted and from the highest with $sorted_desc

function sorter($a, $b) {
  if( $a['count'] == $b['count'] )
    return 0;

  return ($a['count'] < $b['count']) ? -1 : 1;
}

// Clone the original list
$sorted = $index;

// Run a custom sort function
uasort($sorted, 'sorter');

// Reverse the array to find the highest first
$sorted_desc = array_reverse($sorted);

// Debug and look at the index array,
// make sure it looks the way you want it.
echo '<h1>Ascending</h1><pre>';
print_r($sorted);
echo '</pre>';

echo '<h1>Descending</h1><pre>';
print_r($sorted_desc);
echo '</pre>';

Here's a basic structure:

Create an $index array
Use scandir (or glob, if you need to only get files of a certain type) to get the files in the directory.
For each file:
1. Get contents with file_get_contents
2. Use str_word_count to get array $word_stream of word stream
3. Create an array $word_array to hold word counts
4. For each word in $word_stream:
  1. If it is in a $ignored_words array, skip it
  2. If it is not already in $word_array as a key, add $word_array[$word] = 1
  3. If it is already in $word_array, increment $word_array[$word]++
5. Get the sum of $word_array with array_sum, or the sum of unique words with count; you can add them to $word_array with keys "_unique" and "_count" (which will not be words), if you like
6. Add the filename as a key to the $index array, with the value being $word_array

继续阅读：php

Creating a simple text file based search engine

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？