Creating a simple text file based search engine
I need to create a simple text file based search engine asap (using PHP)! Basically it has to read files in a directory, remove stop and useless words, index each remaining useful word with how many times it appears in each document.
I guess the pseudo code for this is:
for each file in directory: read in contents, compare to stop words, add each remaining word to array, count how many times that word appears in document, add that number to the array, add the id/name of t开发者_运维技巧he file to the array,
also need to count the total amount of words (after useless removal i guess) in the whole file, which im guessing can be done afterwards as long as i can get the file id from that array and then count the words inside....?
Can anyone help, maybe provide a barebones structure? I think the main bit i need help with is getting the number of times each word appears in the document and adding it to the index array...
Thanks
$words=array();
foreach (glob('*') as $file) {
$contents=file_get_contents($file);
$words[$file]=array();
preg_match_all('/\S+/',$contents,$matches,PREG_SET_ORDER);
foreach ($matches as $match) {
if (!isset($words[$file][$match[0]))
$words[$file][$match[0]]=0;
$words[$file][$match[0]]++;
}
foreach ($useless as $value)
if (isset($words[$file][$value]))
unset($words[$file][$value]);
$count=count($words[$file]);
var_dump($words[$file]);
echo 'Number of words: '.$count;
}
Take a look at str_word_count. It counts words, but can also extract them to an array (each value in the array being a word). You can then post-process this array to remove stop words, count occurrences, etc.
Well getting each file in the directory should be simple by using glob
Then reading the files can be done with
file_get_contents
/**
* This is how you will add extra rows
*
* $index[] = array(
* 'filename' => 'airlines.txt',
* 'word' => 'JFK',
* 'count' => 3,
* 'all_words_count' => 42
* );
*/
$index = array();
$words = array('jfk', 'car');
foreach( $words as $word ) {
// All files with a .txt extension
// Alternate way would be "/path/to/dir/*"
foreach (glob("test_files/*.txt") as $filename) {
// Includes the file based on the include_path
$content = file_get_contents($filename, true);
$count = 0;
$totalCount = str_word_count($content);
if( preg_match_all('/' . $word . '/i', $content, $matches) ) {
$count = count($matches[0]);
}
// And another item to the list
$index[] = array(
'filename' => $filename,
'word' => $word,
'count' => $count,
'all_words_count' => $totalCount
);
}
}
// Debug and look at the index array,
// make sure it looks the way you want it.
echo '<pre>';
print_r($index);
echo '</pre>';
When I tested the above code, this is what I got.
Array
(
[0] => Array
(
[filename] => test_files/airlines.txt
[word] => jfk
[count] => 2
[all_words_count] => 38
)
[1] => Array
(
[filename] => test_files/rentals.txt
[word] => jfk
[count] => 0
[all_words_count] => 47
)
[2] => Array
(
[filename] => test_files/airlines.txt
[word] => car
[count] => 0
[all_words_count] => 38
)
[3] => Array
(
[filename] => test_files/rentals.txt
[word] => car
[count] => 3
[all_words_count] => 47
)
)
I think I have solved your question :D Add this to the after the above script and you should be able to sort the count, starting at zero with $sorted
and from the highest with $sorted_desc
function sorter($a, $b) {
if( $a['count'] == $b['count'] )
return 0;
return ($a['count'] < $b['count']) ? -1 : 1;
}
// Clone the original list
$sorted = $index;
// Run a custom sort function
uasort($sorted, 'sorter');
// Reverse the array to find the highest first
$sorted_desc = array_reverse($sorted);
// Debug and look at the index array,
// make sure it looks the way you want it.
echo '<h1>Ascending</h1><pre>';
print_r($sorted);
echo '</pre>';
echo '<h1>Descending</h1><pre>';
print_r($sorted_desc);
echo '</pre>';
Here's a basic structure:
- Create an
$index
array - Use
scandir
(orglob
, if you need to only get files of a certain type) to get the files in the directory. - For each file:
- Get contents with
file_get_contents
- Use
str_word_count
to get array$word_stream
of word stream - Create an array
$word_array
to hold word counts - For each word in
$word_stream
:- If it is in a
$ignored_words
array, skip it - If it is not already in
$word_array
as a key, add$word_array[$word] = 1
- If it is already in
$word_array
, increment$word_array[$word]++
- If it is in a
- Get the sum of
$word_array
witharray_sum
, or the sum of unique words withcount
; you can add them to$word_array
with keys"_unique"
and"_count"
(which will not be words), if you like - Add the filename as a key to the
$index
array, with the value being$word_array
- Get contents with
精彩评论