Assistance with building an inverted-index
It's part of an information retrieval thing I'm doing for school. The plan is to create a hashmap of words using the the first two letters of the word as a key and any words with the two letters save开发者_Go百科d as a string value. So,
hashmap["ba"] = "bad barley base"
Once I'm done tokenizing a line I take that hashmap, serialize it, and append it to the text file named after the key.
The idea is that if I take my data and spread it over hundreds of files I'll lessen the time it takes to fulfill a search by lessening the density of each file. The problem I am running into is when I'm making 100+ files in each run it happens to choke on creating a few files for whatever reason and so those entries are empty. Is there any way to make this more efficient? Is it worth continuing this, or should I abandon it?
I'd like to mention I'm using PHP. The two languages I know relatively intimately are PHP and Java. I chose PHP because the front end will be very simple to do and I will be able to add features like autocompletion/suggested search without a problem. I also see no benefit in using Java. Any help is appreciated, thanks.
I would use a single file to get and put the serialized string. I would also use json as the serialization.
Put the data
$string = "bad barley base";
$data = explode(" ",$string);
$hashmap["ba"] = $data;
$jsonContent = json_encode($hashmap);
file_put_contents("a-z.txt",$jsonContent);
Get the data
$jsonContent = file_get_contents("a-z.txt");
$hashmap = json_decode($jsonContent);
foreach($hashmap as $firstTwoCharacters => $value) {
if ($firstTwoCharacters == 'ba') {
$wordCount = count($value);
}
}
You didn't explain the problem you are trying to solve. I'm guessing you are trying to make a full text search engine, but you don't have document ids in your hashmap so I'm not sure how you are using the hashmap to find matching documents.
Assuming you want a full text search engine, I would look into using a trie for the data structure. You should be able to fit everything in it without it growing too large. Nodes that match a word you want to index would contain the ids of the documents containing that word.
精彩评论