Assistance with building an inverted-index

2022-12-25 06:16 问答作者：

It's part of an information retrieval thing I'm doing for school. The plan is to create a hashmap of words using the the first two letters of the word as a key and any words with the two letters save开发者_Go百科d as a string value. So,

hashmap["ba"] = "bad barley base"

Once I'm done tokenizing a line I take that hashmap, serialize it, and append it to the text file named after the key.

The idea is that if I take my data and spread it over hundreds of files I'll lessen the time it takes to fulfill a search by lessening the density of each file. The problem I am running into is when I'm making 100+ files in each run it happens to choke on creating a few files for whatever reason and so those entries are empty. Is there any way to make this more efficient? Is it worth continuing this, or should I abandon it?

I'd like to mention I'm using PHP. The two languages I know relatively intimately are PHP and Java. I chose PHP because the front end will be very simple to do and I will be able to add features like autocompletion/suggested search without a problem. I also see no benefit in using Java. Any help is appreciated, thanks.

I would use a single file to get and put the serialized string. I would also use json as the serialization.

Put the data

$string = "bad barley base";
$data = explode(" ",$string);
$hashmap["ba"] = $data;

$jsonContent = json_encode($hashmap);
file_put_contents("a-z.txt",$jsonContent);

Get the data

$jsonContent = file_get_contents("a-z.txt");
$hashmap = json_decode($jsonContent);

foreach($hashmap as $firstTwoCharacters => $value) {
    if ($firstTwoCharacters == 'ba') {
        $wordCount = count($value);
    }
}

You didn't explain the problem you are trying to solve. I'm guessing you are trying to make a full text search engine, but you don't have document ids in your hashmap so I'm not sure how you are using the hashmap to find matching documents.

Assuming you want a full text search engine, I would look into using a trie for the data structure. You should be able to fit everything in it without it growing too large. Nodes that match a word you want to index would contain the ids of the documents containing that word.

继续阅读：information-retrieval inverted-index php search search-engine

Assistance with building an inverted-index

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？