Efficiently generating a document index for a large number of small documents in a large file

2023-04-07 15:27 问答作者：

Goal

I have a very large corpus of the following format:

<entry id=1>
Some text
...
Some more text
</entry>

...

<entry id=k>
Some text
...
Some more text
</entry>

There are tens of millions of entries for this corpus, and more for other corpora I want to deal with.

I want to treat each entry as a separate document and have a mapping from words of the corpus to the list of documents they occur in.

开发者_开发知识库Problem

Ideally, I would just split the file into separate files for each entry and run something like a Lucene indexer over the directory with all the files. However, creating millions and millions of files seems to crash my lab computer.

Question

Is there a relatively simple way of solving this problem? Should I keep all the entries in a single file? How can I track where they are in the file for use in an index? Should I use some other tool than separate files for each entry?

If it's relevant, I do most of my coding in Python, but solutions in another language are welcome.

Well, keeping all entries in a single file is not a good idea. You can process your big file using generators, so as to avoid memory issues, entry by entry, and then I'd recommend storing each one in a database. While on the process, you can dynamically construct all the relevant stuff, such as term frequencies, document frequencies, posting lists etc, which you can also save in a database.

This question might have some useful info.

Take also a look at this to get an idea.

继续阅读：io lucene python

Efficiently generating a document index for a large number of small documents in a large file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？