开发者

Efficiently generating a document index for a large number of small documents in a large file

Goal

I have a very large corpus of the following format:

<entry id=1>
Some text
...
Some more text
</entry>

...

<entry id=k>
Some text
...
Some more text
</entry>

There are tens of millions of entries for this corpus, and more for other corpora I want to deal with.

I want to treat each entry as a separate document and have a mapping from words of the corpus to the list of documents they occur in.

开发者_开发知识库Problem

Ideally, I would just split the file into separate files for each entry and run something like a Lucene indexer over the directory with all the files. However, creating millions and millions of files seems to crash my lab computer.

Question

Is there a relatively simple way of solving this problem? Should I keep all the entries in a single file? How can I track where they are in the file for use in an index? Should I use some other tool than separate files for each entry?

If it's relevant, I do most of my coding in Python, but solutions in another language are welcome.


Well, keeping all entries in a single file is not a good idea. You can process your big file using generators, so as to avoid memory issues, entry by entry, and then I'd recommend storing each one in a database. While on the process, you can dynamically construct all the relevant stuff, such as term frequencies, document frequencies, posting lists etc, which you can also save in a database.

This question might have some useful info.

Take also a look at this to get an idea.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜