Efficiently generating a document index for a large number of small documents in a large file
Goal
I have a very large corpus of the following format:
<entry id=1>
Some text
...
Some more text
</entry>
...
<entry id=k>
Some text
...
Some more text
</entry>
There are tens of millions of entries for this corpus, and more for other corpora I want to deal with.
I want to treat each entry as a separate document and have a mapping from words of the corpus to the list of documents they occur in.
开发者_开发知识库Problem
Ideally, I would just split the file into separate files for each entry and run something like a Lucene indexer over the directory with all the files. However, creating millions and millions of files seems to crash my lab computer.
Question
Is there a relatively simple way of solving this problem? Should I keep all the entries in a single file? How can I track where they are in the file for use in an index? Should I use some other tool than separate files for each entry?
If it's relevant, I do most of my coding in Python, but solutions in another language are welcome.
Well, keeping all entries in a single file is not a good idea. You can process your big file using generators, so as to avoid memory issues, entry by entry, and then I'd recommend storing each one in a database. While on the process, you can dynamically construct all the relevant stuff, such as term frequencies, document frequencies, posting lists etc, which you can also save in a database.
This question might have some useful info.
Take also a look at this to get an idea.
精彩评论