Fast in-memory inverted index

2023-03-18 01:00 问答作者：

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.

All other attributes of entities I can store in some fast key-value store.

I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vecto开发者_如何学Pythonr with precomputed weights. Any recommendations would be much appreciated!

Thank you.

I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).

Basically a zset is a sorted set of key-value pairs.

So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid

redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).

Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).

Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.

Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.

If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.

If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try. The right place to look at would be

org.apache.lucene.search.Similarity

you should be able to adapt it to your needs and set your version as default with

setDefault(Similarity similarity)

I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.

For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:

http://en.wikipedia.org/wiki/Locality-sensitive_hashing

继续阅读：indexing information-retrieval lucene lucene.net

Fast in-memory inverted index

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？