Getting the Vector Space Model (tf-idf) from a query on a lucene index
I need to get the Vector Space Model(with tf-idf weighting) from the results of a lucene query, and cant figure out how to do it. It seems like it should be simple, and at this stage maybe one of you guys can point me in the right direction.
I have been trying to figure out how to do this for a good while, and either I haven't copped how the stuff i have read is what i need yet (more than likely), or a solution hasn't been posted to my particular problem. I even tried computing the VSM myself direct from the query results, but my solution has hideous complexity.
Edit: For anyone else who stumbles upon this, there is a solution @ the much clearer question here What i need开发者_开发问答 can be gotten by the IndexReader.getTermFreqVector(String field, int docid) method.
Unfortunately this doesn't work for me as the index I am working off hasn't stored the term frequency vectors, so I guess I'm still looking for more help on this!
To answer this question, you can compute a TF-IDF weighted vector space model for a set of lucene results using the IndexReader.getTermFreqVector() and Searcher.docFreq() classes. There is no way of directly getting the VSM for a set of results in Lucene.
Maybe I'm misunderstanding what you're trying to do, but Lucene's scoring uses the vector space model. If you want more details for how the scores are calculated, given a document and a query, use Searcher.explain(Query query, int doc) .
If I understand correctly from your comment, you want the compute VSM cosine similarity between documents rather than between a query and a document. I don't know exactly how to do this, but I'd point you to the Lucene API page for the Similarity
class. You'd probably have to derive and use a custom subclass of Similarity
that changes the coord
and queryNorm
members and find a way to turn documents into query objects.
(No guarantees; I'm just trying to figure out this scoring myself.)
精彩评论