How to calculate "OnTopicness" of documents using Lucene.NET
Imagine I have a huge database of threads and posts (about 10.000.000 records) from different forum sites including several subforums that serve as my lucene documents.
Now I am trying to calculate a feature called "OnTopicness" for each post based on the terms used in it. In fact, this feature is not much more than a simple cosine similarity between two document vectors that will be stored in the database and therefore has to be calculated only once per post. :
- Forum-OnTopicness: cosine similarity between my post and a virtual document consisting of all other posts in the specified forum (including all threads in the forum)
- Thread-OnTopicness: cosine similarity between my post and a virtual document consisting of all other posts in the specified thread
Since the Lucene.NET API doesn't offer a method to calculate a document-document or document-index cosine similarity, I read that I could either parse one of the documents as query and search for the other document in the results or that I could manually calculate the similarity usi开发者_如何学Gong TermFreqVectors and DocFrequencies.
I tried the second attempt because it sounds faster but ran into a problem: The IndexReader.GetTermFreqVector() method takes the internal docNumber as parameter which I don't know if I just pass two documents to my GetCosineSimilarity method:
public void GetCosineSimilarity(Document doc1, Document doc2)
{
using (IndexReader reader = IndexReader.Open(FSDirectory.Open(indexDir), true))
{
// how do I get the docNumbers?
TermFreqVector tfv1 = reader.GetTermFreqVector(???, "PostBody");
TermFreqVector tfv2 = reader.GetTermFreqVector(???, "PostBody");
...
// assuming that I have the TermFreqVectors, how would I continue here?
}
}
Besides that, how would you create the mentioned "virtual document" for either a whole forum or a thread? Should I just concatenate the PostBody fields of all contained posts and parse them into a new document or can I just create an index them for them and somehow compare my post to this entire index?
As you can see, as a Lucene newbie, I am still not sure about my overall index design and could definitely use some general advice. Help is highly appreciated - thanks!
Take a look at MoreLikeThisQuery in https://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/contrib/Queries/Similar/
Its source may be useful.
Take a look at S-Space. It is a free open-source Java package that does a lot of the things you want to do, e.g. compute cosine similarity between documents.
精彩评论