开发者

How to calculate "OnTopicness" of documents using Lucene.NET

Imagine I have a huge database of threads and posts (about 10.000.000 records) from different forum sites including several subforums that serve as my lucene documents.

Now I am trying to calculate a feature called "OnTopicness" for each post based on the terms used in it. In fact, this feature is not much more than a simple cosine similarity between two document vectors that will be stored in the database and therefore has to be calculated only once per post. :

  • Forum-OnTopicness: cosine similarity between my post and a virtual document consisting of all other posts in the specified forum (including all threads in the forum)
  • Thread-OnTopicness: cosine similarity between my post and a virtual document consisting of all other posts in the specified thread

Since the Lucene.NET API doesn't offer a method to calculate a document-document or document-index cosine similarity, I read that I could either parse one of the documents as query and search for the other document in the results or that I could manually calculate the similarity usi开发者_如何学Gong TermFreqVectors and DocFrequencies.

I tried the second attempt because it sounds faster but ran into a problem: The IndexReader.GetTermFreqVector() method takes the internal docNumber as parameter which I don't know if I just pass two documents to my GetCosineSimilarity method:

public void GetCosineSimilarity(Document doc1, Document doc2)
{
    using (IndexReader reader = IndexReader.Open(FSDirectory.Open(indexDir), true))
    {
        // how do I get the docNumbers?
        TermFreqVector tfv1 = reader.GetTermFreqVector(???, "PostBody");
        TermFreqVector tfv2 = reader.GetTermFreqVector(???, "PostBody");
        ...
        // assuming that I have the TermFreqVectors, how would I continue here?
    }
}

Besides that, how would you create the mentioned "virtual document" for either a whole forum or a thread? Should I just concatenate the PostBody fields of all contained posts and parse them into a new document or can I just create an index them for them and somehow compare my post to this entire index?

As you can see, as a Lucene newbie, I am still not sure about my overall index design and could definitely use some general advice. Help is highly appreciated - thanks!


Take a look at MoreLikeThisQuery in https://svn.apache.org/repos/asf/incubator/lucene.net/trunk/src/contrib/Queries/Similar/

Its source may be useful.


Take a look at S-Space. It is a free open-source Java package that does a lot of the things you want to do, e.g. compute cosine similarity between documents.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜