开发者

group similar documents

This question relates to grouping/clustering similar documents in Information Retrieval.

I have a set of documents, D1, D2, .. Dn. For each document, Di, I also have a set of keywords, Di_k1, Di_k2, ..., Di_km. Similarity between two documents, Di and Dj is given by a function that involves the related keywords i.e. similarity(Di, Dj) = f(Di_K, Dj_K).

Now, I want to place each of these documents into a set of groups/clusters such that each cluster contains similar type of documents for a given a threshold value of similarity between the elements present in a cluster.

One easy way is to look at every pair of pages possible which I obviously want to avoid because the number of documents I have is fairly large, in millions. I was going through the Introduction to Information Retrieval book but I don't find any scalable algorithm mentioned.

My question is what kind of algorithm can help me cluster the documents efficiently? I am specially interested in the computational 开发者_开发问答complexity of the algorithm.

Thanks in advance for any pointers.


Okay, off the top of my head ,you can use a Language model based approach . First , use machine learning to build a LM for each possible class. Say, a bigram LM. Then, for each new document you see, calculate P(new document| class) for all classes. Choose the one with the max probability. Use bayes rule to simplify the above formula


One relax similarity between ALL document in the cluster. Pick an arbitrary center and have similarity to center.

Complexity is

(n / avgClusterSize) * (n / 2)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜