Algorithm for returning similar documents represented in Vector space model
I have a DB containing tf-idf vectors of about 30,000 documents.
I would like to return for a given document a set of similar documents - about 4 or so.
I thought about imple开发者_StackOverflowmenting a K-Means (clustering algorithm) on the data (with cosine similarity), but I don't know whether it's the best choice because of many uncertainties: I'm not sure what to put in my initial clusters, I don't know how many clusters to create, I fear the clusters will be too unbalanced, I'm not sure the results quality will be good, etc.
Any advice and help from experienced users will be greatly appreciated.
Thank you,
Katie
I would like to return for a given document a set of similar documents - about 4 or so.
Then don't do k-means. Just return the four closest documents by tf-idf similarity, as any search engine would do. You can implement this as a k-nearest neighbor search, or more easily by installing a search engine library and using the initial document as a query. Lucene comes to mind.
If I understand, you
- read 30k records from a bigger db to a cache file / to memory
- cosine similarity, 10 terms * 30k records -> best 4.
Can you estimate the runtimes of these phases separately ?
- read or cache: how often will this be done, how big are the 30k vectors all together ?
- 10 * 30k multiply-adds: in your c / java / ... or in some opaque db ? In c or java, that should take < 1 second.
In general, make some back-of-the-envelope estimates before getting fancy.
(By the way, I find best-4 faster and simpler in straight-up c than std::partial_sort; ymmv.)
精彩评论