how to calculate similarity between a query and documents?

2023-02-16 20:31 问答作者：

I have a set of documents and i have calculate both

Term -Frequency score
Inverse-Frequency Score
TF/IDF score

Now i need to calculate the similarity between a specific query and a document which will produce a score that will rank the document from the highest similarity to t开发者_如何转开发he lowest similarity towards the query.

I have search for a lot of information but i do no understand the formula.

source : http://en.wikipedia.org/wiki/Vector_space_model

Can anyone guide me ? I just need to know how to proceed from my current progress.

Lucene is a open source library that does all this for you.

Pangea has already given the correct answer: Don't reinvent the wheel, especially a complex wheel like document similarity. That being said, understanding how document similarity is computed is an interesting and worth while thing to do if you are going to be working in the field. I'll see if I can help a bit.

The basic assumption of the Vector space model you have linked is that each document can be represented as a vector in N dimensional space, where each dimension is a different word in the universe of documents. A document's value for a given word is that document's rank for the word in question. In this model, a query can be thought of as a very short document, and thus also represented as a vector in N space. The cosine measure is simply the cosine of the angle between the query vector and a given document vector.

Deriving N dimensional trigonometry is probably a math course in and of itself, but if you understand the basic idea, for the actual computation you can take the Wikipedia formula on faith (or look in a standard text for it if you prefer). The computational steps (vector dot products and norms) are also well documented individually and not terribly hard to implement. I'm sure there are also standard library implementations available.

The logic behind the cosine is that, as the similarity between the documents increases, the angle between the two vectors approaches zero (and thus the cosine approaches 1). You can verify this by hand with a universe of two words on the Cartesian plane. All the vector math does there is extrapolate the same concept into N dimensions.

I hope this clears up some confusion on this interesting topic. For actual implementation, I once again refer you to Pangea's suggestion to use Lucene.

how to calculate similarity between a query and documents?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？