Cosine similarity and tf-idf

2023-03-11 15:49 问答作者：

I am confused by the following comment about TF-IDF and Cosine Similarity.

I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will开发者_如何学C range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90."

Now I'm wondering....aren't they 2 different things?

Is tf-idf already inside the cosine similarity? If yes, then what the heck - I can only see the inner dot products and euclidean lengths.

I thought tf-idf was something you could do before running cosine similarity on the texts. Did I miss something?

Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.

If d₂ and q are tf-idf vectors, then

Cosine similarity and tf-idf

where θ is the angle between the vectors. As θ ranges from 0 to 90 degrees, cos θ ranges from 1 to 0. θ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative.

There's no particularly deep connection between tf-idf and the cosine similarity/vector space model; tf-idf just works quite well with document-term matrices. It has uses outside of that domain, though, and in principle you could substitute another transformation in a VSM.

(Formula taken from the Wikipedia, hence the d₂.)

TF-IDF is just a way to measure the importance of tokens in text; it's just a very common way to turn a document into a list of numbers (the term vector that provides one edge of the angle you're getting the cosine of).

To compute cosine similarity, you need two document vectors; the vectors represent each unique term with an index, and the value at that index is some measure of how important that term is to the document and to the general concept of document similarity in general.

You could simply count the number of times each term occurred in the document (Term Frequency), and use that integer result for the term score in the vector, but the results wouldn't be very good. Extremely common terms (such as "is", "and", and "the") would cause lots of documents to appear similar to each other. (Those particular examples can be handled by using a stopword list, but other common terms that are not general enough to be considered a stopword cause the same sort of issue. On Stackoverflow, the word "question" might fall into this category. If you were analyzing cooking recipes, you'd probably run into issues with the word "egg".)

TF-IDF adjusts the raw term frequency by taking into account how frequent each term occurs in general (the Document Frequency). Inverse Document Frequency is usually the log of the number of documents divided by the number of documents the term occurs in (image from Wikipedia):

Cosine similarity and tf-idf

Think of the 'log' as a minor nuance that helps things work out in the long run -- it grows when it's argument grows, so if the term is rare, the IDF will be high (lots of documents divided by very few documents), if the term is common, the IDF will be low (lots of documents divided by lots of documents ~= 1).

Say you have 100 recipes, and all but one requires eggs, now you have three more documents that all contain the word "egg", once in the first document, twice in the second document and once in the third document. The term frequency for 'egg' in each document is 1 or 2, and the document frequency is 99 (or, arguably, 102, if you count the new documents. Let's stick with 99).

The TF-IDF of 'egg' is:

1 * log (100/99) = 0.01    # document 1
2 * log (100/99) = 0.02    # document 2
1 * log (100/99) = 0.01    # document 3

These are all pretty small numbers; in contrast, let's look at another word that only occurs in 9 of your 100 recipe corpus: 'arugula'. It occurs twice in the first doc, three times in the second, and does not occur in the third document.

The TF-IDF for 'arugula' is:

1 * log (100/9) = 2.40  # document 1
2 * log (100/9) = 4.81  # document 2
0 * log (100/9) = 0     # document 3

'arugula' is really important for document 2, at least compared to 'egg'. Who cares how many times egg occurs? Everything contains egg! These term vectors are a lot more informative than simple counts, and they will result in documents 1 & 2 being much closer together (with respect to document 3) than they would be if simple term counts were used. In this case, the same result would probably arise (hey! we only have two terms here), but the difference would be smaller.

The take-home here is that TF-IDF generates more useful measures of a term in a document, so you don't focus on really common terms (stopwords, 'egg'), and lose sight of the important terms ('arugula').

The complete mathematical procedure for cosine similarity is explained in these tutorials

part-I
part-II
part-III

Suppose if you want to calculate cosine similarity between two documents, first step will be to calculate the tf-idf vectors of the two documents. and then find the dot product of these two vectors. Those tutorials will help you :)

tf/idf weighting has some cases where they fail and generate NaN error in code while computing. It's very important to read this: http://www.p-value.info/2013/02/when-tfidf-and-cosine-similarity-fail.html

Tf-idf is just used to find the vectors from the documents based on tf - Term Frequency - which is used to find how many times the term occurs in the document and inverse document frequency - which gives the measure of how many times the term appears in the whole collection.

Then you can find the cosine similarity between the documents.

TFIDF is inverse documet frequency matrix and finding cosine similarity against document matrix returns similar listings

继续阅读：cosine-similarity information-retrieval tf-idf vsm

Cosine similarity and tf-idf

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？