tf-idf: am I understanding it right?
I am interested in doing some document clustering, and right now I am considering using TF-IDF for this.
If I am not wrong, TF-IDF is particularly used for evaluatin开发者_运维百科g the relevance of a document given a query. If I do not have a particular query, how can I apply tf-idf to clustering?
For document clustering. the best approach is to use k-means algorithm. If you know how many types of documents you have you know what k is.
To make it work on documents:
a) say choose initial k documents at random.
b) Assign each document to a cluser using the minimum distance for a document with the cluster.
c) After documents are assigned to the cluster make K new documents as cluster by taking the centroid of each cluster.
Now, the question is
a) How to calculate distance between 2 documents: Its nothing but cosine similarity of terms of documents with initial cluster. Terms here are nothing but TF-IDF(calculated earlier for each document)
b) Centroid should be: sum of TF-IDF of a given term/ no. of documents. Do, this for all the possible terms in a cluster. this will give you another n-dimensional documents.
Hope thats helps!
Not exactly actually: tf-idf gives you the relevance of a term in a given document.
So you can perfectly use it for your clustering by computing a proximity which would be something like
proximity(document_i, document_j) = sum(tf_idf(t,i) * tf_idf(t,j))
for each term t both in doc i and doc j.
TF-IDF serves a different purpose; unless you intend to reinvent the wheel, you are better of using a tool like Carrot. Googling for document clustering can give you many algorithms if you wish to implement one on your own.
精彩评论