Clustering with scipy - clusters via distance matrix, how to get back the original objects

2023-04-12 17:00 问答作者：

I can't seam to find any simple enough tutorials or descriptions on clustering in scipy, so I'll try to explain my problem:

I try to cluster documents (hierarchical agglomerative clustering) , and have created a vector for each document and produced a symmetric distance matrix. The vector_list contains (really long) vectors representing each document. The order of this list of vectors is the same as my list of input documents so that I'll (hopefully) be able to match the results of the clustering with the corresponding document.

distances = distance.cdist(vector_list, vector_list, 'euclidean')

This gives a matrix like this, where the diagonal line is each documents distance to itself (always 0)

[0 5 4]
[5 0 4]
[5 4 0]

I feed this distance matrix to scipys' linkage() function.

clusters = hier.linkage(distances, method='centroid', metric='euclidean')

this returns something I'm not quite sure what is, but comes out as datatype numpy.ndarray. According to the docs I can feed this again into fcluster to get 'flat clusters'. I use half of the max distance in the distance matrix as threshold.

idx = hier.fcluster(clu,0.5*distances.max(), 'distance')

This returns a numpy.ndarray that again does not make much sense to me. An example is [6 3 1 7 1 8 9 4 5 2]

So my question: what is it that I get from the linkage and fcluster functions, and how can I go from there and back to my document that I created the distance matrix for in the first place, to see if the clusters makes any sense? Am I doing thi开发者_如何学运维s right?

First off, you don't need to go through the entire process with cdist and linkage if you use fclusterdata instead of fcluster; that function you can feed an (n_documents, n_features) array of term counts, tf-idf values, or whatever your features are.

The output from fclusterdata is the same as that of fcluster: an array T such that "T[i] is the flat cluster number to which original observation i belongs." I.e., the cluster.hierarchy module flattens the clustering according to a threshold which you set at 0.5*distances.max(). In your case, the third and fifth document are clustered together, but all the others form clusters of their own, so you might want to set the threshold higher or using a different criterion.

继续阅读：cluster-analysis python scipy

Clustering with scipy - clusters via distance matrix, how to get back the original objects

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？