python hcluster, distance matrix and condensed distance matrix

2023-02-26 14:39 问答作者：

I'm using the module hcluster to calculate a dendrogram from a distance matrix. My distance matrix is an array of arrays generated like this:

import hcluster
import n开发者_如何转开发umpy as np

mols = (..a list of molecules)
distMatrix = np.zeros((10, 10))
  for i in range(0,10):       
    for j in range(0,10):
      sim = OETanimoto(mols[i],mols[j]) # a function to calculate similarity between molecules
      distMatrix[i][j] = 1 - sim

I then use the command distVec = hcluster.squareform(distMatrix) to convert the matrix into a condensed vector and calculate the linkage matrix with vecLink = hcluster.linkage(distVec).

All this works fine but if I calculate the linkage matrix using the distance matrix and not the condensed vector matLink = hcluster.linkage(distMatrix) I get a different linkage matrix (the distances between the nodes are a lot larger and topology is slightly different)

Now I'm not sure whether this is because hcluster only works with condensed vectors or whether I'm making mistakes on the way there.

Thanks for your help!

I knocked up a quick random example similar to yours and experienced the same problem. In the docstring it does say :

Performs hierarchical/agglomerative clustering on the condensed distance matrix y. y must be a :math:{n \choose 2} sized vector where n is the number of original observations paired in the distance matrix.

However, having had a quick look at the code, it seems like the intent is for it to both work with vector shaped and matrix shaped code: In hierachy.py there is a switch based upon the shape of the matrix. It seems however that the key bit of info is in the function linkage's docstring:

   - Q : ndarray
       A condensed or redundant distance matrix. A condensed
       distance matrix is a flat array containing the upper
       triangular of the distance matrix. This is the form that
       ``pdist`` returns. Alternatively, a collection of
       :math:`m` observation vectors in n dimensions may be passed as
       a :math:`m` by :math:`n` array.

So I think that the interface doesn't allow the passing of a distance matrix. Instead it thinks you are passing it m observation vectors in n dimensions . Hence the difference in result?

Does that seem reasonable?

Else just take a look at the code itself I'm sure you'll be able to debug it and figure out why your examples are different.

Cheers Matt

继续阅读：distance hcluster python

python hcluster, distance matrix and condensed distance matrix

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？