开发者

How to use sparse matrix in python hcluster?

I'm trying to use hcluster library in python. I have no enough python knowledges to use sparse matrix in hcluster. Please help me anybody. So, that what I'm 开发者_运维问答doing:

import os.path
import numpy
import scipy
import scipy.io 
from hcluster import squareform, pdist, linkage, complete 
from hcluster.hierarchy import linkage, from_mlab_linkage 
from numpy import savetxt 
from StringIO import StringIO 

data.dmp contains matrix looks like:

  A B C D
A 0 1 0 1 
B 1 0 0 1 
C 0 0 0 0 
D 1 1 0 0 

and contains only upper-right part of matrix. I don't know how to spell it in english correctly :) so, all numbers upper than main diagonal so data.dmp contains : 1 0 1, 0 1 , 0

f = file('data.dmp','r')  
s = StringIO(f.readline()).getvalue()
f.close()

matrix = numpy.asarray(eval("["+s+"]"))

by unknown reason for me, hcluster uses inverted values, for example I use 0 if A!=C,and use 1 if A == D

sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")

linkage Y

Z = linkage(Y, method="complete")

So, matrix Z is what I need (if I correctly used hcluster?)

But I have next problems:

  1. I want to use sparse matrix for the huge amount of input data, cause it's time consuming to generate input data like now, I need to import data to python from another language, thats why I need read text file. Please kindly, python guru's suggest how to make it?

  2. To people that used python hcluster, I need to process huge amount of data, hundreds of rows, it's possible to do in hcluster? This algorithm realy produce correct HAC?

Thank you for reading, I appreciate any help!


Represent the inputs each as a dictionary, from feature name to value. Zeros are not present in the dictionary.

Compute the Y matrix yourself, not using the hcluster.pdist. The following code does sparse squared-error. Squared-error is equivalent to cosine distance IF you l2-normalize all feature vectors.

def sqrerr(repr1, repr2):
    """
    Compute the sqrerr between two reprs.
    The reprs are each a dict from feature to feature value.
    """
    keys = frozenset(repr1.keys() + repr2.keys())
    sqrerr = 0.
    for k in keys:
        diff = repr1.get(k, 0.) - repr2.get(k, 0.)
        sqrerr += diff * diff
    return sqrerr

You should call sqrerr for every Y[i,j] element you want to compute.

Make Y a square matrix, and make sure that Y[i,j] == Y[j,i]. Use method hcluster.squareform to convert Y to a form that is good for hcluster.linkage.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜