How to use sparse matrix in python hcluster?
I'm trying to use hcluster library in python. I have no enough python knowledges to use sparse matrix in hcluster. Please help me anybody. So, that what I'm 开发者_运维问答doing:
import os.path
import numpy
import scipy
import scipy.io
from hcluster import squareform, pdist, linkage, complete
from hcluster.hierarchy import linkage, from_mlab_linkage
from numpy import savetxt
from StringIO import StringIO
data.dmp contains matrix looks like:
A B C D
A 0 1 0 1
B 1 0 0 1
C 0 0 0 0
D 1 1 0 0
and contains only upper-right part of matrix. I don't know how to spell it in english correctly :) so, all numbers upper than main diagonal so data.dmp contains : 1 0 1, 0 1 , 0
f = file('data.dmp','r')
s = StringIO(f.readline()).getvalue()
f.close()
matrix = numpy.asarray(eval("["+s+"]"))
by unknown reason for me, hcluster uses inverted values, for example I use 0 if A!=C,and use 1 if A == D
sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")
linkage Y
Z = linkage(Y, method="complete")
So, matrix Z is what I need (if I correctly used hcluster?)
But I have next problems:
I want to use sparse matrix for the huge amount of input data, cause it's time consuming to generate input data like now, I need to import data to python from another language, thats why I need read text file. Please kindly, python guru's suggest how to make it?
To people that used python hcluster, I need to process huge amount of data, hundreds of rows, it's possible to do in hcluster? This algorithm realy produce correct HAC?
Thank you for reading, I appreciate any help!
Represent the inputs each as a dictionary, from feature name to value. Zeros are not present in the dictionary.
Compute the Y matrix yourself, not using the hcluster.pdist
. The following code does sparse squared-error. Squared-error is equivalent to cosine distance IF you l2-normalize all feature vectors.
def sqrerr(repr1, repr2):
"""
Compute the sqrerr between two reprs.
The reprs are each a dict from feature to feature value.
"""
keys = frozenset(repr1.keys() + repr2.keys())
sqrerr = 0.
for k in keys:
diff = repr1.get(k, 0.) - repr2.get(k, 0.)
sqrerr += diff * diff
return sqrerr
You should call sqrerr for every Y[i,j] element you want to compute.
Make Y a square matrix, and make sure that Y[i,j] == Y[j,i]. Use method hcluster.squareform
to convert Y to a form that is good for hcluster.linkage
.
精彩评论