hierarchical clustering with gene expression matrix in python
how can I do a hierarchical clustering (in this case for gene expression data) in Python in a way that shows the matrix of gene expression values along with the dendrogram? What I mean is like the example here:
http://www.mathworks.cn/access/helpdesk/help/toolbox/bioinfo/ug/a1060813239b1.html
shown after bullet point 6 (Figure 1), where the dendrogram is plotted to the left of the gene expression matrix, where the rows have been reordered to reflect the clustering.
How can I do this in Python using numpy/scipy or other tools? Also, is it computationally practical to do this with a matrix of about 11,000 genes, using euclidean distance as a metric?
EDIT: Many have suggested clustering packages, but I am still unsure how to plot the kind of im开发者_C百科age I linked to above in Python. How can I overlay a dendrogram alongside a heatmap matrix, using Matplotlib for example?
thanks.
Many clustering methods including scipy.cluster
start by sorting all pairwise distances,
~ 60 million in your case, not too big.
How long does the following take for you ?
import scipy.cluster.hierarchy as hier
import pylab as pl
def fcluster( pts, ncluster, method="average", criterion="maxclust" ):
""" -> (pts, Y pdist, Z linkage, T fcluster, clusterlists)
ncluster = n1 + n2 + ... (including n1 singletons)
av cluster size = len(pts) / ncluster
"""
pts = np.asarray(pts)
Y = scipy.spatial.distance.pdist( pts ) # ~ N^2 / 2
Z = hier.linkage( Y, method ) # N-1
T = hier.fcluster( Z, ncluster, criterion=criterion )
# clusters = clusterlists(T)
return (pts, Y, Z, T)
hier.dendrogram( Z )
How to permute the matrix and plot nicely was asked here in So in March, with a partial answer.
You can do this with scipy's cluster.hierarchy module. The commands are actually even very similar. However, you will have to use correlation
instead of corr
as a parameter to pdist
and rather than cluster
the name of the function scipy's cluster module is fcluster
. Also, for the dendrogram, the function is dendrogram
in scipy as opposed to clustergram
in Matlab.
You can definitely use a euclidean metric (think it is the default for pdist
). I think that it should be feasible to do this with 11,000 genes because that will be 11000*(11000-1)/2 = 60494500 (11000 choose 2) distances to be computed. That's a large number, but certainly feasible I would think.
A couple of people have made some decent progress in creating a prototype module for hierarchical clustering and heatmap visualization using scipy and matplotlib:
How to get flat clustering corresponding to color clusters in the dendrogram created by scipy
I've been adapting this code to make a full-fledged hierarchical clustering module that I can integrate into one of my transcriptome analysis packages. I'm pretty happy with the final product which will produce a heatmap using various clustering metrics and methods and coloring gradients. The code and an example output is shown here:
http://altanalyze.blogspot.com/2012/06/hierarchical-clustering-heatmaps-in.html
精彩评论