computing z-scores for 2D matrices in scipy/numpy in Python
How can I compute the z-score for matrices in Python?
Suppose I have the array:
a = array([[ 1, 2, 3],
[ 30, 35, 36],
[2000, 6000, 8000]])
and I want to compute the z-score for each row. The solution I came up with is:
array([zs(item) for item in a])
where zs is in scipy.stats.stats. Is there a better built-in vectorized way to 开发者_运维问答do this?
Also, is it always good to z-score numbers before using hierarchical clustering with euclidean or seuclidean distance? Can anyone discuss the relative advantages/disadvantages?
thanks.
scipy.stats.stats.zs is defined like this:
def zs(a):
mu = mean(a,None)
sigma = samplestd(a)
return (array(a)-mu)/sigma
So to extend it to work on a given axis of an ndarray, you could do this:
import numpy as np
import scipy.stats.stats as sss
def my_zs(a,axis=-1):
b=np.array(a).swapaxes(axis,-1)
mu = np.mean(b,axis=-1)[...,np.newaxis]
sigma = sss.samplestd(b,axis=-1)[...,np.newaxis]
return (b-mu)/sigma
a = np.array([[ 1, 2, 3],
[ 30, 35, 36],
[2000, 6000, 8000]])
result=np.array([sss.zs(item) for item in a])
my_result=my_zs(a)
print(my_result)
# [[-1.22474487 0. 1.22474487]
# [-1.3970014 0.50800051 0.88900089]
# [-1.33630621 0.26726124 1.06904497]]
assert(np.allclose(result,my_result))
the new zscore of scipy, available in the next release takes arbitrary array dimension
http://projects.scipy.org/scipy/changeset/6169
精彩评论