Frequency of non-zero or specific number in a column
My input file:
x <- read.table(textConnection('
t0 t1 t2 t3 t4
aa 0 1 0 1 0
bb 1 0 1 0 1
cc 0 0 0 0 0
dd 1 1 1 0 1
ee 1 1 1 0 0
ff 0 0 1 0 1
gg -1 -1 -1 -1 0
hh -1 1 -1 1 -1
'), header=TRUE)
I want to firstly calculate the frequency of each columns, i.e.
t0 t1 t2 t3 t4
freqency 5/8 5/8 6/8 3/8 4/8
And then multiply the frequency back to matrix x, to obtain the new matrix as follows:
t0 t1 t2 t3 t4
aa 0 5/8 0 3/8 0
bb 5/8 0 6/开发者_如何学Go8 0 4/8
cc 0 0 0 0 0
dd 5/8 5/8 6/8 0 4/8
ee 5/8 5/8 6/8 0 0
ff 0 0 6/8 0 4/8
gg -5/8 -5/8 -6/8 -3/8 0
hh -5/8 5/8 -6/8 3/8 -4/8
How to do it with R? I learnt from manuals that prop.table(x) could be used to get the overall probability for the whole table, how can I do it for each column individually? Pls kindly help.
In the same spirit as the answer from @Joris, this is where the wonderful sweep()
function comes into it's own:
> sweep(x, MARGIN = 2, colMeans(abs(x)), "*")
t0 t1 t2 t3 t4
aa 0.000 0.625 0.00 0.375 0.0
bb 0.625 0.000 0.75 0.000 0.5
cc 0.000 0.000 0.00 0.000 0.0
dd 0.625 0.625 0.75 0.000 0.5
ee 0.625 0.625 0.75 0.000 0.0
ff 0.000 0.000 0.75 0.000 0.5
gg -0.625 -0.625 -0.75 -0.375 0.0
hh -0.625 0.625 -0.75 0.375 -0.5
What is happening here is that colMeans(abs(x))
is a vector of length 5. We sweep()
these values, column-wise (indicated by the MARGIN = 2
in the call), over the data x
applying the function *
as we go. So, the values in column t0
all get multiplied by colMeans(abs(x))[1]
, the values in column t1
all get multiplied by colMeans(abs(x))[2]
and so on.
The advantage of sweep()
is that it is very fast when given a matrix:
X <- data.matrix(x)
> system.time(replicate(1000, sweep(X, 2, means, "*")))
user system elapsed
0.115 0.000 0.118
> system.time(replicate(1000, mapply(`*`, x, means)))
user system elapsed
0.308 0.001 0.309
> system.time(replicate(1000, mapply(`*`, X, means)))
user system elapsed
0.204 0.000 0.205
It is much slower when given a data frame:
> system.time(replicate(1000, sweep(x, 2, means, "*")))
user system elapsed
2.072 0.000 2.074
But that is just the way things are in R.
Try this :
> colMeans(abs(x))
t0 t1 t2 t3 t4
0.625 0.625 0.750 0.375 0.500
for the frequencies and
> mapply(`*`,x,colMeans(abs(x)))
t0 t1 t2 t3 t4
[1,] 0.000 0.625 0.00 0.375 0.0
[2,] 0.625 0.000 0.75 0.000 0.5
[3,] 0.000 0.000 0.00 0.000 0.0
[4,] 0.625 0.625 0.75 0.000 0.5
[5,] 0.625 0.625 0.75 0.000 0.0
[6,] 0.000 0.000 0.75 0.000 0.5
[7,] -0.625 -0.625 -0.75 -0.375 0.0
[8,] -0.625 0.625 -0.75 0.375 -0.5
to get the dataframe. mapply
applies the function *
on every column, taking the arguments mentioned. See also ?mapply
精彩评论