
Frequency of non-zero or specific number in a column

My input file:

 x <- read.table(textConnection('
      t0  t1  t2  t3  t4
  aa  0   1   0   1   0
  bb  1   0   1   0   1
  cc  0   0   0   0   0
  dd  1   1   1   0   1
  ee  1   1   1   0   0
  ff  0   0   1   0   1
  gg  -1  -1  -1  -1  0
  hh  -1  1   -1  1   -1
 '), header=TRUE)

I want to firstly calculate the frequency of each columns, i.e.

          t0   t1   t2   t3   t4
freqency  5/8  5/8  6/8  3/8  4/8

And then multiply the frequency back to matrix x, to obtain the new matrix as follows:

       t0    t1     t2     t3     t4
  aa   0     5/8    0      3/8    0
  bb   5/8   0      6/开发者_如何学Go8    0      4/8
  cc   0     0      0      0      0
  dd   5/8   5/8    6/8    0      4/8
  ee   5/8   5/8    6/8    0      0
  ff   0     0      6/8    0      4/8
  gg  -5/8  -5/8   -6/8   -3/8    0
  hh  -5/8   5/8   -6/8    3/8   -4/8

How to do it with R? I learnt from manuals that prop.table(x) could be used to get the overall probability for the whole table, how can I do it for each column individually? Pls kindly help.

In the same spirit as the answer from @Joris, this is where the wonderful sweep() function comes into it's own:

> sweep(x, MARGIN = 2, colMeans(abs(x)), "*")
       t0     t1    t2     t3   t4
aa  0.000  0.625  0.00  0.375  0.0
bb  0.625  0.000  0.75  0.000  0.5
cc  0.000  0.000  0.00  0.000  0.0
dd  0.625  0.625  0.75  0.000  0.5
ee  0.625  0.625  0.75  0.000  0.0
ff  0.000  0.000  0.75  0.000  0.5
gg -0.625 -0.625 -0.75 -0.375  0.0
hh -0.625  0.625 -0.75  0.375 -0.5

What is happening here is that colMeans(abs(x)) is a vector of length 5. We sweep() these values, column-wise (indicated by the MARGIN = 2 in the call), over the data x applying the function * as we go. So, the values in column t0 all get multiplied by colMeans(abs(x))[1], the values in column t1 all get multiplied by colMeans(abs(x))[2] and so on.

The advantage of sweep() is that it is very fast when given a matrix:

X <- data.matrix(x)
> system.time(replicate(1000, sweep(X, 2, means, "*")))
   user  system elapsed 
  0.115   0.000   0.118 
> system.time(replicate(1000, mapply(`*`, x, means)))
   user  system elapsed 
  0.308   0.001   0.309 
> system.time(replicate(1000, mapply(`*`, X, means)))
   user  system elapsed 
  0.204   0.000   0.205

It is much slower when given a data frame:

> system.time(replicate(1000, sweep(x, 2, means, "*")))
   user  system elapsed 
  2.072   0.000   2.074

But that is just the way things are in R.

Try this :

> colMeans(abs(x))
   t0    t1    t2    t3    t4 
0.625 0.625 0.750 0.375 0.500 

for the frequencies and

> mapply(`*`,x,colMeans(abs(x)))
         t0     t1    t2     t3   t4
[1,]  0.000  0.625  0.00  0.375  0.0
[2,]  0.625  0.000  0.75  0.000  0.5
[3,]  0.000  0.000  0.00  0.000  0.0
[4,]  0.625  0.625  0.75  0.000  0.5
[5,]  0.625  0.625  0.75  0.000  0.0
[6,]  0.000  0.000  0.75  0.000  0.5
[7,] -0.625 -0.625 -0.75 -0.375  0.0
[8,] -0.625  0.625 -0.75  0.375 -0.5

to get the dataframe. mapply applies the function * on every column, taking the arguments mentioned. See also ?mapply





验证码 换一张
取 消

