开发者

Cross tabulate true values only in R

I have a dataframe "dataAll" and some vars (A,B) on it, I want a table like

   A   >10    >20     >30

B

>1

>10

>100

To do that, I coded like

with(dataAll, table(A=A>10,B=B>1)) which gives me

         A
               FALSE    TRUE
B

        FALSE  220357   4798

        TRUE  596开发者_高级运维618 210080

So, here the value 210080 should go to first cell in the upper table

I tried something like this with no success

with(dataAll, table(A=A>c(10,20,30),B=B>c(1,10,100)))

and even

with(dataAll, table(A=c(A>10,A>20,A>30),B=c(B>1,B>10,B>100)))

but no success,

I know there should be a way to do this...


A.categ <- cut(A, breaks = c(-Inf, 10, 20, 30, Inf), right=FALSE)
B.categ <- cut(B, breaks = c(-Inf, 1, 10, 100, Inf), right=FALSE)
table(A.categ, B.categ)

The trick with cut is remembering to set right=FALSE, since that is the way most people expect it to work. In fact when Frank Harrell made his version of cut2 for Hmisc, he set that as the default option.

When you do that with the example cosntructed by Tommy you get

> A.categ <- cut(d$A, breaks = c(-Inf, 10, 20, 30, Inf), right=FALSE)
>     B.categ <- cut(d$B, breaks = c(-Inf, 1, 10, 100, Inf), right=FALSE)
>     table(A.categ, B.categ)
           B.categ
A.categ     [-Inf,1) [1,10) [10,100) [100, Inf)
  [-Inf,10)        0      1        1          9
  [10,20)          0      2        3          2
  [20,30)          0      5        4          1
  [30, Inf)        0     17       11         44

Not every understands the open/closed convention so sometimes you need to go in and rework the labels of a factored variable that you constructed with cut so the less mathematically inclined client can map it to his conventions. You use the factor function and specify the labels argument (and do NOT specify the levels argument or you will "break the variable")

> A.categ <- factor(A.categ, labels=c(" Less than 1", "1-9.9", "10-99.9" , "100+") )
>     table(A.categ, B.categ)
              B.categ
A.categ        [-Inf,1) [1,10) [10,100) [100, Inf)
   Less than 1        0      1        1          9
  1-9.9               0      2        3          2
  10-99.9             0      5        4          1
  100+                0     17       11         44


try the cut function.

?cut

it works with breaks like you want.

cut(x,breaks,labels,...)


 table(cut(A[which(B<1)],breaks=c(0,10,20,30)))


 table(cut(A[which(B>1)],breaks=c(0,10,20,30)))


Here's a little vapply - based solution. Is assumes you want the total count for each cell (for example, A > 20 & B > 100) - not the count for (A > 20 & A < 30) & (B > 100 & B < 1000).

# Create some data
set.seed(42)
n <- 100;
dataAll <- data.frame(A=runif(n, 1,100), B=10^runif(n, 0, 4))

# And some break points    
a <- 1:10*10 # 10, 20 etc...
b <- 10^(0:4) # 1, 10, 100, 1000


f <- function(A, a, B, b) {
    structure(t(vapply(b, function(bb) {
    vapply(a, function(aa, A) sum(A > aa), 1, A[B > bb])    
    }, a)), dimnames=list(B=b, A=a))
}

f(dataAll$A, a, dataAll$B, b)

Which gives the following table:

       A
B       10 20 30 40 50 60 70 80 90 100
  1     89 82 72 63 55 46 34 23 16   0
  10    65 60 55 47 41 34 26 18 12   0
  100   47 45 44 39 34 28 21 14 10   0
  1000  20 19 18 17 16 12  8  5  5   0
  10000  0  0  0  0  0  0  0  0  0   0

The meat of it all is to count the TRUE values for one a condition sum(A > aa), and then do that for all a conditions with a call to vapply.

vapply(a, function(aa, A) sum(A > aa)    

Then do it again for each b condition, and add some dimnames to the (transposed) result.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜