Cross tabulate true values only in R
I have a dataframe "dataAll" and some vars (A,B) on it, I want a table like
A >10 >20 >30
B
>1
>10
>100
To do that, I coded like
with(dataAll, table(A=A>10,B=B>1)) which gives me
A
FALSE TRUE
B
FALSE 220357 4798
TRUE 596开发者_高级运维618 210080
So, here the value 210080 should go to first cell in the upper table
I tried something like this with no success
with(dataAll, table(A=A>c(10,20,30),B=B>c(1,10,100)))
and even
with(dataAll, table(A=c(A>10,A>20,A>30),B=c(B>1,B>10,B>100)))
but no success,
I know there should be a way to do this...
A.categ <- cut(A, breaks = c(-Inf, 10, 20, 30, Inf), right=FALSE)
B.categ <- cut(B, breaks = c(-Inf, 1, 10, 100, Inf), right=FALSE)
table(A.categ, B.categ)
The trick with cut
is remembering to set right=FALSE, since that is the way most people expect it to work. In fact when Frank Harrell made his version of cut2
for Hmisc, he set that as the default option.
When you do that with the example cosntructed by Tommy you get
> A.categ <- cut(d$A, breaks = c(-Inf, 10, 20, 30, Inf), right=FALSE)
> B.categ <- cut(d$B, breaks = c(-Inf, 1, 10, 100, Inf), right=FALSE)
> table(A.categ, B.categ)
B.categ
A.categ [-Inf,1) [1,10) [10,100) [100, Inf)
[-Inf,10) 0 1 1 9
[10,20) 0 2 3 2
[20,30) 0 5 4 1
[30, Inf) 0 17 11 44
Not every understands the open/closed convention so sometimes you need to go in and rework the labels
of a factored variable that you constructed with cut
so the less mathematically inclined client can map it to his conventions. You use the factor
function and specify the labels
argument (and do NOT specify the levels
argument or you will "break the variable")
> A.categ <- factor(A.categ, labels=c(" Less than 1", "1-9.9", "10-99.9" , "100+") )
> table(A.categ, B.categ)
B.categ
A.categ [-Inf,1) [1,10) [10,100) [100, Inf)
Less than 1 0 1 1 9
1-9.9 0 2 3 2
10-99.9 0 5 4 1
100+ 0 17 11 44
try the cut function.
?cut
it works with breaks like you want.
cut(x,breaks,labels,...)
table(cut(A[which(B<1)],breaks=c(0,10,20,30)))
table(cut(A[which(B>1)],breaks=c(0,10,20,30)))
Here's a little vapply
- based solution. Is assumes you want the total count for each cell (for example, A > 20 & B > 100) - not the count for (A > 20 & A < 30) & (B > 100 & B < 1000).
# Create some data
set.seed(42)
n <- 100;
dataAll <- data.frame(A=runif(n, 1,100), B=10^runif(n, 0, 4))
# And some break points
a <- 1:10*10 # 10, 20 etc...
b <- 10^(0:4) # 1, 10, 100, 1000
f <- function(A, a, B, b) {
structure(t(vapply(b, function(bb) {
vapply(a, function(aa, A) sum(A > aa), 1, A[B > bb])
}, a)), dimnames=list(B=b, A=a))
}
f(dataAll$A, a, dataAll$B, b)
Which gives the following table:
A
B 10 20 30 40 50 60 70 80 90 100
1 89 82 72 63 55 46 34 23 16 0
10 65 60 55 47 41 34 26 18 12 0
100 47 45 44 39 34 28 21 14 10 0
1000 20 19 18 17 16 12 8 5 5 0
10000 0 0 0 0 0 0 0 0 0 0
The meat of it all is to count the TRUE values for one a
condition sum(A > aa)
, and then do that for all a
conditions with a call to vapply
.
vapply(a, function(aa, A) sum(A > aa)
Then do it again for each b
condition, and add some dimnames to the (transposed) result.
精彩评论