开发者

List of (nearly) equal columns from a data.frame by condition in R

First without the details

I have data.frames like that one:

  val1 val2 val3 val4 val5
1  1.1    2  1.1  2.1  4.2
2  5.7    5  5.6  4.9  9.9
3  3.1    3  3.2  2.9  5.9
4  9.6    1  9.5  1.0  2.0

and want to get the (nearly) equal rows. The desired result would be something like

[1] "val1" "val2" "val5"

because the column val3 is almost equal to val1, val4 is almost equal to val2 and val5 is different.

Details:

  • What does "nearly" equal mean (just one of the options listed below):

    • the absolute difference of the values is smaller than a fixed number (0.2 for the sample above)
    • the relative difference of the values is smaller than a fixed number (~11% for the sample)
    • other metrics which make sense ;-)
  • a listing of linearly dependent co开发者_JAVA技巧lumns would be even better (but I think that's way more complicated) (that would mean that val5 is also part of the group which is formed by val2 and val4 since it's roughly twice the value)
  • it has not to be really fast, O(n^2) would be okay. (my frames are only about 12 rows and 300 columns)
  • if that should not be possible, a list of exactly equal columns would somehow work, too. Then I would apply the round() function before


It's not quite well-defined how to choose which rows are equal; for instance, you could have three columns where A and B are "equal" and B and C are "equal" but A and C are not. What to do then? One way around that might be to use hierarchical clustering, maybe like this:

Using the data from Andrie's answer, first transpose it and make it into a matrix; I'll also standardize each row (what was a column) as a start at finding linear combinations; this will group rows that are exact multiple of each other but not more complex combinations.

d <- t(as.matrix(d))
s <- rowSums(d)
ds <- sweep(d, 1, s, `/`)

We now make a tree, and for interest, plot it. This uses the default distance function (Euclidean) but others are possible.

tree <- hclust(dist(ds))
plot(tree)

List of (nearly) equal columns from a data.frame by condition in R

We then choose where to cut the tree into groups (this is where you choose how close two have to be to be "equal"); I output it together with the sum of values to see if any are multiples of another.

> grp <- cutree(tree, h=0.1)
> cbind(grp, s)

     grp    s
val1   1 19.5
val2   2 11.0
val3   1 19.4
val4   2 10.9
val5   2 22.0


Replicate your data:

structure(list(val1 = c(1.1, 5.7, 3.1, 9.6), val2 = c(2L, 5L, 
3L, 1L), val3 = c(1.1, 5.6, 3.2, 9.5), val4 = c(2.1, 4.9, 2.9, 
1), val5 = c(4.2, 9.9, 5.9, 2)), .Names = c("val1", "val2", "val3", 
"val4", "val5"), class = "data.frame", row.names = c("1", "2", 
"3", "4"))
x
  val1 val2 val3 val4 val5
1  1.1    2  1.1  2.1  4.2
2  5.7    5  5.6  4.9  9.9
3  3.1    3  3.2  2.9  5.9
4  9.6    1  9.5  1.0  2.0

Create a function. The mechanism is to wrap around the base R function duplicated which has a method for arrays that also handles columns, unlike the method for data.frames that only handles rows. Also, I took you at your word and round each column, but you can specify the number of digits as a parameter.

not_duplicated <- function(x, round_digits, margin=2){
  x2 <- apply(x, margin, round, round_digits)  
  colnames(x)[!duplicated(x2, MARGIN=margin)]
}

The results are as you specified:

x <- as.matrix(x)
not_duplicated(x, 0)
[1] "val1" "val2" "val5"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜