List of (nearly) equal columns from a data.frame by condition in R

2023-03-28 17:00 问答作者：

First without the details

I have data.frames like that one:

  val1 val2 val3 val4 val5
1  1.1    2  1.1  2.1  4.2
2  5.7    5  5.6  4.9  9.9
3  3.1    3  3.2  2.9  5.9
4  9.6    1  9.5  1.0  2.0

and want to get the (nearly) equal rows. The desired result would be something like

[1] "val1" "val2" "val5"

because the column val3 is almost equal to val1, val4 is almost equal to val2 and val5 is different.

Details:

What does "nearly" equal mean (just one of the options listed below):
- the absolute difference of the values is smaller than a fixed number (0.2 for the sample above)
- the relative difference of the values is smaller than a fixed number (~11% for the sample)
- other metrics which make sense ;-)
a listing of linearly dependent co开发者_JAVA技巧lumns would be even better (but I think that's way more complicated) (that would mean that val5 is also part of the group which is formed by val2 and val4 since it's roughly twice the value)
it has not to be really fast, O(n^2) would be okay. (my frames are only about 12 rows and 300 columns)
if that should not be possible, a list of exactly equal columns would somehow work, too. Then I would apply the round() function before

It's not quite well-defined how to choose which rows are equal; for instance, you could have three columns where A and B are "equal" and B and C are "equal" but A and C are not. What to do then? One way around that might be to use hierarchical clustering, maybe like this:

Using the data from Andrie's answer, first transpose it and make it into a matrix; I'll also standardize each row (what was a column) as a start at finding linear combinations; this will group rows that are exact multiple of each other but not more complex combinations.

d <- t(as.matrix(d))
s <- rowSums(d)
ds <- sweep(d, 1, s, `/`)

We now make a tree, and for interest, plot it. This uses the default distance function (Euclidean) but others are possible.

tree <- hclust(dist(ds))
plot(tree)

List of (nearly) equal columns from a data.frame by condition in R

We then choose where to cut the tree into groups (this is where you choose how close two have to be to be "equal"); I output it together with the sum of values to see if any are multiples of another.

> grp <- cutree(tree, h=0.1)
> cbind(grp, s)

     grp    s
val1   1 19.5
val2   2 11.0
val3   1 19.4
val4   2 10.9
val5   2 22.0

Replicate your data:

structure(list(val1 = c(1.1, 5.7, 3.1, 9.6), val2 = c(2L, 5L, 
3L, 1L), val3 = c(1.1, 5.6, 3.2, 9.5), val4 = c(2.1, 4.9, 2.9, 
1), val5 = c(4.2, 9.9, 5.9, 2)), .Names = c("val1", "val2", "val3", 
"val4", "val5"), class = "data.frame", row.names = c("1", "2", 
"3", "4"))
x
  val1 val2 val3 val4 val5
1  1.1    2  1.1  2.1  4.2
2  5.7    5  5.6  4.9  9.9
3  3.1    3  3.2  2.9  5.9
4  9.6    1  9.5  1.0  2.0

Create a function. The mechanism is to wrap around the base R function duplicated which has a method for arrays that also handles columns, unlike the method for data.frames that only handles rows. Also, I took you at your word and round each column, but you can specify the number of digits as a parameter.

not_duplicated <- function(x, round_digits, margin=2){
  x2 <- apply(x, margin, round, round_digits)  
  colnames(x)[!duplicated(x2, MARGIN=margin)]
}

The results are as you specified:

x <- as.matrix(x)
not_duplicated(x, 0)
[1] "val1" "val2" "val5"

继续阅读：dataframe filter subset

List of (nearly) equal columns from a data.frame by condition in R

First without the details

Details:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

First without the details

Details:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？