Subset dataframe by an unusual relation between columns
I want to subset a dataframe which has an ID column (v1
, all unique) and a "linked" ID column (v2
). The expectation of v2
is that it may contain NA
s, but where it does, the corresponsing element of v1
does not appear elsewhere in v2
. Also, it is expected that the relation between the columns is symmetric: where there is an entry, x, in v2
the v1
entry of that row, y, is mirrored in another row where v1
has x and v2
has y. The last criteria is that the relation is not reflexive: ie x!=y.
I want to subset the dataframe to the elements which don't fit the expected criteria.
Here is some sample data to illustrate:
set.seed(1)
dfr <- data.frame(v1=letters,v2=rev(letters))
dfr[sample(26,10),2]<-NA
开发者_C百科dfr[sample(26,5),2]<-sample(letters,5)
dfr
v1 v2
1 a z
2 b <NA>
3 c x
4 d w
5 e <NA>
6 f u
7 g <NA>
8 h s
9 i i
10 j <NA>
11 k p
12 l <NA>
13 m f
14 n <NA>
15 o l
16 p k
17 q j
18 r e
19 s <NA>
20 t g
21 u <NA>
22 v e
23 w <NA>
24 x q
25 y x
26 z a
So rows 1, 2, 11, 14, 16, and 26 all meet the criteria, and I want to identify the rest.
I have attempted some solutions using match
, but the NA
s are causing problems. It also probably relies on the fact that in this case v2
is based on rev(v1)
, whereas a more general solution can't make that assumption.
If I correctly understand, here is an example:
> subset(dfr, (is.na(v2) & !(v1%in%dfr$v2)) | !is.na(v2) & paste(v1, v2) %in% paste(dfr$v2, dfr$v1))
v1 v2
1 a z
2 b <NA>
9 i i
11 k p
14 n <NA>
16 p k
26 z a
# or if v1 == v2 is not included:
> subset(dfr, (is.na(v2) & !(v1%in%dfr$v2)) | !is.na(v2) & (v1 != v2 & paste(v1, v2) %in% paste(dfr$v2, dfr$v1)))
v1 v2
1 a z
2 b <NA>
11 k p
14 n <NA>
16 p k
26 z a
精彩评论