R: Finding patterns across multiple columns- possibly duplicated()?
I am trying to isolate entries in a dataframe which share common values: see below to reconstruct a portion of my df:
Stand<-c("MY","MY","MY","MY","MY")
Plot<-c(12,12,12,12,12)
StumpNumber<-c(1,2,3,3,7)
TreeNumber<-c(1,2,3,4,8)
sample<-data.frame(Stand,Plot,StumpNumber,Tree开发者_运维百科Number)
sample
And get an output that tells me which entries have common values. In other words, to quickly isolate situations where there is more than one TreeNumber (or more than one row) for a given Stand,Plot,StumpNumber combination. In the example code that would be that StumpNumber 3 has TreeNumber 3 and TreeNumber 4.
My understanding of duplicated() is that can find instances where duplicated values occur within a single column- what can I do to find situations where a common combination of columns occurs?
Thanks.
The Description of ?duplicated
indicates that it works on rows of data.frames and the fourth paragraph of the Details section says:
The data frame method works by pasting together a character
representation of the rows separated by ‘\r’, so may be imperfect
if the data frame has characters with embedded carriage returns or
columns which do not reliably map to characters.
How did you come to understand that it only works on single columns?
Assuming TreeNumber
is unique within Stand
, Plot
, and StumpNumber
you just need to exclude it from the call to duplicated
.
> duplicated(sample[,1:3])
[1] FALSE FALSE FALSE TRUE FALSE
> duplicated(sample[,1:3], fromLast=TRUE)
[1] FALSE FALSE TRUE FALSE FALSE
Update - If you would like all the duplicated rows, you could do something like:
> allDups <- duplicated(sample[,1:3],fromLast=TRUE) | duplicated(sample[,1:3])
> sample[allDups,]
Stand Plot StumpNumber TreeNumber
3 MY 12 3 3
4 MY 12 3 4
For convenience, I'm going to assume you have a nesting scheme going on. So, let's say Trees
are nested in Stumps
, Stumps
in Plots
, and Plots
in Stands
. I also assumed the problem you're trying to solve is that some trees are attached to the same stump, which means the problematic entries are those where Stand
/Plot
/Stump
identifiers are repeated for different TreeNumber
s
What I did was:
- Order the data
- Wrap a slightly customized function around
duplicated()
- Use
ddply()
(in theplyr
package) to split and analyze your data - Print out the problematic entries
Ordering the Data
I ordered first by Stand
, then Plot
, and finally StumpNumber
sampleOrdered <- sample[order(sample$Stand, sample$Plot, sample$StumpNumber)]
Wrapping my own duplicated()
function
Assuming the issue is that some trees are attached to the same stump, we can write the following function:
findTreesAttachedToTheSameStump <- function(data) {
x <- duplicated(data[ , "StumpNumber"])
data[x, ]
}
This function will select out and return (implicitly) whatever entries pass the duplicated()
test.
Using ddply
I did a bit of split-apply-combine here. I instruct ddply
to break the dataset by Stand
and Plot
(since your data is nested, and StumpNumber
might only be unique within a plot). Then, I apply the function we created above:
sampleDuplicated <- ddply(sampleOrdered, .(Stand, Plot), findTreesAttachedToTheSameStump)
Print out the problematic stumps
Now all we need to do is call sampleDuplicated
, which contains the entries for every Plot/Stand/Stump combination that was repeated.
精彩评论