开发者

R: Finding patterns across multiple columns- possibly duplicated()?

I am trying to isolate entries in a dataframe which share common values: see below to reconstruct a portion of my df:

Stand<-c("MY","MY","MY","MY","MY")
Plot<-c(12,12,12,12,12)
StumpNumber<-c(1,2,3,3,7)
TreeNumber<-c(1,2,3,4,8)
sample<-data.frame(Stand,Plot,StumpNumber,Tree开发者_运维百科Number)
sample

And get an output that tells me which entries have common values. In other words, to quickly isolate situations where there is more than one TreeNumber (or more than one row) for a given Stand,Plot,StumpNumber combination. In the example code that would be that StumpNumber 3 has TreeNumber 3 and TreeNumber 4.

My understanding of duplicated() is that can find instances where duplicated values occur within a single column- what can I do to find situations where a common combination of columns occurs?

Thanks.


The Description of ?duplicated indicates that it works on rows of data.frames and the fourth paragraph of the Details section says:

 The data frame method works by pasting together a character
 representation of the rows separated by ‘\r’, so may be imperfect
 if the data frame has characters with embedded carriage returns or
 columns which do not reliably map to characters.

How did you come to understand that it only works on single columns?

Assuming TreeNumber is unique within Stand, Plot, and StumpNumber you just need to exclude it from the call to duplicated.

> duplicated(sample[,1:3])
[1] FALSE FALSE FALSE  TRUE FALSE
> duplicated(sample[,1:3], fromLast=TRUE)
[1] FALSE FALSE  TRUE FALSE FALSE

Update - If you would like all the duplicated rows, you could do something like:

> allDups <- duplicated(sample[,1:3],fromLast=TRUE) | duplicated(sample[,1:3])
> sample[allDups,]
  Stand Plot StumpNumber TreeNumber
3    MY   12           3          3
4    MY   12           3          4


For convenience, I'm going to assume you have a nesting scheme going on. So, let's say Trees are nested in Stumps, Stumps in Plots, and Plots in Stands. I also assumed the problem you're trying to solve is that some trees are attached to the same stump, which means the problematic entries are those where Stand/Plot/Stump identifiers are repeated for different TreeNumbers

What I did was:

  • Order the data
  • Wrap a slightly customized function around duplicated()
  • Use ddply() (in the plyr package) to split and analyze your data
  • Print out the problematic entries

Ordering the Data

I ordered first by Stand, then Plot, and finally StumpNumber

    sampleOrdered <- sample[order(sample$Stand, sample$Plot, sample$StumpNumber)]

Wrapping my own duplicated() function

Assuming the issue is that some trees are attached to the same stump, we can write the following function:

    findTreesAttachedToTheSameStump <- function(data) {
        x <- duplicated(data[ , "StumpNumber"])
        data[x, ]
    }

This function will select out and return (implicitly) whatever entries pass the duplicated() test.

Using ddply

I did a bit of split-apply-combine here. I instruct ddply to break the dataset by Stand and Plot (since your data is nested, and StumpNumber might only be unique within a plot). Then, I apply the function we created above:

    sampleDuplicated <- ddply(sampleOrdered, .(Stand, Plot), findTreesAttachedToTheSameStump)

Print out the problematic stumps

Now all we need to do is call sampleDuplicated, which contains the entries for every Plot/Stand/Stump combination that was repeated.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜