Assigning values based on the number of character duplicates

2023-03-16 22:41 问答作者：

Sorry for the burst of question after question. Trying my best to search, but I have the arduous task of coming up with a very, very large program and I am still very new to R so I appreciate all the quick help I have got thus far.

Fake example to demonstrate Problem

Gene <- c("A","B","C","A","B","C","A","B","C")
> IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
> ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
> ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
> ExampleData <- as.data.frame(ExampleData)
> ExampleData
Gene IntensityValue ProceedTest
 A              1           2
 B             10           2
 C             20           2
 A              3           2
 B             NA          -1
 C             23           2
 A             NA          -1
 B             NA          -1
 C             22           2

ProceedTest is a score that indicates whether the test should proceed. A score of 2 means it will take the data into account, a score of -1 means that the test will not take the data into account.

You'll notice开发者_StackOverflow社区 that the gene B has NA appear twice, and A has NA appear only once. I would like R to be able to recognize that for gene B, NA appears twice. Such that any time NA appears twice for a given gene (B), a value of zero replaces the NA, and the subsequent -1 is turned into a 2. I want R to ignore the NA for A and continue to leave the Proceed test values as is.

The changed data should look like:

Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20           2
  A              3           2
  B              0           2
  C             23           2
  A             NA          -1
  B              0           2
  C             22           2

This may not be possible, but if it is, I would like to be able to say that if there are no NA's for the gene then the ProceedTest value becomes a -1.

Final Dataset
 Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20          -1
  A              3           2
  B              0           2
  C             23          -1
  A             NA          -1
  B              0           2
  C             22          -1

In summary. Gene A has only one NA, so nothing changes. Gene B has two NA values so it gets all 2's, and the NA's become zeros in the intensity value column. Gene C becomes a -1 because it does not contain any NA (doesn't really matter to change intensity values).

I hope this is clear, I also know that my other questions have been a little bit easier, so I hope this particular question isn't so straightforward where I should have done more research to find the answer on my own.

Thanks for the help in advance,

Joe

If you don't care about the order of your data.frame, ddply from the plyr package can do the trick:

ddply(ExampleData, "Gene", function(dfr){
        #here, dfr is the part of your original data.frame
        #only for the 'current value' of Gene
        numNA<-sum(is.na(dfr$IntensityValue))
        if(numNA>1)
        {
            dfr$IntensityValue<-0
            dfr$ProceedTest<-2
        }
        else if(numNA==0)
        {
            dfr$ProceedTest<- -1
        }
        dfr
    })

There are many other solutions though.

With the caveat that there are almost certainly more efficient ways of doing this (if your data has many repeats for each gene, the merge operation's duplication of a very condensed data.frame containing the counts will eat up a lot of memory):

Gene <- c("A","B","C","A","B","C","A","B","C")
IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
ExampleData <- as.data.frame(ExampleData)
ExampleData

num.na <- function(x) {
  sum(is.na(x))
}
ED.numna <- by(data=ExampleData,Gene,num.na)
# res.name is what you want the result column to be named
  #ideally would pull this from the call via something like as.character(attr(x,"call"))
as.data.frame.by <- function(x,res.name=NA) {
  stopifnot(length(dimnames(x))==1) # Only 1d case handled for now
  df <- data.frame(by = names(x), res = as.numeric(x) )
  names(df)[names(df)=="by"] <- names(dimnames(x))
  if(!is.na(res.name)) {
    names(df)[names(df)=="res"] <- res.name
  }
  df
}
ExampleData <- merge(ExampleData,as.data.frame(ED.numna,"count"))
ExampleData$IntensityValue[ExampleData$count > 1] <- 0

继续阅读：conditional-statements duplicate-data

Assigning values based on the number of character duplicates

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？