开发者

Assigning values based on the number of character duplicates

Sorry for the burst of question after question. Trying my best to search, but I have the arduous task of coming up with a very, very large program and I am still very new to R so I appreciate all the quick help I have got thus far.

Fake example to demonstrate Problem

Gene <- c("A","B","C","A","B","C","A","B","C")
> IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
> ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
> ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
> ExampleData <- as.data.frame(ExampleData)
> ExampleData
Gene IntensityValue ProceedTest
 A              1           2
 B             10           2
 C             20           2
 A              3           2
 B             NA          -1
 C             23           2
 A             NA          -1
 B             NA          -1
 C             22           2

ProceedTest is a score that indicates whether the test should proceed. A score of 2 means it will take the data into account, a score of -1 means that the test will not take the data into account.

You'll notice开发者_StackOverflow社区 that the gene B has NA appear twice, and A has NA appear only once. I would like R to be able to recognize that for gene B, NA appears twice. Such that any time NA appears twice for a given gene (B), a value of zero replaces the NA, and the subsequent -1 is turned into a 2. I want R to ignore the NA for A and continue to leave the Proceed test values as is.

The changed data should look like:

Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20           2
  A              3           2
  B              0           2
  C             23           2
  A             NA          -1
  B              0           2
  C             22           2

This may not be possible, but if it is, I would like to be able to say that if there are no NA's for the gene then the ProceedTest value becomes a -1.

Final Dataset
 Gene IntensityValue ProceedTest
  A              1           2
  B             10           2
  C             20          -1
  A              3           2
  B              0           2
  C             23          -1
  A             NA          -1
  B              0           2
  C             22          -1

In summary. Gene A has only one NA, so nothing changes. Gene B has two NA values so it gets all 2's, and the NA's become zeros in the intensity value column. Gene C becomes a -1 because it does not contain any NA (doesn't really matter to change intensity values).

I hope this is clear, I also know that my other questions have been a little bit easier, so I hope this particular question isn't so straightforward where I should have done more research to find the answer on my own.

Thanks for the help in advance,

Joe


If you don't care about the order of your data.frame, ddply from the plyr package can do the trick:

ddply(ExampleData, "Gene", function(dfr){
        #here, dfr is the part of your original data.frame
        #only for the 'current value' of Gene
        numNA<-sum(is.na(dfr$IntensityValue))
        if(numNA>1)
        {
            dfr$IntensityValue<-0
            dfr$ProceedTest<-2
        }
        else if(numNA==0)
        {
            dfr$ProceedTest<- -1
        }
        dfr
    })

There are many other solutions though.


With the caveat that there are almost certainly more efficient ways of doing this (if your data has many repeats for each gene, the merge operation's duplication of a very condensed data.frame containing the counts will eat up a lot of memory):

Gene <- c("A","B","C","A","B","C","A","B","C")
IntensityValue <- c(1,10,20,3,NA,23,NA,NA,22)
ProceedTest <- c(2,2,2,2,-1,2,-1,-1,2)
ExampleData <- list(Gene=Gene, IntensityValue=IntensityValue, ProceedTest=ProceedTest)
ExampleData <- as.data.frame(ExampleData)
ExampleData

num.na <- function(x) {
  sum(is.na(x))
}
ED.numna <- by(data=ExampleData,Gene,num.na)
# res.name is what you want the result column to be named
  #ideally would pull this from the call via something like as.character(attr(x,"call"))
as.data.frame.by <- function(x,res.name=NA) {
  stopifnot(length(dimnames(x))==1) # Only 1d case handled for now
  df <- data.frame(by = names(x), res = as.numeric(x) )
  names(df)[names(df)=="by"] <- names(dimnames(x))
  if(!is.na(res.name)) {
    names(df)[names(df)=="res"] <- res.name
  }
  df
}
ExampleData <- merge(ExampleData,as.data.frame(ED.numna,"count"))
ExampleData$IntensityValue[ExampleData$count > 1] <- 0
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜