开发者

R-thonic replacement for simple for loops containing a condition

I'm using R, and I'm a beginner. I have two large lists (30K elements each). One is called descriptions and where each element is (maybe) a tokenized string. The other is called probes where each element is a number. I need to make a dictionary that mapsprobes to something in descriptions, if that something is there. Here's how I'm going about this:

probe2gene <- list()
for (i in 1:length(probes)){
 strings<-strsplit(descriptions[i]), '//')
 if (length(strings[[1]]) > 1){ 
  probe2gene[probes[i]] = strings[[1]][2]
 }
}

Which works fine, but seems slow, much slower than the roughly equivalent python:

probe2gene = {}
for p,d in zip(probes, descriptions):
    try:
     probe2gene[p] = descriptions.split('//')[1]
    except IndexError:
     pass

My question: is there an "R-thonic" way of doing what I'm trying to do? The R manual entry on for loops suggests that such loops开发者_高级运维 are rare. Is there a better solution?

Edit: a typical good "description" looks like this:

"NM_009826 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// AB070619 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// ENSMUST00000027040 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421"

a bad "description: looks like this

"-----"

though it can quite easily be some other not-very-helpful string. Each probe is simply a number. The probe and description vectors are the same length, and completely correspond to each other, i.e. probe[i] maps to description[i].


It's usually better in R if you use the various apply-like functions, rather than a loop. I think this solves your problem; the only drawback is that you have to use string keys.

> descriptions <- c("foo//bar", "")
> probes <- c(10, 20)
> probe2gene <- lapply(strsplit(descriptions, "//"), function (x) x[2])
> names(probe2gene) <- probes
> probe2gene <- probe2gene[!is.na(probe2gene)]
> probe2gene[["10"]]
[1] "bar"

Unfortunately, R doesn't have a good dictionary/map type. The closest I've found is using lists as a map from string-to-value. That seems to be idiomatic, but it's ugly.


If I understand correctly you are looking to save each probe-description combination where the there is more than one (split) value in description?

Probe and Description are the same length?

This is kind of messy but a quick first pass at it?

a <- list("a","b","c")
b <- list(c("a","b"),c("DEF","ABC"),c("Z"))

names(b) <- a
matches <- which(lapply(b, length)>1) #several ways to do this
b <- lapply(b[matches], function(x) x[2]) #keeps the second element only

That's my first attempt. If you have a sample dataset that would be very useful.

Best regards,

Jay


Another way.

probe<-c(4,3,1)
gene<-c('red//hair','strange','blue//blood')
probe2gene<-character()
probe2gene[probe]<-sapply(strsplit(gene,'//'),'[',2)
probe2gene
[1] "blood" NA      NA      "hair" 

In the sapply, we take advantage of the fact that in R the subsetting operator is also a function named '[' to which we can pass the index as an argument. Also, an out-of-range index does not cause an error but gives a NA value. On the left hand of the same line, we use the fact that we can pass a vector of indices in any order and with gaps.


Here's another approach that should be fast. Note that this doesn't remove the empty descriptions. It could be adapted to do that or you could clean those in a post processing step using lapply. Is it the case that you'll never have a valid description of length one?

make_desc <- function(n)
{
    word <- function(x) paste(sample(letters, 5, replace=TRUE), collapse = "")
    if (runif(1) < 0.70)
        paste(sapply(seq_len(n), word), collapse = "//")
    else
        "----"
}

description <- sapply(seq_len(10), make_desc)
probes <- seq_len(length(description))

desc_parts <- strsplit(description, "//", fixed=TRUE, useBytes=TRUE)
lens <- sapply(desc_parts, length)
probes_expand <- rep(probes, lens)
ans <- split(unlist(desc_parts), probes_expand)


> description
 [1] "fmbec"                                                               
 [2] "----"                                                                
 [3] "----"                                                                
 [4] "frrii//yjxsa//wvkce//xbpkc"                                          
 [5] "kazzp//ifrlz//ztnkh//dtwow//aqvcm"                                   
 [6] "stupm//ncqhx//zaakn//kjymf//swvsr//zsexu"                            
 [7] "wajit//sajgr//cttzf//uagwy//qtuyh//iyiue//xelrq"                     
 [8] "nirex//awvnw//bvexw//mmzdp//lvetr//xvahy//qhgym//ggdax"              
 [9] "----"                                                                
[10] "ubabx//tvqrd//vcxsp//rjshu//gbmvj//fbkea//smrgm//qfmpy//tpudu//qpjbu"


> ans[[3]]
[1] "----"
> ans[[4]]
[1] "frrii" "yjxsa" "wvkce" "xbpkc"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜