R-thonic replacement for simple for loops containing a condition

2022-12-19 16:57 问答作者：

I'm using R, and I'm a beginner. I have two large lists (30K elements each). One is called descriptions and where each element is (maybe) a tokenized string. The other is called probes where each element is a number. I need to make a dictionary that mapsprobes to something in descriptions, if that something is there. Here's how I'm going about this:

probe2gene <- list()
for (i in 1:length(probes)){
 strings<-strsplit(descriptions[i]), '//')
 if (length(strings[[1]]) > 1){ 
  probe2gene[probes[i]] = strings[[1]][2]
 }
}

Which works fine, but seems slow, much slower than the roughly equivalent python:

probe2gene = {}
for p,d in zip(probes, descriptions):
    try:
     probe2gene[p] = descriptions.split('//')[1]
    except IndexError:
     pass

My question: is there an "R-thonic" way of doing what I'm trying to do? The R manual entry on for loops suggests that such loops开发者_高级运维 are rare. Is there a better solution?

Edit: a typical good "description" looks like this:

"NM_009826 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// AB070619 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// ENSMUST00000027040 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421"

a bad "description: looks like this

"-----"

though it can quite easily be some other not-very-helpful string. Each probe is simply a number. The probe and description vectors are the same length, and completely correspond to each other, i.e. probe[i] maps to description[i].

It's usually better in R if you use the various apply-like functions, rather than a loop. I think this solves your problem; the only drawback is that you have to use string keys.

> descriptions <- c("foo//bar", "")
> probes <- c(10, 20)
> probe2gene <- lapply(strsplit(descriptions, "//"), function (x) x[2])
> names(probe2gene) <- probes
> probe2gene <- probe2gene[!is.na(probe2gene)]
> probe2gene[["10"]]
[1] "bar"

Unfortunately, R doesn't have a good dictionary/map type. The closest I've found is using lists as a map from string-to-value. That seems to be idiomatic, but it's ugly.

If I understand correctly you are looking to save each probe-description combination where the there is more than one (split) value in description?

Probe and Description are the same length?

This is kind of messy but a quick first pass at it?

a <- list("a","b","c")
b <- list(c("a","b"),c("DEF","ABC"),c("Z"))

names(b) <- a
matches <- which(lapply(b, length)>1) #several ways to do this
b <- lapply(b[matches], function(x) x[2]) #keeps the second element only

That's my first attempt. If you have a sample dataset that would be very useful.

Best regards,

Jay

Another way.

probe<-c(4,3,1)
gene<-c('red//hair','strange','blue//blood')
probe2gene<-character()
probe2gene[probe]<-sapply(strsplit(gene,'//'),'[',2)
probe2gene
[1] "blood" NA      NA      "hair"

In the sapply, we take advantage of the fact that in R the subsetting operator is also a function named '[' to which we can pass the index as an argument. Also, an out-of-range index does not cause an error but gives a NA value. On the left hand of the same line, we use the fact that we can pass a vector of indices in any order and with gaps.

Here's another approach that should be fast. Note that this doesn't remove the empty descriptions. It could be adapted to do that or you could clean those in a post processing step using lapply. Is it the case that you'll never have a valid description of length one?

make_desc <- function(n)
{
    word <- function(x) paste(sample(letters, 5, replace=TRUE), collapse = "")
    if (runif(1) < 0.70)
        paste(sapply(seq_len(n), word), collapse = "//")
    else
        "----"
}

description <- sapply(seq_len(10), make_desc)
probes <- seq_len(length(description))

desc_parts <- strsplit(description, "//", fixed=TRUE, useBytes=TRUE)
lens <- sapply(desc_parts, length)
probes_expand <- rep(probes, lens)
ans <- split(unlist(desc_parts), probes_expand)


> description
 [1] "fmbec"                                                               
 [2] "----"                                                                
 [3] "----"                                                                
 [4] "frrii//yjxsa//wvkce//xbpkc"                                          
 [5] "kazzp//ifrlz//ztnkh//dtwow//aqvcm"                                   
 [6] "stupm//ncqhx//zaakn//kjymf//swvsr//zsexu"                            
 [7] "wajit//sajgr//cttzf//uagwy//qtuyh//iyiue//xelrq"                     
 [8] "nirex//awvnw//bvexw//mmzdp//lvetr//xvahy//qhgym//ggdax"              
 [9] "----"                                                                
[10] "ubabx//tvqrd//vcxsp//rjshu//gbmvj//fbkea//smrgm//qfmpy//tpudu//qpjbu"


> ans[[3]]
[1] "----"
> ans[[4]]
[1] "frrii" "yjxsa" "wvkce" "xbpkc"

继续阅读：for-loop r

R-thonic replacement for simple for loops containing a condition

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？