R: combine same identifiers in dataframe

2023-03-04 15:27 问答作者：

I have a dataframe with 2 columns, one Identifier and column with names. Each Identifier is several times present in the column ID (see below).

 ID           Names
uc001aag.1  DKFZp686C24272
uc001aag.1  DQ786314
uc001aag.1  uc001aag.1
uc001aah.2  AK056232
uc001aah.2  FLJ00038
uc001aah.2  uc001aah.1
uc001aah.2  uc001aah.2
uc001aai.1  AY217347

Now I want to create a dataframe like this:

 ID           Names
uc001aag.1  DKFZp686C24272 | DQ786314 | uc001aag.1
uc001aah.2  AK056232 | FLJ00038 | uc001aah.1 | u开发者_如何转开发c001aah.2
uc001aai.1  AY217347

Can anyone help me?

Aggregate is quite a fast one, but you can use an sapply solution to parallelize the code. This can easily be done on Windows using snowfall :

require(snowfall)
sfInit(parallel=TRUE,cpus=2)
sfExport("Data")

ID <- unique(Data$ID)
CombNames <- sfSapply(ID,function(i){
    paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
sfStop()

The parallel version will give you an extra speedup, but the single sapply solution is actually slower than aggregate. Tapply is a bit faster, but can't be parallelized using snowfall. on my computer :

n <- 3000
m <- 3
Data <- data.frame( ID = rep(1:n,m),
                    Names=rep(LETTERS[1:m],each=n))
 # using snowfall for parallel sapply    
 system.time({
   ID <- unique(Data$ID)
   CombNames <- sfSapply(ID,function(i){
     paste(Data$Names[Data$ID==i],collapse=" | ")
   })
   data.frame(ID,CombNames)
 }) 
   user  system elapsed 
   0.02    0.00    0.33 

 # using tapply
 system.time({
   CombNames <- tapply(Data$Names,Data$ID,paste,collapse=" | ")
   data.frame(ID=names(CombNames),CombNames)
 })
   user  system elapsed 
   0.44    0.00    0.44 

 # using aggregate
 system.time(
   aggregate(Names ~ ID, data=Data, FUN=paste, collapse=" | ")
 )
   user  system elapsed 
   0.47    0.00    0.47 

 # using the normal sapply
 system.time({
   ID <- unique(Data$ID)
   CombNames <- sapply(ID,function(i){
     paste(Data$Names[Data$ID==i],collapse=" | ")
   })
   data.frame(ID,CombNames)
 })
   user  system elapsed 
   0.75    0.00    0.75

Note:

For the record, the better sapply-solution would be :

CombNames <- sapply(split(Data$Names,Data$ID),paste,collapse=" | ")
data.frame(ID=names(CombNames),CombNames)

which is equivalent to tapply. But parallelizing this one is actually slower, as you have to move more data around within the sfSapply. The speed comes from copying the dataset to every cpu. This is what you have to keep in mind when your dataset is huge : you'll pay the speed with more memory usage.

You can use aggregate:

R> aggregate(Names ~ ID, data=tmp, FUN=paste, collapse=" | ")
          ID                                         Names
1 uc001aag.1        DKFZp686C24272 | DQ786314 | uc001aag.1
2 uc001aah.2 AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2
3 uc001aai.1                                      AY217347

继续阅读：dataframe

R: combine same identifiers in dataframe

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？