using R to compare data frames to find first occurrence to df1$columnA in df2$columnB when df2$columnB is a single space separated list

2023-02-25 21:26 问答作者：

I have a question regarding data frames in R. I want to take a data.frame, dfy, and find the first occurrence of dfy$workerId in dfx$workers, to create a new dataframe, dfz, a copy of dfx that also contains the first occurance of dfy$workerId in dfx$wokers as dfz$highestRankingGroup. Its a little tricky becuase dfx$workers is a single spaced seperated string. My original plan was to do this in Perl, but I would like to find a way to work in R and avoid having to write out to temp. files.

thank you for your time.

y <- "name,workerId,aptitude  
joe,4,34
steve,5,42 
jon,7,23 
nick,8,122"


x <- "workers,projectScore
1 2 3 8 ,92
1 2 5 9 ,89
3 5 7 ,85  
1 8 9 10 ,82  
4 5 7 8 ,83  
1 3 5 7 8 ,79" 

z <- "name,workerId,aptitude,highestRankingGr开发者_运维问答oup
joe,4,0.34,5
steve,5,0.42,2
jon,7,0.23,3
nick,8,0.122,1"

dfy <- read.csv(textConnection(y), header=TRUE, sep=",", stringsAsFactors=FALSE)  
dfx <- read.csv(textConnection(x), header=TRUE, sep=",", stringsAsFactors=FALSE)  
dfz <- read.csv(textConnection(z), header=TRUE, sep=",", stringsAsFactors=FALSE)

First, add the highestRankingGroup column to your dataset dfx

dfx$highestRankingGroup <- seq(1, length(dfx$projectScore))

Since you have mentioned perl you can do a familar perl thing and simple split the workers column in whitespaces. I combined the splitting with functions from the plyr package which are always nice to work with.

library(plyr)
df.l <- dlply(dfx, "projectScore")

f.reshape <- function(x) {
  wrk <- strsplit(x$workers, "\\s", perl = TRUE)
  data.frame(worker = wrk[[1]]
             , projectScore = x$projectScore
             , highestRankingGroup = x$highestRankingGroup
             )
}

df.tmp <- ldply(df.l, f.reshape)

df.z1 <- merge(df.tmp, dfy, by.x = "worker", by.y = "workerId")

Now you have to look for the max values in the projectScore column:

df.z2 <- ddply(df.z1, "name", function(x) x[x$projectScore == max(x$projectScore), ])

This produces:

R> df.z2
  worker .id projectScore highestRankingGroup  name aptitude
1      4  83           83                   5   joe       34
2      7  85           85                   3   jon       23
3      8  92           92                   1  nick      122
4      5  89           89                   2 steve       42
R>

You can reshape the df.z2 dataframe according to your personal taste. Simply look at the different steps and the produced objects in order to see at which step different columns, etc get introduced.

Before I start, I recommend that you go with @mropa's answer. This answer is a bit of fun I had messing about with your question. On the plus side, it does involve a bit of fun with function closures ;)

Essentially, I create a function that returns two functions.

updateDFz = function(dfy) {
  ## Create a default dfz matrix
  dfz = dfy
  dfz$HRG = 10000 ## Big max value
  counter = 0
  ## Update the dfz matrix after every row
  update = function(x) {
    counter <<- counter + 1
    for(i in seq_along(x)) {
        if(is.element(x[i], dfz$workerId))
           dfz[dfz$workerId == x[i],]$HRG <<- min(dfz[dfz$workerId == x[i],]$HRG, counter)
    }
    return(dfz)
  }
  ## Get the dfz matrix
  getDFz = function()
    return(dfz)
  list(getDFz=getDFz, update=update)
}

f = updateDFz(dfy)
lapply(strsplit(dfx$workers, " "), f$update)
f$getDFz()

As I said, a bit of fun ;)

Hopefully someone finds this useful.

# Recieves a data.frame and a search column
# Returns a data.frame of the first occurances of all unique values of the "search" column

getfirsts <- function(data, searchcol){


rows <- as.data.frame(match(unique(data[[searchcol]]), data[[searchcol]]))  
firsts = data[rows[[1]],]

return(firsts)
}

继续阅读：r

using R to compare data frames to find first occurrence to df1$columnA in df2$columnB when df2$columnB is a single space separated list

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？