开发者

using R to compare data frames to find first occurrence to df1$columnA in df2$columnB when df2$columnB is a single space separated list

I have a question regarding data frames in R. I want to take a data.frame, dfy, and find the first occurrence of dfy$workerId in dfx$workers, to create a new dataframe, dfz, a copy of dfx that also contains the first occurance of dfy$workerId in dfx$wokers as dfz$highestRankingGroup. Its a little tricky becuase dfx$workers is a single spaced seperated string. My original plan was to do this in Perl, but I would like to find a way to work in R and avoid having to write out to temp. files.

thank you for your time.

y <- "name,workerId,aptitude  
joe,4,34
steve,5,42 
jon,7,23 
nick,8,122"


x <- "workers,projectScore
1 2 3 8 ,92
1 2 5 9 ,89
3 5 7 ,85  
1 8 9 10 ,82  
4 5 7 8 ,83  
1 3 5 7 8 ,79" 

z <- "name,workerId,aptitude,highestRankingGr开发者_运维问答oup
joe,4,0.34,5
steve,5,0.42,2
jon,7,0.23,3
nick,8,0.122,1"

dfy <- read.csv(textConnection(y), header=TRUE, sep=",", stringsAsFactors=FALSE)  
dfx <- read.csv(textConnection(x), header=TRUE, sep=",", stringsAsFactors=FALSE)  
dfz <- read.csv(textConnection(z), header=TRUE, sep=",", stringsAsFactors=FALSE)


First, add the highestRankingGroup column to your dataset dfx

dfx$highestRankingGroup <- seq(1, length(dfx$projectScore))

Since you have mentioned perl you can do a familar perl thing and simple split the workers column in whitespaces. I combined the splitting with functions from the plyr package which are always nice to work with.

library(plyr)
df.l <- dlply(dfx, "projectScore")

f.reshape <- function(x) {
  wrk <- strsplit(x$workers, "\\s", perl = TRUE)
  data.frame(worker = wrk[[1]]
             , projectScore = x$projectScore
             , highestRankingGroup = x$highestRankingGroup
             )
}

df.tmp <- ldply(df.l, f.reshape)

df.z1 <- merge(df.tmp, dfy, by.x = "worker", by.y = "workerId")

Now you have to look for the max values in the projectScore column:

df.z2 <- ddply(df.z1, "name", function(x) x[x$projectScore == max(x$projectScore), ])

This produces:

R> df.z2
  worker .id projectScore highestRankingGroup  name aptitude
1      4  83           83                   5   joe       34
2      7  85           85                   3   jon       23
3      8  92           92                   1  nick      122
4      5  89           89                   2 steve       42
R> 

You can reshape the df.z2 dataframe according to your personal taste. Simply look at the different steps and the produced objects in order to see at which step different columns, etc get introduced.


Before I start, I recommend that you go with @mropa's answer. This answer is a bit of fun I had messing about with your question. On the plus side, it does involve a bit of fun with function closures ;)

Essentially, I create a function that returns two functions.

updateDFz = function(dfy) {
  ## Create a default dfz matrix
  dfz = dfy
  dfz$HRG = 10000 ## Big max value
  counter = 0
  ## Update the dfz matrix after every row
  update = function(x) {
    counter <<- counter + 1
    for(i in seq_along(x)) {
        if(is.element(x[i], dfz$workerId))
           dfz[dfz$workerId == x[i],]$HRG <<- min(dfz[dfz$workerId == x[i],]$HRG, counter)
    }
    return(dfz)
  }
  ## Get the dfz matrix
  getDFz = function()
    return(dfz)
  list(getDFz=getDFz, update=update)
}

f = updateDFz(dfy)
lapply(strsplit(dfx$workers, " "), f$update)
f$getDFz()

As I said, a bit of fun ;)


Hopefully someone finds this useful.

# Recieves a data.frame and a search column
# Returns a data.frame of the first occurances of all unique values of the "search" column

getfirsts <- function(data, searchcol){


rows <- as.data.frame(match(unique(data[[searchcol]]), data[[searchcol]]))  
firsts = data[rows[[1]],]

return(firsts)
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜