using R to compare data frames to find first occurrence to df1$columnA in df2$columnB when df2$columnB is a single space separated list
I have a question regarding data frames in R. I want to take a data.frame, dfy, and find the first occurrence of dfy$workerId in dfx$workers, to create a new dataframe, dfz, a copy of dfx that also contains the first occurance of dfy$workerId in dfx$wokers as dfz$highestRankingGroup. Its a little tricky becuase dfx$workers is a single spaced seperated string. My original plan was to do this in Perl, but I would like to find a way to work in R and avoid having to write out to temp. files.
thank you for your time.
y <- "name,workerId,aptitude
joe,4,34
steve,5,42
jon,7,23
nick,8,122"
x <- "workers,projectScore
1 2 3 8 ,92
1 2 5 9 ,89
3 5 7 ,85
1 8 9 10 ,82
4 5 7 8 ,83
1 3 5 7 8 ,79"
z <- "name,workerId,aptitude,highestRankingGr开发者_运维问答oup
joe,4,0.34,5
steve,5,0.42,2
jon,7,0.23,3
nick,8,0.122,1"
dfy <- read.csv(textConnection(y), header=TRUE, sep=",", stringsAsFactors=FALSE)
dfx <- read.csv(textConnection(x), header=TRUE, sep=",", stringsAsFactors=FALSE)
dfz <- read.csv(textConnection(z), header=TRUE, sep=",", stringsAsFactors=FALSE)
First, add the highestRankingGroup
column to your dataset dfx
dfx$highestRankingGroup <- seq(1, length(dfx$projectScore))
Since you have mentioned perl
you can do a familar perl thing and simple split the workers
column in whitespaces. I combined the splitting with functions from the plyr
package which are always nice to work with.
library(plyr)
df.l <- dlply(dfx, "projectScore")
f.reshape <- function(x) {
wrk <- strsplit(x$workers, "\\s", perl = TRUE)
data.frame(worker = wrk[[1]]
, projectScore = x$projectScore
, highestRankingGroup = x$highestRankingGroup
)
}
df.tmp <- ldply(df.l, f.reshape)
df.z1 <- merge(df.tmp, dfy, by.x = "worker", by.y = "workerId")
Now you have to look for the max values in the projectScore
column:
df.z2 <- ddply(df.z1, "name", function(x) x[x$projectScore == max(x$projectScore), ])
This produces:
R> df.z2
worker .id projectScore highestRankingGroup name aptitude
1 4 83 83 5 joe 34
2 7 85 85 3 jon 23
3 8 92 92 1 nick 122
4 5 89 89 2 steve 42
R>
You can reshape the df.z2
dataframe according to your personal taste. Simply look at the different steps and the produced objects in order to see at which step different columns, etc get introduced.
Before I start, I recommend that you go with @mropa's answer. This answer is a bit of fun I had messing about with your question. On the plus side, it does involve a bit of fun with function closures ;)
Essentially, I create a function that returns two functions.
updateDFz = function(dfy) {
## Create a default dfz matrix
dfz = dfy
dfz$HRG = 10000 ## Big max value
counter = 0
## Update the dfz matrix after every row
update = function(x) {
counter <<- counter + 1
for(i in seq_along(x)) {
if(is.element(x[i], dfz$workerId))
dfz[dfz$workerId == x[i],]$HRG <<- min(dfz[dfz$workerId == x[i],]$HRG, counter)
}
return(dfz)
}
## Get the dfz matrix
getDFz = function()
return(dfz)
list(getDFz=getDFz, update=update)
}
f = updateDFz(dfy)
lapply(strsplit(dfx$workers, " "), f$update)
f$getDFz()
As I said, a bit of fun ;)
Hopefully someone finds this useful.
# Recieves a data.frame and a search column
# Returns a data.frame of the first occurances of all unique values of the "search" column
getfirsts <- function(data, searchcol){
rows <- as.data.frame(match(unique(data[[searchcol]]), data[[searchcol]]))
firsts = data[rows[[1]],]
return(firsts)
}
精彩评论