"smoothing" time data - can it be done more efficient?
I have a data frame containing an ID, a start date and an end date. My data is ordered by ID, start, end (in this sequence).
Now I want all rows with the same ID having an overlapping time span (or have a start date that is right the day after the end date of another row) to be merged together.
Merging them means that they end up in one row having the same ID, the min(start date) and the max(end date) (I hope you understand what I mean).
I have written a function for that (it is not fully t开发者_如何学编程ested, but it looks fine for the moment). The problem is, as my data frame has nearly 100.000 observations, the function is very slow.
Can you help me improve my function in terms of efficiency?
Here is the function
smoothingEpisodes <- function (theData) {
theOutput <- data.frame()
curId <- theData[1, "ID"]
curStart <- theData[1, "START"]
curEnd <- theData[1, "END"]
for(i in 2:nrow(theData)) {
nextId <- theData[i, "ID"]
nextStart <- theData[i, "START"]
nextEnd <- theData[i, "END"]
if (curId != nextId | (curEnd + 1) < nextStart) {
theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
curId <- nextId
curStart <- nextStart
curEnd <- nextEnd
} else {
curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
}
}
theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
theOutput
}
Thank you!
[edit]
test data:
ID START END
1 1 2000-01-01 2000-03-31
2 1 2000-04-01 2000-05-31
3 1 2000-04-15 2000-07-31
4 1 2000-09-01 2000-10-31
5 2 2000-01-15 2000-03-31
6 2 2000-02-01 2000-03-15
7 2 2000-04-01 2000-04-15
8 3 2000-06-01 2000-06-15
9 3 2000-07-01 2000-07-15
(START and END have data type "Date", ID is a numeric)
A dput of the data:
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957,
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"),
END = structure(c(11047, 11108, 11169, 11261, 11047, 11031,
11062, 11123, 11153), class = "Date")), .Names = c("ID",
"START", "END"), class = "data.frame", row.names = c(NA, 9L))
The first [without really thinking to hard about what you are trying to do] optimisation I would suggest is to allocate storage for theOutput
. At the moment, you are growing theOutput
at each iteration of the loop. In R that is an absolute no no!! That is something you never do, unless you like woefully slow code. R has to copy the object and expand it during each iteration and that is slow.
Looking at the code, we know that theOutput
needs to have nrow(theData) - 1
rows, and 3 columns. So create that before the loop starts:
theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))
then fill in this object during the loop:
theOutput[i, ] <- data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
for example.
It isn't clear what START
and END
are? if these are numerics, then working with a matrix and not a data frame could also improve speed efficiency.
Also, creating a data frame each iteration is going to be slow. I can't time this without spending a lot of my own time, but you could just fill in the bits you want directly, without incurring the data.frame()
call during each iteration:
theOutput[i, "ID"] <- curId
theOutput[i, "START"] <- curStart
theOutput[i, "END"] <- curEnd
The best tip I can give you however, is to profile your code. See where the bottlenecks are and speed those up. Run your function on a smaller subset of the data; the size of which is sufficient to give you a bit of run-time to gather useful profiling data without having to wait for ages to get the profiling run completed. To profile in R, use Rprof()
:
Rprof(filename = "my_fun_profile.Rprof")
## run your function call here on a subset of the data
Rprof(NULL)
The you can look at the output using
summaryRprof("my_fun_profile.Rprof")
Hadley Wickham (@hadley) has a package to make this a bit easier. It is called profr. And as Dirk reminds me in the comments, there is also Luke Tierney's proftools package.
Edit: as the OP provided some test data I knocked up something quick to show the speed-up achieved by just following good loop practice:
smoothingEpisodes2 <- function (theData) {
curId <- theData[1, "ID"]
curStart <- theData[1, "START"]
curEnd <- theData[1, "END"]
nr <- nrow(theData)
out1 <- integer(length = nr)
out2 <- out3 <- numeric(length = nr)
for(i in 2:nrow(theData)) {
nextId <- theData[i, "ID"]
nextStart <- theData[i, "START"]
nextEnd <- theData[i, "END"]
if (curId != nextId | (curEnd + 1) < nextStart) {
out1[i-1] <- curId
out2[i-1] <- curStart
out3[i-1] <- curEnd
curId <- nextId
curStart <- nextStart
curEnd <- nextEnd
} else {
curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
}
}
out1[i] <- curId
out2[i] <- curStart
out3[i] <- curEnd
theOutput <- data.frame(ID = out1,
START = as.Date(out2, origin = "1970-01-01"),
END = as.Date(out3, origin = "1970-01-01"))
## drop empty
theOutput <- theOutput[-which(theOutput$ID == 0), ]
theOutput
}
Using the test dataset provide in object testData
, I get:
> res1 <- smoothingEpisodes(testData)
> system.time(replicate(100, smoothingEpisodes(testData)))
user system elapsed
1.091 0.000 1.131
> res2 <- smoothingEpisodes2(testData)
> system.time(replicate(100, smoothingEpisodes2(testData)))
user system elapsed
0.506 0.004 0.517
a 50% speed up. Not dramatic but simple to achieve just by not growing an object at each iteration.
I did it slightly different to avoid deleting empty rows in the end:
smoothingEpisodes <- function (theData) {
curId <- theData[1, "ID"]
curStart <- theData[1, "START"]
curEnd <- theData[1, "END"]
theLength <- nrow(theData)
out.1 <- integer(length = theLength)
out.2 <- out.3 <- numeric(length = theLength)
j <- 1
for(i in 2:nrow(theData)) {
nextId <- theData[i, "ID"]
nextStart <- theData[i, "START"]
nextEnd <- theData[i, "END"]
if (curId != nextId | (curEnd + 1) < nextStart) {
out.1[j] <- curId
out.2[j] <- curStart
out.3[j] <- curEnd
j <- j + 1
curId <- nextId
curStart <- nextStart
curEnd <- nextEnd
} else {
curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
}
}
out.1[j] <- curId
out.2[j] <- curStart
out.3[j] <- curEnd
theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
theOutput
}
quite a big improvement to my original version!
Marcel, I thought I'd just try to improve your code a little. The version below is about 30x faster (from 3 seconds to 0.1 seconds)... The trick is to first extract the three columns to integer and double vectors.
As a side note, I try to use [[
where applicable, and try to keep integers as integers by writing j <- j + 1L
etc. That does not make any difference here, but sometimes coercing between integers and doubles can take quite some time.
smoothingEpisodes3 <- function (theData) {
theLength <- nrow(theData)
if (theLength < 2L) return(theData)
id <- as.integer(theData[["ID"]])
start <- as.numeric(theData[["START"]])
end <- as.numeric(theData[["END"]])
curId <- id[[1L]]
curStart <- start[[1L]]
curEnd <- end[[1L]]
out.1 <- integer(length = theLength)
out.2 <- out.3 <- numeric(length = theLength)
j <- 1L
for(i in 2:nrow(theData)) {
nextId <- id[[i]]
nextStart <- start[[i]]
nextEnd <- end[[i]]
if (curId != nextId | (curEnd + 1) < nextStart) {
out.1[[j]] <- curId
out.2[[j]] <- curStart
out.3[[j]] <- curEnd
j <- j + 1L
curId <- nextId
curStart <- nextStart
curEnd <- nextEnd
} else {
curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
}
}
out.1[[j]] <- curId
out.2[[j]] <- curStart
out.3[[j]] <- curEnd
theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
theOutput
}
Then, the following code will show the speed difference. I just took your data and replicated it 1000 times...
x <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957,
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"),
END = structure(c(11047, 11108, 11169, 11261, 11047, 11031,
11062, 11123, 11153), class = "Date")), .Names = c("ID",
"START", "END"), class = "data.frame", row.names = c(NA, 9L))
r <- 1000
y <- data.frame(ID=rep(x$ID, r) + rep(1:r, each=nrow(x))-1, START=rep(x$START, r), END=rep(x$END, r))
system.time( a1 <- smoothingEpisodes(y) ) # 2.95 seconds
system.time( a2 <- smoothingEpisodes3(y) ) # 0.10 seconds
all.equal( a1, a2 )
精彩评论