joining data based on a moving time window in R
I have weather data that was recorded every hour, and location data (X,Y) that was recorded every 4 hours. I want to know what the temperature was at location X,Y. 开发者_如何学JAVA The weather data isn't exactly at the same time. So, I have written this loop for every location to scan through the weather data looking for the "closest" in Date/TIME and extracting the data from that time. The problem is the way Ive written it, for location #2, it scans through the weather data but will not allow the closest time information to be assigned that was assigned for location#1. Say location #1 & 2 are taken within 10 minutes at 6pm and 6:10pm, the closest weather time is 6pm. I can't get it to allow the weather data at 6pm as an option. I kind of set it up like this because 200 locations into my location data set (say 3 months into it), I do not want it starting at time 0 from the weather data, when I know that the closest weather data was just calculated for the last location and that happens to be 3 months into that data set too. Below is some sample data and my code. I don't know if this makes sense.
<h6>####Location data</h6>
<p>X Y DateTime <br />
1 2 4/2/2003 18:01:01
3 2 4/4/2003 17:01:33
2 3 4/6/2003 16:03:07
5 6 4/8/2003 15:03:08
3 7 4/10/2003 14:03:06
4 5 4/2/2003 13:02:00
4 5 4/4/2003 12:14:43
4 3 4/6/2003 11:00:56
3 5 4/8/2003 10:02:06</p>
<h2>2 4 4/10/2003 9:02:19</h2>
<p>Weather Data
DateTime WndSp WndDir Hgt
4/2/2003 17:41:00 8.17 102.86 3462.43
4/2/2003 20:00:00 6.70 106.00 17661.00
4/2/2003 10:41:00 6.18 106.00 22000.00
4/2/2003 11:41:00 5.78 106.00 22000.00
4/2/2003 12:41:00 5.48 104.00 22000.00
4/4/2003 17:53:00 7.96 104.29 6541.00
4/4/2003 20:53:00 6.60 106.00 22000.00
4/4/2003 19:41:00 7.82 105.00 7555.00
4/4/2003 7:41:00 6.62 105.00 14767.50
4/4/2003 8:41:00 6.70 106.00 17661.00
4/4/2003 9:41:00 6.60 106.00 22000.00
4/5/2003 20:41:00 7.38 106.67 11156.67
4/6/2003 18:07:00 7.82 105.00 7555.00
4/6/2003 21:53:00 6.18 106.00 22000.00
4/6/2003 21:41:00 6.62 105.00 14767.50
4/6/2003 4:41:00 7.96 104.29 6541.00
4/6/2003 5:41:00 7.82 105.00 7555.00
4/6/2003 6:41:00 7.38 106.67 11156.67
4/8/2003 18:53:00 7.38 106.67 11156.67
4/8/2003 22:53:00 5.78 106.00 22000.00
4/8/2003 1:41:00 5.78 106.00 22000.00
4/8/2003 2:41:00 5.48 104.00 22000.00
4/8/2003 3:41:00 8.17 102.86 3462.43
4/10/2003 19:53:00 6.62 105.00 14767.50
4/10/2003 23:53:00 5.48 104.00 22000.00
4/10/2003 22:41:00 6.70 106.00 17661.00
4/10/2003 23:41:00 6.60 106.00 22000.00
4/10/2003 0:41:00 6.18 106.00 22000.00
4/11/2003 17:41:00 8.17 102.86 3462.43</p>
<h2>4/12/2003 18:41:00 7.96 104.29 6541.0</h2>
.
weathrow = 1
for (i in 1:nrow(SortLoc)) {
t = 0
while (t < 1) {
timedif1 = difftime(SortLoc$DateTime[i], SortWeath$DateTime[weathrow], units="auto")
timedif2 = difftime(SortLoc$DateTime[i], SortWeath$DateTime[weathrow+1], units="auto")
if (timedif2 < 0) {
if (abs(timedif1) < abs(timedif2)) {
SortLoc$WndSp[i]=SortWeath$WndSp[weathrow]
SortLoc$WndDir[i]=SortWeath$WndDir[weathrow]
SortLoc$Hgt[i]=SortWeath$Hgt[weathrow]
} else {
SortLoc$WndSp[i]=SortWeath$WndSp[weathrow+1]
SortLoc$WndDir[i]=SortWeath$WndDir[weathrow+1]
SortLoc$Hgt[i]=SortWeath$Hgt[weathrow+1]
}
t = 1
}
if (abs(SortLoc$DateTime[i] - SortLoc$DateTime[i+1] < 50)) {
weathrow=weathrow
} else {
weathrow = weathrow+1
#if(weathrow = nrow(SortWeath)){t=1}
}
} #end while
}
You could use findInterval
function to find nearest value:
# example data:
x <- rnorm(120000)
y <- rnorm(71000)
y <- sort(y) # second vector must be sorted
id <- findInterval(x, y, all.inside=TRUE) # finds position of last y smaller then x
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1) # to find nearest
In your case some as.numeric
might be needed.
# assumed that SortWeath is sorted, if not then SortWeath <- SortWeath[order(SortWeath$DateTime),]
x <- as.numeric(SortLoc$DateTime)
y <- as.numeric(SortWeath$DateTime)
id <- findInterval(x, y, all.inside=TRUE)
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1)
SortLoc$WndSp <- SortWeath$WndSp[id_min]
SortLoc$WndDir <- SortWeath$WndDir[id_min]
SortLoc$Hgt <- SortWeath$Hgt[id_min]
Some addition: you should never, ABSOLUTELY NEWER add values to data.frame
in for-loop. Check this comparison:
N=1000
x <- numeric(N)
X <- data.frame(x=x)
require(rbenchmark)
benchmark(
vector = {for (i in 1:N) x[i]<-1},
data.frame = {for (i in 1:N) X$x[i]<-1}
)
# test replications elapsed relative
# 2 data.frame 100 4.32 22.74
# 1 vector 100 0.19 1.00
data.frame
version is over 20 times slower, and if more rows it contain then difference is bigger.
So if you change you script and first initialize result vectors:
tmp_WndSp <- tmp_WndDir <- tmp_Hg <- rep(NA, nrow(SortLoc))
then update values in loop
tmp_WndSp[i] <- SortWeath$WndSp[weathrow+1]
# and so on...
and at the end (outside the loop) update proper columns:
SortLoc$WndSp <- tmp_WndSp
SortLoc$WndDir <- tmp_WndDir
SortLoc$Hgt <- tmp_Hgt
It should run much faster.
Here's an example of one strategy you might use. This goes through the weather times one-by-one, then takes the absolute value of the difference between that and every location time, then grabs the lowest difference in times. That solves your look ahead/look back problem. Your dataset seems small enough that moving to a half-vectorized solution should be all the speed gain you need, but if not it should be relatively simple to add on a moving window which only passes some +/- N observations around the last loc.match.index.
w <- as.POSIXct(strptime( c("4/2/2003 17:41:00","4/2/2003 20:00:00","4/2/2003 10:41:00","4/2/2003 11:41:00","4/2/2003 12:41:00"),format="%m/%d/%Y %H:%M:%S"))
l <- as.POSIXct(strptime( c("4/2/2003 18:01:01","4/2/2003 17:01:33","4/2/2003 16:03:07","4/2/2003 15:03:08","4/2/2003 14:03:06","4/2/2003 13:02:00"),format="%m/%d/%Y %H:%M:%S"))
window.size <- 5
findClosest <- function(w.i,l) {
which.min(abs(w.i-l))
}
makeWindow <- function(loc.match.index,i,window.size,n) {
win.max <- loc.match.index[i-1] + window.size
if(win.max > n) {
win.max <- n
}
win.min <- loc.match.index[i-1] - window.size
if(win.min < 1) {
win.min <- 1
}
return(seq(win.min,win.max))
}
loc.match.index <- integer()
n <- length(w)
# Initialize on whole vector
i <- 1
loc.match.index[i] <- findClosest(w[i],l)
# Continue on window
for(i in seq(2,n)) {
wndw <- makeWindow(loc.match.index,i,window.size,n)
loc.match.index[i] <- findClosest(w[i],l[wndw])
# Add the start of the window back to the index that was returned
loc.match.index[i] <- loc.match.index[i] + min(wndw)-1
}
> loc.match.index
[1] 1 1 5 5 5
There are still areas that could be made more efficient here, but this should be reasonably speedy given that the comparisons are vectorized and the window can be tuned.
精彩评论