How to correlate two time series with gaps and different time bases?
I have two time series of 3D accelerometer data that have different time bases (clocks started at different times, with some very 开发者_运维百科slight creep during the sampling time), as well as containing many gaps of different size (due to delays associated with writing to separate flash devices).
The accelerometers I'm using are the inexpensive GCDC X250-2. I'm running the accelerometers at their highest gain, so the data has a significant noise floor.
The time series each have about 2 million data points (over an hour at 512 samples/sec), and contain about 500 events of interest, where a typical event spans 100-150 samples (200-300 ms each). Many of these events are affected by data outages during flash writes.
So, the data isn't pristine, and isn't even very pretty. But my eyeball inspection shows it clearly contains the information I'm interested in. (I can post plots, if needed.)
The accelerometers are in similar environments but are only moderately coupled, meaning that I can tell by eye which events match from each accelerometer, but I have been unsuccessful so far doing so in software. Due to physical limitations, the devices are also mounted in different orientations, where the axes don't match, but they are as close to orthogonal as I could make them. So, for example, for 3-axis accelerometers A & B, +Ax maps to -By (up-down), +Az maps to -Bx (left-right), and +Ay maps to -Bz (front-back).
My initial goal is to correlate shock events on the vertical axis, though I would eventually like to a) automatically discover the axis mapping, b) correlate activity on the mapped aces, and c) extract behavior differences between the two accelerometers (such as twisting or flexing).
The nature of the times series data makes Python's numpy.correlate() unusable. I've also looked at R's Zoo package, but have made no headway with it. I've looked to different fields of signal analysis for help, but I've made no progress.
Anyone have any clues for what I can do, or approaches I should research?
Update 28 Feb 2011: Added some plots here showing examples of the data.
My interpretation of your question: Given two very long, noisy time series, find a shift of one that matches large 'bumps' in one signal to large bumps in the other signal.
My suggestion: interpolate the data so it's uniformly spaced, rectify and smooth the data (assuming the phase of the fast oscillations is uninteresting), and do a one-point-at-a-time cross correlation (assuming a small shift will line up the data).
import numpy
from scipy.ndimage import gaussian_filter
"""
sig1 and sig 2 are assumed to be large, 1D numpy arrays
sig1 is sampled at times t1, sig2 is sampled at times t2
t_start, t_end, is your desired sampling interval
t_len is your desired number of measurements
"""
t = numpy.linspace(t_start, t_end, t_len)
sig1 = numpy.interp(t, t1, sig1)
sig2 = numpy.interp(t, t2, sig2)
#Now sig1 and sig2 are sampled at the same points.
"""
Rectify and smooth, so 'peaks' will stand out.
This makes big assumptions about your data;
these assumptions seem true-ish based on your plots.
"""
sigma = 10 #Tune this parameter to get the right smoothing
sig1, sig2 = abs(sig1), abs(sig2)
sig1, sig2 = gaussian_filter(sig1, sigma), gaussian_filter(sig2, sigma)
"""
Now sig1 and sig2 should look smoothly varying, with humps at each 'event'.
Hopefully we can search a small range of shifts to find the maximum of the
cross-correlation. This assumes your data are *nearly* lined up already.
"""
max_xc = 0
best_shift = 0
for shift in range(-10, 10): #Tune this search range
xc = (numpy.roll(sig1, shift) * sig2).sum()
if xc > max_xc:
max_xc = xc
best_shift = shift
print 'Best shift:', best_shift
"""
If best_shift is at the edges of your search range,
you should expand the search range.
"""
If the data contains gaps of unknown sizes that are different in each time series, then I would give up on trying to correlate entire sequences, and instead try cross correlating pairs of short windows on each time series, say overlapping windows twice the length of a typical event (300 samples long). Find potential high cross correlation matches across all possibilities, and then impose a sequential ordering constraint on the potential matches to get sequences of matched windows.
From there you have smaller problems that are easier to analyze.
This isn't a technical answer, but it might help you come up with one:
- Convert the plot to an image, and stick it into a decent image program like gimp or photoshop
- break the plots into discrete images whenever there's a gap
- put the first series of plots in a horizontal line
- put the second series in a horizontal line right underneath it
- visually identify the first correlated event
- if the two events are not lined up vertically:
- select whichever instance is further to the left and everything to the right of it on that row
- drag those things to the right until they line up
This is pretty much how an audio editor works, so you if you converted it into a simple audio format like an uncompressed WAV file, you could manipulate it directly in something like Audacity. (It'll sound horrible, of course, but you'll be able to move the data plots around pretty easily.)
Actually, audacity has a scripting language called nyquist, too, so if you don't need the program to detect the correlations (or you're at least willing to defer that step for the time being) you could probably use some combination of audacity's markers and nyquist to automate the alignment and export the clean data in your format of choice once you tag the correlation points.
My guess is, you'll have to manually build an offset table that aligns the "matches" between the series. Below is an example of a way to get those matches. The idea is to shift the data left-right until it lines up and then adjust the scale until it "matches". Give it a try.
library(rpanel)
#Generate the x1 and x2 data
n1 <- rnorm(500)
n2 <- rnorm(200)
x1 <- c(n1, rep(0,100), n2, rep(0,150))
x2 <- c(rep(0,50), 2*n1, rep(0,150), 3*n2, rep(0,50))
#Build the panel function that will draw/update the graph
lvm.draw <- function(panel) {
plot(x=(1:length(panel$dat3))+panel$off, y=panel$dat3, ylim=panel$dat1, xlab="", ylab="y", main=paste("Alignment Graph Offset = ", panel$off, " Scale = ", panel$sca, sep=""), typ="l")
lines(x=1:length(panel$dat3), y=panel$sca*panel$dat4, col="red")
grid()
panel
}
#Build the panel
xlimdat <- c(1, length(x1))
ylimdat <- c(-5, 5)
panel <- rp.control(title = "Eye-Ball-It", dat1=ylimdat, dat2=xlimdat, dat3=x1, dat4=x2, off=100, sca=1.0, size=c(300, 160))
rp.slider(panel, var=off, from=-500, to=500, action=lvm.draw, title="Offset", pos=c(5, 5, 290, 70), showvalue=TRUE)
rp.slider(panel, var=sca, from=0, to=2, action=lvm.draw, title="Scale", pos=c(5, 70, 290, 90), showvalue=TRUE)
It sounds like you want to minimize the function (Ax'+By) + (Az'+Bx) + (Ay'+Bz) for a pair of values: Namely, the time-offset: t0 and a time scale factor: tr. where Ax' = tr*(Ax + t0), etc..
I would look into SciPy's bivariate optimize functions. And I would use a mask or temporarily zero the data (both Ax' and By for example) over the "gaps" (assuming the gaps can be programmatically determined).
To make the process more efficient, start with a coarse sampling of A and B, but set the precision in fmin
(or whatever optimizer you've selected) that is commensurate with your sampling. Then proceed with progressively finer-sampled windows of the full dataset until your windows are narrow and are not down-sampled.
Edit - matching axes
Regarding the issue of trying to identify which axis is co-linear with a given axis, and not knowing at thing about the characteristics of your data, i can point towards a similar question. Look into pHash or any of the other methods outlined in this post to help identify similar waveforms.
精彩评论