MongoDB as a Time Series Database
I'm trying to use mongodb for a time series database and was wondering if anyone could suggest how best to set it up for that scenario.
The time series data is very similar to a stock price history. I have a collection of data from a variety of sensors taken from different machines. There are values at billion's of timestamps and I would like to ask the f开发者_如何学Pythonollowing questions (preferably from the database rather than the application level):
For a given set of sensors and time interval, I want all the timestamps and sensor values that lie within that interval in order by time. Assume all the sensors share the same timestamps (they were all sampled at the same time).
For a given set of sensors and time interval, I want every kth item (timestamp, and corresponding sensor values) that lie within the given interval in order by time.
Any recommendation on how to best set this up and achieve the queries?
Thanks for the suggestions.
Obviously this is an old question, but I came across it when I was researching MongoDB for timeseries data. I thought that it might be worth sharing the following approach for allocating complete documents in advance and performing update operations, as opposed to new insert operations. Note, this approach was documented here and here.
Imagine you are storing data every minute. Consider the following document structure:
{
timestamp: ISODate("2013-10-10T23:06:37.000Z"),
type: ”spot_EURUSD”,
value: 1.2345
},
{
timestamp: ISODate("2013-10-10T23:06:38.000Z"),
type: ”spot_EURUSD”,
value: 1.2346
}
This is comparable to a standard relational approach. In this case, you produce one document per value recorded, which causes a lot of insert operations. We can do better. Consider the following:
{
timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
type: “spot_EURUSD”,
values: {
0: 1.2345,
…
37: 1.2346,
38: 1.2347,
…
59: 1.2343
}
}
Now, we can write one document, and perform 59 updates. This is much better because updates are atomic, individual writes are smaller, and there are other performance and concurrency benefits. But what if we wanted to store the entire day, and not just the entire hours, in one document. This would then require us to walk along 1440 entries to get the last value. To improve on this, we can extend further to the following:
{
timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
type: “spot_EURUSD”,
values: {
0: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343},
1: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343},
…,
22: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343},
23: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343}
}
}
Using this nested approach, we now only have to walk, at maximum, 24 + 60 to get the very last value in the day.
If we build the documents with all the values filled-in with padding in advance, we can be sure that the document will not change size and therefore will not be moved.
If you don't need to keep the data for ever (ie. you don't mind it 'ageing out') you may want to consider a 'capped collection'. Capped collections have a number of restrictions that in turn provide some interesting benefits which sound like they fit what you want quite well.
Basically, a capped collection has a specified size, and documents are written to it in insertion order until it fills up, at which point it wraps around and begins overwriting the oldest documents with the newest. You are slightly limited in what updates you can perform on the documents in a capped collection - ie. you cannot perform an update that will change the size of the document (as this would mean it would need to be moved on disk to find the extra space). I can't see this being a problem for what you describe.
The upshot is that you are guaranteed that the data in your capped collection will be written to, and will stay on, disk in insertion order, which makes queries on insertion order very fast.
How different are the sensors and the data they produce, by the way? If they're relatively similar I would suggest storing them all in the same collection for ease of use - otherwise split them up.
Assuming you use a single collection, both your queries then sound very doable. One thing to bear in mind would be that to get the benefit of the capped collection you would need to be querying according to the collections 'natural' order, so querying by your timestamp key would not be as fast. If the readings are taken at regular intervals (so you know how many of them would be taken in a given time interval) I would suggest something like the following for query 1:
db.myCollection.find().limit(100000).sort({ $natural : -1 })
Assuming, for example, that you store 100 readings a second, the above will return the last 100 seconds worth of data. If you wanted the previous 100 seconds you could add .skip(100000)
.
For your second query, it sounds to me like you'll need MapReduce, but it doesn't sound particularly difficult. You can select the range of documents you're interested in with a similar query to the one above, then pick out only the ones at the intervals you're interested in with the map
function.
Here's the Mongo Docs on capped collections: http://www.mongodb.org/display/DOCS/Capped+Collections
Hope this helps!
I know that this is an old question, but I found these blogs that helped me a lot:
- Time series data & MongoDB | Introduction
- Schema design best practices
- Querying analyzing & presenting time series data
精彩评论