开发者

Best NoSql for querying date ranges?

Given a store which is a collection of JSON documents in the (approximate) form of:

{
PeriodStart: 18/04/2011 17:10:49 
PeriodEnd: 18/04/2011 17:15:54
Count: 12902
Max: 23041 Min: 0
Mean: 102.86 StdDev: 560.97
},
{
PeriodStart: 18/04/2011 17:15:49 
PeriodEnd: 18/04/2011 17:20:54
Count: 10000
Max: 23041 Min: 0
Mean: 102.86 StdDev: 560.97
}... etc

If I want to query the collection for given date range (say all documents from last 24 hours), which would give me the easiest querying operations to do this?

To further elaborate on requirements:

  • Its for an application monitoring service, so strict CAP/ACID isn't necessarily required
  • Performance isn't a primary consideration either. Read/writes would be at most 10s per second which could be handled by an RDBMS anyway
  • Ability to handle changing document schema's would be desirable
  • Ease of querying ability of l开发者_运维百科ists/sets is important (ad-hoc queries an advantage)


I may not have your query requirements down exactly, as you didn't specify. However, if you need to find any documents that start or end in a particular range, then you can apply most of what is written below. If that isn't quite what you're after, I can be more helpful with a bit more direction. :)

If you use CouchDB, you can create your indexes by splitting up the parts of your date into an array. ([year, month, day, hour, minute, second, ...])

Your map function would probably look similar to:

function (doc) {
    var date = new Date(doc.PeriodStart);
    emit([ date.getFullYear(), date.getMonth(), date.getDate(), date.getHours(), date.getMinutes() ] , null]);
}

To perform any sort of range query, you'd need to convert your start and end times into this same array structure. From there, your view query would have params called startkey and endkey. They would would receive the array parameters for start and end respectively.

So, to find the documents that started in the past 24 hours, you would send a querystring like this in addition to the full URI for the view itself:

// start: Apr 17, 2011 12:30pm ("24 hours ago")
// end:   Apr 18, 2011 12:30pm ("today")
startkey=[2011,04,17,12,30]&endkey=[2011,04,18,12,30]

Or if you want everything from this current year:

startkey=[2011]&endkey=[2011,{}]

Note the {}. When used as an endkey: [2011,{}] is identical to [2012] when the view is collated. (either format will work)

The extra components of the array will simply be ignored, but the further specificity you add to your arrays, the more specific your range can be. Adding reduce functions can be really powerful here, if you add in the group_level parameter, but that's beyond the scope of your question.


[Update edited to match edit to original question]

Short answer, (almost) any of them will work.

BigTable databases are a great platform for monitoring services (log analysis, etc). I prefer Cassandra (Super Column Families, secondary indexes, atomic increment coming soon), but HBase will work for you too. Structure the date value so that its lexicographic ordering is the same as the date ordering. Fixed-length strings following the format "YYYYMMDDHHmmss" work nicely for this. If you use this string as your key, range queries will be very simple to perform.

Handling changing schema is a breeze - just add more columns to the column family. They don't need to be defined ahead of time.

I probably wouldn't use graph databases for this problem, as it'll probably summarize to traversing a linked list. However, I don't have a ton of experience with graph databases, so take this advice with a grain of salt.

[Update: some of this is moot since the question was edited, but I'm keeping it for posterity]
Is this all you're doing with this database? The big problem with selecting a NoSQL database isn't finding one that supports one query requirement well. The problem is finding one that supports all of your query requirements well. Also, what are your operational requirements? Can you accept a single point of failure? What kind of setup/maintenance overhead are you willing to tolerate? Can you sacrifice low latency for high-throughput batch operations, or is realtime your gig?

Hope this helps!


It seems to me that the easiest way to implement what you want is performing a range query in a search engine like ElasticSearch.

I, for one, certainly would not want to write all the map/reduce code for CouchDB (because I did in the past). Also, based on my experience (YMMV), range queries will outperform CouchDB's views and use much less resources for large datasets.

Not to mention you can compute interesting statistics with „date histogram“ facets in ElasticSearch.

ElasticSearch is schema-free, JSON based, so you should be able to evaluate it for your case in a very short time.


I've decided to go with Mongo for the time being.

I found that setup/deployment was relatively easy, and the C# wrapper was adequate for what we're trying to do (and in the cases where its not we can resort to javascript queries easily).


What you want is whichever one gives you access to some kind of spatial index. Most of these work off of B-Trees and/or hashes, neither of which is particularly good for spatial indexing.

Now, if your definition of "last 24 hours" is simply "starts or ends within the last 24 hours" then a B-Tree may be find (you do two queries, one on PeriodStart and then one on PeriodEnd, both being within range of the time window).

But if the PeriodStart to PeriodEnd is longer than the time window, then neither of these will be as much help to you.

Either way, that's what you're looking for.


This question explains how to query a date range in CouchDB. You would need your data to be in a lexicographically sortable state, in all the examples I've seen.


Since this is tagged Redis and nobody has answered that aspect I'm going to put forth a solution for it.

Step one, store your documents under a given redis key, as a hash or perhaps as a JSON string. Step two, add the redis key (lets call it a DocID) in a sorted set, with the timestamp converted to a UNIX timestamp. For example where r is a redis Connection instance in the Python redis client library:

mydocs:Doc12 => [JSON string of the doc]

In Python: r.set('mydocs:Doc12', JSONStringOfDocument)

timeindex:documents, DocID, Timestamp:

In Python: r.zadd('timeindex:documents', 'Doc12', timestamp)

In effect you are building an index of documents based on UNIX timestamps.

To get documents from a range of time, you use zrange (or zrevrange if you want the order reversed) to get the list of Document IDs in that window. Then you can retrieve the documents from the db as normal. Sorted sets are pretty fast in Redis. Further advantages are that you can do set operations such as "documents in this window but not this window", and indeed even store the results in Redis automatically for later use.

One example of how this would be useful is that in your example documents you have a start and end time. If you made an index of each as above, you could get the intersection of the set of documents that start in a given range and the set of documents that end in a given range, and store the resulting set in a new key for later re-use. This would be done via zinterstore

Hopefully, that helps someone using Redis for this.


Mongodb is very positive for queries, i think that it's useful because has a lot of functions. I use mongodb for GPS distance, text search and pipeline model (aggregation includes)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜