开发者

Filtering a sub set of (potentially) 1.000.000+ items

I have a large dataset with possibly over a million entries. All items have an assigned time stamp and items are added to the set at runtime (usually, but not always, with a newer time stamp). I need to show a sub set of this data given a certain time range. This time range is usually quite small compared to the total data set, i.e. of the 1.000.000+ items not more than about 1000 are in that given time range. This time range moves at a constant pace, e.g. every second the time range is moved by one second. Addi开发者_JAVA百科tionally, the user may adjust the time range at any time ("move" through the data set) or set additional filters (e.g. filter by some text).

So far I wasn't worried about performance, trying to get the other things right, and only worked with smaller test sets. I am not quite sure how to tackle this problem efficiently and would be glad for every input. Thanks.

Edit: Used language is C# 4.

Update: I am now using a interval tree, implementation can be found here: https://github.com/mbuchetics/RangeTree

It also comes with an asynchronous version which rebuilds the tree using the Task Parallel Library (TPL).


We had similar problem in our development - had to collect several million items sorted by some key and then export one page on demand from it. I see that your problem is somehow similar.

For the purpose, we adapted the red-black tree structure, in the following ways:

  • we added the iterator to it, so we could get 'next' item in o(1)
  • we added finding the iterator from the 'index', and managed to do that in O(log n)

RB Tree has O(log n) insertion complexity, so I guess that your insertions will fit in there nicely.

next() on the iterator was implemented by adding and maintaining the linked list of all leaf nodes - our original adopted RB Tree implementation didn't include this.

RB Tree is also cool because it allows you to fine-tune the node size according to your needs. By experimenting you'll be able to figure right numbers that just fit your problem.


Use SortedList sorted by timestamp.

All you have to is to have a implement a binary search on the sorted keys inside the sorted list to find the boundary of your selection which is pretty easy.


Insert new items into a sorted list. This would let you select a range pretty easily. You could potentially use linq as well if you're familiar with it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜