Data structure for large ranges of consecutive integers?
Suppose you have a large range of consecutive integers in memory, each of which belongs to exactly one category. Two operations must be O(log n): moving a range from one category to another, and finding the category counts for a given range.
I'm pretty sure the second operation is solved trivially given a correct implementation for the first operation.
Each integer begins in a category, so I started with a set of balanced BSTs. Moving a subtree from one BST to another (eg, moving a range to a different cat开发者_如何学Goegory) has a runtime equivalent to merging two BSTs, which is O(n1 * n2)[1].
This is too slow (in python, and C is not an option), and I couldn't figure out a way to take advantage of my data's inherent structure to create an efficient BST merge operation.
I'm now looking at AVL, red-black, and interval trees, binary heaps, and treaps. Comparing their properties is overwhelming. Which structure should I use?
Edit (problem clarification): I am flexible on how I store these values and create my data structures. I am inflexible about how I receive my input, which comes from another application, and looks like the following: CATEGORY(cat3, I, J)
. My current solution creates a tree with a node for each integer in the range. This is too slow for the size of my dataset, so I'm happy to re-architect if given a better method.
Any given request can move any possible range of integers into any category. In other words, ranges are overlapping in the sense of CATEGORY(cat1, 1, 10)
followed by CATEGORY(cat3, 5, 15)
, but non-overlapping in the sense that every integer will be in exactly one category at any given time.
As far as I understood the question you have a range [A, B] and queries of the form -
- Assign a particular range to a category
E.g. R1 R2 C1 R3 R4 C2
- Query a range for the total number of items in particular categories. E.g. find count of categories in R1 R4
A simple implementation using dictionaries as given above would not work as I describe by this example -
suppose we have a range [1000, 5000]
and we make assignment as follows -
1 2 C1 2 3 C2 3 4 C3 ...... 4999 5000 C4999
Now we make the following assignment
1 5000 C5555
This will make updation/changes/deletion to previously assigned ranges of the ranges O(N) where N is maximum size of range (B - A)
D['category'] = set(of_all_the_ranges_you_have_in_category)
In this case deletion of previous ranges from corresponding sets in categories C1,C2...C4999 will be needed for last assignment ( 1 5000 C5555 )
OR
1 : { "stop" : 5, "category" : "C1"}, 6 : { "stop" : 19, "category" : "C23"},
Here updation of category for each starting value (1,2,3,4...,4999) will be required for last assignment ( 1 5000 C5555 )
A better option that will make updation of ranges in O(lg n) would be segment trees (http://en.wikipedia.org/wiki/Segment_tree )
For the above example the segment tree will look something like this
1000:5000
|
---------------------
| |
1000:3000 3001:5000
| |
---------------- --------------------
| | | |
1000:2000 2001:3000 3001:4000 4001:5000
................................................................. ............................................................... and so on
The leaf nodes will have ranges (1:2, 2:3,...)
You can assign a value "category" to each node and given an interval traverse the tree dividing the interval appropriately ( e.g for 2500 to 4500 divide in 2500:3000 and 3001:4500 and proceed in both directions until nodes with matching ranges are found).
Now an interesting thing is update the children of the nodes when you need them. For example do not proceed updating the children immediately when doing assignments like 1 5000 C5555. This thing is called lazy propagation and you can learn more about it here (http://www.spoj.pl/forum/viewtopic.php?f=27&t=8296 ).
Now for query part. If the number of categories are very small, you can maintain a frequency table at each node and update the ranges when required and lazy propagate when needed otherwise, you will have to traverse the whole tree from leaf to node, counting and complexity will become O(n).
I think a better solution to querying may exist. It's not coming to my mind.
UPDATE Lets take a small example.
Range [1,8]
Categories allowed {C1, C2}
1:8
1:4 5:8
1:2 3:4 5:6 7:8
1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8
Each node will have 3 fields [category, category_counts[], children_update_required = false]
1 5 C1
Query will get divided and nodes 1:4's category will be set to C1 and children_update_required will be set to true, its children will not be updated now (remember update only when required or lazy propagation). Also node 5:5's category will be set to C1
3 4 C2
Query will propagate along the tree towards 3:4 (and in the process to reach 3:4, 1:2 and 3:4's categories will be updated to C1, 1:4's children_update_required will be set to false, 1:2's and 3:4's children_update_required will be set to true) and now will update the category of 3:4 to C2 according to current requirement. Next it will set children_update required of 3:4 to be true for future updation of its children (which is already set in this case..no harm done).
You can build a tree structure on arrays of consecutive integers quite easily, which should help with your constant factor. First renumber the sequence to start at 0, and work out which is the smallest power of two that is larger than the range of the sequence.
Now consider the following tree formed from the integers 0-7, which I can hold as four arrays, each array going horizontally.
(all)
0-3 4-7
0-1 2-3 4-5 6-7
0 1 2 3 4 5 6 7
Given a number and a level, I can find out an offset in the array for that level, just by shifting the number right an amount depending on the level.
In each element I can put a marker that either says "mixed" or gives the category for every element at or under that node of the tree. I can work out what category a node is in by following the path down from the root of the tree to a leaf, stopping as soon as I see a node that does not say "mixed". I can change the category for an interval of numbers in time lg n, because I have at most two nodes at each level to represent the category: if I had three, two of them would have the same parent and I could merge them. You may have to fiddle a bit with the edges to get neighbouring ranges correct, but I think this works in lg n time.
Assumptions:
- Any integer can belong to exactly one category, so ranges cannot intersect.
- All integers in an incoming range belong to one category.
- Sometimes you need to split a range to move a subrange to a different category.
Represent ranges by (start, end, category)
tuples. Ranges don't intersect so you can build a tree of them. It's far more economical than a tree of individual integers. To order ranges (that is, nodes) you can just use the start value, for it's unique and cannot belong to another range.
If you have to move a range [a, b]
to another category, you have to:
Scan your tree and update every node/range that is fully included in [a, b]
range. It's a simple depth-first traversal. During traversal
- If
current_node.start < a or current_node.start > b
, stop the search. - If
current_node.start >= a and current_node.end > b
, you have to splitcurrent_node
in two;[current_node.start, b]
will belong to a new category, the rest will be of its original category. - If
current_node.start < a and current_node.end <= b
, you split the other way around.
Tree search is logarithmic, and node splits are O(1).
If your tree ever gets too fragmented, you could consider merging nodes that have adjacent ranges. This will always be a parent and a child resulting either from an insert or a split; checks and joins seem to be always O(1).
We can represent the current states as something like:
0:cat1 200:cat2 500: cat0 700:cat6 800:cat1
This means 0-200 is cat1, 200-500 is cat2, etc. We store the values in a binary search tree keyed on the starting number. Each node will also contain a dictionary mapping categories to counts for all children of that node.
Those dictionaries should make it easy to obtain the counts in O(log) time. We simply have to add the correct numbers on the paths to the start and stop of the sequence in question.
What do we do when we need set the sequence X-Y to category Z?
- Determine the current category of Y O(logn )
- Remove all values between X -Y O(k), but since the cost of inserting those nodes is more expensive, we can call it O(1) amortized.
- Insert new node at X O(log n)
- Update category count dictionaries. We should only need to update the O(log n) parents of the affected sections so this is O(log n)
Altogether this is O(log n) time.
You can have a plain python dictionary of the following form
1 : { "stop" : 5, "category" : "C1"},
6 : { "stop" : 19, "category" : "C23"},
etc
The keys here are the start of your range and the values contain the end of the range and the category this range falls into.
Because dictionaries have constant time for accessing items, you can write some code that moves a range to another category easily and efficiently: In the worst case, you'll have to restructure somehow your dictionary, if your range splits the previous ranges in two or more. For example, if you want to assign the range of (4, 8) into another category, you'll end up with:
1 : { "stop" : 3, "category" : "C1"},
4 : { "stop" : 8, "category" : "new category"},
9 : { "stop" : 19, "category" : "C23"},
etc
Finding the category count is trivial, just collect all your desired ranges in constant time and count the categories..
EDIT: To successfully find the lowest (highest) key to start performing computations/alterations, you also need a plain python list with all keys sorted, and the bisect module. This will help locating the index in the list to "put" the start of the range in O(logn) time, then everything can be done in constant time, except of the O(n) time needed to insert the new key into the list with bisect.insort_left.
精彩评论