Optimizing a bin-placement algorithm
Alright, I've got two collections, and I need to place elements from collection1 into the bins (elements) of collection2, based on whether their value falls within a given bin's range.
For a concrete example, assume I have a sorted collection objects (bins) which have an int range ([1...4], [5..10], etc). I need to determine the range an int falls in, and place it in the appropriate bin.
foreach(element n in collection1) {
foreach(bin m in collection2) {
if (m.inRange(n)) {
m.add(n);
break;
}
}
}
So the obvious NxM complexity algorithm is there, but I really would like to see Nxlog(M). To do this I'd like to use BinarySearch in place of the inner foreach loop. To use BinarySearch, I need to implement an IComparer class to do the searching for me. The problem I'm running into is this approach would require me to make an IComparer.Compare function that compares two different types of objects (an element to its bin), and that doesn't seem possible or correct. So I'm asking, how should I wri开发者_高级运维te this algorithm?
Let's restate the problem. You wish to write
foreach(int item in items)
bins[GetBinIndex(item)].Add(item);
such that the performance of GetBinIndex is better than O(n) in the number of bins.
It depends on the topology of the bins.
If the bins are simply contiguous non-negative integer ranges, say, 0..4, 5..9, 10..14, and so on, then just divide item by 5, done. That's O(1).
If the bins are contiguous integer ranges of different sizes, say, 0..4, 5..14, 15..16, 17..17, 18..32, ... then:
- Make a
List<int>
that stores the top of each range. So in this case, {4, 14, 16, 17, 32, ...} - BinarySearch the list for the item.
- if the result is a non-negative number then that is the index of the bin, and you have an item that is at the top of its bin.
- if the result is a negative number then that is the complement of the best bin whose top element is larger than the item. Take the complement of the index, and that's the bin.
This is O(lg n) to search, and O(n) to build the list in the first place.
If the bins are noncontiguous integer ranges -- that is, if the ranges have holes, or if they overlap -- then the data structure you want to build to efficiently find the best range is an interval tree. Interval trees are typically O(lg n) to search in nonpathological situations, and O(n lg n) to build the tree in the first place.
I'm not sure I understand the question completely because I didn't really get this part:
The problem I'm running into is this approach would require me to make an IComparer.Compare function that compares two different types of objects (an element to its bin)
Nevertheless, I'll try to do my best:
IComparer is used to sort the collection so that you can perform a binary search. Take a look at the MSDN article: http://msdn.microsoft.com/en-us/library/system.collections.icomparer.aspx
So basically, you want to make sure that you first sort Collection2 using your IComparer which basically just sorts the Bins from lowest to highest range. Judging by the fact that you're doing a break inside the second foreach, it seems you don't have any overlap so that shouldn't be an issue.
You're not going to use the Array.BinarySearch (http://msdn.microsoft.com/en-us/library/system.array.binarysearch.aspx) method because it searches for a specific object in the array (maybe this is what you were referring to with that quote above?), but you can implement your own binary search without much difficulty.
Binary search will only work if the elements in Bin2 are sorted. So, change the Bin2 collection to a sorted collection (array, for example). Sort it in m*logm
time, and then use binary search to place the new items in logm
time. All in all: m*logm + n*logm
.
This can be optimized further - but it's a start.
IF (that's a big if) your bins have computable upper and lower indices then your problem translates into the relatively easy, and efficient, one of devising a hashing algorithm and running through your collection-of-items-to-be-binned once. And if your bins don't have computable indices, why not transform your problem so that they do have ?
FURTHER to OP's comment:
It's not so much whether your bins have fixed bounds as whether there is a rule to compute the bounds given the bin number. So, if your bins had bounds 1..5, 6..10, 11..15, etc, the rule is that
bin_bounds = (bin_number-1)*5+1..(bin_number*5)
The hashing function is simply the inverse of this function, ie given an integer, compute the index number of the bin.
BUT if the bounds on your bins are essentially arbitrary it will be essentially impossible to find such a hash function. In my experience it's relatively unusual for bins to be of arbitrary sizes. Of course, I don't know your problem in any detail so all of this may not help you a jot.
精彩评论