Efficient algorithm for merging small overlaping blocks into larger contiguous blocks?
i am faced with a rather interesting开发者_如何学JAVA problem. I have a (fairly large) number of blocks. A block is simply something starting at an offset and having a length and a color. The offset and length are limited - the space where these blocks exist is <0-N> where N ranges from hundreds of thousands to a few million. An invalid block is any block with offset larger than N or a sum of offset and length larger than N. There are about 16 different colors a block might have (just one of them).
There might be thousands of blocks and there are always situations like this:
block_X: off: 100, len: 50, color: blue
block_Y: off: 148, len: 50, color: blue block_Z: off: 200, len: 30, color: redAs you can see, the X and Y blocks can be connected into one single larger block, resulting in this:
block_XY: off: 100, len 98, color: blue
block_Z: off 200, len 30, color: redI would like to create an algorithm that would go through all the blocks and connect those that overlap and have the same color. Actually, if the gap between the blocks is reasonably small (a constant i could choose, say about 16 or so, but may be any number...) then i would like to connect those blocks anyway. Note that in the above example, there are just two blocks that get connected. In reality, the sequence of blocks that can be connected is most of the time much longer.
There is also an interesting twist, consider this:
block_A: off: 100, len: 200, color: blue
block_B: off: 200, len: 200, color: blue block_C: off: 300, len: 150, color: red block_D: off: 400, len: 200, color: blue block_E: off: 500, len: 200, color: blueAs you can see, we have a nice sequence of blue blocks that can be merged into one large blue block. However, there is also a small red block in the middle of it. This should not fool the algorithm, the correct result is:
block_ABDE: off: 100, len: 600, color: blue
block_C: off: 300, len: 150, color: redThe data are stored in a std::multimap where the key is the offset and the value is a structure containing length and color of the block, something like: multimap<uint32_t,pair<uint32_t, uint8_t> >
. Note that there might be blocks starting at the same offset. It is guaranteed however, that if such thing happens, the blocks starting at the same offset have different colors AND different lengths.
Can anyone come up with a nice idea how to solve this problem efficiently? The goal of the algorithm is to reduce the number of blocks we have, but it is NOT required to reduce it to the smallest amount possible.
I would suggest the following:
- Create a separate list for each of the colors
- For each specific color, calculate the beginning (offset) and the end (offset + length) of all blocks
- Sort each color-specific list by value of beginning
- Now, traverse each of the lists, merging the items: if the next item's beginning is less or equal than the previous item's end (plus the allowed gap), remove the next item and extend the previous one.
Separate the list of blocks into a list for each color, and sort all of those lists by the starting offset of the block. (Actually, you'll probably want to do an insertion sort as you're filtering based on color, or sort by offset and then do a stable sort by color and work with partitions of your array.)
Once you have the per-color lists, you'll iterate over each one: Starting with the first block, check if the offset of the next block is close enough to the end of your current block, and if so, merge them. Then continue to the end of the list.
Now, you've combined all the blocks you can of each color, so you can just concatenate your lists to get the final list of all blocks of all colors. The most expensive step (asymptotically) will be the sorting, so this is probably efficient enough. You might be able to come up with something that is faster on average using more advanced data structures than arrays and linked lists, but won't be worth the trouble until you measure the performance of this simpler approach.
Note that because the rules for whether a two blocks can be merged depend only on the end point of one block and the start point of the other, and do not depend on the size of the blocks, any solution that can identify any potential merges will probably identify all merges, and it won't matter what order the merges are done in (that is, merging is an associative operation).
You don't have to split the blocks into one list for each colour. You can sort the whole list by block start address, and then process it in a single pass, by maintaining a separate {start,length} pair for each colour.
This is fairly straightforward as for each colour it is just like a numbered range, eg
blue: 100-299, 200-400, 500-670, 671-702
which you merge into:
blue: 100-400, 500-702
Just sort by the first value in the range and look at the end value (actually one past the last element). If that is greater than or equal to the start of the next one then they merge.
In the above 300>200 so they merge, 401 < 500 so they do not, 671==671 so they merge.
Do the same for each colour.
精彩评论