开发者

Searching huge sorted chunks of data

I have a huge set data records in disk that arranged in sorted order based on some key(s). The data is read into memory a block (thousands of records) at a time. I have to search and display all records matching a key. I was thinking of some binary search based algorithm, but I have some restrictions here.

  1. Records can be only sequentially looked up within a block from the start of the block.
  2. Records with the same key can span multiple blocks (as shown in the figure - 8 spans). In binary search, if I am loading the middle block and if the first record matches, then I have to scan the blocks previous to the matched block.

Can someone help me devise an efficient strategy that could work in C++. Will it be efficient to go with the linear search method开发者_JAVA技巧.

+---+
| 1 | Block1
| 3 |
| 3 |
| 4 |
+---+
| 4 | Block2
| 6 |
| 7 |
| 8 |
+---+
| 8 | Block3
| 8 |
| 8 |
| 8 |
+---+
| 8 | Block4
| 14|
| 15|
| 16|
+---+


You could build a secondary array that consists of the first entry in each block then run binary search on that array. The indices for the array should corresponding directly with the block indices making it an O(1) lookup to get the corresponding block.

It cuts the worst case from O(n) to O(logn) and is still relatively simple.


Your idea, using a binary search, is correct. And you can avoid the linear scans alltogether by saving both the minimum and maximum value in each node. In your example the constructed binary-search-tree will look like this:

Block1    <- (1,4)
                    (1,8)
Block2    <- (4,8)
                          (1,16)
Block3    <- (8,8)
                    (8,16)
Block4    <- (8,16)

....

Having both the max and min values makes it efficient to compute the higher nodes. In your example, if you want to search for "8" you will go ...->...->...->(1,16)->(1,8)->(4,8), so you found the correct block without seeking backward and in the most efficent (log(n)) correct way.


It's a well-known problem, and there is a well-known data structure to solve it, employed mostly in databases :)

The idea is to use a B+ Tree.

The idea is to superpose a kind of Binary Search Tree (except that there'll more than 2 children per node) on top of the structure to search in.


If you have any idea of key distribution you could improve a binary search by guesstimating the first location to check. As a sample, with a "Name" key and the value "Bob" you can approximate where "B" is located, either simply based on position in alphabet, or more complex with domain-specific knowledge of the key (distribution frequency of first character in english first names e.g.)

Either way, a binary search is the way to go, optionally with first-key-in-block preloading or caching-as-you-see-them.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜