开发者

Suitable Data Structure for storing and calculating Highest Scoring K items

I need to store W items. Each item has a 'string' attribute and a 'double' attribute (the item's score) associated with it. In each iteration, additional C items are added to the set. After the iteration is complete, score of some of the items is updated by a small amount. Now, out of the W+C items only W items need to be taken forward to the next iteration. Highest Scoring 'W' items will be selected 开发者_运维问答that will go to the next generation. In every iteration a different set of 'C' items are added.

W is of the order of 10,000. C is of the order of 600.

What is the best data structure to use this in terms of time complexity. Hash Table, Heap, Binary Search Tree?? I am using C++. Some boost references will be appreciated


I would store these values in two parallel structures. First, have an array of the double values, each of which stores a pointer. Next, store all the strings in a hash table along with an auxiliary integer. The idea is that the pointers in the array point to the nodes in the hash table or trie holding the string associated with the double, while the integer value with each string stores the index of the double paired with that string.

To insert a string/double pair into this structure, you add the string to the hash table, append the double to the array, then store a pointer to the new string in the array and the index of the double in the hash table. This has complexity O(k), where k is the length of the string.

To change a priority, look up the string in the hash table, then get the index of the double in the array. You can then modify that element to change tye associated priority. This also has complexity O(k).

To discard all but the top B key/value pairs, run a selection algorithm on the array to put the top B elements in one part of the array and the remaining C elements in the other. Whenever you perform a swap, follow the pointers out of the array and into the hash table and update the indices of the elements you just swapped. Finally, iterate across the last C elements of the array, follow their pointers back into the hash table, and remove the elements they point at from the table. This takes expected O(n) time to do the selection step, or worst-case O(n) time using the median-of-medians algorithm, followed by O(n) time to remove the elements from the hash table, for an expected runtime of O(n), where n is the number of elements in the structure.

To summarize, this gives you O(k) insertion and lookup of any string, where k is the string length, and O(n) retaining of the best elements, where n is the total number of elements.


Well, I think you will be fine just using a std::vector<Item> and doing a std::nth_element (on the score) once at end of iteration. E.g. if you want to keep 10000 items, do like this:

struct Item {
    double score;
    std::string name;
};

bool comparator(const Item& a, const Item& b) {
    return a.score > b.score;
};

if (items.size() > 10000) {
   // Make sure the 10,000 first elements contain the highest scores.
   items.nth_element(item.begin(), item.begin() + 10000, item.end(),
       comparator);
   // Only keep the first 10,000 elements.
   items.resize(10000);
}

Actually, if you do it like this, updating values (by linear search and string comparison) will probably be slower than sorting. You can speed up the comparisons by putting a string hash into your Item instead of the pure strings.

If you want even faster updating: Before updating, sort items on string hash. Then you can do a binary search instead of linear search to find the item you want to update.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜