开发者

Optimal data structure for a special dictionary

Which data structure is best in terms of computational complexity to implement a dictionary of (key,val) items, which must support only following commands:

  1. Insert(key) - appends an item (key,val) with val=1
  2. Increment(key) - increments val of existed (key,val)
  3. Find(key) - returns a val of (key,val)
  4. Select(part_of_key) - returns a list of all items (key,val) if strstr(key,part_of_key)!=NULL in the form of a new dictionary of the same type (without allocating new memory if possible); for example if dictionary is {(red,3), (blue,4), (green,1)}, then Select(re)={(red,3), (green,1)}
  5. Max(i) - returns an item which has the i-th maximal value among all items; for example if dictionary is {(red,3), (blue,4), (green,1)}, then Max(1)=blue, Max(2)=red, Max(3)=green

The keys are strings and the values are positive integers. The number of items in the dictionary is expected to be very large.

I think it must be a s开发者_如何学Cynthesis of two different data structures. But should it be a hash table + a binary tree or a trie + sorted array or something else?


A combination of balanced tree (such as red-black tree) and suffix tree (or suffix array).

  • Balanced tree: operation 1, 2 (implemented as remove + insert), 3 and 5.
  • Suffix tree: operation 4.

NOTE: Hash table will not be able to support operation 5 efficiently.

I think you'll have a hard time implementing the suffix tree. You could possibly use Mark Nelson's C++ implementation of Ukkonen's algorithm, but it has memory leaks and is essentially a singleton, so you'll need to clean it up before being ready for production use. Even after you fix it, you'll need to adjust it so it works with your "other" data structure (which is balanced tree in my proposal) instead of one big plain string.

If you do operation 1 more frequently than operation 4 and/or you can live with linear operation 4, I recommend you skip the whole complication with the suffix tree and just traverse your data structure linearly.


For first three operation, hash table might be good idea.

For 4th operation (select part of key), you may have to write hash function differently. Yes, hash function which was used to find/calculate hash value from given key. As you want to support partial match and your key is string, you may want to use Suffix-tree or trie.

For 5th operation (ith max element), you may want to maintain heap or sorted Linked-list (or skip-list) which interacts with hash-table.

You will also have to see various use-cases and find which operation should be optimized. For exa: If you have lots of query on part_of_key operation, you should use Suffix-tree/LC-trie kind of structure and that will give good results. However, your Find operation may not be fast as it will take O(logN) instead of constant look-up.

To summarize, you need to integrate hash-table, heap and suffix tree to achieve all operations.


While the exact answer depends on the frequency of your operations, your list of options should include a suffix array


I think a trie with several shortcuts would be best.

1-3 are already supported by a trie as fast as possible (length of the key).

For 4 I would add an additional table of 256 elements to all trie nodes that can be entered from that character. This way I can do fast lookup of parts of the string, without ever explicitely constructing or storing the suffixes as in a suffix tree.

For 5 I would add a linked list on the leaves, that get's updated sorted when data is inserted. You would maybe need another reference to the head of the list and backlinks in the trie to find the right value and get the key.

The memory complexity does not change from the trie with these additions, since it is bounded by the number of trie nodes. However the time that insertions take might change because of the inefficient sort of the linked list.

The final structure should probably be called a hash-leaf-trie. But I would not want to be implementing such a beast.


I think an efficient data base will be a modified trie, with bi-directional links [can go from leaves to the root and reconstruct the word], and each terminating node will have additional 'value' field.
You will also need a multimap [i.e. a map, in which each value is a set of addresses].
The keys will be arranged in a tree like, and the set of values will be hash based. [in jave, it should be something like TreeMap<Integer,HashSet<Node>>]

Pseudo-code: [well, very pseudo... it just shows the general ideas for each op].

Insert(key):
1. Insert a word into the trie
2. add the value `1` to the terminating node.
3. add the terminating field to the multimap [i.e. map.add(1,terminating_node_address)]
Increment(key):
1. read the word in the trie, and store the value from the terminating node as temp.
2. delete the entree (temp,terminating_node_address) from the multimap.
3. insert the entree (temp+1,terminating_node_address) to the multimap.
4. increase the value of the terminating node's address by 1.
Find(key):
1. find the terminating node for the key in the trie, return its value.
Select(part_of_key):
1. go to the last node you can reach with part_of_key.
2. from this node: do BFS and retrieve all possible words [store them in an empty set you will later return].
Max(i):
1. find the i'th biggest element key in the multimap.
2. choose an arbitrary address, and return get the relevant terminating word node.
3. build the string by following the uplinks to the root, and return it.

complexities:
let n be the number of strings, k be the maximum value, and S a string's length.
Insert: O(S) [trie insertion is O(S)]. The smallest element (1) in the map can be cached, and thus can be access at O(1).
Increment: O(S+logk): find the string in a trie is O(S), finding the relevant key is O(logk), deleting element from a hash-set is O(1), finding the next element in the tree is also O(logk) [can be O(1) actually, for a skip-list based map], and inserting a value is O(1). Note that by linking the node to the relevant key in the map, complexity can even be improved to O(S), since no search is actually required.
Find: O(S): just find in the trie and return the value.
Select: O(S*n): at worst case you need to retrieve all elements, which enforces you to iterate the whole trie in order to find all words.
Max: O(logk+S): find a key in a sorted map, reconstruct the word.

EDIT: Note that in here, I assumed for find(i) you want the i'th distinct biggest element. If it is not the case, you will need to modify the set a bit, and use a multimap that allows duplicate keys [instead a set as a value]. This will make all the tree based ops O(logn) instead O(logk), still efficient in my opinion...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜