C++ - Efficient container for large amounts of searchable data?

2022-12-28 17:39 问答作者：

I am implementing a text-based version of Scrabble for a College project.

My dictionary is quite large, weighing in at around 400.000 words (std::string)开发者_如何学运维.

Searching for a valid word will suck, big time, in terms of efficiency if I go for a vector<string> ( O(n) ). Are there any good alternatives? Keep in mind, I'm enrolled in freshman year. Nothing TOO complex!

Thanks for your time!

Francisco

If you wanted to go for something that is in the standard library, you could use std::set with the word as the key. That would give you logarithmic search time.

Since your dictionary is presumably static (i.e. created once and not modified), you could also use a std::vector, sort it using std::sort, and then use std::binary_search on the sorted vector to find a word. This would also give logarithmic search time, and would possibly be more space efficient than a set.

If you want to implement your own data structure, a trie would be a good choice.

std::set is a natural for this because it takes almost 0 work on top of using a vector. Even so, I will teach you something you don't usually learn until your are a professional. Don't optimized prematurely. I bet on a moden computer that a linear dictionary lookup in a 40K string vector takes .001 seconds.

In a set, it's O(log n) and probably takes .00001 seconds.

Anything not in the STL is a total waste of time. Don't spend $10 of work on a 10 cent problem.

A trie or radix tree would be give you search times and insertion times that are linear in the length of the string you are searching for.

(Note that linear in the length of the string you are searching for is the best you can do with any search algorithm, because comparing or hashing strings is linear in the length of the string -- therefore the component of running time that's linear in the length of the string is usually left out of the running times for binary search, binary trees, or linear search.)

These solutions are probably overkill if you don't have them in your library already.

I did some profiling and got the following results (MSVS 2008, /O2, release build, launching .exe separately).

EDIT - Now I realize I actually f*cked up my first test, because I didn't split the building and searching test. Though it didn't change the "winners", I made some new tests. So, here are the results when we split them.

First, if there are almost no bad search requests (4 million good search attempts).

[ RUN      ] Containers.DictionaryPrepare
[       OK ] Containers.DictionaryPrepare (234 ms)
[ RUN      ] Containers.VectorPrepare
[       OK ] Containers.VectorPrepare (704 ms)
[ RUN      ] Containers.SetPrepare
[       OK ] Containers.SetPrepare (593 ms)
[ RUN      ] Containers.MultisetPrepare
[       OK ] Containers.MultisetPrepare (578 ms)
[ RUN      ] Containers.UnorderedSetPrepare
[       OK ] Containers.UnorderedSetPrepare (266 ms)
[ RUN      ] Containers.UnorderedMultisetPrepare
[       OK ] Containers.UnorderedMultisetPrepare (375 ms)
[ RUN      ] Containers.VectorSearch
[       OK ] Containers.VectorSearch (4484 ms)
[ RUN      ] Containers.SetSearch
[       OK ] Containers.SetSearch (5469 ms)
[ RUN      ] Containers.MultisetSearch
[       OK ] Containers.MultisetSearch (5485 ms)
[ RUN      ] Containers.UnorderedSet
[       OK ] Containers.UnorderedSet (1078 ms)
[ RUN      ] Containers.UnorderedMultiset
[       OK ] Containers.UnorderedMultiset (1250 ms)
[----------] 11 tests from Containers (20516 ms total)

This profiling shows that you should use "normal" container variations instead of "multi", and you should choose the unordered_set. It's great in building time and search operations time.

Here are the results for another case (guess, it's not about your app, but just for it to be), when the amount of bad searches equals the amount of good searches (and equals 2 million). The winner remains the same.

Also note, that for static dictionaries vector performs better (though need more time for initialization), than set, but well, it will suck if you have to add elements.

[ RUN      ] Containers.DictionaryPrepare
[       OK ] Containers.DictionaryPrepare (235 ms)
[ RUN      ] Containers.VectorPrepare
[       OK ] Containers.VectorPrepare (718 ms)
[ RUN      ] Containers.SetPrepare
[       OK ] Containers.SetPrepare (578 ms)
[ RUN      ] Containers.MultisetPrepare
[       OK ] Containers.MultisetPrepare (579 ms)
[ RUN      ] Containers.UnorderedSetPrepare
[       OK ] Containers.UnorderedSetPrepare (265 ms)
[ RUN      ] Containers.UnorderedMultisetPrepare
[       OK ] Containers.UnorderedMultisetPrepare (375 ms)
[ RUN      ] Containers.VectorSearch
[       OK ] Containers.VectorSearch (3375 ms)
[ RUN      ] Containers.SetSearch
[       OK ] Containers.SetSearch (3656 ms)
[ RUN      ] Containers.MultisetSearch
[       OK ] Containers.MultisetSearch (3766 ms)
[ RUN      ] Containers.UnorderedSet
[       OK ] Containers.UnorderedSet (875 ms)
[ RUN      ] Containers.UnorderedMultiset
[       OK ] Containers.UnorderedMultiset (1016 ms)
[----------] 11 tests from Containers (15438 ms total)

Testing code:

TEST(Containers, DictionaryPrepare) {
   EXPECT_FALSE(strings_initialized);
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      strings.push_back(generate_string());
   }
}

TEST(Containers, VectorPrepare) {
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      vec.push_back(strings[i]);
   }
   sort(vec.begin(), vec.end());
}

TEST(Containers, SetPrepare) {
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      set.insert(strings[i]);
   }
}

TEST(Containers, MultisetPrepare) {
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      multiset.insert(strings[i]);
   }
}

TEST(Containers, UnorderedSetPrepare) {
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      uo_set.insert(strings[i]);
   }
}

TEST(Containers, UnorderedMultisetPrepare) {
   for (size_t i = 0; i < TOTAL_ELEMENTS; ++i) {
      uo_multiset.insert(strings[i]);
   }
}

TEST(Containers, VectorSearch) {
   for (size_t i = 0; i < TOTAL_SEARCHES; ++i) {
      std::binary_search(vec.begin(), vec.end(), strings[rand() % TOTAL_ELEMENTS]);
   }
   for (size_t i = 0; i < TOTAL_BAD_SEARCHES; ++i) {
      std::binary_search(vec.begin(), vec.end(), NONEXISTENT_ELEMENT);
   }
}

TEST(Containers, SetSearch) {
   for (size_t i = 0; i < TOTAL_SEARCHES; ++i) {
      set.find(strings[rand() % TOTAL_ELEMENTS]);
   }
   for (size_t i = 0; i < TOTAL_BAD_SEARCHES; ++i) {
      set.find(NONEXISTENT_ELEMENT);
   }
}

TEST(Containers, MultisetSearch) {
   for (size_t i = 0; i < TOTAL_SEARCHES; ++i) {
      multiset.find(strings[rand() % TOTAL_ELEMENTS]);
   }
   for (size_t i = 0; i < TOTAL_BAD_SEARCHES; ++i) {
      multiset.find(NONEXISTENT_ELEMENT);
   }
}

TEST(Containers, UnorderedSet) {
   for (size_t i = 0; i < TOTAL_SEARCHES; ++i) {
      uo_set.find(strings[rand() % TOTAL_ELEMENTS]);
   }
   for (size_t i = 0; i < TOTAL_BAD_SEARCHES; ++i) {
      uo_set.find(NONEXISTENT_ELEMENT);
   }
}

TEST(Containers, UnorderedMultiset) {
   for (size_t i = 0; i < TOTAL_SEARCHES; ++i) {
      uo_multiset.find(strings[rand() % TOTAL_ELEMENTS]);
   }
   for (size_t i = 0; i < TOTAL_BAD_SEARCHES; ++i) {
      uo_multiset.find(NONEXISTENT_ELEMENT);
   }
}

What is the purpose of your data structure? What sort of things do you want to do with it?

Check if a complete word like "tempest" is in the list?
Find all seven letter words starting and ending with "t"?
Find all seven letter words starting and ending with "t" that can be made with a given set of letters?

The first question is all you need if you're implementing some sort of referee for a group of human players, that is, some entity to check proposed words against the official dictionary. Basic data structures like std::map, std::set, std::vector, already suggested by others, by themselves suffice for this purpose.

The second and third questions are the kind you need answered if you're writing a player. Here you might want to consider 26 sets for each letter position, each set holding the words with a given letter in a given position. You'll need extra code to compute intersections when needed, and maybe check words against the letters available on your rack.

Update: In a comment on the original question, the OP made it clear that he only has to check if a word is in the dictionary. This is the first question I asked above, and any of the standard efficient data structures is fine.

If the vector is sorted, you can use binary_search to test if a given word exists in the dictionary.

Use a std::tr1::unordered_set, and that gives you constant time lookup. (Linear in the length of the string, as per my other answer.)

I would echo Ken's suggestion of using a Trie, and further suggest that you could dramatically shrink the size of the trie by letting it be more like a finite-state-machine for common prefixes and suffixes. For example
"nation",
"national",
"nationalize",
"nationalization",
"nationalizing",
"denationalize",
could all share common structure. You would have to be concerned about dubious words, like
"nationalizationalizationalizing"
I used this ages ago in a spelling correction program.

I suggest using the length of the word and the first letters as the first two items in the search. Let the data be organized to support this algorithm.

First, let us define a container for all the words that are the same length:

typedef std::vector<std::string> Word_Container;

This data structure shall be sorted, so that a binary search can be used.

Next, an index table will be created. The index table will be of the form <word length, pointer to word container>:

typedef std::map<unsigned int, Word_Container *> Index_Table;

And finally have an array of index tables, using the first letter of the word as an index:

Index_Table alpha_array[26]; // ASCII A - Z.

The algorithm is then:
Calculate index into alpha_array:
index = word[0] - 'A';

Use index to get associated index table:
Index_Table& table = alpha_array[index];

Use length of the word as the key for the table lookup, to get word container:
Word_Container * p_word_container = table[word.length()];

Search container for exact word:

bool found = false;
if (p_word_container)
{
    found = std::binary_search(p_word_container->begin(), p_word_container->end(), word);
}

There are more efficient, but complex, methods for searching a dictionary for a word. The above algorithm has a benefit of a quick "escape" points where the word doesn't exist. This concept can be extended to database tables.

继续阅读：performance search string

C++ - Efficient container for large amounts of searchable data?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？