开发者

find repeated word in infinite stream of words

You are given an infinite supply of words, which are coming one by one, and length of words, can be huge and is unknown how big it is. How will you find if the new word is repeated, what data structure will you use to s开发者_如何转开发tore.This was the question asked to me in the interview .please help me to verify my answer.


Normally use a hash-table to keep track of the count of each word. Since you only have to answer whether the words are duplicated, you can reduce the word count to a bitmask, so that you only store a single bit for each hash index.

If the question is related to big data, like how to write a search engine for Google, your answer may need to relate to MapReduce or similar distributed techniques (which takes root somewhat in same hash table techniques as described above)


As with most sequential data, a trie would be a good choice here. Using a trie you can store new words very cost efficiently and still be sure to find new words. Tries can actually be seen as a form of multiple hashing of the words. If this still leads to problems, because the size of the words is to big, you can make it more efficient by producing a directed acyclic word graph (DAWG) from the words in order to reduce common suffixes as well as prefixes.


If all you need to do is efficiently detect if each word is one you've seen before, a Bloom filter is one nice option. It's kind of like a set and a hash table combined in one, and therefore can result in false positives -- for this reason they are sometimes adapted to use additional techniques to reduce that risk. The advantage of Bloom filters is that they are very space efficient (important if you really don't know how large the list will be). They are also fast. On the downside, you can't get the words out again, you can only tell whether you've seen them or not.

There's a nice description at: http://en.wikipedia.org/wiki/Bloom_filter.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜