开发者

Unique word count

This is a generic question that applies to (probably) any high-level programming language. Here is the situation:

Suppose I have an array of strings. Say, I managed to put 500 000 strings from a short story into an array (just suppose you don't have an option for input format). Consequently, there will most likely be an arbitrary number of duplicated items.

I want to take this array of strings and create another array that contains a unique subset(?) of that array (ie: no duplicates). In this scenario, both the input and 开发者_开发问答output must be arrays, so that may restrict you from various options.

Performance-wise, what's the fastest way to accomplish this? I'm currently using a linear search to check whether a word exists already, but as it is a linear search I feel that there might be faster ways especially if I have unreasonable amounts of strings to work with. Like a bigger novel!


Using a hashset might be the most sensible thing to do - complexity should be O(N).

Note: most high-level programming languages contain an implementation of a function that removes duplicates from an array, e.g. PHP.


If you are going to be putting gazillions of words into it, a directed acyclic word graph is the most efficient data structure I know of.

And yet it is conceptually a very simple data structure.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜