Unique word count

2023-02-27 05:11 问答作者：

This is a generic question that applies to (probably) any high-level programming language. Here is the situation:

Suppose I have an array of strings. Say, I managed to put 500 000 strings from a short story into an array (just suppose you don't have an option for input format). Consequently, there will most likely be an arbitrary number of duplicated items.

I want to take this array of strings and create another array that contains a unique subset(?) of that array (ie: no duplicates). In this scenario, both the input and 开发者_开发问答output must be arrays, so that may restrict you from various options.

Performance-wise, what's the fastest way to accomplish this? I'm currently using a linear search to check whether a word exists already, but as it is a linear search I feel that there might be faster ways especially if I have unreasonable amounts of strings to work with. Like a bigger novel!

Using a hashset might be the most sensible thing to do - complexity should be O(N).

Note: most high-level programming languages contain an implementation of a function that removes duplicates from an array, e.g. PHP.

If you are going to be putting gazillions of words into it, a directed acyclic word graph is the most efficient data structure I know of.

And yet it is conceptually a very simple data structure.

继续阅读：programming-languages word-count

Unique word count

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？