开发者

Counting the number of unique words in a document with Python

I am Python newbie trying to understand the answer given here to the question of counting unique words in a document. The answer is:

print len(set(w.lower() for w in open('filename.dat').read().split()))

Reads the entire file into memory, splits it into words using whitespace, converts each word to lower case, creates a (unique) set from the lowercase words, counts them and prints the output

To try understand that, I am trying to implement it in Python step by step. I can import the text tile using open and read, divide it into individual words using split, and make them all lower case using lower. I can also create a set of the unique words in the list. However, I cannot figure out how to do the last part - count the number of unique words.

I thought I could fini开发者_Go百科sh by iterating through the items in the set of unique words and counting them in the original lower-case list, but I find that that the set construct is not indexable.

So I guess I am trying to do something that in natural language is like, for all the items in the set, tell me how many times they occur in the lower case list. But I cannot quite figure out how to do that, and I suspect some underlying misunderstanding of Python is holding me back.

  • EDIT -

Guys thanks for the answers. I have just realised I did not explain myself correctly - I wanted to find not only the total number of unique words (which I understand is the length of the set) but also the number of times each individual word was used, e.g. 'the' was used 14 times, 'and' was used 9 times, 'it' was used 20 times and so on. Apologies for the confusion.


I believe that Counter is all that you need in this case:

from collections import Counter

print Counter(yourtext.split())


You can calculate the number of items in a set, list or tuple all the same with len(my_set) or len(my_list).

Edit: Calculating the numbers of times a word is used, is something different.
Here the obvious approach:

count = {}
for w in open('filename.dat').read().split():
    if w in count:
        count[w] += 1
    else:
        count[w] = 1
for word, times in count.items():
    print "%s was found %d times" % (word, times)

If you want to avoid the if-clause, you can look at collections.defaultdict.


A set, by definition, contains unique elements (in your case, you can't find the same 'lower cased string' twice there). So, what you have to do is simply get the count of elements in the set = the length of the set = len(set(...))


Your question already contains the answer. If s is the set of unique words in the document, then len(s) gives the number of elements in the set, i.e. the number of unique words in the document.


You can use Counter

from collections import Counter
c = Counter(['mama','papa','mama'])

The result of c will be

Counter({'mama': 2, 'papa': 1})


The easiest way:

len(set(open(file_path).read().lower().split()))


I suppose this can be used to get a unique word count. Works fine with python 3.10.2

from collections import Counter

def get_count_of_unique_words(lines):
    selected_words = []
    for word in lines:
        if word.isalpha():
           selected_words.append(word)

    unique_count = 0
    for letter, count in Counter(selected_words).items():
        if count == 1:
            unique_count += 1

    print(unique_count)
    return unique_count

Docs https://docs.python.org/3/library/collections.html#collections.Counter


I would say that that code counts the number of distinct words, not the number of unique words, which is the number of words which occur only once.

This counts the number of times that each word occurs:

from collections import defaultdict

word_counts = defaultdict(int)

for w in open('filename.dat').read().split():
    word_counts[w.lower()] += 1

for w, c in word_counts.iteritems():
    print w, "occurs", word_counts[w], "times"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜