Counting the number of unique words in a document with Python

2023-03-11 06:04 问答作者：

I am Python newbie trying to understand the answer given here to the question of counting unique words in a document. The answer is:

print len(set(w.lower() for w in open('filename.dat').read().split()))

Reads the entire file into memory, splits it into words using whitespace, converts each word to lower case, creates a (unique) set from the lowercase words, counts them and prints the output

To try understand that, I am trying to implement it in Python step by step. I can import the text tile using open and read, divide it into individual words using split, and make them all lower case using lower. I can also create a set of the unique words in the list. However, I cannot figure out how to do the last part - count the number of unique words.

I thought I could fini开发者_Go百科sh by iterating through the items in the set of unique words and counting them in the original lower-case list, but I find that that the set construct is not indexable.

So I guess I am trying to do something that in natural language is like, for all the items in the set, tell me how many times they occur in the lower case list. But I cannot quite figure out how to do that, and I suspect some underlying misunderstanding of Python is holding me back.

EDIT -

Guys thanks for the answers. I have just realised I did not explain myself correctly - I wanted to find not only the total number of unique words (which I understand is the length of the set) but also the number of times each individual word was used, e.g. 'the' was used 14 times, 'and' was used 9 times, 'it' was used 20 times and so on. Apologies for the confusion.

I believe that Counter is all that you need in this case:

from collections import Counter

print Counter(yourtext.split())

You can calculate the number of items in a set, list or tuple all the same with len(my_set) or len(my_list).

Edit: Calculating the numbers of times a word is used, is something different.
Here the obvious approach:

count = {}
for w in open('filename.dat').read().split():
    if w in count:
        count[w] += 1
    else:
        count[w] = 1
for word, times in count.items():
    print "%s was found %d times" % (word, times)

If you want to avoid the if-clause, you can look at collections.defaultdict.

A set, by definition, contains unique elements (in your case, you can't find the same 'lower cased string' twice there). So, what you have to do is simply get the count of elements in the set = the length of the set = len(set(...))

Your question already contains the answer. If s is the set of unique words in the document, then len(s) gives the number of elements in the set, i.e. the number of unique words in the document.

You can use Counter

from collections import Counter
c = Counter(['mama','papa','mama'])

The result of c will be

Counter({'mama': 2, 'papa': 1})

The easiest way:

len(set(open(file_path).read().lower().split()))

I suppose this can be used to get a unique word count. Works fine with python 3.10.2

from collections import Counter

def get_count_of_unique_words(lines):
    selected_words = []
    for word in lines:
        if word.isalpha():
           selected_words.append(word)

    unique_count = 0
    for letter, count in Counter(selected_words).items():
        if count == 1:
            unique_count += 1

    print(unique_count)
    return unique_count

Docs https://docs.python.org/3/library/collections.html#collections.Counter

I would say that that code counts the number of distinct words, not the number of unique words, which is the number of words which occur only once.

This counts the number of times that each word occurs:

from collections import defaultdict

word_counts = defaultdict(int)

for w in open('filename.dat').read().split():
    word_counts[w.lower()] += 1

for w, c in word_counts.iteritems():
    print w, "occurs", word_counts[w], "times"

继续阅读：python

Counting the number of unique words in a document with Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？