开发者

How to find most common elements of a list? [duplicate]

This question already has answers here: Find the item with maximum occurrences in a list [duplicate] (14 answers) Closed 2 years ago.

The community reviewed whether to reopen this question 9 months ago and left it closed:

开发者_如何学Python

Original close reason(s) were not resolved

Given the following list

['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 
 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 
 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 
 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 
 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 
 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 
 'Moon', 'to', 'rise.', '']

I am trying to count how many times each word appears and display the top 3.

However I am only looking to find the top three that have the first letter capitalized and ignore all words that do not have the first letter capitalized.

I am sure there is a better way than this, but my idea was to do the following:

  1. put the first word in the list into another list called uniquewords
  2. delete the first word and all its duplicated from the original list
  3. add the new first word into unique words
  4. delete the first word and all its duplicated from original list.
  5. etc...
  6. until the original list is empty....
  7. count how many times each word in uniquewords appears in the original list
  8. find top 3 and print


In Python 2.7 and above there is a class called Counter which can help you:

from collections import Counter
words_to_count = (word for word in word_list if word[:1].isupper())
c = Counter(words_to_count)
print c.most_common(3)

Result:

[('Jellicle', 6), ('Cats', 5), ('And', 2)]

I am quite new to programming so please try and do it in the most barebones fashion.

You could instead do this using a dictionary with the key being a word and the value being the count for that word. First iterate over the words adding them to the dictionary if they are not present, or else increasing the count for the word if it is present. Then to find the top three you can either use a simple O(n*log(n)) sorting algorithm and take the first three elements from the result, or you can use a O(n) algorithm that scans the list once remembering only the top three elements.

An important observation for beginners is that by using builtin classes that are designed for the purpose you can save yourself a lot of work and/or get better performance. It is good to be familiar with the standard library and the features it offers.


If you are using an earlier version of Python or you have a very good reason to roll your own word counter (I'd like to hear it!), you could try the following approach using a dict.

Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> word_list = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
>>> word_counter = {}
>>> for word in word_list:
...     if word in word_counter:
...         word_counter[word] += 1
...     else:
...         word_counter[word] = 1
... 
>>> popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
>>> 
>>> top_3 = popular_words[:3]
>>> 
>>> top_3
['Jellicle', 'Cats', 'and']

Top Tip: The interactive Python interpretor is your friend whenever you want to play with an algorithm like this. Just type it in and watch it go, inspecting elements along the way.


To just return a list containing the most common words:

from collections import Counter
words=["i", "love", "you", "i", "you", "a", "are", "you", "you", "fine", "green"]
most_common_words= [word for word, word_count in Counter(words).most_common(3)]
print most_common_words

this prints:

['you', 'i', 'a']

the 3 in "most_common(3)", specifies the number of items to print. Counter(words).most_common() returns a a list of tuples with each tuple having the word as the first member and the frequency as the second member.The tuples are ordered by the frequency of the word.

`most_common = [item for item in Counter(words).most_common()]
print(str(most_common))
[('you', 4), ('i', 2), ('a', 1), ('are', 1), ('green', 1), ('love',1), ('fine', 1)]`

"the word for word, word_counter in", extracts only the first member of the tuple.


Is't it just this ....

word_list=['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 
 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 
 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 
 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 
 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 
 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 
 'Moon', 'to', 'rise.', ''] 

from collections import Counter
c = Counter(word_list)
c.most_common(3)

Which should output

[('Jellicle', 6), ('Cats', 5), ('are', 3)]


nltk is convenient for a lot of language processing stuff. It has methods for frequency distribution built in. Something like:

import nltk
fdist = nltk.FreqDist(your_list) # creates a frequency distribution from a list
most_common = fdist.max()    # returns a single element
top_three = fdist.keys()[:3] # returns a list


A simple, two-line solution to this, which does not require any extra modules is the following code:

lst = ['Jellicle', 'Cats', 'are', 'black', 'and','white,',
       'Jellicle', 'Cats','are', 'rather', 'small;', 'Jellicle', 
       'Cats', 'are', 'merry', 'and','bright,', 'And', 'pleasant',    
       'to','hear', 'when', 'they', 'caterwaul.','Jellicle', 
       'Cats', 'have','cheerful', 'faces,', 'Jellicle',
       'Cats','have', 'bright', 'black','eyes;', 'They', 'like',
       'to', 'practise','their', 'airs', 'and', 'graces', 'And', 
       'wait', 'for', 'the', 'Jellicle','Moon', 'to', 'rise.', '']

lst_sorted=sorted([ss for ss in set(lst) if len(ss)>0 and ss.istitle()], 
                   key=lst.count, 
                   reverse=True)
print lst_sorted[0:3]

Output:

['Jellicle', 'Cats', 'And']

The term in squared brackets returns all unique strings in the list, which are not empty and start with a capital letter. The sorted() function then sorts them by how often they appear in the list (by using the lst.count key) in reverse order.


There's two standard library ways to find the most frequent value in a list:

statistics.mode:

from statistics import mode
most_common = mode([3, 2, 2, 2, 1, 1])  # 2
most_common = mode([3, 2])  # StatisticsError: no unique mode
  • Raises an exception if there's no unique most frequent value
  • Only returns single most frequent value

collections.Counter.most_common:

from collections import Counter
most_common, count = Counter([3, 2, 2, 2, 1, 1]).most_common(1)[0]  # 2, 3
(most_common_1, count_1), (most_common_2, count_2) = Counter([3, 2, 2]).most_common(2)  # (2, 2), (3, 1)
  • Can return multiple most frequent values
  • Returns element count as well

So in the case of the question, the second one would be the right choice. As a side note, both are identical in terms of performance.


The simple way of doing this would be (assuming your list is in 'l'):

>>> counter = {}
>>> for i in l: counter[i] = counter.get(i, 0) + 1
>>> sorted([ (freq,word) for word, freq in counter.items() ], reverse=True)[:3]
[(6, 'Jellicle'), (5, 'Cats'), (3, 'to')]

Complete sample:

>>> l = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
>>> counter = {}
>>> for i in l: counter[i] = counter.get(i, 0) + 1
... 
>>> counter
{'and': 3, '': 1, 'merry': 1, 'rise.': 1, 'small;': 1, 'Moon': 1, 'cheerful': 1, 'bright': 1, 'Cats': 5, 'are': 3, 'have': 2, 'bright,': 1, 'for': 1, 'their': 1, 'rather': 1, 'when': 1, 'to': 3, 'airs': 1, 'black': 2, 'They': 1, 'practise': 1, 'caterwaul.': 1, 'pleasant': 1, 'hear': 1, 'they': 1, 'white,': 1, 'wait': 1, 'And': 2, 'like': 1, 'Jellicle': 6, 'eyes;': 1, 'the': 1, 'faces,': 1, 'graces': 1}
>>> sorted([ (freq,word) for word, freq in counter.items() ], reverse=True)[:3]
[(6, 'Jellicle'), (5, 'Cats'), (3, 'to')]

With simple I mean working in nearly every version of python.

if you don't understand some of the functions used in this sample, you can always do this in the interpreter (after pasting the code above):

>>> help(counter.get)
>>> help(sorted)


The answer from @Mark Byers is best, but if you are on a version of Python < 2.7 (but at least 2.5, which is pretty old these days), you can replicate the Counter class functionality very simply via defaultdict (otherwise, for python < 2.5, three extra lines of code are needed before d[i] +=1, as in @Johnnysweb's answer).

from collections import defaultdict
class Counter():
    ITEMS = []
    def __init__(self, items):
        d = defaultdict(int)
        for i in items:
            d[i] += 1
        self.ITEMS = sorted(d.iteritems(), reverse=True, key=lambda i: i[1])
    def most_common(self, n):
        return self.ITEMS[:n]

Then, you use the class exactly as in Mark Byers's answer, i.e.:

words_to_count = (word for word in word_list if word[:1].isupper())
c = Counter(words_to_count)
print c.most_common(3)


I will like to answer this with numpy, great powerful array computation module in python.

Here is code snippet:

import numpy
a = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 
 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 
 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 
 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 
 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 
 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 
 'Moon', 'to', 'rise.', '']
dict(zip(*numpy.unique(a, return_counts=True)))

Output

{'': 1, 'And': 2, 'Cats': 5, 'Jellicle': 6, 'Moon': 1, 'They': 1, 'airs': 1, 'and': 3, 'are': 3, 'black': 2, 'bright': 1, 'bright,': 1, 'caterwaul.': 1, 'cheerful': 1, 'eyes;': 1, 'faces,': 1, 'for': 1, 'graces': 1, 'have': 2, 'hear': 1, 'like': 1, 'merry': 1, 'pleasant': 1, 'practise': 1, 'rather': 1, 'rise.': 1, 'small;': 1, 'the': 1, 'their': 1, 'they': 1, 'to': 3, 'wait': 1, 'when': 1, 'white,': 1}

Output is in dictionary object in format of (key, value) pairs, where value is count of particular word

This answer is inspire by another answer on stackoverflow, you can view it here


If you are using Count, or have created your own Count-style dict and want to show the name of the item and the count of it, you can iterate around the dictionary like so:

top_10_words = Counter(my_long_list_of_words)
# Iterate around the dictionary
for word in top_10_words:
        # print the word
        print word[0]
        # print the count
        print word[1]

or to iterate through this in a template:

{% for word in top_10_words %}
        <p>Word: {{ word.0 }}</p>
        <p>Count: {{ word.1 }}</p>
{% endfor %}

Hope this helps someone

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜