Finding the common elements of a list

2023-01-13 19:13 问答作者：

Hi as per the earlier post.

Given the following list:

['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']

I am t开发者_Go百科rying to count how many times each word THAT STARTS WITH A CAPITAL appears and display the top 3.

I am not interested in words that do not start with a capital.

If a word appears multiple times, sometimes starting with a capital and sometimes not, only count the times it does with a capital.

THis is what my code looks like currently:

words = ""
for word in open('novel.txt', 'rU'):
      words += word
words = words.split(' ')
words= list(words)
words = ('\n'.join(words)).split('\n')

word_counter = {}

for word in words:

        if word in word_counter:
            word_counter[word] += 1
        else:
            word_counter[word] = 1      
popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
top_3 = popular_words[:3]

matches = []

for i in range(3):

      print word_counter[top_3[i]], top_3[i]

#uncomment to produce the word file
##words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
##open('novel.txt','w').write('\n'.join(words))

import string
cap_words = [word.strip(string.punctuation) for word in open('novel.txt').read().split() if word.istitle()]
##print(cap_words) # debug
try:
    from collections import Counter # Python >= 2.7
    print('Counter')
    print(Counter(cap_words).most_common(3))
except ImportError:
    print('Normal dict')
    wordcount= dict()
    for word in cap_words:
         wordcount[word] = (wordcount[word] + 1
                            if word in wordcount
                            else 1)
    print(sorted(wordcount.items(), key = lambda x: x[1], reverse = True)[:3])

I do not get it why you would like to keep different kinds of line termination with 'rU' mode. I would use normally normal open as I wrote in edited code above. EDIT: You had words together with punctuation, so cleaned up those with strip()

print "\n".join(sorted(["%d %s" % (lst.count(i), i) \
             for i in set(lst) if i.istitle()])[-3:])
2 And
5 Cats
6 Jellicle

Here are some additional comments:

words = ""
for word in open('novel.txt', 'rU'):
      words += word
words = words.split(' ')
words= list(words)
words = ('\n'.join(words)).split('\n')

can be replaced with:

text = open('novel.txt', 'rU').read() # read everything
wordlist = text.split() # split on all whitespace

But you don't use your 'must start with a capital letter' requirement yet. Time to add:

capwordlist = (word for word in wordlist if word.istitle())

istitle() means word[0].isupper() and word[1:].islower(). This means 'SO'.istitle() -> False.

That might work for you, but maybe you just want the word[0].isupper() instead.

This part is good if you can't use collections.Counter (new in 2.7)

word_counter = {}

for word in capwordlist:

        if word in word_counter:
            word_counter[word] += 1
        else:
            word_counter[word] = 1      
popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
top_3 = popular_words[:3]

else this simply becomes:

from collections import Counter

word_counter = Counter(capwords)
top_3 = word_counter.most_common(3) # gives `word, count` pairs!

And this:

for i in range(3):
      print word_counter[top_3[i]], top_3[i]

can be this:

for word in top_3:
    print word_counter[word], word

One thing i'd avoid is reading all the words in before processing. It will work, but IMHO, it's better not to do that if you don't need to, and you don't. Here's my solution (with elements liberally stolen from previous ones!), done with 2.6.2:

import sys

# a generator function which iterates over the words in a file
def words(f):
    for line in f:
        for word in line.split():
            yield word

# returns a generator expression filtering an iterator down to titlecase words
def titles(s):
    return (word for word in s if word.istitle())

# count the titlecase words in the file
count = {}
for word in titles(words(file(sys.argv[1]))):
    count[word] = count.get(word, 0) + 1

# build a list of tuples with the count for each word
countsAndWords = [(kv[1], kv[0]) for kv in count.iteritems()]

# put them in decreasing order
countsAndWords.sort()
countsAndWords.reverse()

# print the top three
for count, word in countsAndWords[:3]:
    print word, count

I do a sort of decorate-sort-undecorate on the counts rather than doing the sort with a comparator which does lookups in the count dictionary; it's less elegant, but i believe it will be faster. That's probably a sinful thing to do.

In general, word[0].isupper() will tel you if a word starts with an uppercase letter. Combine this into a list comprehension (or your loop)

[x for x in my_list if x[0].isupper()]

(assuming there are no empty strings)

and you get all words starting with an uppercase letter.

Since you aren't using Python2.7 and don't have Counter

from collections import defaultdict
counter = defaultdict(int)
words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
for word in (word for word in words if word[0].isupper()):
    counter[word]+=1
print counter

you could use itertools

import itertools

words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
capwords = (word for word in words if len(word) > 1 and word[0].isupper())
capwordssorted = sorted(capwords)
wordswithcounts = ((k,len(list(g))) for (k,g) in itertools.groupby(capwordssorted))
print sorted(wordswithcounts,key=lambda x:x[1],reverse=True)[:3]

继续阅读：python

Finding the common elements of a list

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？