Finding the common elements of a list
Hi as per the earlier post.
Given the following list:
['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
I am t开发者_Go百科rying to count how many times each word THAT STARTS WITH A CAPITAL appears and display the top 3.
I am not interested in words that do not start with a capital.
If a word appears multiple times, sometimes starting with a capital and sometimes not, only count the times it does with a capital.
THis is what my code looks like currently:
words = ""
for word in open('novel.txt', 'rU'):
words += word
words = words.split(' ')
words= list(words)
words = ('\n'.join(words)).split('\n')
word_counter = {}
for word in words:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
top_3 = popular_words[:3]
matches = []
for i in range(3):
print word_counter[top_3[i]], top_3[i]
#uncomment to produce the word file
##words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
##open('novel.txt','w').write('\n'.join(words))
import string
cap_words = [word.strip(string.punctuation) for word in open('novel.txt').read().split() if word.istitle()]
##print(cap_words) # debug
try:
from collections import Counter # Python >= 2.7
print('Counter')
print(Counter(cap_words).most_common(3))
except ImportError:
print('Normal dict')
wordcount= dict()
for word in cap_words:
wordcount[word] = (wordcount[word] + 1
if word in wordcount
else 1)
print(sorted(wordcount.items(), key = lambda x: x[1], reverse = True)[:3])
I do not get it why you would like to keep different kinds of line termination with 'rU' mode. I would use normally normal open as I wrote in edited code above. EDIT: You had words together with punctuation, so cleaned up those with strip()
print "\n".join(sorted(["%d %s" % (lst.count(i), i) \
for i in set(lst) if i.istitle()])[-3:])
2 And
5 Cats
6 Jellicle
Here are some additional comments:
words = ""
for word in open('novel.txt', 'rU'):
words += word
words = words.split(' ')
words= list(words)
words = ('\n'.join(words)).split('\n')
can be replaced with:
text = open('novel.txt', 'rU').read() # read everything
wordlist = text.split() # split on all whitespace
But you don't use your 'must start with a capital letter' requirement yet. Time to add:
capwordlist = (word for word in wordlist if word.istitle())
istitle()
means word[0].isupper() and word[1:].islower()
. This means 'SO'.istitle() -> False
.
That might work for you, but maybe you just want the word[0].isupper()
instead.
This part is good if you can't use collections.Counter
(new in 2.7)
word_counter = {}
for word in capwordlist:
if word in word_counter:
word_counter[word] += 1
else:
word_counter[word] = 1
popular_words = sorted(word_counter, key = word_counter.get, reverse = True)
top_3 = popular_words[:3]
else this simply becomes:
from collections import Counter
word_counter = Counter(capwords)
top_3 = word_counter.most_common(3) # gives `word, count` pairs!
And this:
for i in range(3):
print word_counter[top_3[i]], top_3[i]
can be this:
for word in top_3:
print word_counter[word], word
One thing i'd avoid is reading all the words in before processing. It will work, but IMHO, it's better not to do that if you don't need to, and you don't. Here's my solution (with elements liberally stolen from previous ones!), done with 2.6.2:
import sys
# a generator function which iterates over the words in a file
def words(f):
for line in f:
for word in line.split():
yield word
# returns a generator expression filtering an iterator down to titlecase words
def titles(s):
return (word for word in s if word.istitle())
# count the titlecase words in the file
count = {}
for word in titles(words(file(sys.argv[1]))):
count[word] = count.get(word, 0) + 1
# build a list of tuples with the count for each word
countsAndWords = [(kv[1], kv[0]) for kv in count.iteritems()]
# put them in decreasing order
countsAndWords.sort()
countsAndWords.reverse()
# print the top three
for count, word in countsAndWords[:3]:
print word, count
I do a sort of decorate-sort-undecorate on the counts rather than doing the sort with a comparator which does lookups in the count dictionary; it's less elegant, but i believe it will be faster. That's probably a sinful thing to do.
In general, word[0].isupper() will tel you if a word starts with an uppercase letter. Combine this into a list comprehension (or your loop)
[x for x in my_list if x[0].isupper()]
(assuming there are no empty strings)
and you get all words starting with an uppercase letter.
Since you aren't using Python2.7 and don't have Counter
from collections import defaultdict
counter = defaultdict(int)
words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
for word in (word for word in words if word[0].isupper()):
counter[word]+=1
print counter
you could use itertools
import itertools
words = ['Jellicle', 'Cats', 'are', 'black', 'and', 'white,', 'Jellicle', 'Cats', 'are', 'rather', 'small;', 'Jellicle', 'Cats', 'are', 'merry', 'and', 'bright,', 'And', 'pleasant', 'to', 'hear', 'when', 'they', 'caterwaul.', 'Jellicle', 'Cats', 'have', 'cheerful', 'faces,', 'Jellicle', 'Cats', 'have', 'bright', 'black', 'eyes;', 'They', 'like', 'to', 'practise', 'their', 'airs', 'and', 'graces', 'And', 'wait', 'for', 'the', 'Jellicle', 'Moon', 'to', 'rise.', '']
capwords = (word for word in words if len(word) > 1 and word[0].isupper())
capwordssorted = sorted(capwords)
wordswithcounts = ((k,len(list(g))) for (k,g) in itertools.groupby(capwordssorted))
print sorted(wordswithcounts,key=lambda x:x[1],reverse=True)[:3]
精彩评论