开发者

Word Frequency in text using Python but disregard stop words

This gives me a frequency of words in a text:

 fullWords = re.findall(r'\w+', allText)

 d = defaultdict(int)

 for word in fullWords :
          d[word] += 1

 finalFreq = sorted(d.iteritems(),开发者_开发问答 key = operator.itemgetter(1), reverse=True)

 self.response.out.write(finalFreq)

This also gives me useless words like "the" "an" "a"

My question is, is there a stop words library available in python which can remove all these common words? I want to run this on google app engine


You can download lists of stopwords as files in various formats, e.g. from here -- all Python needs to do is to read the file (and these are in csv format, easily read with the csv module), make a set, and use membership in that set (probably with some normalization, e.g., lowercasing) to exclude words from the count.


There's an easy way to handle this by slightly modifying the code you have (edited to reflect John's comment):

stopWords = set(['a', 'an', 'the', ...])
fullWords = re.findall(r'\w+', allText)
d = defaultdict(int)
for word in fullWords:
    if word not in stopWords:
        d[word] += 1
finalFreq = sorted(d.iteritems(), key=lambda t: t[1], reverse=True)
self.response.out.write(finalFreq)

This approach constructs the sorted list in two steps: first it filters out any words in your desired list of "stop words" (which has been converted to a set for efficiency), then it sorts the remaining entries.


I know that NLTK has a package with a corpus and the stopwords for many languages, including English, see here for more information. NLTK has also a word frequency counter, it's a nice module for natural language processing that you should consider to use.


stopwords = set(['an', 'a', 'the']) # etc...
finalFreq = sorted((k,v) for k,v in d.iteritems() if k not in stopwords,
                      key = operator.itemgetter(1), reverse=True)

This will filter out any keys which are in the stopwords set.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜