How do I find the frequency count of a word in English using WordNet?

2023-03-04 08:11 问答作者：

Is there a way to find the frequency of the usage of a word in the English language using WordNet or NLTK using Python?

NOTE: I do not want the frequency count of a word in a given input file. I w开发者_如何学Cant the frequency count of a word in general based on the usage in today's time.

In WordNet, every Lemma has a frequency count that is returned by the method lemma.count(), and which is stored in the file nltk_data/corpora/wordnet/cntlist.rev.

Code example:

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print l.name + " " + str(l.count())

Result:

stack 2
batch 0
deal 1
flock 1
good_deal 13
great_deal 10
hatful 0
heap 2
lot 13
mass 14
mess 0
...

However, many counts are zero and there is no information in the source file or in the documentation which corpus was used to create this data. According to the book Speech and Language Processing from Daniel Jurafsky and James H. Martin, the sense frequencies come from the SemCor corpus which is a subset of the already small and outdated Brown Corpus.

So it's probably best to choose the corpus that fits best to the your application and create the data yourself as Christopher suggested.

To make this Python3.x compatible just do:

Code example:

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print( l.name() + " " + str(l.count()))

You can sort of do it using the brown corpus, though it's out of date (last revised in 1979), so it's missing lots of current words.

import nltk
from nltk.corpus import brown
from nltk.probability import *

words = FreqDist()

for sentence in brown.sents():
    for word in sentence:
        words.inc(word.lower())

print words["and"]
print words.freq("and")

You could then cpickle the FreqDist off to a file for faster loading later.

A corpus is basically just a file full of sentences, one per line, and there are lots of other corpora out there, so you could probably find one that fits your purpose. A couple of other sources of more current corpora: Google, American National Corpus.

You can also suppsedly get a current list of the top 60,000 words and their frequencies from the Corpus of Contemporary American English

Check out this site for word frequencies: http://corpus.byu.edu/coca/

Somebody compiled a list of words taken from opensubtitles.org (movie scripts). There's a free simple text file formatted like this available for download. In many different languages.

you 6281002
i 5685306
the 4768490
to 3453407
a 3048287
it 2879962

http://invokeit.wordpress.com/frequency-word-lists/

You can't really do this, because it depends so much on the context. Not only that, for less frequent words the frequency will be wildly dependent on the sample.

Your best bet is probably to find a large corpus of text of the given genre (e.g. download a hundred books from Project Gutenberg) and count the words yourself.

Take a look at the Information Content section of the Wordnet Similarity project at http://wn-similarity.sourceforge.net/. There you will find databases of word frequencies (or, rather, information content, which is derived from word frequency) of Wordnet lemmas, calculated from several different corpora. The source codes are in Perl, but the databases are provided independently and can be easily used with NLTK.

The Wiktionary project has a few frequency lists based on TV scripts and Project Gutenberg, but their format is not particularly nice for parsing.

You can download the word vectors glove.6B.zip from https://github.com/stanfordnlp/GloVe, unzip them and look at the file glove.6B.50d.txt.

There, you will find 400.000 English words, one in each line (plus 50 numbers per word in the same line), lower cased, sorted from most frequent (the) to least frequent. You can create a rank of words by reading this file in raw format or pandas.

It's not perfect, but I have used it in the past. The same website provides other files with up to 2.2m English words, cased.

Python 3 version of Christopher Pickslay's solution (incl. saving frequencies to tempdir):

from pathlib import Path
from pickle import dump, load
from tempfile import gettempdir

from nltk.probability import FreqDist


def get_word_frequencies() -> FreqDist:
  tmp_path = Path(gettempdir()) / "word_freq.pkl"
  if tmp_path.exists():
    with tmp_path.open(mode="rb") as f:
      word_frequencies = load(f)
  else:
    from nltk import download
    download('brown', quiet=True)
    from nltk.corpus import brown
    word_frequencies = FreqDist(word.lower() for sentence in brown.sents()
                                for word in sentence)
    with tmp_path.open(mode="wb") as f:
      dump(word_frequencies, f)

  return word_frequencies

Usage:

word_frequencies = get_word_frequencies()

print(word_frequencies["and"])
print(word_frequencies.freq("and"))

Output:

28853
0.02484774266443448

继续阅读：nltk python wordnet

How do I find the frequency count of a word in English using WordNet?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？