How do I count words in an nltk plaintextcorpus faster?

2023-01-19 16:03 问答作者：

I have a set of documents, and I want to return a list of tuples where each tuple has the date of a given document and the number of times a given search term appears in that document. My code (below) works, but is slow, and I'm a n00b. Are there obvious ways to make this faster? Any help would be much appreciated, mostly so that I can learn better coding, but also so that I can get this project done faster!

def searchText(searchword):
    counts = []
    corpus_root =开发者_StackOverflow 'some_dir'
    wordlists = PlaintextCorpusReader(corpus_root, '.*')
    for id in wordlists.fileids():
        date = id[4:12]
        month = date[-4:-2]
        day = date[-2:]
        year = date[:4]
        raw = wordlists.raw(id)
        tokens = nltk.word_tokenize(raw)
        text = nltk.Text(tokens)
        count = text.count(searchword)
        counts.append((month, day, year, count))

    return counts

If you just want a frequency of word counts, then you don't need to create nltk.Text objects, or even use nltk.PlainTextReader. Instead, just go straight to nltk.FreqDist.

files = list_of_files
fd = nltk.FreqDist()
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                fd.inc(word)

Or, if you don't want to do any analysis - just use a dict.

files = list_of_files
fd = {}
for file in files:
    with open(file) as f:
        for sent in nltk.sent_tokenize(f.lower()):
            for word in nltk.word_tokenize(sent):
                try:
                    fd[word] = fd[word]+1
                except KeyError:
                    fd[word] = 1

These could be made much more efficient with generator expressions, but I'm used for loops for readability.

继续阅读：corpus nltk python

How do I count words in an nltk plaintextcorpus faster?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？