Extracting a set of words with the Python/NLTK, then comparing it to a standard English dictionary

2023-01-11 01:47 问答作者：

I have:

from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]

f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
engli开发者_如何学编程shwords = [w.lower() for w in englishwords]

which is straight from the NLTK manual. What I want to do next is to compare vocab to an exhaustive set of English words, like the OED, and extract the difference -- the set of Finnegans Wake words that have not, and probably never will, be in the OED. I'm much more of a verbal person than a math-oriented person, so I haven't figured out how to do that yet, and the manual goes into way too much detail about stuff I don't actually want to do. I'm assuming it's just one or two more lines of code, though.

If your English dictionary is indeed a set (hopefully of lowercased words),

set(vocab) - english_dictionary

gives you the set of words which are in the vocab set but not in the english_dictionary one. (It's a pity that you turned vocab into a list by that sorted, since you need to turn it back into a set to perform operations such as this set difference!).

If your English dictionary is in some different format, not really a set or not comprised only of lowercased words, you'll have to tell us what that format is for us to be able to help!-)

Edit: given the OP's edit shows that both words (what was previously called vocab) and englishwords (what I previously called english_dictionary) are in fact lists of lowercased words, then

newwords = set(words) - set(englishwords)

newwords = set(words).difference(englishwords)

are two ways to express "the set of words that are not englishwords". The former is slightly more concise, the latter perhaps a bit more readable (since it uses the word "difference" explicitly, instead of a minus sign) and perhaps a bit more efficient (since it doesn't explicitly transform the list englishwords into a set -- though, if speed is crucial this needs to be checked by measurement, since "internally" difference still needs to do some kind of "transformation-to-set"-like operation).

If you're keen to have a list as the result instead of a set, sorted(newwords) will give you an alphabetically sorted list (list(newwords) would give you a list a bit faster, but in totally arbitrary order, and I suspect you'd rather wait a tiny extra amount of time and get, in return, a nicely alphabetized result;-).

继续阅读：nltk python text

Extracting a set of words with the Python/NLTK, then comparing it to a standard English dictionary

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？