开发者

Large free block of english non-pronoun text

As part of teaching myself python I've written a script which allows a user to play hangman. At the moment, the hangman word to be guessed is simply entered manually at the start of the script's code.

I want instead for the script to choose randomly from a large list of english words. This I know how to do - my problem is finding that list of words to work from in the first place.

Does anyone know of a source on the net for, say, 1000 common english words where they can be downloaded as a block of text or something similar that I can work with?

(My开发者_StackOverflow中文版 initial thought was grabbing a chunk of a novel from project gutenburg [this project is only for my own amusement and won't be available anywhere else so copyright etc doesn't matter hugely to me btw], but anything like that is likely to contain too many names or non-standard words that wouldn't be suitable for hangman. I need text that only has words legal for use in scrabble, basically).

It's a slightly odd question for here I suppose, but actually I thought the answer might be of use not just to me but anyone else working on a project for a wordgame or similar that needs a large seed list of words to work from.

Many thanks for any links or suggestions :)


Would this be useful?


Have you tried /usr/share/dict/words?


Create text list manually

Grab text from Project Gutenberg, Wikipedia or some other source. Go through the text and count how many times each word is found. The words that are found most frequently will be pronouns, conjunctions, etc... Just throw them out.

Proper Nouns will likely be the least frequently found words unless of course your text is a story, then the character names will likely be found quite often. Probably the best way to handle proper nouns is to use many sources and count how many sources the word is found in. Essentially, words that are common among a lot of different sources will likely not be proper nouns. Words that are specific to one text source, you can throw out. This idea is related to tfidf.

Once you have calculated these word frequencies, it's also easy to just look over the words, and tweak your list as necessary.

Use Wordnet

Another idea is to download words from Wordnet. Wordnet tells the parts of speech for a lot of words. You could just stick to nouns and verbs for your purpose.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜