Where can I find get a dump of raw text on the web?
I am looking to do some text analysis in a program I am writing. I am looking for alternate sources of text in its raw form similar to wha开发者_开发技巧t is provided in the Wikipedia dumps (download.wikimedia.com).
I'd rather not have to go through the trouble of crawling websites, trying to parse the html , extracting text etc..
What sort of text are you looking for?
There are many free e-books (fiction and non-fiction) in .txt format available at Project Gutenberg.
They also have large DVD images full of books available for download.
NLTK provides a simple Python API to access many text corpora, including Gutenberg, Reuters, Shakespeare, and others.
>>> from nltk.corpus import brown
>>> brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
the gutenberg project has huge amounts of ebooks in various formats (including plain text)
精彩评论