开发者

Where can I find get a dump of raw text on the web?

I am looking to do some text analysis in a program I am writing. I am looking for alternate sources of text in its raw form similar to wha开发者_开发技巧t is provided in the Wikipedia dumps (download.wikimedia.com).

I'd rather not have to go through the trouble of crawling websites, trying to parse the html , extracting text etc..


What sort of text are you looking for?

There are many free e-books (fiction and non-fiction) in .txt format available at Project Gutenberg.

They also have large DVD images full of books available for download.


NLTK provides a simple Python API to access many text corpora, including Gutenberg, Reuters, Shakespeare, and others.

>>> from nltk.corpus import brown
>>> brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]


the gutenberg project has huge amounts of ebooks in various formats (including plain text)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜