Possible anticrawler
For an educational NLP project I need a list of all Italian words. I thought I would write a crawler that will get the words from www.wordreference.com. I use Python with the mechanize crawler framework. but when i use 开发者_运维知识库the code:
br = mechanize.Browser()
br.open("http://www.wordreference.com/iten/abaco")
html = br.response().get_data()
print html
I get some page from "yahoo.com". is it possible this website has an anticrawler mechanism?
I would suggest to use existing datasets, here are few examples from this acl wiki page:
Corpuses:
- ...
- Oxford Text Archive Corpus of Italian Newspapers ...
- ...
WordNets
- EuroWordNet
- MultiWordNet - a multilingual lexical database in which the Italian WordNet is strictly aligned with Princeton WordNet 1.6 ...
Please check the full list on the acl wiki page, I think you should find an italian corpus, which let you to define italian words.
精彩评论