开发者

Possible anticrawler

For an educational NLP project I need a list of all Italian words. I thought I would write a crawler that will get the words from www.wordreference.com. I use Python with the mechanize crawler framework. but when i use 开发者_运维知识库the code:

 br = mechanize.Browser()
 br.open("http://www.wordreference.com/iten/abaco")
 html = br.response().get_data()
 print html

I get some page from "yahoo.com". is it possible this website has an anticrawler mechanism?


I would suggest to use existing datasets, here are few examples from this acl wiki page:

Corpuses:

  • ...
  • Oxford Text Archive Corpus of Italian Newspapers ...
  • ...

WordNets

  • EuroWordNet
  • MultiWordNet - a multilingual lexical database in which the Italian WordNet is strictly aligned with Princeton WordNet 1.6 ...

Please check the full list on the acl wiki page, I think you should find an italian corpus, which let you to define italian words.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜