I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc
What is the best practice and library I can use to key in search textbox on external website and collect the search result?
I am using the Java-based Nutch web-search software. In order to prevent duplicate (url) results from being returned in my search query results, I am trying to remove (a.k.a. normalize) the expression
I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/)开发者_开发技巧 and one other.
I am using Nutch-1.0 and I am getting this log entry 2009-11-12 22:13:11,093 INFOhttpclient.HttpMethodDirector - Redirec开发者_如何学Pythont requested but followRedirects is disabled.
I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol