Regarding crawling of short URLs using nutch
I am using nutch crawler for my application which needs to crawl a set of URLs which I give to the urls
directory and fetch only the contents of that URL only.
I am not interested in the contents of the internal or external links.
So I have used NUTCH crawler and have run the crawl command by giving depth as 1.
bin/nutch crawl urls -dir crawl -depth 1
Nutch crawls the urls and gives me the contents of the given urls.
I am reading the content using readseg utility.
bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata
With this I am fetching the content of webpage.
The problem I am facing is if I give direct urls like
http://isoc.org/wp/worldipv6day/ http://openhackindia.eventbrite.com http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/ http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php http://bangalore.yahoo.com/labs/summerschool.html http://riadevcamp.eventbrite.com http://www.sleepingtime.org/
then I am able to get the contents of the webpage. But when I give the set of URLs as short URLs like
http://is.gd/jOoAa9 http://is.gd/ubHRAF http://is.gd/GiFqj9 http://is.gd/H5rUhg http://is.gd/wvKINL http://is.gd/K6jTNl http://is.gd/mpa6fr http://is.gd/fmobvj http://is.gd/s7uZf***
I am not able to fetch the contents.
When I read the segments, it is not showing any content. Please find below the content of dump file read from segments.
*Recno:: 0 URL:: http://is.gd/0yKjO6 CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Tue Jan 25 20:56:07 IST 2011 Modified time: Thu Jan 01 05:30:00 IST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1295969171407 Content:: Version: -1 url: http://is.gd/0yKjO6 base: http://is.gd/0yKjO6 contentType: text/html metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14 Content: Recno:: 1 URL:: http://is.gd/1tpKaN Content:: Version: -1 url: http://is.gd/1tpKaN base: http://is.gd/1tpKaN contentType: text/html metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1 _fst_=36 nutch.segment.name=20110125开发者_如何学运维205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14 Content: CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Tue Jan 25 20:56:07 IST 2011 Modified time: Thu Jan 01 05:30:00 IST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0*
I have also tried by setting the max.redirects property in nutch-default.xml as 4 but dint find any progress. Kindly provide me a solution for this problem.
Thanks and regards, Arjun Kumar Reddy
Using nutch 1.2 try editing the file conf/nutch-default.xml
find http.redirect.max and change the value to at least 1 instead of the default 0.
<property>
<name>http.redirect.max</name>
<value>2</value><!-- instead of 0 -->
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
Good luck
You will have to set a depth of 2 or more, because the first fetch returns a 301 (or 302) code. The redirection will be followed on the next iteration, so you have to allow more depth.
Also, make sure that you allow all urls that will be followed in your regex-urlfilter.txt
精彩评论