How to speed up crawling in Nutch
I am trying to develop an application in which I'll give a constrained set of urls to the urls file in Nutch. I am able to crawl these urls and get the contents of them by reading the data from the segments.
I have crawled by giving the depth 1 as I am no way concerned about开发者_C百科 the outlinks or inlinks in the webpage. I only need the contents of that webpages in the urls file.
But performing this crawl takes time. So, suggest me a way to decrease the crawl time and increase the speed of crawl. I also dont need indexing because I am not concerned about the search part.
Does anyone have suggestions on how to speed up the crawl?
The main thing for getting speed is configuring the nutch-site.xml
<property>
<name>fetcher.threads.per.queue</name>
<value>50</value>
<description></description>
</property>
You can scale up the threads in nutch-site.xml. Increasing fetcher.threads.per.host and fetcher.threads.fetch will both increase the speed at which you crawl. I have noticed drastic improvements. Use caution when increasing these though. If you do not have the hardware or connection to support this increased traffic, the amount of errors in crawling can signifigantly increase.
For me, this property helped me so much, because a slow domain can slow down all the fetch phase :
<property>
<name>generate.max.count</name>
<value>50</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>
For example, if you respect the robots.txt (default behaviour) and a domain is too long to crawl, the delay will be : fetcher.max.crawl.delay. And a lot of this domain in a queue will slow down all the fetch phase, so it's better to limit the generate.max.count.
You can add this property for limit the time of the fetch phase in the same way :
<property>
<name>fetcher.throughput.threshold.pages</name>
<value>1</value>
<description>The threshold of minimum pages per second. If the fetcher downloads less
pages per second than the configured threshold, the fetcher stops, preventing slow queue's
from stalling the throughput. This threshold must be an integer. This can be useful when
fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
</description>
</property>
But please, dont touch to the fetcher.threads.per.queue property, you will finish in a black list... It's not a good solution to improve the crawl speed...
Hello I am also new for this crawling but I have used some methods I got some good results may it will you I have changed my nutch-site.xml with these properties
<property>
<name>fetcher.server.delay</name>
<value>0.5</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server. Note that this might get
overriden by a Crawl-Delay from a robots.txt and is used ONLY if
fetcher.threads.per.queue is set to 1.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>400</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>25</value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
kindly suggest some more options Thanks
I have similar issues and could improve the speed with the help of https://wiki.apache.org/nutch/OptimizingCrawls
It has useful information with what can be slowing down your crawl and what you can do to improve each of those issues.
Unfortunately in my case I have the queues quite unbalanced and can't request too fast to the bigger one otherwise I get blocked so I probably need to go to cluster solution or TOR before i speed up the threads further.
If you don't need to follow links, I see no reason to use Nutch. You can simply take your list of urls and fetch those with an http client library or a simple script using curl.
精彩评论