How to speed up crawling in Nutch

2023-02-08 05:37 问答作者：

I am trying to develop an application in which I'll give a constrained set of urls to the urls file in Nutch. I am able to crawl these urls and get the contents of them by reading the data from the segments.

I have crawled by giving the depth 1 as I am no way concerned about开发者_C百科 the outlinks or inlinks in the webpage. I only need the contents of that webpages in the urls file.

But performing this crawl takes time. So, suggest me a way to decrease the crawl time and increase the speed of crawl. I also dont need indexing because I am not concerned about the search part.

Does anyone have suggestions on how to speed up the crawl?

The main thing for getting speed is configuring the nutch-site.xml

<property>
<name>fetcher.threads.per.queue</name>
   <value>50</value>
   <description></description>
</property>

You can scale up the threads in nutch-site.xml. Increasing fetcher.threads.per.host and fetcher.threads.fetch will both increase the speed at which you crawl. I have noticed drastic improvements. Use caution when increasing these though. If you do not have the hardware or connection to support this increased traffic, the amount of errors in crawling can signifigantly increase.

For me, this property helped me so much, because a slow domain can slow down all the fetch phase :

 <property>
  <name>generate.max.count</name>
  <value>50</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
 </property>

For example, if you respect the robots.txt (default behaviour) and a domain is too long to crawl, the delay will be : fetcher.max.crawl.delay. And a lot of this domain in a queue will slow down all the fetch phase, so it's better to limit the generate.max.count.

You can add this property for limit the time of the fetch phase in the same way :

<property>
  <name>fetcher.throughput.threshold.pages</name>
  <value>1</value>
  <description>The threshold of minimum pages per second. If the fetcher downloads less
  pages per second than the configured threshold, the fetcher stops, preventing slow queue's
  from stalling the throughput. This threshold must be an integer. This can be useful when
  fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
  </description>
</property>

But please, dont touch to the fetcher.threads.per.queue property, you will finish in a black list... It's not a good solution to improve the crawl speed...

Hello I am also new for this crawling but I have used some methods I got some good results may it will you I have changed my nutch-site.xml with these properties

<property>
  <name>fetcher.server.delay</name>
  <value>0.5</value>
 <description>The number of seconds the fetcher will delay between 
   successive requests to the same server. Note that this might get
   overriden by a Crawl-Delay from a robots.txt and is used ONLY if 
   fetcher.threads.per.queue is set to 1.
 </description>

</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>400</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>


<property>
  <name>fetcher.threads.per.host</name>
  <value>25</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>

kindly suggest some more options Thanks

I have similar issues and could improve the speed with the help of https://wiki.apache.org/nutch/OptimizingCrawls

It has useful information with what can be slowing down your crawl and what you can do to improve each of those issues.

Unfortunately in my case I have the queues quite unbalanced and can't request too fast to the bigger one otherwise I get blocked so I probably need to go to cluster solution or TOR before i speed up the threads further.

If you don't need to follow links, I see no reason to use Nutch. You can simply take your list of urls and fetch those with an http client library or a simple script using curl.

继续阅读：nutch web-crawler

How to speed up crawling in Nutch

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？