anemone Ruby with focus_crawl

2023-01-28 10:18 问答作者：

I'm working to do a crawl, but before I crawl an开发者_StackOverflow entire website, I would like to shoot off a test, of to or so pages. So I was thinking something like below would work, but I keep getting a nomethoderror....

Anemone.crawl(self.url) do |anemone|
      anemone.focus_crawl do |crawled_page|
        crawled_page.links.slice(0..10)
        page = pages.find_or_create_by_url(crawled_page.url)
        logger.debug(page.inspect)
        page.check_for_term(self.term, crawled_page.body)
      end
    end

NoMethodError (private method `select' called for true:TrueClass):
    app/models/site.rb:14:in `crawl'
    app/controllers/sites_controller.rb:96:in `block in crawl'
    app/controllers/sites_controller.rb:95:in `crawl'

Basically I want to have a way to first craw only 10 pages, but I seem to be not understanding the basics here. Can someone help me out? Thanks!!

Add this monkeypatch to your crawling file.

module Anemone
    class Core
        def kill_threads
            @tentacles.each { |thread| 
                Thread.kill(thread)  if thread.alive?
            }
        end
    end
end

Here is an example of how to use it after you've added it to your crawling file.Then in the file which you are running your add this to your anemone.on_every_page method

@counter = 0
Anemone.crawl(http://stackoverflow.com, :obey_robots => true) do |anemone|
    anemone.on_every_page do |page|
        @counter+= 1 
        if @counter > 10
            anemone.kill_threads
        end
    end
end

Source: https://github.com/chriskite/anemone/issues/24

So I found the :depth_limit param and that will be ok, but I would rather limit it to # of links.

i found your question while i was googling for anemone.

I had the same problem. And with Anemone, what i did was:

As soon as i reach the URL limit that i want, i raise an exception. The whole anemone block is inside a begin/rescue block.

In your case specific i would take another approach. I would download the page that you want to parse, and bind it to fakeweb. I wrote a blog entry about it, long time ago, maybe it would be useful: http://blog.bigrails.com/scraper-guide.html

继续阅读：gem ruby

anemone Ruby with focus_crawl

更多精彩内容

精彩评论

最新问答

dnf骸麒的骨戒多少钱?？

黑神话悟空谁过了我卡了两天了？

原神玛拉妮伤害怎么拉满?？

不育不孕要做什么检查？

结婚半年不怀孕正常吗？

问答排行榜

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

太快了能不能慢一点好爽~好大~不要拔出来了？

二年级家长回音怎么写大全简短的（二年级家长回音怎么写）？