Niocchi crawler - how to add url to crawle during crawling process (crawling whole website)

2023-03-03 22:44 问答作者：

has anyone experience with Niocchi library? I start to crawle with domain url. In Worker method processResource(), I parse resource I get, extract all internal links in this page and I need to add them to crawle. But I can`t fi开发者_StackOverflow社区nd how. Should I add it to UrlPool, or ResourcePool, or somewhere else?

Thanks!

You can add them to an existing URLPool. The existing URLPool implementations are not expandable so you have to create your own URLPool class that is expandable. I called my class ExpandableURLPool.

The URLPool.setProcessed method is called by the framework upon completion of processing and it is there you can add additional URLS to the url list. I will follow with an example, but first, the URLPool documentation states:

setProcessed(Query) is called by the crawler to inform the URLPool when a Query has been crawled and its resource processed. This is typically used by the URLPool to check the crawl status and log the error in case of a failure or to get more URL to crawl in case of success. A typical example where getNextQuery() returns null but hasNextQuery() returns true is when the URLPool is waiting for some processed resources from which more URL to crawl have been extracted to come back. Check the urlpools package for examples of implementation.

This implies that the tricky part in your implementation of ExapndableURLPool is that the hasNextQuery method should return true if there is an outstanding query being processed that MAY result in new urls being added to the pool. Similarly, getNextQuery must return null in cases where there is an outstanding query that has not finished yet and MAY result in new urls being added to the pool. [I dislike the way the niocchi is put together in this regard]

Here is my very preliminary version of ExpandableURLPool:

class ExpandableURLPool implements URLPool {
List<String> urlList = new ArrayList<String>();
int cursor = 0;

int outstandingQueryies = 0;

public ExpandableURLPool(Collection<String> seedURLS) {
    urlList.addAll(seedURLS);
}

@Override
public boolean hasNextQuery() {
   return  cursor < urlList.size() || outstandingQueryies > 0;

}

@Override
public Query getNextQuery() throws URLPoolException {
    try {
        if (cursor >= urlList.size()) {
            return null;
        } else {
            outstandingQueryies++;
            return new Query( urlList.get(cursor++) ) ;
        }
    } catch (MalformedURLException e) {
        throw new URLPoolException( "invalid url", e ) ;
    }    
}

@Override
public void setProcessed(Query query) {
    outstandingQueryies--;


}

public void addURL(String url) {
    urlList.add(url);
}

}

I also created a Worker class, derived from DiskSaveWorker to test the above implementation:

    class MyWorker extends org.niocchi.gc.DiskSaveWorker {

    Crawler mCrawler = null;
    ExpandableURLPool pool = null;

    int maxepansion = 10;

    public MyWorker(Crawler crawler, String savePath, ExpandableURLPool aPool) {
        super(crawler, savePath);
        mCrawler = crawler;
        pool = aPool;
    }

    @Override
    public void processResource(Query query) {
        super.processResource(query);
        // The following is a test
        if (--maxepansion >= 0  ) {
            pool.addURL("http://www.somewhere.com");
        }       

    }


}

继续阅读：web-crawler

Niocchi crawler - how to add url to crawle during crawling process (crawling whole website)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？