Exclude Some URL from getting crawled

2023-03-20 22:45 问答作者：

I am writing a crawler and in that crawler I do not want to crawl some page(exclude some link so that it is not 开发者_运维技巧crawl). So I wrote exclusions for that page. Anything wrong with this code.. As this http://www.host.com/technology/ url is getting called despite writing the exclusions.. I do not want any url that starts with this url http://www.host.com/technology/ to get crawled..

public class MyCrawler extends WebCrawler {

    Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
            + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
            + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

List<String> exclusions;


    public MyCrawler() {

        exclusions = new ArrayList<String>();
        //Add here all your exclusions
//I do not want this url to get crawled..
        exclusions.add("http://www.host.com/technology/");

    }

    public boolean shouldVisit(WebURL url) {
        String href = url.getURL().toLowerCase();
        System.out.println(href);
        if (filters.matcher(href).matches()) {
            System.out.println("noooo");
            return false;
        }

        if (exclusions.contains(href)) {//why this loop is not working??
        System.out.println("Yes2");
            return false;
    }

        if (href.startsWith("http://www.host.com/")) {
            System.out.println("Yes1");
            return true;
        }



        System.out.println("No");
        return false;
    }

    public void visit(Page page) {
        int docid = page.getWebURL().getDocid();
        String url = page.getWebURL().getURL();         
        String text = page.getText();
        List<WebURL> links = page.getURLs();
        int parentDocid = page.getWebURL().getParentDocid();

        System.out.println("=============");
        System.out.println("Docid: " + docid);
        System.out.println("URL: " + url);
        System.out.println("Text length: " + text.length());
        System.out.println("Number of links: " + links.size());
        System.out.println("Docid of parent page: " + parentDocid);
        System.out.println("=============");
    }   
}

If you don't want to crawl any URL that starts with the exclusions, you'd have to do something like this:

for(String exclusion : exclusions){
    if(href.startsWith(exclusion)){
        return false;
    }
}

Also, an if statement is not a loop.

继续阅读：web-crawler

Exclude Some URL from getting crawled

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？