How do we build a website Crawler using Java

2023-02-02 02:33 问答作者：

posting this question again.I have started with the crawler , but i am stucked with the indexing part.I want an efficient and fast way to index the links.Currently what i am doing is i am inserting the links into the database , but checking for unique links is overhead so can anyone suggest me any better way to do this.

Hi I am trying to build a website crawler , which will crawl the whole website and get all of the links within it.Something very similar to "XENU". But I am not able to figure out how to go for it. I have one algorithm in my mind but that would be very slow, it is mention b开发者_Python百科elow.

Get the source of home page.
Get all the anchor tag from the source.
Get the URLs from the anchor tag.
Check if the url belongs to the same site or external site.
Get the source for the urls we found in the above process and mark those urls as checked.
Repeat the process until there are no un marked urls.

This is some what rough idea of what I came up with. But it will be very slow. So can any one suggest me some other approach or enhance this algorithm.

Regards, Sagar.

The method you have described is pretty much the only thing you can do. The only way to make it faster is to process multiple URLs in parallel via separate threads. This can be done relatively easily and massively: you only need to synchronize access to the URLs-to-be-processed pool and saving the results, so having 1000 threads doing it in parallel should work quite fine.

I have done something similar on J2ME three years ago. The idea was to implement a simple HTML parser that will detect all the tags and media tags. Every link is put in a synchronized collection. The collection's elements are consumed by a number of threads that will explore the next URL and so on. That was 3 years ago on a limited J2ME device.Now there is Lucene which is a very powerful java full text search engine. I recommend you to read this link which consist on using Lucene for web pages crawling : http://www.codeproject.com/KB/java/JSearch_Engine.aspx

example :

private static void indexDocs(String url) throws Exception {

    //index page
    Document doc = HTMLDocument.Document(url);
    System.out.println("adding " + doc.get("path"));
    try {
        indexed.add(doc.get("path"));
        writer.addDocument(doc);          // add docs unconditionally
        //TODO: only add HTML docs
        //and create other doc types

        //get all links on the page then index them
        LinkParser lp = new LinkParser(url);
        URL[] links = lp.ExtractLinks();

        for (URL l : links) {
            //make sure the URL hasn't already been indexed
            //make sure the URL contains the home domain
            //ignore URLs with a querystrings by excluding "?"
            if ((!indexed.contains(l.toURI().toString())) &&
                (l.toURI().toString().contains(beginDomain)) &&
                (!l.toURI().toString().contains("?"))) {
                //don't index zip files
                if (!l.toURI().toString().endsWith(".zip")) {
                    System.out.print(l.toURI().toString());
                    indexDocs(l.toURI().toString());
                }
            }
        }

    } catch (Exception e) {
        System.out.println(e.toString());
    }
}

Assuming that this is for learning purposes, I suggest you read more about web crawlers here and this will provide you more info/context than you already have. You no need to implement all of it but can choose the most important onces. Go through this article that provides a simple implementation.

Divide your problem into small logical chunks so that you can perform them in parallel. Look at the MapReduce implementations like Hadoop and GridGain.

I think, you should look at Apache Nutch Project.

继续阅读：lucene solr web-crawler

How do we build a website Crawler using Java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？