开发者

How do we build a website Crawler using Java

posting this question again.I have started with the crawler , but i am stucked with the indexing part.I want an efficient and fast way to index the links.Currently what i am doing is i am inserting the links into the database , but checking for unique links is overhead so can anyone suggest me any better way to do this.


Hi I am trying to build a website crawler , which will crawl the whole website and get all of the links within it.Something very similar to "XENU". But I am not able to figure out how to go for it. I have one algorithm in my mind but that would be very slow, it is mention b开发者_Python百科elow.

  1. Get the source of home page.
  2. Get all the anchor tag from the source.
  3. Get the URLs from the anchor tag.
  4. Check if the url belongs to the same site or external site.
  5. Get the source for the urls we found in the above process and mark those urls as checked.
  6. Repeat the process until there are no un marked urls.

This is some what rough idea of what I came up with. But it will be very slow. So can any one suggest me some other approach or enhance this algorithm.

Regards, Sagar.


The method you have described is pretty much the only thing you can do. The only way to make it faster is to process multiple URLs in parallel via separate threads. This can be done relatively easily and massively: you only need to synchronize access to the URLs-to-be-processed pool and saving the results, so having 1000 threads doing it in parallel should work quite fine.


I have done something similar on J2ME three years ago. The idea was to implement a simple HTML parser that will detect all the tags and media tags. Every link is put in a synchronized collection. The collection's elements are consumed by a number of threads that will explore the next URL and so on. That was 3 years ago on a limited J2ME device.Now there is Lucene which is a very powerful java full text search engine. I recommend you to read this link which consist on using Lucene for web pages crawling : http://www.codeproject.com/KB/java/JSearch_Engine.aspx

example :

private static void indexDocs(String url) throws Exception {

    //index page
    Document doc = HTMLDocument.Document(url);
    System.out.println("adding " + doc.get("path"));
    try {
        indexed.add(doc.get("path"));
        writer.addDocument(doc);          // add docs unconditionally
        //TODO: only add HTML docs
        //and create other doc types

        //get all links on the page then index them
        LinkParser lp = new LinkParser(url);
        URL[] links = lp.ExtractLinks();

        for (URL l : links) {
            //make sure the URL hasn't already been indexed
            //make sure the URL contains the home domain
            //ignore URLs with a querystrings by excluding "?"
            if ((!indexed.contains(l.toURI().toString())) &&
                (l.toURI().toString().contains(beginDomain)) &&
                (!l.toURI().toString().contains("?"))) {
                //don't index zip files
                if (!l.toURI().toString().endsWith(".zip")) {
                    System.out.print(l.toURI().toString());
                    indexDocs(l.toURI().toString());
                }
            }
        }

    } catch (Exception e) {
        System.out.println(e.toString());
    }
} 


Assuming that this is for learning purposes, I suggest you read more about web crawlers here and this will provide you more info/context than you already have. You no need to implement all of it but can choose the most important onces. Go through this article that provides a simple implementation.

Divide your problem into small logical chunks so that you can perform them in parallel. Look at the MapReduce implementations like Hadoop and GridGain.


I think, you should look at Apache Nutch Project.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜