开发者

Java crawler library - recursive HTTP subtree download with directory listing parser

My application currently reads data by copying filesystem tree from remote machine via shared disk, so it works as filesystem deep copy from application's point of view.

This solution is somewhat limiting and I want to support also second option - copy subtree via http.

The library should do something like wget --recursive which parses the directory listing and use it for traversing down the tree.

I could not find 开发者_开发技巧any java library doing this.

I am able to implement such functionality myself (with NekoHTML or something similar), but I don't like reinventing the wheel.

Is there such a library that I can easily use within my application ?

Ideally:

  • published in Maven Central Repository as I am using Maven for builds
  • with as few dependencies on other libraries as possible
  • no need for robots exclusion support - will operate on limited set of interim servers only

Thanks.

Note: please post pointers to homepages of libraries which you personally used.


The Norconex HTTP Collector traverses websites like a tree, given one or more start URLs. It can be used as a java library in your application, or as a command line application. You can decide what to do with each document it crawls. Being a full-blown web crawler, it probably does more than what you are after, but you can configure it to suit your need.

For instance, it will by default extract text found in your documents and it let's you decide what to do with that text via plugging a "Committer" (i.e. where to "commit" the extracted content). In your case I think you want to the raw documents only and ignore the text conversion part. You can do so by plugging in your own document processor, followed by "filtering out" documents so they stop being processed once you have dealt with them your own way.

The project is open-source, hosted on Github and is fully "mavenized". It supports robots.txt, but that can turn that off if you want. The only downside to you is having more than a few dependencies, but since you are using Maven, those should get resolved automatically without effort. You'll find Maven repository info on the product site.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜