开发者

Java robots.txt parser with wildcard support

I'm looking for a robots.txt parser in Java, which supports the same pattern matching rules as the Googlebot.

I'v开发者_高级运维e found some librairies to parse robots.txt files, but none of them supports Googlebot-style pattern matching :

  • Heritrix (there is an open issue on this subject)
  • Crawler4j (looks like the same implementation as Heritrix)
  • jrobotx

Does anyone know of a java library that can do this ?


Nutch seems to be using a combination of crawler-commons with some custom code (see RobotsRulesParser.java). I'm not sure of the current state of afairs, though.

In particular, the issue NUTCH-1455 looks to be quite related to your needs:

If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt http.robots.agents = "Download Ninja,*"

Perhaps its worth it to try/patch/submit the fix :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜