开发者

Efficiently detect broken urls in Java

What's the most efficient way to detect a broken url (HTTP 404) i开发者_开发问答n Java? I'd like to do this in a loop and take as little time as possible.


You can only detect a 404 after you've requested the URL: you'll get a header back with the code (200, or 301 for redirect, or 404 for missing file), and you can check that.

So you'll have to do the request and wait for a possible 404.

There's a rather good comment below that should not be skipped, so I'm repeating it here: Possible optimization (in the case of existing URLs): use a HEAD request instead of a GET.


There are many different ways in which an URL can be broken:

  • Syntactically invalid
  • Contains a non-existing domain
  • Server is not reachable
  • Server does not accept connections
  • Server responds with an error

Except for the first, all of these can take a relatively long time (probably well over a second on average), and there is no way to speed it up since you're communicating with another computer.

The only thing you can do is to check many URLs in parallel using a thread pool.


You can establish URL connection verify that URL is broken by catching exception and checking the HTTP status code. If exception is not thrown and HTTP status is 200 URL is OK.

But be carefull! Sometimes URL is broken but the application returns human readable error page with status 200. For example site www.somecompany.com exists but page www.somecompany.com/foo.html does not exist any more. When you try to get there you get message "page does not exist" but the HTTP status is 200. This can be solved (sometimes) by parsing the page content only.


I wrote a Github action that can help with continuous integration by testing all links before any merge or update. This gitHub action reads all the scripts given certain extensions input and extract all the links and test them one by one. The action is also available on GitHub marketplace to use in GitHub hosted projects:

https://github.com/marketplace/actions/urls-checker

The scripts are in python so you can actually with very little changes use them locally: https://github.com/SuperKogito/URLs-checker

Feel free to fork and star the repository if you find this useful ;)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜