Efficiently detect broken urls in Java
What's the most efficient way to detect a broken url (HTTP 404) i开发者_开发问答n Java? I'd like to do this in a loop and take as little time as possible.
You can only detect a 404 after you've requested the URL: you'll get a header back with the code (200, or 301 for redirect, or 404 for missing file), and you can check that.
So you'll have to do the request and wait for a possible 404.
There's a rather good comment below that should not be skipped, so I'm repeating it here: Possible optimization (in the case of existing URLs): use a HEAD request instead of a GET.
There are many different ways in which an URL can be broken:
- Syntactically invalid
- Contains a non-existing domain
- Server is not reachable
- Server does not accept connections
- Server responds with an error
Except for the first, all of these can take a relatively long time (probably well over a second on average), and there is no way to speed it up since you're communicating with another computer.
The only thing you can do is to check many URLs in parallel using a thread pool.
You can establish URL connection verify that URL is broken by catching exception and checking the HTTP status code. If exception is not thrown and HTTP status is 200 URL is OK.
But be carefull! Sometimes URL is broken but the application returns human readable error page with status 200. For example site www.somecompany.com exists but page www.somecompany.com/foo.html does not exist any more. When you try to get there you get message "page does not exist" but the HTTP status is 200. This can be solved (sometimes) by parsing the page content only.
I wrote a Github action that can help with continuous integration by testing all links before any merge or update. This gitHub action reads all the scripts given certain extensions input and extract all the links and test them one by one. The action is also available on GitHub marketplace to use in GitHub hosted projects:
https://github.com/marketplace/actions/urls-checker
The scripts are in python so you can actually with very little changes use them locally: https://github.com/SuperKogito/URLs-checker
Feel free to fork and star the repository if you find this useful ;)
精彩评论