Java regex very slow (translate nested quantifiers to possessive quantifiers)
I've found this regular expression to match urls (originally in Javascript by Daring Fireball) which in java works but in some cases is extremly slow:
private final static String pattern =
"\\b" +
"(" + // Capture 1: entire matched URL
"(?:" +
"[a-z][\\w-]+:" + // URL protocol and colon
"(?:" +
"/{1,3}" + // 1-3 slashes
"|" + // or
"[a-z0-9%]" + // Single letter or digit or '%'
// (Trying not to match e.g. "URI::Escape")
")" +
"|" + // or
"www\\d{0,3}[.]" + // "www.", "www1.", "www2." … "www999."
"|" + // or
"[a-z0-9.\\-]+[.][a-z]{2,4}/" + // looks like domain name followed by a slash
")" +
"(?:" + // One or more:
"[^\\s()<>]+" + // Run of non-space, non-()<>
"|" + // or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens, up to 2 levels
")+" +
"(?:" + // End with:
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens, up to 2 levels
"|" + 开发者_如何转开发 // or
"[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" + // not a space or one of these punct chars (updated to add a 'dash'
")" +
")";
and i've found on topic: Java Regular Expression running very slow that the problem is in this block of code:
"(?:" + // One or more:
"[^\\s()<>]+" + // Run of non-space, non-()<>
"|" + // or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens, up to 2 levels
")+"
and it seems that to solve the problem i need to make these inner quantifiers possessive (which actually are nested), but i don't know how to do that Thanks in advice and sorry for my BAD english!
You can avoid all of this by using java.net.URL
or java.net.URI
to parse the urls.
java.io.URI
does a better job of parsing thanjava.net.URL
. Try that one.Once you've parsed the url, you can check each of the components; e.g. check that the hostname can be resolved.
If you want urls that will resolve, you need to distinguish between absolute and non-absolute urls, and check that the "scheme" is one that you can cope with.
You cannot check that a url works (i.e. that it corresponds to a retrievable resource) without actually attempting to open the resource. And even that isn't definitive test, for a number of possible reasons.
You might have a case of catastrophic backtracking: Check that your regex doesn't match the same characters in multiple groups, causing a runaway number of combinations that must be checked.
See this article for an explanation.
精彩评论