Need to prevent PHP regex segfault
Why does the following segfault, and how can I prevent it?
<?php
$str = ' <fieldset> <label for="go-to">Go to: </label> '
. str_repeat(' ', 10000)
. '<input type="submit" value="Go" /> </fieldset> </form>';
preg_match_all("@
</?(?![bisa]\b)(?!em\b)[^>]*> # starting tag, must not be one of several inline tags
(?:[^<]|</?(?:(?:[bisau]|em|strong|sup)\b)[^>]*>)* #allow text and some inline tags
[\?\!\.]+
@ix", $str, $matches);
?>
I believe it's causing a .... wait for it .... stack overflow.
EDIT:
The above is a simplified version the pattern that demonstrates the problem. A more complete version:
@
</?(?![bisa]\b)(?!em\b)[^>]*> # starting tag, must not be one of several inline tags
(?:[^<]|</?(?:(?:[bisau]|em|strong|sup)\b)[^>]*>)* # continue, allow text content and some inline tags
# normal sentence ending
[\?\!\.]+ # valid ending characters -- note elipses allowed
(?<!\b[ap]m\.)(?<!\b[ap]\.m\.)(?<!digg this\!)(?<!Stumble This\!) # disallow some false positives that we don't care about
\s*
(?:'|&\#0*34;|'|‘)?\s* # closing single quotes, in the unusual case like "he said: 'go away'".
(?:"|"|&\#0*34;|&\#x0*22;|”|&\#0*8221;|&\#x0*201D;|''|``|\xe2\x80\x9d|&\#0*148;|&\#x0*94;|\x94|\))?\s* # followed by any kind of close-quote char
(?=\<) # should be followed by a tag.
@ix
The purpose is to find html blocks that appear to end at what looks like a valid English sentence ending. I have found that this method is very good at telling the difference between 'content' text (like an article body) a开发者_开发百科nd 'layout' text (i.e., like navigational elements). Sometimes if there's a vast amount of white space in between tags it blows up, however.
The first thing I would try is making all the quantifiers possessive and all the groups atomic:
"@</?+(?![bisa]\b)(?!em\b)[^>]*+>
(?>[^<]++|</?+(?>(?>[bisau]|em|strong|sup)\b)[^>]*+>)*+
[?!.]+
@ix"
I think Jeremy's right: it's not backtracking per se that's killing you, it's all the state info the regex engine has to save to make backtracking possible. The regex seems to be constructed in such a way that if it ever has to backtrack, it's going to fail anyway. So use possessive quantifiers and atomic groups and don't bother saving all that useless info.
EDIT: to allow for the sentence-ending punctuation, you could add another alternative to the second line:
(?>[^<?!.]++|(?![^?!.\s<]++<)[?!.]++|</?+(?>(?>[bisau]|em|strong|sup)\b)[^>]*+>)*+
The addition matches one or more of said characters, unless they're the last non-whitespace characters in the element.
I'm fairly sure that even newer versions of PHP are bundled with PCRE 7.0 which has known segment fault issues. I don't think that there are any intentions on correcting the issue as it is technically a PCRE issue, not an issue with PHP.
If you tell us what you are attempting to accomplish your best bet would be to try to write an alternate expression.
The bug in question is: http://bugs.php.net/bug.php?id=40909
Does this still do what you want?
</?(?![bisa]\b)(?!em\b)[^>]*> # starting tag, must not be one of several inline tags
(?:(?>[^<\?\!\.]*)|</?(?:(?:[bisau]|em|strong|sup)\b)[^>]*>)* #allow text and some inline tags
[\?\!\.]+
Your regular expression causes massive amounts of backtracking. With 10000 characters in the middle, it would get pretty messy and slow. Still, I would not expect it to crash...!
精彩评论