PHP regular expression again
Is there any difference between:
preg_replace( '@<(script|style)[^>开发者_运维知识库;]*?>.*?</\\1>@si', '', $string );
and
preg_replace( '@<(script|style)[^>]*>.*</\\1>@si', '', $string );
?
Yes...
Consider this example string...
<script>bla</script><script>hello</script>
- The first one will stop matching as soon as it is satisfied; it is known as an ungreedy match.
In the above example, it will only match the first script
element.
- The second one will match everything between the first and last closing tag, perhaps consuming other matches inside. This is known as greedy, as it will consume as much as it can.
It will match <script>bla</script><script>hello</script>
.
The first non greedy probably doesn't need to be there, as it will search all non >
anyway, and then there should not be any other characters after it anyway (between non >
and closing >
).
I also need to mention using something like DOMDocument is a much better method of getting script
and style
elements.
$dom = new DOMDocument;
$dom->loadHTML($string);
$scripts = $dom->getElementsByTagName('script');
$styles = $dom->getElementsByTagName('style');
The extra ?
will invert the greediness of the expression (they're greedy by default in php):
/a+b/
will matchaaab
inaaab
/a*b/
will matchaaab
inaaab
/a*?b/
will matchb
inaaab
/a+?b/
will matchab
inaaab
So, in your particular example, the non-greedy expression will catch a script tag and its contents, so to speak. While the greedy version will start matching the first script tag, and grab everything (including non-script areas) up to the very last close script tag.
Don't rely on either, though:
http://ha.ckers.org/xss.html
精彩评论