开发者

PHP regular expression again

Is there any difference between:

preg_replace( '@<(script|style)[^>开发者_运维知识库;]*?>.*?</\\1>@si', '', $string );

and

preg_replace( '@<(script|style)[^>]*>.*</\\1>@si', '', $string );

?


Yes...

Consider this example string...

<script>bla</script><script>hello</script>
  • The first one will stop matching as soon as it is satisfied; it is known as an ungreedy match.

In the above example, it will only match the first script element.

  • The second one will match everything between the first and last closing tag, perhaps consuming other matches inside. This is known as greedy, as it will consume as much as it can.

It will match <script>bla</script><script>hello</script>.

The first non greedy probably doesn't need to be there, as it will search all non > anyway, and then there should not be any other characters after it anyway (between non > and closing >).

I also need to mention using something like DOMDocument is a much better method of getting script and style elements.

$dom = new DOMDocument;

$dom->loadHTML($string);

$scripts = $dom->getElementsByTagName('script');

$styles = $dom->getElementsByTagName('style');


The extra ? will invert the greediness of the expression (they're greedy by default in php):

  • /a+b/ will match aaab in aaab
  • /a*b/ will match aaab in aaab
  • /a*?b/ will match b in aaab
  • /a+?b/ will match ab in aaab

So, in your particular example, the non-greedy expression will catch a script tag and its contents, so to speak. While the greedy version will start matching the first script tag, and grab everything (including non-script areas) up to the very last close script tag.

Don't rely on either, though:

http://ha.ckers.org/xss.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜