Removing all script tags from html with JS Regular Expression

2023-03-19 08:34 问答作者：

I want to strip script tags out of this HTML at Pastebin:

http://pastebin.com/mdxygM0a

I tried using the below regular expression:

html.replace(/<script.*>.*<\/script>/ims, " ")

But it does not remove all of the script tags in the HTML. It only removes in-line scripts. I'm looking for some regex that can remove all 开发者_如何转开发of the script tags (in-line and multi-line). It would be highly appreciated if a test is carried out on my sample http://pastebin.com/mdxygM0a

jQuery uses a regex to remove script tags in some cases and I'm pretty sure its devs had a damn good reason to do so. Probably some browser does execute scripts when inserting them using innerHTML.

Here's the regex:

/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi

And before people start crying "but regexes for HTML are evil": Yes, they are - but for script tags they are safe because of the special behaviour - a <script> section may not contain </script> at all unless it should end at this position. So matching it with a regex is easily possible. However, from a quick look the regex above does not account for trailing whitespace inside the closing tag so you'd have to test if </script etc. will still work.

Attempting to remove HTML markup using a regular expression is problematic. You don't know what's in there as script or attribute values. One way is to insert it as the innerHTML of a div, remove any script elements and return the innerHTML, e.g.

  function stripScripts(s) {
    var div = document.createElement('div');
    div.innerHTML = s;
    var scripts = div.getElementsByTagName('script');
    var i = scripts.length;
    while (i--) {
      scripts[i].parentNode.removeChild(scripts[i]);
    }
    return div.innerHTML;
  }

alert(
 stripScripts('<span><script type="text/javascript">alert(\'foo\');<\/script><\/span>')
);

Note that at present, browsers will not execute the script if inserted using the innerHTML property, and likely never will especially as the element is not added to the document.

Regexes are beatable, but if you have a string version of HTML that you don't want to inject into a DOM, they may be the best approach. You may want to put it in a loop to handle something like:

<scr<script>Ha!</script>ipt> alert(document.cookie);</script>

Here's what I did, using the jquery regex from above:

var SCRIPT_REGEX = /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi;
while (SCRIPT_REGEX.test(text)) {
    text = text.replace(SCRIPT_REGEX, "");
}

This Regex should work too:

<script(?:(?!\/\/)(?!\/\*)[^'"]|"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\/\/.*(?:\n)|\/\*(?:(?:.|\s))*?\*\/)*?<\/script>

It even allows to have "problematic" variable strings like these inside:

<script type="text/javascript">
   var test1 = "</script>";
   var test2 = '\'</script>';
   var test1 = "\"</script>";
   var test1 = "<script>\"";
   var test2 = '<scr\'ipt>';
   /* </script> */
   // </script>
   /* ' */
   // var foo=" '
</script>

It seams that jQuery and Prototype fail on these ones...

Edit July 31 '17: Added a) non-capturing groups for better performance (and no empty groups) and b) support for JavaScript comments.

Whenever you have to resort to Regex based script tag cleanup. At least add a white-space to the closing tag in the form of

</script\s*>

Otherwise things like

<script>alert(666)</script   >

would remain since trailing spaces after tagnames are valid.

If you want to remove all JavaScript code from some HTML text, then removing <script> tags isn't enough, because JavaScript can still live in "onclick", "onerror", "href" and other attributes.

Try out this npm module which handles all of this: https://www.npmjs.com/package/strip-js

Why not using jQuery.parseHTML() http://api.jquery.com/jquery.parsehtml/?

You can do this without a regular expression. Simply cast your HTML string to an HTML node using document.createElement(), find all scripts with element.getElementsByTagName('script'), and then just remove() them!

Fun fact: SO's demo does not like it when you create an element with a <script> tag! The snippet below will not run, but it does work at: Full Working Demo at JSBin.com.

var el = document.createElement( 'html' );
el.innerHTML = "<p>Valid paragraph.</p><p>Another valid paragraph.</p><script>Dangerous scripting!!!</script><p>Last final paragraph.</p>";

var scripts = el.getElementsByTagName( 'script' ); // Live NodeList of your anchor elements

for(var i = 0; i < scripts.length; i++) {
  var script = scripts[i];
  script.remove();
}

console.log(el.innerHTML);

This is a much cleaner solution than a regex, imho.

In my case, I needed a requirement to parse out the page title AND and have all the other goodness of jQuery, minus it firing scripts. Here is my solution that seems to work.

        $.get('/somepage.htm', function (data) {
            // excluded code to extract title for simplicity
            var bodySI = data.indexOf('<body>') + '<body>'.length,
                bodyEI = data.indexOf('</body>'),
                body = data.substr(bodySI, bodyEI - bodySI),
                $body;

            body = body.replace(/<script[^>]*>/gi, ' <!-- ');
            body = body.replace(/<\/script>/gi, ' --> ');

            //console.log(body);

            $body = $('<div>').html(body);
            console.log($body.html());
        });

This kind of shortcuts worries about script because you are not trying to remove out the script tags and content, instead you are replacing them with comments rendering schemes to break them useless as you would have comments delimiting your script declarations.

Let me know if that still presents a problem as it will help me too.

Try this:

var text = text.replace(/<script[^>]*>(?:(?!<\/script>)[^])*<\/script>/g, "")

Here are a variety of shell scripts you can use to strip out different elements.

# doctype
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/<\!DOCTYPE\s\+html[^>]*>/<\!DOCTYPE html>/gi" {} \;

# meta charset
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/<meta[^>]*content=[\"'][^\"']*utf-8[\"'][^>]*>/<meta charset=\"utf-8\">/gi" {} \;

# script text/javascript
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<script[^>]*\)\(\stype=[\"']text\/javascript[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

# style text/css
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<style[^>]*\)\(\stype=[\"']text\/css[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

# html xmlns
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<html[^>]*\)\(\sxmlns=[\"'][^\"']*[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

# html xml:lang
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<html[^>]*\)\(\sxml:lang=[\"'][^\"']*[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

/(?:(?!</s\w)<[^<])</s\w*/gi; - Removes any sequence in any combination with

Don't use regex to parse HTML.

Consider the following string:

var str = "<script>var false_closing_tag = '</script>';</script>";
var stripped = str.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
console.log(stripped); // Logs: ';</script>

The current, top voted regex answer will fail to fully remove this. (Try it). I can't even run that in the SO editor or JSFiddle because both of them are using insufficient means to parse the code before running it.

And the other option which involves adding it to a <div> element and then pulling the innerText of the div has negative side effects as well: It will actually run the code (which is a security concern) and it will remove ALL HTML and not just script tags.

The Solution

You need to actually parse the text:

function stripScriptTags(str){
  if(typeof str !== 'string') {
    return false;
  }
  var opened_quote_type = null;
  var in_script_tag = false;
  var string_buffer = [];
  for (let i = 0; i < str.length; i++) {
    if(opened_quote_type === null && ["'", '"', '`'].includes(str[i])){
      opened_quote_type = str[i];
    }else if(opened_quote_type === str[i]){
      opened_quote_type = null;
    }
    if(str.length > i+7 && str.toUpperCase().substring(i, i+7) === '<SCRIPT'){
      i += 7;
      in_script_tag = true;
    }
    if(in_script_tag && 
       opened_quote_type === null && 
       str.length > i+9 && 
       str.toUpperCase().substring(i, i+9) === '</SCRIPT>'
    ){
      i += 9;
      in_script_tag = false;
    }
    if(!in_script_tag){
      string_buffer.push(str[i]);
    }
  }
  return string_buffer.join('');
}

You can try

$("your_div_id").remove();

 $("your_div_id").html("");

继续阅读：javascript regex

Removing all script tags from html with JS Regular Expression

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？