.NET Remove/Strip JavaScript and CSS code blocks from HTML page

2023-03-13 11:57 问答作者：

I have HTML string with the JavaScript and CSS code blocks:

<script type="text/javascript">

  alert('hello world');

</script>

<style type="text/css">
  A:link {text-decoration: none}
  A:visited {text-decoration: none}
  A:active {开发者_StackOverflow社区text-decoration: none}
  A:hover {text-decoration: underline; color: red;}
</style>

How to strip those blocks? Any suggestion about the regular expressions that can be used to remove those?

The quick 'n' dirty method would be a regex like this:

var regex = new Regex(
   "(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)", 
   RegexOptions.Singleline | RegexOptions.IgnoreCase
);

string ouput = regex.Replace(input, "");

The better* (but possibly slower) option would be to use HtmlAgilityPack:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlInput);

var nodes = doc.DocumentNode.SelectNodes("//script|//style");

foreach (var node in nodes)
    node.ParentNode.RemoveChild(node);

string htmlOutput = doc.DocumentNode.OuterHtml;

*) For a discussion about why it's better, see this thread.

Use HTMLAgilityPack for better results

or try this function

public string RemoveScriptAndStyle(string HTML)
{
    string Pat = "<(script|style)\\b[^>]*?>.*?</\\1>";
    return Regex.Replace(HTML, Pat, "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
}

Just look for an opening <script tag, and then remove everything between it and the closing /script> tag.

Likewise for the style. See Google for string manipulation tips.

I made my bike) He may not be as correct as HtmlAgilityPack but it is much faster by about 5-6 times on a page in the 400 kb. Also make symbols lowercase and remove digits(made for tokenizer)

 private static readonly List<byte[]> SPECIAL_TAGS = new List<byte[]>
                                                            {
                                                                Encoding.ASCII.GetBytes("script"),
                                                                Encoding.ASCII.GetBytes("style"),
                                                                Encoding.ASCII.GetBytes("noscript")
                                                            };

    private static readonly List<byte[]> SPECIAL_TAGS_CLOSE = new List<byte[]>
                                                                  {
                                                                      Encoding.ASCII.GetBytes("/script"),
                                                                      Encoding.ASCII.GetBytes("/style"),
                                                                      Encoding.ASCII.GetBytes("/noscript")};

public static string StripTagsCharArray(string source, bool toLowerCase)
    {
        var array = new char[source.Length];
        var arrayIndex = 0;
        var inside = false;
        var haveSpecialTags = false;
        var compareIndex = -1;
        var singleQouteMode = false;
        var doubleQouteMode = false;
        var matchMemory = SetDefaultMemory(SPECIAL_TAGS);
        for (int i = 0; i < source.Length; i++)
        {
            var let = source[i];
            if (inside && !singleQouteMode && !doubleQouteMode)
            {
                compareIndex++;
                if (haveSpecialTags)
                {
                    var endTag = CheckSpecialTags(let, compareIndex, SPECIAL_TAGS_CLOSE, ref matchMemory);
                    if (endTag) haveSpecialTags = false;
                }
                if (!haveSpecialTags)
                {
                    haveSpecialTags = CheckSpecialTags(let, compareIndex, SPECIAL_TAGS, ref matchMemory);
                }
            }
            if (haveSpecialTags && let == '"')
            {
                doubleQouteMode = !doubleQouteMode;
            }
            if (haveSpecialTags && let == '\'')
            {
                singleQouteMode = !singleQouteMode;
            }
            if (let == '<')
            {
                matchMemory = SetDefaultMemory(SPECIAL_TAGS);
                compareIndex = -1;
                inside = true;
                continue;
            }
            if (let == '>')
            {
                inside = false;
                continue;
            }
            if (inside) continue;
            if (char.IsDigit(let)) continue; 
            if (haveSpecialTags) continue;
            array[arrayIndex] = toLowerCase ? Char.ToLowerInvariant(let) : let;
            arrayIndex++;
        }
        return new string(array, 0, arrayIndex);
    }

    private static bool[] SetDefaultMemory(List<byte[]> specialTags)
    {
        var memory = new bool[specialTags.Count];
        for (int i = 0; i < memory.Length; i++)
        {
            memory[i] = true;
        }
        return memory;
    }

Similar to Elian Ebbing's answer and Rajeev's answer, I opted for the more stable solution of using an HTML library, not regular expressions. But instead of using HtmlAgilityPack I used AngleSharp, which gave me jquery-like selectors, in .NET Core 3:

//using AngleSharp;
var context = BrowsingContext.New(Configuration.Default);
var document = await context.OpenAsync(req => req.Content(sourceHtml)); // generate HTML DOM from source html string
var elems = document.QuerySelectorAll("script, style"); // get script and style elements
foreach(var elem in elems)
{
    var parent = elem.Parent;
    parent.RemoveChild(elem); // remove element from DOM
}
var resultHtml = document.DocumentElement.OuterHtml; // HTML result as a string

继续阅读：.net regex

.NET Remove/Strip JavaScript and CSS code blocks from HTML page

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？