What is a regular expression and C# code to strip any html tag except links?

2023-01-11 20:33 问答作者：

I'm creating a CLR user defined function in Sql Server 2005 to do some cleaning in a lot of database tables.

The task is to remove almost all tags except links ('a' tags and their 'href' attributes). So I divided the problem in two stages. 1. creating a user defined sql server function, and 2. creating a sql server script to do the update to all the involved tables calling the clr function.

For the user defined function and given the restricted environment, I prefer to do this with native libraries. That means, not using the Html Agility Pack, for example.

In javascript this regular expression, apparently does the right job:

 <\s*a[^>]\s*href=(.*)>(.*?)<\s*/\开发者_JAVA百科s*a>

At least, according to http://www.pagecolumn.com/tool/regtest.htm

But, I don't know how to translate that (especially, the capturing groups part) into C# code to use the text as part of the output.

For instance, if the input is : <a href="http://example.com">some text</a> how to save the text "http://example.com" and "some text" as part of the output in C# code and at the same time stripping any other possible html tag (and their content)?

Your regular expression is completely wrong:

<\s*a[^>]\s*href=(.*)>(.*?)<\s*/\s*a>
      ↑            ↑
      1.           2.

This causes <aa..., <ab..., <ac... etc. to match too.

This causes you to overmatch. For example, consider this input:

<a href='/one'>One</a> <a href='/two'>Two</a>
        ├───────────────────────────┤ ├─┤
                   group 1            grp2

Not quite as bomb-proof as Jordan's, but an example using Matches instead:

var pattern = @"<.*href=""(?<url>.*)"".*>(?<name>.*)</a>";
var matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
    var groups = match.Groups;
    Console.WriteLine("{0}, {1}", groups["url"], groups["name"]);
}

At the end. I made a separate .net console program combining HtmlAgilityPack (HAP) and querying SQL Server from there. In the program I did use a naive regular expression to isolate the fragments, and with HAP I did retrieve the href and anchor texts, and with that I did a final composition stripping out any other characters except text, numbers, and some punctuation.

继续阅读：javascript regex sql sql-server-2005

What is a regular expression and C# code to strip any html tag except links?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？