What is a regular expression and C# code to strip any html tag except links?
I'm creating a CLR user defined function in Sql Server 2005 to do some cleaning in a lot of database tables.
The task is to remove almost all tags except links ('a'
tags and their 'href'
attributes). So I divided the problem in two stages. 1. creating a user defined sql server function, and 2. creating a sql server script to do the update to all the involved tables calling the clr function.
For the user defined function and given the restricted environment, I prefer to do this with native libraries. That means, not using the Html Agility Pack, for example.
In javascript this regular expression, apparently does the right job:
<\s*a[^>]\s*href=(.*)>(.*?)<\s*/\开发者_JAVA百科s*a>
At least, according to http://www.pagecolumn.com/tool/regtest.htm
But, I don't know how to translate that (especially, the capturing groups part) into C# code to use the text as part of the output.
For instance, if the input is : <a href="http://example.com">some text</a>
how to save the text "http://example.com"
and "some text"
as part of the output in C# code and at the same time stripping any other possible html tag (and their content)?
Your regular expression is completely wrong:
<\s*a[^>]\s*href=(.*)>(.*?)<\s*/\s*a>
↑ ↑
1. 2.
- This causes
<aa...
,<ab...
,<ac...
etc. to match too. This causes you to overmatch. For example, consider this input:
<a href='/one'>One</a> <a href='/two'>Two</a> ├───────────────────────────┤ ├─┤ group 1 grp2
Not quite as bomb-proof as Jordan's, but an example using Matches instead:
var pattern = @"<.*href=""(?<url>.*)"".*>(?<name>.*)</a>";
var matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
var groups = match.Groups;
Console.WriteLine("{0}, {1}", groups["url"], groups["name"]);
}
At the end. I made a separate .net console program combining HtmlAgilityPack (HAP) and querying SQL Server from there. In the program I did use a naive regular expression to isolate the fragments, and with HAP I did retrieve the href and anchor texts, and with that I did a final composition stripping out any other characters except text, numbers, and some punctuation.
精彩评论