开发者

Regular expression to strip everything between anchor tags

I am trying to strip out all the links and text between anchors tags from a html string as below:

 string LINK_TAG_PATTERN = "/<a\b[^>]*>(.*?)<\\/a>";

 htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty);

This is not working anyone have ideas why?

Thanks a lot,

Edit: the regex was from this link Extract text and lin开发者_如何学运维ks from HTML using Regular Expressions


Problems in your string: Unnecessary slash at the beginning (that's Perl syntax), unescaped backslash (\b), unnecessary escaped backslash (\\).

So, if it has to be a Regex, taking all warnings into account that enough other people have linked to, try

string LINK_TAG_PATTERN = @"<a\b[^>]*>(.*?)</a>";
htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty, RegexOptions.IgnoreCase);

The \b is necessary to prevent other tags that start with a from matching.


Use an HTML Parser and not Regular Expressions to parse HTML.

HTML Agiliity Pack


I recommend Expresso to troubleshoot regular expressions. You can find a library of regular expressions here.

You might consider using javascript to walk the DOM tree for your replacements instead of regex.


string LINK_TAG_PATTERN = @"(<a\s+[^>]*>)(.*?)(</a>)";

htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, "$1$3", RegexOptions.IgnoreCase);


Conceptually, this only strips links of a very special kind (e.g. your regex does not match upper-case A which is perfectly valid in HTML: <A ...>bla</A>. The replacement wouldn't work for javascript links either. Is your code relevant to user security?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜