Regular expression to strip everything between anchor tags
I am trying to strip out all the links and text between anchors tags from a html string as below:
string LINK_TAG_PATTERN = "/<a\b[^>]*>(.*?)<\\/a>";
htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty);
This is not working anyone have ideas why?
Thanks a lot,
Edit: the regex was from this link Extract text and lin开发者_如何学运维ks from HTML using Regular Expressions
Problems in your string: Unnecessary slash at the beginning (that's Perl syntax), unescaped backslash (\b
), unnecessary escaped backslash (\\
).
So, if it has to be a Regex, taking all warnings into account that enough other people have linked to, try
string LINK_TAG_PATTERN = @"<a\b[^>]*>(.*?)</a>";
htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty, RegexOptions.IgnoreCase);
The \b
is necessary to prevent other tags that start with a
from matching.
Use an HTML Parser and not Regular Expressions to parse HTML.
HTML Agiliity Pack
I recommend Expresso to troubleshoot regular expressions. You can find a library of regular expressions here.
You might consider using javascript to walk the DOM tree for your replacements instead of regex.
string LINK_TAG_PATTERN = @"(<a\s+[^>]*>)(.*?)(</a>)";
htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, "$1$3", RegexOptions.IgnoreCase);
Conceptually, this only strips links of a very special kind (e.g. your regex does not match upper-case A which is perfectly valid in HTML: <A ...>bla</A>
. The replacement wouldn't work for javascript links either. Is your code relevant to user security?
精彩评论