Remove non-HTML special tags from text

2023-01-10 12:59 问答作者：

I'm having problem with matching non-HTML tags in text mainly, because tags starts with < and ends with > but not < and >. So instead <ref>xx</ref> i have <ref>xxx</ref>. What I need to do is remove all such tags including their content.

The problem is that some tags may have attributes. I found nice answer here but still there's a problem.

Assuming that I have tag like: <gallery src=sss>xxx</gallery> this expression suits perfect:

@"<(?<Tag>\w+)[^>)]*>.*?</\k<Tag>>"

Reality is quite different and all special characters are escaped, so tag looks like: <gallery src=sss>xxx</gallery>. My problem is to match this king of tags. So far I have this expression: @"\&lt\;(?<Tag>\w+)[^\&)]*\&gt\;.*?\&lt\;/\k<Tag>\&gt\;". It matches tags with no attributes, but not the one mentioned above. What am I missing?

Second issue is matching {| |} tags, because they can be nested. Can you help me with this as well? This expression doesn't do the job: @"\{\|(?:[^\|\}]|\{\|[^\|\}]*\|\})*\|\}"

Edit: To clarify second issue. I have to match strings that starts with opening tag {| then goes some text and ends with |} tags. This structure can be nested, so this: {| xxx {| yyy |} xxx |} is allowed. I don't know maximum nesting level unfortunat开发者_高级运维ely, but lets say that 1 should suit most cases.

Edit 2: This expressions works for my 1st issue @"\&lt\;(?<Tag>\w+).*?\&lt\;/\k<Tag>\&gt\;". I have noticed that it fails if there's a new line mark between opening and closing tags.

Edit 3: This do the job with second issue: @"\{\|(?>(?!\{\||\|\}).|\{\|(?<N>)|\|\}(?<-N>))*(?(N)(?!))\|\}"

so you have HTML-escaped text in which you want to find elements? Why not just unescape it first and then use the code you already have? You can use HttpServerUtility.HtmlDecode() for that.

edit: try this then

string text = "PLAIN-TEXT&lt;gallery src=sss&gt;xxx&lt;/gallery&gt;PLAIN-TEXT";
while (text.IndexOf("&lt;") > -1)
    text = Regex.Replace(text, "&lt;\\w+.*?&lt;/\\w+&gt;", "");
Console.WriteLine(text);

in case it is confusing: the loop is for the nested tags. You could handle them with Regex to but that get complicated.

This regex should (partially) work:

@"&lt;.+?&gt;(.*?)&lt;/.+?&gt;"

That being said, regex is not an appropriate tool for parsing (X)HTML. Here's a better solution:

Add an identifier after the <, ie: BOGUS000 : YourStr.Replace("<", "<BOGUS000")
Now convert the < and %gt; to < and > using HttpServerUtility.HtmlDecode()
Parse the file using an XML parser
Now you know all elements which have a name starting with your identifier (here BOGUS000) are, well, bogus. They can be removed.
Profit ! :)

I am not sure I understand your second issue.

add RegexOptions.Singleline to the Regex.Replace() call (yes I know, it feels backward) to address the issue with tag spanning multiple lines not matching.

second issue: How is it not exactly the same problem? The regex is given to you - just substitute the bounding strings and done.

继续阅读：regex

Remove non-HTML special tags from text

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？