Remove non-HTML special tags from text
I'm having problem with matching non-HTML tags in text mainly, because tags starts with <
and ends with >
but not <
and >
. So instead <ref>xx</ref>
i have <ref>xxx</ref>
. What I need to do is remove all such tags including their content.
The problem is that some tags may have attributes. I found nice answer here but still there's a problem.
Assuming that I have tag like: <gallery src=sss>xxx</gallery>
this expression suits perfect:
@"<(?<Tag>\w+)[^>)]*>.*?</\k<Tag>>"
Reality is quite different and all special characters are escaped, so tag looks like: <gallery src=sss>xxx</gallery>
. My problem is to match this king of tags. So far I have this expression:
@"\<\;(?<Tag>\w+)[^\&)]*\>\;.*?\<\;/\k<Tag>\>\;"
. It matches tags with no attributes, but not the one mentioned above. What am I missing?
Second issue is matching {| |}
tags, because they can be nested. Can you help me with this as well? This expression doesn't do the job: @"\{\|(?:[^\|\}]|\{\|[^\|\}]*\|\})*\|\}"
Edit: To clarify second issue. I have to match strings that starts with opening tag {|
then goes some text and ends with |}
tags. This structure can be nested, so this: {| xxx {| yyy |} xxx |}
is allowed. I don't know maximum nesting level unfortunat开发者_高级运维ely, but lets say that 1 should suit most cases.
Edit 2: This expressions works for my 1st issue @"\<\;(?<Tag>\w+).*?\<\;/\k<Tag>\>\;"
. I have noticed that it fails if there's a new line mark between opening and closing tags.
Edit 3: This do the job with second issue: @"\{\|(?>(?!\{\||\|\}).|\{\|(?<N>)|\|\}(?<-N>))*(?(N)(?!))\|\}"
so you have HTML-escaped text in which you want to find elements? Why not just unescape it first and then use the code you already have? You can use HttpServerUtility.HtmlDecode()
for that.
edit: try this then
string text = "PLAIN-TEXT<gallery src=sss>xxx</gallery>PLAIN-TEXT";
while (text.IndexOf("<") > -1)
text = Regex.Replace(text, "<\\w+.*?</\\w+>", "");
Console.WriteLine(text);
in case it is confusing: the loop is for the nested tags. You could handle them with Regex to but that get complicated.
This regex should (partially) work:
@"<.+?>(.*?)</.+?>"
That being said, regex is not an appropriate tool for parsing (X)HTML. Here's a better solution:
- Add an identifier after the
<
, ie: BOGUS000 :YourStr.Replace("<", "<BOGUS000")
- Now convert the
<
and%gt;
to<
and>
usingHttpServerUtility.HtmlDecode()
- Parse the file using an XML parser
- Now you know all elements which have a name starting with your identifier (here
BOGUS000
) are, well, bogus. They can be removed. - Profit ! :)
I am not sure I understand your second issue.
add RegexOptions.Singleline to the Regex.Replace() call (yes I know, it feels backward) to address the issue with tag spanning multiple lines not matching.
second issue: How is it not exactly the same problem? The regex is given to you - just substitute the bounding strings and done.
精彩评论