开发者

Remove non-HTML special tags from text

I'm having problem with matching non-HTML tags in text mainly, because tags starts with &lt; and ends with &gt; but not < and >. So instead <ref>xx</ref> i have &lt;ref&gt;xxx&lt;/ref&gt;. What I need to do is remove all such tags including their content.

The problem is that some tags may have attributes. I found nice answer here but still there's a problem.

Assuming that I have tag like: <gallery src=sss>xxx</gallery> this expression suits perfect:

@"<(?<Tag>\w+)[^>)]*>.*?</\k<Tag>>"

Reality is quite different and all special characters are escaped, so tag looks like: &lt;gallery src=sss&gt;xxx&lt;/gallery&gt;. My problem is to match this king of tags. So far I have this expression: @"\&lt\;(?<Tag>\w+)[^\&)]*\&gt\;.*?\&lt\;/\k<Tag>\&gt\;". It matches tags with no attributes, but not the one mentioned above. What am I missing?

Second issue is matching {| |} tags, because they can be nested. Can you help me with this as well? This expression doesn't do the job: @"\{\|(?:[^\|\}]|\{\|[^\|\}]*\|\})*\|\}"

Edit: To clarify second issue. I have to match strings that starts with opening tag {| then goes some text and ends with |} tags. This structure can be nested, so this: {| xxx {| yyy |} xxx |} is allowed. I don't know maximum nesting level unfortunat开发者_高级运维ely, but lets say that 1 should suit most cases.


Edit 2: This expressions works for my 1st issue @"\&lt\;(?<Tag>\w+).*?\&lt\;/\k<Tag>\&gt\;". I have noticed that it fails if there's a new line mark between opening and closing tags.

Edit 3: This do the job with second issue: @"\{\|(?>(?!\{\||\|\}).|\{\|(?<N>)|\|\}(?<-N>))*(?(N)(?!))\|\}"


so you have HTML-escaped text in which you want to find elements? Why not just unescape it first and then use the code you already have? You can use HttpServerUtility.HtmlDecode() for that.

edit: try this then

string text = "PLAIN-TEXT&lt;gallery src=sss&gt;xxx&lt;/gallery&gt;PLAIN-TEXT";
while (text.IndexOf("&lt;") > -1)
    text = Regex.Replace(text, "&lt;\\w+.*?&lt;/\\w+&gt;", "");
Console.WriteLine(text);

in case it is confusing: the loop is for the nested tags. You could handle them with Regex to but that get complicated.


This regex should (partially) work:

@"&lt;.+?&gt;(.*?)&lt;/.+?&gt;"

That being said, regex is not an appropriate tool for parsing (X)HTML. Here's a better solution:

  1. Add an identifier after the &lt;, ie: BOGUS000 : YourStr.Replace("&lt;", "&lt;BOGUS000")
  2. Now convert the &lt; and %gt; to < and > using HttpServerUtility.HtmlDecode()
  3. Parse the file using an XML parser
  4. Now you know all elements which have a name starting with your identifier (here BOGUS000) are, well, bogus. They can be removed.
  5. Profit ! :)

I am not sure I understand your second issue.


add RegexOptions.Singleline to the Regex.Replace() call (yes I know, it feels backward) to address the issue with tag spanning multiple lines not matching.

second issue: How is it not exactly the same problem? The regex is given to you - just substitute the bounding strings and done.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜