开发者

Regular expression to remove <br> from <pre>

I am trying to remove the <br /> tags that appear in between the <pre></pre> tags. My string looks like

string str = "Test<br/><pre><br/>Test<br/></pre><br/>Test<br/>---<br/>Test<br/><pre><br/>Test<br/></pre><br/>Test"

string temp = "`##`";
while (Regex.IsMatch(result, @"\<pre\>(.*?)\<br\>(.*?)\</pre\>", RegexOptions.IgnoreCase))
{
    result = System.Text.RegularExpressions.Regex.Replace(result, @"\<pre\>(.*?)\<br\>(.*?)\</pre\>", "<pre>$1" + temp + "$2</pre>", RegexOptions.IgnoreCase);
}
str = str.Replace(temp, System.Environment.NewL开发者_JS百科ine);

But this replaces all <br> tags between first and the last <pre> in the whole text. Thus my final outcome is:

str = "Test<br/><pre>\r\nTest\r\n</pre>\r\nTest\r\n---\r\nTest\r\n<pre>\r\nTest\r\n</pre><br/>Test"

I expect my outcome to be

str = "Test<br/><pre>\r\nTest\r\n</pre><br/>Test<br/>---<br/>Test<br/><pre>\r\nTest\r\n</pre><br/>Test"


If you are parsing whole HTML pages, RegEx is not a good choice - see here for a good demonstration of why.

Use an HTML parser such as the HTML Agility Pack for this kind of work. It also works with fragments like the one you posted.


Don't use regex to do it.

"Be lazy, use CPAN and use HTML::Sanitizer." -Jeff Atwood, Parsing Html The Cthulhu Way


        string input = "Test<br/><pre><br/>Test<br/></pre><br/>Test<br/>---<br/>Test<br/><pre><br/>Test<br/></pre><br/>Test";
        string pattern = @"<pre>(.*)<br/>(([^<][^/][^p][^r][^e][^>])*)</pre>";
        while (Regex.IsMatch(input, pattern))
        {
            input = Regex.Replace(input, pattern, "<pre>$1\r\n$2</pre>");
        }

this will probably work, but you should use html agility pack, this will not match <br> or <br /> etc.


Ok. So I discovered the issue with my code. The problem was that, Regex.IsMatch was considering just the first occurrence of <pre> and the last occurrence of </pre>. I wanted to consider individual sets of <pre> for replacements. So I modified my code as

foreach (Match regExp in Regex.Matches(str, @"\<pre\>(.*?)\<br\>(.*?)\</pre\>", RegexOptions.IgnoreCase)) 
{
    matchFound = true;
    str = str.Replace(regExp.Value, regExp.Value.Replace("<br>", temp));
}

and it worked well. Anyways thanks all for your replies.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜