开发者

Remove all '\n' between '<' and '>' in C# with regexp

I need to remove all '\n' between '<' and '>' in html file with C#.

my code is below:

Regex.Replace(text, "(<[^<>)]*)\\n+([^><]*>$)", "\1\2");

But it obviously doesn't work. Any suggestions?

Example:

< style="



">

detailed example:

<td colspan="3" rowspan="2">
      <table cellpadding="0" cellspacing="0" class="a10" cols="13" id="t_5" lang="en-AU">
       <tr id="t_5_FNHR">
        <td class="a26" style="HEIGHT:5.00mm">
         <div class="r11">LAKOTA - PINK PANTHER RETURNS-V</div>
        </td>
        <td class="a27" style="



">
         <div class="r11">5c</div>
        </td>

Another:

<td class="a34" style="



">
             <div class="r11">7,390.62</div>
            </td>
            <td class="a35" style="



开发者_开发百科">
             <div class="r11">617.81</div>
            </td>
            <td class="a36" style="



">


An easy but obviously brittle way would be to remove all linebreaks where the next angle bracket is a >:

Regex.Replace(text, @"[\r\n]+(?=[^<>]*>)", "");

Explanation:

[\r\n]+  # Match one or more CR or LF characters
(?=      # if the following can be matched at the current position:
 [^<>]*  # any number of characters except angle brackets
 >       # and one closing angle bracket
)        # (End of lookahead).

Might be good enough for your case (if it isn't, regex probably is not the right tool anyway).


First create a regex that match a html tag, something like <[^>]+> and then use a match evaluator.

   Regex r = new Regex(pattern);
   var result = r.Replace(input, new MatchEvaluator(ReplaceNewline));

   public string ReplaceNewline(Match m)
   {
      return m.Value.Replace("\n", "");     
   }

http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.matchevaluator.aspx

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜