Regex for extracting only TR with TDs

2023-01-27 02:33 问答作者：

Good morning

I'm trying to get a table row (TR) that must have one or more table cells (TDs):

Having this string

<TABLE>
<TR valign="top">
  <TH>First</TH>
  <TH>2nd</TH>
  <TH>3rd</TH>
  <TH>4th</TH>
</TR>
<TR valign="top">
  <TD width="15%">Michael Jackson</TD>
  <TD width="5%">Cramberries</TD>
  <TD width="25%">Pixies</TD>
  <TD width="45%">The Ramones</TD>
</TR>
</TABLE>

I would like to get:

<TR valign="top">
  <TD width="15%">Michael Jackson</TD>
  <TD width="5%">Cramberries</TD>
  <TD width="25%">Pixies</TD>
 开发者_Python百科 <TD width="45%">The Ramones</TD>
</TR>

what would be the best pattern for extracting one or more TRs with nested TDs?

<tr(\s[^>*)?>.*?<td(\s[^>]*)?>.*?</tr(\s[^>]*)?> should work, but set the case insensitive and multiline flags.

But I fully agree with Jan's comment above. Use an html parser, which will be far more robust and readable.

This one is working

Regex.Matches(sourceHtmlString, @"(?<1><TR[^>]*>\s*<td.*?</tr>)", 
              RegexOptions.Singleline | RegexOptions.IgnoreCase)

Where is this running, exactly? If you're running this in the browser, in Javascript, there are better ways than regular expression (e.g. jQuery selectors on tr:has(td) as a random example)

If you're running it on a server-side environment, e.g. PHP, regular expression can work.

Something like: (]+>.?)

Reason I'm suggesting that as opposed to anything else - you want to get the entire content, so wrap the entire thing in parentheses, the TR and TD may or may not have width, never hurts to be sure about such things.

The .*? construction should in most regexp engines be non-greedy, so match the smallest string that conforms - which should prevent ... being matched. Would still need multiline and case insensitivity, usually m and i, to be set as well. (I haven't tested this, however)

But as robert points out, on the server side, a proper HTML parser would be better, either the DOM or XML extensions should be able to deal with it.

This is not something regular expressions will do. For example, trying to match your text with <tr[^>]*>.*?<td[^>]*>.*?</tr> will match the <th> row and the first <td> row. You should first match rows and then try to search each row for <td>.

Or, better yet, use an HTML parser. HTML is not a regular language and can't really be parsed by a regular expression.

继续阅读：regex text-extraction

Regex for extracting only TR with TDs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？