Regex for extracting only TR with TDs
Good morning
I'm trying to get a table row (TR) that must have one or more table cells (TDs):
Having this string
<TABLE>
<TR valign="top">
<TH>First</TH>
<TH>2nd</TH>
<TH>3rd</TH>
<TH>4th</TH>
</TR>
<TR valign="top">
<TD width="15%">Michael Jackson</TD>
<TD width="5%">Cramberries</TD>
<TD width="25%">Pixies</TD>
<TD width="45%">The Ramones</TD>
</TR>
</TABLE>
I would like to get:
<TR valign="top">
<TD width="15%">Michael Jackson</TD>
<TD width="5%">Cramberries</TD>
<TD width="25%">Pixies</TD>
开发者_Python百科 <TD width="45%">The Ramones</TD>
</TR>
what would be the best pattern for extracting one or more TRs with nested TDs?
<tr(\s[^>*)?>.*?<td(\s[^>]*)?>.*?</tr(\s[^>]*)?>
should work, but set the case insensitive and multiline flags.
But I fully agree with Jan's comment above. Use an html parser, which will be far more robust and readable.
This one is working
Regex.Matches(sourceHtmlString, @"(?<1><TR[^>]*>\s*<td.*?</tr>)",
RegexOptions.Singleline | RegexOptions.IgnoreCase)
Where is this running, exactly? If you're running this in the browser, in Javascript, there are better ways than regular expression (e.g. jQuery selectors on tr:has(td) as a random example)
If you're running it on a server-side environment, e.g. PHP, regular expression can work.
Something like: (]+>.?)
Reason I'm suggesting that as opposed to anything else - you want to get the entire content, so wrap the entire thing in parentheses, the TR and TD may or may not have width, never hurts to be sure about such things.
The .*? construction should in most regexp engines be non-greedy, so match the smallest string that conforms - which should prevent ... being matched. Would still need multiline and case insensitivity, usually m and i, to be set as well. (I haven't tested this, however)
But as robert points out, on the server side, a proper HTML parser would be better, either the DOM or XML extensions should be able to deal with it.
This is not something regular expressions will do. For example, trying to match your text with <tr[^>]*>.*?<td[^>]*>.*?</tr>
will match the <th>
row and the first <td>
row. You should first match rows and then try to search each row for <td>
.
Or, better yet, use an HTML parser. HTML is not a regular language and can't really be parsed by a regular expression.
精彩评论