C# Regex parsing HTML [duplicate]
My Value 1
I need "My Value 1" Please Help Me. C# language
As HTML code is very "unpredictable" I would recommend using a HTML parsing kit. Which programming language do you use? In .NET I have used HTML Agility Pack with great success. In Java HTML Parser might be handy (though I have not worked with it yet).
You cannot properly parse HTML with regular expressions because regexps can't handle the nesting allowed by HTML. To do it properly. For that one line you show, you can use a regexp but you can't count on that line remaining identical so must use SAX/DOM for the task generally.
I think parsing HTML using Regexes is not a wise idea, as highlighted by spa. A classic previous answer to a similar question is RegEx match open tags except XHTML self-contained tags
c# language
string input = "<tr><TD><FONT size=\"2\">My Value 1</FONT></TD></tr>";
string pattern = @"<[^>]*?>";
string output = Regex.Replace(input, pattern, ""); //My Value 1
Just to remove all html tags.
function stripTags(markup){
return markup.replace(/\s*<[^>]*?>\s*/gim,'');
}
This assumes all you really want is the inner text represented by "My Value 1" above.
Try:
/<tr>\s*<td>\s*<font.*?>(.*?)<\/font>\s*<\/td>\s*<\/tr>/i
Used in PHP:
<?php
if(preg_match('/<tr>\s*<td>\s*<font.*?>(.*?)<\/font>\s*<\/td>\s*<\/tr>/i',
'<tr><TD><FONT size="2">My Value 1</FONT></TD></tr>',$matches))
echo $matches[1]; // prints My Value 1
?>
if you are using PHP, split on </FONT>
$string='<tr><TD><FONT size="2">My Value 1</FONT></TD></tr>';
$s = explode('</FONT>',$string);
foreach ($s as $v){
if ( strpos($v,"<FONT") !==FALSE) {
$t = explode(">",$v);
print end($t)."\n";
}
}
output
$ php test.php
My Value 1
in perl I would use
my $string='<tr><TD><FONT size="2">My Value 1</FONT></TD></tr>';
$string =~ m/(<.*?>)*([^<]*)(<.*?>)*/;
print $2;
to get the desired result. The last part is not strictly necessary,
(<.*?>)*([^<]*)
will work as well
If you want to get the contents within the tags I think the following Regexp is enough:
^<.*>([^<>]+)<.*>$
It will only work if there really is any data between the tags somewhere, otherwise it will give a no-match.
精彩评论