开发者

C# Regex parsing HTML [duplicate]

This question already has answers here: How do I remove all HTML tags from a string without knowing which 开发者_如何学Pythontags are in it? (5 answers) Closed 2 years ago.

My Value 1

I need "My Value 1" Please Help Me. C# language


As HTML code is very "unpredictable" I would recommend using a HTML parsing kit. Which programming language do you use? In .NET I have used HTML Agility Pack with great success. In Java HTML Parser might be handy (though I have not worked with it yet).


You cannot properly parse HTML with regular expressions because regexps can't handle the nesting allowed by HTML. To do it properly. For that one line you show, you can use a regexp but you can't count on that line remaining identical so must use SAX/DOM for the task generally.


I think parsing HTML using Regexes is not a wise idea, as highlighted by spa. A classic previous answer to a similar question is RegEx match open tags except XHTML self-contained tags


c# language

 string input = "<tr><TD><FONT size=\"2\">My Value 1</FONT></TD></tr>";
 string pattern = @"<[^>]*?>";
 string output = Regex.Replace(input, pattern, ""); //My Value 1

Just to remove all html tags.


function stripTags(markup){
  return markup.replace(/\s*<[^>]*?>\s*/gim,'');
}

This assumes all you really want is the inner text represented by "My Value 1" above.


Try:

/<tr>\s*<td>\s*<font.*?>(.*?)<\/font>\s*<\/td>\s*<\/tr>/i

Used in PHP:

<?php

if(preg_match('/<tr>\s*<td>\s*<font.*?>(.*?)<\/font>\s*<\/td>\s*<\/tr>/i',
              '<tr><TD><FONT size="2">My Value 1</FONT></TD></tr>',$matches))
        echo $matches[1]; // prints My Value 1
?>


if you are using PHP, split on </FONT>

$string='<tr><TD><FONT size="2">My Value 1</FONT></TD></tr>';
$s = explode('</FONT>',$string);
foreach ($s as $v){
     if ( strpos($v,"<FONT") !==FALSE) {
        $t = explode(">",$v);
        print end($t)."\n";
    }

}

output

$ php test.php
My Value 1


in perl I would use

my $string='<tr><TD><FONT size="2">My Value 1</FONT></TD></tr>';
$string =~ m/(<.*?>)*([^<]*)(<.*?>)*/;
print $2;

to get the desired result. The last part is not strictly necessary,

(<.*?>)*([^<]*)

will work as well


If you want to get the contents within the tags I think the following Regexp is enough:

^<.*>([^<>]+)<.*>$

It will only work if there really is any data between the tags somewhere, otherwise it will give a no-match.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜