开发者

Regex HTML help

Hey all I'm in need of some help trying to figure out the RegEx formula for finding the values within the tags of HTML mark-up like this:

<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span> 

I only need 1993, R, 2.8 and 94% from that HTML above.

开发者_运维技巧

Any help would be great as I don't have much knowledge when it comes to forming one of these things.


Don't use a regular expression to parse HTML. Use an HTML parser. There is a good one here.


If you already have the HTML in a string:

string html = @"
<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span>
";

Or you can load a page from the internet directly (saves you from 5 lines of streams and requests):

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.rottentomatoes.com/m/source_code/");

Using the HTML Agility Pack:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection spans = doc.DocumentNode.SelectNodes("//span");

Now you can iterate over them, or simply get the text of each node:

IEnumerable<string> texts = spans.Select(option => option.InnerText).ToList();

Alternatively, you can search for the node you're after:

HtmlNode nodeReleaseYear = doc.DocumentNode
                              .SelectSingleNode("//span[@class='releaseYear']");
string year = nodeReleaseYear.InnerText;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜