Regex HTML help
Hey all I'm in need of some help trying to figure out the RegEx
formula for finding the values within the tags of HTML mark-up like this:
<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span>
I only need 1993, R, 2.8 and 94% from that HTML above.
开发者_运维技巧Any help would be great as I don't have much knowledge when it comes to forming one of these things.
Don't use a regular expression to parse HTML. Use an HTML parser. There is a good one here.
If you already have the HTML in a string:
string html = @"
<span class=""releaseYear"">1993</span>
<span class=""mpaa"">R</span>
<span class=""average-rating"">2.8</span>
<span class=""rt-fresh-small rt-fresh"" title=""Rotten Tomatoes score"">94%</span>
";
Or you can load a page from the internet directly (saves you from 5 lines of streams and requests):
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.rottentomatoes.com/m/source_code/");
Using the HTML Agility Pack:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection spans = doc.DocumentNode.SelectNodes("//span");
Now you can iterate over them, or simply get the text of each node:
IEnumerable<string> texts = spans.Select(option => option.InnerText).ToList();
Alternatively, you can search for the node you're after:
HtmlNode nodeReleaseYear = doc.DocumentNode
.SelectSingleNode("//span[@class='releaseYear']");
string year = nodeReleaseYear.InnerText;
精彩评论