Can someone define the regular expression to match the following html code
I was doing some web scraping and i was looking for some div elements with particular class names and markup.
This is my objective , i have to extract everything within the div having the class s_specs_box s_box_4
Could someone please provide the regular expression in .NET terms (i.e., which can be straight away passed into Regex's constructor)to match one such div (given below)
<div class=\"s_specs_box s_box_4\"><h3>Display</h3><ul><li><strong><span class='s_tooltip_anchor'>Display:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Display</b> - Phone's main display</p></span></strong><ul>\n<li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Type:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Type</b> - R开发者_开发技巧efers to the type of the display. There are four major display types: Greyscale, Black&White, LCD:STN-color and LCD:TFT-color</p></span></strong><ul><li>Color</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Technology:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Technology</b> - Refers to the type of the color displays. There are five major types: LCD, TFT, TFD, STN and OLED</p></span></strong><ul><li>Super AMOLED</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Size:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Size</b> - Refers to the width and the height of the display</p></span></strong><ul><li><span title='Big display' class=\"s_display_rating s_size_1 s_mr_5\"><span></span></span>480 x 800 pixels</li></ul>\n</li><li class='clear clearfix'><strong>Physical Size:</strong><ul><li>4.00 inches</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Colors:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Colors</b> - Shows the number of colors that the display supports</p></span></strong><ul><li>16 777 216</li></ul>\n</li><li class='clear clearfix'><strong>Touch Screen:</strong><ul>\n<li class='clear clearfix'><strong>Type:</strong><ul><li>Capacitive</li></ul>\n</li>\n</ul></li><li class='clear clearfix'><strong>Multi-touch:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Proximity Sensor:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Light sensor:</strong><ul><li>Yes</li></ul>\n</li>\n</ul></li></ul>\n</div>
Thanks in advance ,
Vijay
You cannot parse HTML using regular expressions.
Instead, you should use the HTML Agility Pack in C# or jQuery in Javascript.
For example:
var html = document.DocumentNode.Descendants("div")
.First(div => div.GetAttributeValue("class", null) == "s_specs_box s_box_4")
.InnerHtml;
Ok, if nobody else wants to link this outright for a better description, I will ... (Altho @SLaks really helped you out better than this could)
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
This works for your provided sample data:
string subject = "<div class=\"s_specs_box s_box_4\"><h3>Display</h3><ul><li><strong><span class='s_tooltip_anchor'>Display:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Display</b> - Phone's main display</p></span></strong><ul>\n<li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Type:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Type</b> - Refers to the type of the display. There are four major display types: Greyscale, Black&White, LCD:STN-color and LCD:TFT-color</p></span></strong><ul><li>Color</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Technology:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Technology</b> - Refers to the type of the color displays. There are five major types: LCD, TFT, TFD, STN and OLED</p></span></strong><ul><li>Super AMOLED</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Size:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Size</b> - Refers to the width and the height of the display</p></span></strong><ul><li><span title='Big display' class=\"s_display_rating s_size_1 s_mr_5\"><span></span></span>480 x 800 pixels</li></ul>\n</li><li class='clear clearfix'><strong>Physical Size:</strong><ul><li>4.00 inches</li></ul>\n</li><li class='clear clearfix'><strong><span class='s_tooltip_anchor'>Colors:</span>\n<span class='s_tooltip_content'><p class='s_help'><b>Colors</b> - Shows the number of colors that the display supports</p></span></strong><ul><li>16 777 216</li></ul>\n</li><li class='clear clearfix'><strong>Touch Screen:</strong><ul>\n<li class='clear clearfix'><strong>Type:</strong><ul><li>Capacitive</li></ul>\n</li>\n</ul></li><li class='clear clearfix'><strong>Multi-touch:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Proximity Sensor:</strong><ul><li>Yes</li></ul>\n</li><li class='clear clearfix'><strong>Light sensor:</strong><ul><li>Yes</li></ul>\n</li>\n</ul></li></ul>\n</div>";
Match match = Regex.Match(subject,
@"<div[^>]+class\s*=\s*""s_specs_box s_box_4""[^>]*>(.*?)<\s*/\s*div\s*>",
RegexOptions.Singleline);
Console.WriteLine(match.Success);
string result = match.Groups[1].Value;
Console.WriteLine(result);
Disclaimer 1: Don't parse HTML with regex. It is particularly bad at matching nested tags of the same type. If for example, your main <div>
had a <div>
child, my code would almost certainly not yield the results you desire. This is not the only problem with using regex to parse HTML, just the first of many.
Disclaimer 2: Don't use regex to parse HTML in production code or with unknown, future inputs. It's kind of OK if you are just going use it to batch transform a few dozen HTML files on your hard drive, and you are going to manually verify the results. It's not OK to trust it for new, unknown inputs.
精彩评论