开发者

How can I use html agility to grab everything between <b> and <br>

I poorly asked about this same project last week and didn't receive any suggestions. I will try to be more clear. I am trying to work with data from the website www.gtin13.co开发者_如何转开发m. For example if you enter peanut butter into the search, I am trying to grab the description:**Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct The *Size:Size: 12 oz The GTIN: 0044000003562 *ean:**00-44000-00356-2 upc: 044000003562 and upca: 04400000356. I have tried using nodeCollection with SelectNodes("<b>") and all I get are errors. Is it even possible using html agility to grab the data between the <b> <br> as well and then parse between the /s? With my lack of experience I just can't make any headway on this. It doesn't appear that the returned page has what I would consider true nodes. If html agility can't do this can anyone suggest a better approach? Eventually I would like to send each piece of the data to a sql table. I hope I have presented in a way that makes better sense.

The page returns the information in this source format:

<b><a href="/product/nabisco+nutter+butter+sandwich+cookies+chocolate+peanut+butter+4+ct/">Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct</a></b><br />

Size: 12 oz<br />

GTIN/EAN-13: 0044000003562 / 00-44000-00356-2<br />

UPC-A: 044000003562 / 04400000356<br />



Tags:

<a href="/tag/chocolate/">Chocolate</a>, 

<a href="/tag/cookies/">Cookies</a>, 
 ..<br />

<br >


It's not that easy because the original document is quite unstructured (not using a hierarchical layout, but a flat one), but here is how you can extract the main text fields with the Html Agility Pack:

        HtmlDocument doc = new HtmlDocument();
        doc.Load("yourDoc.Htm");

        // Get A nodes that have an HREF attribute
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//b/a[@href]"))
        {
            // This will contain anchor's displayed text
            string title = node.InnerText;
            Console.WriteLine("title=" + title);

            // Get the 1st BR, and then it's next sibling of TEXT type.
            HtmlNode sizeNode = node.SelectSingleNode("../following-sibling::br[1]/following-sibling::text()");
            Console.WriteLine(" size=" + sizeNode.InnerText.Trim());

            // Get the 3nd BR, and then it's next sibling of TEXT type.
            HtmlNode eanNode = node.SelectSingleNode("../following-sibling::br[2]/following-sibling::text()");
            Console.WriteLine(" ean=" + eanNode.InnerText.Trim());

            // Get the 3rd BR, and then it's next sibling of TEXT type.
            HtmlNode upcNode = node.SelectSingleNode("../following-sibling::br[3]/following-sibling::text()");
            Console.WriteLine(" upc=" + upcNode.InnerText.Trim());
        }

This will display:

title=Peanut Delight Peanut Butter & Grape Jelly
 size=Size: 18 oz
 ean=GTIN/EAN-13: 0041498143909 / 00-41498-14390-9
 upc=UPC-A: 041498143909 / 04149814390
title=Nabisco Nutter Butter Sandwich Cookie Bites Peanut Butter
 size=Size: 10 oz
 ean=GTIN/EAN-13: 0044000046118 / 00-44000-04611-8
 upc=UPC-A: 044000046118 / 04400004611
title=Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct
 size=Size: 12 oz
 ean=GTIN/EAN-13: 0044000003562 / 00-44000-00356-2
 upc=UPC-A: 044000003562 / 04400000356

etc...

NOTE: It's not 100% finished, as you'll have to parse the size, ean and upc variable using standard string manipulation (IndexOf, Substring, etc...) or Regex but the Html side of things is done.


Using HTQL, the query to extract the whole table from the page is:

<div (CLASS='BGC')>1.<div (CLASS='CON')>1.<div (CLASS='SC')>1.<div (ID='post-20')>1.<div (CLASS='PostContent')>1.<b sep>2-0 {
  title=<a>1:tx; 
  size=/'Size:'~'<br />'/;
  gtin=/'GTIN/EAN-13:'~'<br />'/;
  upc=/'UPC-A:'~'<br />'/;
  tags=/'Tags:'~'<br />'/;
}

If you only need to send the results to sql database, then I sugguest you use IRobotSoft web scraper.


String Split takes strings in addition to characters:

String[] Sections = HTML.Split(new string[] {"<b>", "<br />"}, StringSplitOptions.RemoveEmptyEntries);


I'd suggest using Regex for that, since HtmlAgilityPack is probably going to want to have properly formed html tags, as in a <b> without a </b> is an improper tag pair, that is why you are getting errors. <br > and <br /> are not the end tag to the starting <b> (bold) tag.

If you are not the generator of the HTML, then like I said, I'd suggest Regex, and tell it you want everything between the <b> and the <br > tags. But since you have a couple different break tags, you might have problems with that too.


I'm guessing SelectNodes uses xpath? So you should be doing something like.. SelectNotes("//b" ) to get all the b nodes.

Simple examples of xpath are here: http://www.w3schools.com/xpath/xpath_syntax.asp

You could select the links and only look at ones that have href starting with 'product' to group nodes?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜