How can I use html agility to grab everything between and

2023-02-21 09:32 问答作者：

I poorly asked about this same project last week and didn't receive any suggestions. I will try to be more clear. I am trying to work with data from the website www.gtin13.co开发者_如何转开发m. For example if you enter peanut butter into the search, I am trying to grab the description:**Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct The *Size:Size: 12 oz The GTIN: 0044000003562 *ean:**00-44000-00356-2 upc: 044000003562 and upca: 04400000356. I have tried using nodeCollection with SelectNodes("") and all I get are errors. Is it even possible using html agility to grab the data between the   as well and then parse between the /s? With my lack of experience I just can't make any headway on this. It doesn't appear that the returned page has what I would consider true nodes. If html agility can't do this can anyone suggest a better approach? Eventually I would like to send each piece of the data to a sql table. I hope I have presented in a way that makes better sense.

The page returns the information in this source format:

<b><a href="/product/nabisco+nutter+butter+sandwich+cookies+chocolate+peanut+butter+4+ct/">Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct</a></b><br />

Size: 12 oz<br />

GTIN/EAN-13: 0044000003562 / 00-44000-00356-2<br />

UPC-A: 044000003562 / 04400000356<br />



Tags:

<a href="/tag/chocolate/">Chocolate</a>, 

<a href="/tag/cookies/">Cookies</a>, 
 ..<br />

<br >

It's not that easy because the original document is quite unstructured (not using a hierarchical layout, but a flat one), but here is how you can extract the main text fields with the Html Agility Pack:

        HtmlDocument doc = new HtmlDocument();
        doc.Load("yourDoc.Htm");

        // Get A nodes that have an HREF attribute
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//b/a[@href]"))
        {
            // This will contain anchor's displayed text
            string title = node.InnerText;
            Console.WriteLine("title=" + title);

            // Get the 1st BR, and then it's next sibling of TEXT type.
            HtmlNode sizeNode = node.SelectSingleNode("../following-sibling::br[1]/following-sibling::text()");
            Console.WriteLine(" size=" + sizeNode.InnerText.Trim());

            // Get the 3nd BR, and then it's next sibling of TEXT type.
            HtmlNode eanNode = node.SelectSingleNode("../following-sibling::br[2]/following-sibling::text()");
            Console.WriteLine(" ean=" + eanNode.InnerText.Trim());

            // Get the 3rd BR, and then it's next sibling of TEXT type.
            HtmlNode upcNode = node.SelectSingleNode("../following-sibling::br[3]/following-sibling::text()");
            Console.WriteLine(" upc=" + upcNode.InnerText.Trim());
        }

This will display:

title=Peanut Delight Peanut Butter & Grape Jelly
 size=Size: 18 oz
 ean=GTIN/EAN-13: 0041498143909 / 00-41498-14390-9
 upc=UPC-A: 041498143909 / 04149814390
title=Nabisco Nutter Butter Sandwich Cookie Bites Peanut Butter
 size=Size: 10 oz
 ean=GTIN/EAN-13: 0044000046118 / 00-44000-04611-8
 upc=UPC-A: 044000046118 / 04400004611
title=Nabisco Nutter Butter Sandwich Cookies Chocolate Peanut Butter 4 Ct
 size=Size: 12 oz
 ean=GTIN/EAN-13: 0044000003562 / 00-44000-00356-2
 upc=UPC-A: 044000003562 / 04400000356

etc...

NOTE: It's not 100% finished, as you'll have to parse the size, ean and upc variable using standard string manipulation (IndexOf, Substring, etc...) or Regex but the Html side of things is done.

Using HTQL, the query to extract the whole table from the page is:

<div (CLASS='BGC')>1.<div (CLASS='CON')>1.<div (CLASS='SC')>1.<div (ID='post-20')>1.<div (CLASS='PostContent')>1.<b sep>2-0 {
  title=<a>1:tx; 
  size=/'Size:'~'<br />'/;
  gtin=/'GTIN/EAN-13:'~'<br />'/;
  upc=/'UPC-A:'~'<br />'/;
  tags=/'Tags:'~'<br />'/;
}

If you only need to send the results to sql database, then I sugguest you use IRobotSoft web scraper.

String Split takes strings in addition to characters:

String[] Sections = HTML.Split(new string[] {"<b>", "<br />"}, StringSplitOptions.RemoveEmptyEntries);

I'd suggest using Regex for that, since HtmlAgilityPack is probably going to want to have properly formed html tags, as in a  without a  is an improper tag pair, that is why you are getting errors.   and   are not the end tag to the starting  (bold) tag.

If you are not the generator of the HTML, then like I said, I'd suggest Regex, and tell it you want everything between the  and the   tags. But since you have a couple different break tags, you might have problems with that too.

I'm guessing SelectNodes uses xpath? So you should be doing something like.. SelectNotes("//b" ) to get all the b nodes.

Simple examples of xpath are here: http://www.w3schools.com/xpath/xpath_syntax.asp

You could select the links and only look at ones that have href starting with 'product' to group nodes?

继续阅读：html-agility-pack html-parsing screen-scraping

How can I use html agility to grab everything between <b> and <br>

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

抽烟只抽炫赫门？

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？