can't get proper information from amazon.com using c#/htmlagilitpack

2023-02-28 20:26 问答作者：

I want to get book information such as author name / pages / publish year / etc ... from amazon using HtmlAgilityPack but seems amazon webpages have some problems and I can't access the appropriate fields.

here is what I've done :

I use Firefox and Firebug + FirePath to retrieve desired XPath and then inside my code I summon HtmlAgilityPack and instruct it to get information using acquired XPath that I've got it from Firebug but no luck and till now I couldn't access the "Product Details" part of the amazon.com

and this is my XPath (which is working only with HtmlAgilityPack)

HtmlAgilityPack.HtmlNodeCollection cnt = doc.DocumentNode.SelectNodes("//*[@class='content']");
int i=1;
foreach (HtmlAgilityPack.HtmlNode content in cnt)
开发者_运维问答{
    if (i != 3)
    {
        i++;
        continue;
    }
    if (i == 3) // i==3 means I've reached the product details but I can't go any further :(
    {

        s = content.SelectSingleNode("").OuterHtml;

      //  break;
    }

}

How can I access Product Details using appropriate understandable XPath for HtmlAgilityPack?

And why does the syntax of Firebug + FirePath XPath is different from HtmlAgilityPack?

As @Mystere said, I suggest using the API. But if you are doing this for test purpose, or just because you want to use web scraping to obtain the info (I'm not sure if Amazon allows it or not. You should check it before doing this), here is the thing:

Why are you doing this?

s = content.SelectSingleNode("").OuterHtml;

The following is what you are looking for in case you want to get the HTML source of that part of the page.

s = content.OuterHtml;

When you are scraping, I suggest you trying to identify the part you need to scrape, and see the particularities of that block of content.

If you use:

var node = doc.DocumentNode.SelectNodes("//td[@class='bucket']/div[@class='content']");

that will give you the Product Details block you are looking for. If you want to get some fields like Paperback, Publisher, ... you can do:

string paperback = node.SelectSingleNode("./ul/li[1]/text()").InnerText;
string publisher = node.SelectSingleNode("./ul/li[2]/text()").InnerText;
string language = node.SelectSingleNode("./ul/li[3]/text()").InnerText;
...

If you want to be sure that the XPath you are using will be correct for HtmlAgilityPack, open the page on Internet Explorer 8 (or 9) and use the Developer Tools (F12) to get the XPath. The thing is that each browser renders the HTML in a particular way. For example, you will always see <tbody> tags in Firefox right after a <table>, so maybe HtmlAgilityPack doesn't, and that simple detail of adding /tbody/ to your XPath can make your program fail.

Why don't you just use amazon's web service api that is designed to do this?

继续阅读：html-agility-pack

can't get proper information from amazon.com using c#/htmlagilitpack

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？