开发者

Scraping content from webpage

I need to scrape a remote html page looking for images and links. I need to find an image that is "most likely" the product image on the page and links that are "near" that image. I currently do this with a javascript bookmarklet so that I am able to get the rendered x/y coordinates of images and links to help me determine if those are the ones that I want.

What I want is the abil开发者_Python百科ity to get this information by just using a url and not the bookmarklet. The issues it that by using the url and trying something like httpwebrequest and getting the html on the server, I will not have location values since it wasn't rendered in a browser. I need the location of images and links to help me determine the images and links that I want.

So how can I get html from a remote site on the server AND use the rendered location values of the dom elements to help me locate images and links?


As you indicate, doing this purely through inspection of the html is a royal pain (especially when CSS gets involved). You could try using the WebBrowser control (which hosts IE), but I wonder if looking for an appropriate, supported API might be better (and less likely to get you blocked). If there isn't an API or similar, you probably shouldn't be doing this. So don't.


You can dowload the page with HttpWebRequet and then use the HtmlAgilityPack to parse out the data that you need.

You can download it from http://htmlagilitypack.codeplex.com/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜