开发者

Retrieving the first picture with a HTML parser

(Not a native english speaker)

I'm doing a personal project in PHP in which I use the Simple HTML Parser to parse the HTML of a given URL and retrieve the first image in a DIV that have a specific ID or class (maincontent, content, main, wrapper, etc. - it's all in an array) and ignore ads. The goal is to take this image and make a thumbnail with it, pretty much like on Digg and others.

I thought everything was working fine until I tried my script with the website Snopes ("http://www.snopes.com/photos/animals/luckycoyote.asp" <- this page more exactly).

The source of the first image it gets is: " graphics/luckycoyote1.jpg ". So far, to 开发者_运维技巧correct this problem I created a little function that gets the domain name of the given URL and insert it before the IMG's source attribute. So for sites like Snopes.com, it gives me: "http://www.snopes.com/graphics/luckycoyote1.jpg" ... while the real URL on Snopes for this image is "http://www.snopes.com*/photos/animals/***graphics/luckycoyote1.jpg" (or, more precisely: " http://**graphics1.snopes.com/photos/animals/graphics/luckycoyote1.jpg " - note the subdomain here).

So my main question is: how can I externally/dynamically retrieve the full URL address of an image ("absolute path") when I am only given the "relative path"? I'm pretty sure this is possible, since when I paste the link in Facebook's "What are you doing?" field for example, it gives me the correct path to the image while on the website, the source of the image is only (example) "image/photo/example.jpg".

Thank you for your time.


When you get a relative graphic URL graphics/luckycoyote1.jpg which means the src="" tag DOESN'T start with a / you should instead of using the domain name use the current path your browsing.

  • URL: http://www.snopes.com/photos/animals/luckycoyote.asp
  • URL Path: http://www.snopes.com/photos/animals/

To get this in PHP run dirname('http://www.snopes.com/photos/animals/luckycoyote.asp') and it will return the path you need. Stick that in front of graphics/luckycoyote1.jpg and you'll get your image.

The graphics1.snopes.com happens automatically on the server and you shouldn't need to worry about it. When the image src="" starts with a / use the domain name http://www.snopes.com instead.


In your case my guess is that there is a server redirect going on and the only real way would be for you to try and make a web request to get the image using the "default domain" as you initially completed, and then see where/what it gets redirected to during the process.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜