Why am I not getting back any images here?
$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$html = @file_get_contents($url);
$doc = new DOMDocument();
@$doc->loadHTML($html);
$xml = @simplexml_import_dom($doc);
$images = $xml->xpath('//img');
var_dump($images);
die();
Output is:
array(0) { }
However, in the page source I see this:
<img border="0" width="336" heig开发者_开发问答ht="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />
Edit: It appears $html
's contents stop at the <body>
tag for this page. Any idea why?
It appears $html's contents stop at the tag for this page. Any idea why?
Yes, you must provide this page with a valid user agent.
$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_exec($ch);
outputs everything to the ending </html>
including your requested <img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />
When a simple wget or curl without the user agent returns only up to the <body>
tag.
$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$images = $xml->xpath('//img');
var_dump($images);
die();
EDIT: My first post stated that there was still an issue with xpath... I was just not doing my due diligence and the updated code above works great. I forgot to force curl to output to a string rather then print to the screen(as it does by default).
Why bring simplexml into the mix? You're already loading the HTML from w3fools into the DOM class, which has a perfectly good XPath query engine in it already.
[...snip...]
$doc->loadHTML($html);
$xpath = new DOMXPath($doc)
$images = $xpath->xpath('//img');
[...snip...]
The IMG tag is generated by javascript. If you'd downloaded this page via wget, you'd realize there is no IMG tag in the HTML.
Update #1
I believe it is because of user agent string. If I supply "Mozilla/5.0 (X11; Linux i686 on x86_64; rv:2.0) Gecko/20100101 Firefox/4.0" as user agent id, I get the page in whole.
精彩评论