Anyone have a good solution for scraping the HTML source of a page with content (in this case, HTML tables) generated with Javascript? [closed]
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this questionAnyone have a good solution for scraping the HTML source of a page with content (in this case, HTML tables) generated with Javascript?
An embarrassingly simple, though workable solution using Crowbar:
<?php
function get_html($url) // $url must be urlencode(d)
{
$context = stream_context_create(array(
'http' => array('timeout' => 120) // HTTP timeout in seconds
));
$html = substr(file_get_contents('http://127.0.0.1:10000/?url=' . $url . '&delay=300开发者_开发问答0&view=browser', 0, $context), 730, -32); // substr removes HTML from the Crowbar web service, returning only the $url HTML
return $html;
}
?>
The advantage to using Crowbar is that the tables will be rendered (and accessible) thanks to the headless mozilla-based browser. Edit: discovered that the problem with Crowbar was a conflicting app, not the server downtime, which was just a coincidence.
Well, Java provides some convenient solutions, like HtmlUint, which interprets correctly Javascript, and as a consequence should allow the generated html to be visible.
This is a more robust version of the example in the OP using cURL/Crowbar:
<?php
function get_html($url)
{
$curl = curl_init();
curl_setopt ($curl, CURLOPT_URL, 'http://127.0.0.1:10000/?url=' . $url . '&delay=3000&view=as-is');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec ($curl);
return $html;
}
?>
Was getting frequent "failed to open stream: HTTP request failed!" errors using f_g_c with multiple URLs.
Also, remember to urlencode the $url (e.g. 'http%3A%2F%2Fwww.google.com' > 'http://www.google.com').
精彩评论