开发者

Extracting JavaScript Variable Values via Web Scraping

For a company project, I need to create a web scraping application with PHP and JavaScript (including jQuery) that will extract specific data from each page of our clients' websites. The scraping app needs to get two types of data for each page: 1) determine whether certain HTML elements with specific IDs are present, and 2) extract the value of a specific JavaScript variable. The JS variable name is the same on each page, but the value is usually different开发者_开发问答.

I believe I know how I can get the first data requirement: using the PHP file_get_contents() function to get each page's HTML and then use JavaScript/jQuery to parse that HTML and search for elements with specific IDs. However, I'm not sure how to get the 2nd piece of data - the JavaScript variable values. The JavaScript variable isn't even found within each page's HTML; instead, it is found in an external JavaScript file that is linked to the page. And even if the JavaScript were embedded in the page's HTML, I know that file_get_contents() would only extract the JavaScript code (and other HTML) and not any variable values.

Can anyone suggest a good approach to getting this variable value for each page of a given website?

EDIT: Just to clarify, I need the values of the JavaScript variables after the JavaScript code has been run. Is such a thing even possible?


You say you need the value of the variable after the JS has executed. I assume it's always the same JS, with just initial variable values being the thing that changes. Your best bet is to port the JS to PHP, which lets you extract the initial JS variable values and then pretend you executed the JS.

Here's a function for extracting variable values from JavaScript:


/**
 * extracts a variable value given its name and type. makes certain assumptions about the source,
 * i.e. can't handle strings with escaped quotes.
 * 
 * @param string $jsText    the JavaScript source
 * @param string $name      the name of the variable
 * @param string $type      the variable type, either 'string' (default), 'float' or 'int'
 * @return string|int|float           the extracted variable value
 */
function extractVar($jsText, $name, $type = 'string') {
    if ($type == 'string') {
        $valueMatch = "(\"|')(.*?)(\"|')";
    } else {
        $valueMatch = "([0-9.]+?)";
    }

    preg_match("/$name\s*\=\s*$valueMatch/", $jsText, $matches);
    if ($type == 'string') {
        return $matches[2];
    } else if ($type == 'float') {
        return (float)$matches[1];
    } else if ($type == 'int') {
        return (int)$matches[1];
    } else {
        return false;
    }
}


presumably this is impossible because it seems so simple, but if it's your .js you're trying to detect, why not just have that .js do something detectable via scrape to the page?

use the js to populate a tag like this somewhere (via element.innerHTML, presumably):

<span><!--Important js thing has been activated!--></span>.   

edit: alternately, maybe use a document.write, if the script needs to be detectable onload


Cant you use a js script that will be sent to your clients and that script send the info to your server?


You may be able to use Zombie.js a Node(js) library: http://zombie.labnotes.org/

It can click links, walk the dom tree, and should be able to parse JS since it is JavaScript that's running it all.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜