开发者

Download html of URL with Python - but with javascript enabled

I am trying to download this page so that I can scrape the search results. However, when I download the page and try to process it w开发者_如何学JAVAith BeautifulSoup, I find that parts of the page (for example, the search results) aren't included as the site has detected that javascript is not enabled.

Is there a way to download the HTML of a URL with javascript enabled in Python?


@kstruct: My preferred way, instead of writing a full browser with QtWebKit and PyQt4, is to use one already written. There's the PhantomJS (C++) project, or PyPhantomJS (Python). Basically the Python one is QtWebKit and Python.

They're both headless browsers which you can control directly from JavaScript. The Python version has a plug-in system which allows you to extend the core too, to allow additional functionalities should you need.

Here's an example script for PyPhantomJS (with the saveToFile plugin)

// create new webpage
var page = new WebPage();

// open page, set callback
page.open('url', function(status) {
    // exit if page couldn't load
    if (status !== 'success') {
        console.log('FAIL to load!');
        phantom.exit(1);
    }

    // save page content to file
    phantom.saveToFile(page.content, 'myfile.txt');
    phantom.exit();
});

Useful links:
API reference | How to write plugins


I'd look into using the QtWebKit module in the PyQt4 library. The module will let the JS code run on the page and once it's done, you can save the HTML using standard methods I believe.

Otherwise, Selenium is the way to go. It lets you control a web browser from your Python script to pull up the page and then extract all the DOM stuff.


Once you wanta javascript enabled, what you're asking for is very close to a browser. You can use jython and then use HtmlUnit, which is a headless java based browser. It's pretty fast but not very stable (because is imitates a browser and isn't really a browser). I think the fastest and easiest way is to use selenium (ide or preferably rc). Selenium gives you the ability to control your favorite browser (FF, IE, chrome,..). Although it's meant for testing puposes, it'll probably work for you. It's stable and pretty fast (I think it's even faster than HtmlUnit).


You can use htql at http://htql.net.

import htql;
browser=htql.Browser(2);
page, url=browser.goUrl('http://docs.python.org/search.html?q=chdir&check_keywords=yes&area=default');
import time; 
time.sleep(2);
page, url=browser.getUpdatedPage();

BTW, you will need to install IRobot at http://irobotsoft.com/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜