Download html of URL with Python - but with javascript enabled

2023-03-18 20:58 问答作者：

I am trying to download this page so that I can scrape the search results. However, when I download the page and try to process it w开发者_如何学JAVAith BeautifulSoup, I find that parts of the page (for example, the search results) aren't included as the site has detected that javascript is not enabled.

Is there a way to download the HTML of a URL with javascript enabled in Python?

@kstruct: My preferred way, instead of writing a full browser with QtWebKit and PyQt4, is to use one already written. There's the PhantomJS (C++) project, or PyPhantomJS (Python). Basically the Python one is QtWebKit and Python.

They're both headless browsers which you can control directly from JavaScript. The Python version has a plug-in system which allows you to extend the core too, to allow additional functionalities should you need.

Here's an example script for PyPhantomJS (with the saveToFile plugin)

// create new webpage
var page = new WebPage();

// open page, set callback
page.open('url', function(status) {
    // exit if page couldn't load
    if (status !== 'success') {
        console.log('FAIL to load!');
        phantom.exit(1);
    }

    // save page content to file
    phantom.saveToFile(page.content, 'myfile.txt');
    phantom.exit();
});

Useful links:
API reference | How to write plugins

I'd look into using the QtWebKit module in the PyQt4 library. The module will let the JS code run on the page and once it's done, you can save the HTML using standard methods I believe.

Otherwise, Selenium is the way to go. It lets you control a web browser from your Python script to pull up the page and then extract all the DOM stuff.

Once you wanta javascript enabled, what you're asking for is very close to a browser. You can use jython and then use HtmlUnit, which is a headless java based browser. It's pretty fast but not very stable (because is imitates a browser and isn't really a browser). I think the fastest and easiest way is to use selenium (ide or preferably rc). Selenium gives you the ability to control your favorite browser (FF, IE, chrome,..). Although it's meant for testing puposes, it'll probably work for you. It's stable and pretty fast (I think it's even faster than HtmlUnit).

You can use htql at http://htql.net.

import htql;
browser=htql.Browser(2);
page, url=browser.goUrl('http://docs.python.org/search.html?q=chdir&check_keywords=yes&area=default');
import time; 
time.sleep(2);
page, url=browser.getUpdatedPage();

BTW, you will need to install IRobot at http://irobotsoft.com/

继续阅读：python screen-scraping

Download html of URL with Python - but with javascript enabled

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？