开发者

Options for handling javascript heavy pages while screen scraping

Disclaimer here: I'm really not a programmer. I'm eager to learn, but my experience is pretty much basic on c64 20 years ago and a couple of days of learning Python.

I'm just starting out on a fairly large (for me as a beginner) screen scraping project. So far I have been using python with mechanize+lxml for my browsing/parsing. Now I'm encountering some really javascript heavy pages that doesn't show a anything without javascript enabled, which means trouble for mechanize.

From my searching I've kind come to the co开发者_如何学Pythonnclusion that I have a basically a few options:

  1. Trying to figure out what the javascript is doing a emulate that in my code (I don't quite know where to start with this. ;-))

  2. Using pywin32 to control internet explorer or something similar, like using the webkit-browser from pyqt4 or even using telnet and mozrepl (this seems really hard)

  3. Switching language to perl since www::Mechanize seems be a lot more mature on per (addons and such for javascript). Don't know too much about this at all.

If anyone has some pointers here that would be great. I understand that I need to do a lot of trial and error, but would be nice I wouldn't go too far away from the "true" answer, if there is such a thing.


You might be able to find the data you are looking for elsewhere. Try using the web-developer toolbar in firefox to see what is being loaded by javascript. It might be that you can find the data in the js files.

Otherwise, you probably do need to use Mechanize. There are two tutorials that you might find useful here:

http://scraperwiki.com/help/tutorials/python/


A fourth option might be to use browserjs.

This is supposed to be a way to run a browser environment in Mozilla Rhino or some other command-line javascript engine. Presumably you could (at least in theory) load the page in that environment and dump the HTML after JS has had its way with it.

I haven't really used it myself, I tried a couple of times but found it way too slow for my purposes. I didn't try very hard though, there might be an option you need to set or some such.


I use Chickenfoot for simple tasks and python-webkit for more complex. Have had good experience with both.

Here is a snippet to render a webpage (including executing any JavaScript) and return the resulting HTML:

class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.html = str(self.mainFrame().toHtml())
    self.app.quit()

html = Render(url).html


For nonprogrammers, I recomment using IRobotSoft. It is visual oriented and with full javascript support. The shortcoming is that it runs only on Windows. The good thing is you can become an expert just by trial and error to learn the software.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜