开发者

Setting up a python screen scraper that could work on Google App engine

I am looking to setup a automated screen scraper that will run on Google app engine using python. I want it to scrape the site and put the specified results into a Entity in app engine. I am looking for some directions on what to use. I have seen be开发者_如何学Pythonautifulsoup but wonder if people could recommend anything else that could run on Google App engine.


Beautifulsoup runs fine on App Engine (just make sure to use 3.0.8, not the iffy 3.1.0). The main alternative, I think, would be html5lib -- I haven't tries it on App Engine but I believe it does run there (quite slowly -- if that's a problem I think you need to stick with BeautifulSoup), e.g. this service runs on App Engine and is based on html5lib.


I have had good (although slow) results using mechanize and BeautifulSoup. In fact, to save code space on Google App Engine, I use the (old) version of BeautifulSoup included in mechanize.

I have mechanize in a zip file, mechanize.zip. The index of this zip file looks like:

mechanize/
mechanize/__init__.py
mechanize/_auth.py
mechanize/_beautifulsoup.py
mechanize/_clientcookie.py
... etc

Then in my Python code,

import sys
sys.path.insert(0, 'mechanize.zip')

import mechanize
from mechanize._beautifulsoup import BeautifulSoup


The other choice is lxml, but it uses C code and so does not work on GAE.


I have used BeautifulSoup with great success parsing HTML. Problem is that's all BeautifulSoup does, is parse the HTML. I ended up writing all the http interactions using urlfetch.

To web-scrape my target I need a full fledged code driven browser that can execute javascript on my target site's pages. I think I'm having to dump the python app and go java so I can use HTMLUnit - prototyping underway. - mattb

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜