Running a web crawler for selected sites on google app engine?
I need to write a crawler to extract some info from few pre-slected websites only.
I know this is a straightway job but am thinking of using google app engine to get this done.
May be I can try Nutch to do this for me.
How feasible is this way of 开发者_JAVA技巧getting it done?
1) hosting a crawler on google infrastructure 2) Nutch + app engine- will it be possible?
Just glancing over the nutch docs, I see comments like "[t]his is the second release of Nutch based entirely on the underlying Hadoop platform" which make me suspect this will not run on App Engine. App Engine apps run in a Python or Java sandbox.
That said, you should be able to put a basic crawler together on App Egnine. I basic implementation would probably involve launching tasks that use urlfetch to grab pages, and then, optionally, insert additional tasks to process links the document links to. You can kick the crawl off using scheduled tasks.
精彩评论