crawler on appengine
i want to run a program continiously on app开发者_如何学Goengine.This program will automatically crawl some website continiously and store the data into its database.Is it possible for the program to continiously keep doing it on appengine?Or will appengine kill the process?
Note:The website which will be crawled is not stored on appengine
i want to run a program continiously on appengine.
Can't.
The closest you can get is background-running scheduled tasks that last no more than 30 seconds:
Notably, this means that the lifetime of a single task's execution is limited to 30 seconds. If your task's execution nears the 30 second limit, App Engine will raise an exception which you may catch and then quickly save your work or log process.
A friend of mine suggested following
- Create a task queue
- Start the queue by passing some data.
- Use an Exception handler and handle DeadlineExceededException.
- In your handler create a new queue for same purpose.
You can run your job infinitely. You only need to consider used CPU Time and storage.
You might want to consider Backends introduced in the newer version of GAE.
These run continuous processes
Is Possible Yes, I have already build a solution on Appengine - wowprice
Sharing all details here will make my answer lengthy,
Problem - Suppose I want to crawl walmart.com, As i known that I cant crawl in one shot(millions products)
Solution - I have designed my spider to break the task in smaller task.
- Step 1 : I input job for walmart.com, Job scheduler will create a task.
- Step 2 : My spider will pick the job and its notice that Its index page, now my spider will create more jobs as starting page as categories page, Now its enters 20 more tasks
- Step 3 : now spider make more smaller jobs for subcategories, and its will go till it gets product list page and create task for it.
- Step 4 : for product list pages, its get the product and make call to to stores the product data and in case of next page It ll make one task to crawl them.
Advantages - We can crawl without breaking 30 seconds rules, and speed of crawling will depends backend machine, It will provide parallel crawling for single target.
they fixed it for you. you can run background threads on a manual scaled instance.
check https://developers.google.com/appengine/docs/python/modules/#Python_Background_threads
You cannot literally run one continuous process for more than 30 seconds. However, you can use the Task Queue to have one process call another in a continuous chain. Alternatively you can schedule jobs to run with the Cron service.
Use a cron job to periodically check for pages which have not been scraped in the past n hours/days/whatever, and put scraping tasks for some subset of these pages onto a task queue. This way your processes don't get killed for taking too long, and you don't hammer the server you're scraping with excessive bursts of traffic.
I've done this, and it works pretty well. Watch out for task timeouts; if things take too long, split them into multiple phases and be sure to use memcached liberally.
Try this:
on appengine run any program. You connect from browser, click for start url during ajax. Ajax call server, download some data from internet and return you (your browser) next url. This is not one request, each url is one diferent request. You mast only resolve in JS how ajax is calling url un cycle.
You can using lasted GAE service called backends . Check this http://code.google.com/appengine/docs/java/backends/ Backends are special App Engine instances that have no request deadlines, higher memory and CPU limits, and persistent state across requests. They are started automatically by App Engine and can run continously for long periods. Each backend instance has a unique URL to use for requests, and you can load-balance requests across multiple instances.
精彩评论