crawler on appengine

2023-01-04 01:41 问答作者：

i want to run a program continiously on app开发者_如何学Goengine.This program will automatically crawl some website continiously and store the data into its database.Is it possible for the program to continiously keep doing it on appengine?Or will appengine kill the process?

Note:The website which will be crawled is not stored on appengine

i want to run a program continiously on appengine.

Can't.

The closest you can get is background-running scheduled tasks that last no more than 30 seconds:

Notably, this means that the lifetime of a single task's execution is limited to 30 seconds. If your task's execution nears the 30 second limit, App Engine will raise an exception which you may catch and then quickly save your work or log process.

A friend of mine suggested following

Create a task queue
Start the queue by passing some data.
Use an Exception handler and handle DeadlineExceededException.
In your handler create a new queue for same purpose.

You can run your job infinitely. You only need to consider used CPU Time and storage.

You might want to consider Backends introduced in the newer version of GAE.

These run continuous processes

Is Possible Yes, I have already build a solution on Appengine - wowprice

Sharing all details here will make my answer lengthy,

Problem - Suppose I want to crawl walmart.com, As i known that I cant crawl in one shot(millions products)

Solution - I have designed my spider to break the task in smaller task.

Step 1 : I input job for walmart.com, Job scheduler will create a task.
Step 2 : My spider will pick the job and its notice that Its index page, now my spider will create more jobs as starting page as categories page, Now its enters 20 more tasks
Step 3 : now spider make more smaller jobs for subcategories, and its will go till it gets product list page and create task for it.
Step 4 : for product list pages, its get the product and make call to to stores the product data and in case of next page It ll make one task to crawl them.

Advantages - We can crawl without breaking 30 seconds rules, and speed of crawling will depends backend machine, It will provide parallel crawling for single target.

they fixed it for you. you can run background threads on a manual scaled instance.

check https://developers.google.com/appengine/docs/python/modules/#Python_Background_threads

You cannot literally run one continuous process for more than 30 seconds. However, you can use the Task Queue to have one process call another in a continuous chain. Alternatively you can schedule jobs to run with the Cron service.

Use a cron job to periodically check for pages which have not been scraped in the past n hours/days/whatever, and put scraping tasks for some subset of these pages onto a task queue. This way your processes don't get killed for taking too long, and you don't hammer the server you're scraping with excessive bursts of traffic.

I've done this, and it works pretty well. Watch out for task timeouts; if things take too long, split them into multiple phases and be sure to use memcached liberally.

Try this:

on appengine run any program. You connect from browser, click for start url during ajax. Ajax call server, download some data from internet and return you (your browser) next url. This is not one request, each url is one diferent request. You mast only resolve in JS how ajax is calling url un cycle.

You can using lasted GAE service called backends . Check this http://code.google.com/appengine/docs/java/backends/ Backends are special App Engine instances that have no request deadlines, higher memory and CPU limits, and persistent state across requests. They are started automatically by App Engine and can run continously for long periods. Each backend instance has a unique URL to use for requests, and you can load-balance requests across multiple instances.

crawler on appengine

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？