cron job periodicity and amount of work
I am working on blog-aggregation project开发者_StackOverflow社区. One of the main tasks is the fetching of RSS feeds of blogs and processing them. I have currently about 500 blogs, but the number will be increasing steadily with time (it should reach thousands soon).
Currently (still beta), I have cron job which periodically fetches all the RSS feeds once every day. But this puts all processing and network IO on only once per day.
Should I:
- Keep the current situation (all at once)
- Make hourly fetching of number_of_blogs / 24 (constant cron job timing)
- Change cron periodicity to make constant number of RSS fetches (10 blogs every smaller time)
or there any other ideas?
I am on shared hosting, so reducing CPU and network IO is much appreciated :)
I have used a system that adapts the update frequency of the feed, described in this answer.
You can spare resources if you use conditional HTTP GET's to retrieve feeds that support it. Keep the values of the Last-Modified and ETag headers from the HTTP response. On the next try supply their values in the If-Modified-Since and If-None-Match request headers.
Now if you receive the HTTP 304 response code you know the feed hasn't changed. In this case the complete feed hasn't been send again, only the header telling you there are no new posts. This reduces the bandwidth and data processing.
I had similar situation, but not so many blogs :) I used to import them once in 24 hours but to save CPU load, I was using sleep()
after every blog, like sleep(10)
; and it kept me safe.
I would consider using the Google App Engine to retrieve and process the 'raw' information and have it POST out the data in managable size packets to the web server. The GAE has its own cron job system and can run independantly 24/7.
Currently using a similar system to retrieve job information from several websites and compile it for another, brilliant way to offset the bandwidth & processing requirements as well.
精彩评论