Rails Queue Management
I am building a job that is going to fetch and re-validate information from a remote website. I actually have it already implemented with a queue that works kinda like this: text file is read then sliced up into 5k increments and handed off to thread processors, that then quit and a new worker is generated.
I am looking into resque, but had a generic kind of design question about problems like this. So if I have a job that could potentially be 5-20M units of work, what is the best practice for storing the queue? For instance, I could theoretically chunk the work up and store it, then create开发者_运维技巧 a job for that chunk, or I could have 5-20M individual line items in the queue. It would seem like there is a lot of overhead in the work being fetched/regenerated. But then there is also decent overhead, and more coding, to try chunking the work.
Based on what we've done and seen, a good approach is to chunk the work at runtime and not prior. In other words, a master/slave pattern that is event or time-driven with the master slicing up the work/data space into granular tasks/chunks when it gets queued and run.
The reason for this is that viewing jobs in the schedule is much easier when done at a coarse grain level. At this level, the jobs correspond to the units that you're tracking (webpages, a user profile, or streaming data from a sensor, for example).
We often see slicing on a fine grained level but then see each worker working on a reasonable collection of tasks. We've found that having each worker process multiple tasks (20-1000? depending on the type/length of task) provides a good balance between:
- optimizing setup (establishing a database connection for example)
- providing good introspection into the jobs
- making retries and exception handling more manageable
You'd want to have the processing time for each worker be in minutes as opposed to long running tasks just so you have more visibility into worker performance and so that retries only affect a limited amount of the work space. Making use of a NoSQL solution (esp. database-as-a-service ones like MongoHQ or MongoLabs) can allow you to easily keep track and manage the chunking and in-process work.
Another recommendation is to create workers that are independent of your application environment. This means writing each worker to be reasonably self contained as well as using callbacks, database flags, and other asynchronous approaches. It may be a bit more work but just like a MVC application design, it gives you much greater agility plus allows the work to be distributed over elastic worker systems.
(Full disclosure: I'm on the team at Iron.io, maker of IronMQ, IronWorker, and IronCache.)
精彩评论