How to use database server for distributed job scheduling?
I have around 100 computers and few workers on each of them. The already connect to a central database to query for job parameters.
Now I have to do job schedulin开发者_开发问答g for them. One job for one worker takes few minutes, doesn't require network connection (except for dealing jobs and reporting) and can be done at any time in any order.
Constraints:
- no job will be taken/done twice
- there will be a timeeout in case the worker dies.
I thought that I could use a separate table in DB to schedule jobs. How to create and access job scheduling table?
Break it down into pieces:
You have a job description - which may have some constraints on where its performed - and you have a queue of jobs to be performed, broadly speaking if there are no other constraints you'd expect the jobs to be done in order i.e. you take them from the front of the queue and add them to the end.
If we run the queue as a single table then jobs will have 3 states:
- Not Started
- In Progress
- Completed
So looking for work to do is simply a matter of finding the first job (or the first n jobs if work is to be assigned in a batch) that are not started and then flagging them as started. If the constraints are more complex than simply the first n available jobs then it becomes the first n available jobs that meet the constraints, but it should still be fairly simple.
That implies the following fields in the queue table:
- Status
- DateQueued (date and time) for sort
- DateStarted (date and time) for timeout
- AssignedTo
One should probably add a DateCompleted - but if the work is batched that's not going to be strictly accurate (it'll be time reported as complete).
So for a worker (the worker "app") the process becomes:
- Connect to server
- Report completed work - set status and completion time
- Request new work
- Search for new work for worker (first n jobs not started that worker can do)
- Assign new work to worker (set status, Date Started and assigned to) - search and assign as a transaction.
- List work and disconnect
Separately you'd need processes to queue work, to look for jobs that have "timed out" so that the status can be reset and to archive or otherwise clear out completed jobs from the queue.
Full table would have the following plus any audit fields required.
- ID
- JobID -- Assuming that jobs are defined elsewhere
- StatusID
- DateQueued
- DateStarted
- AssignedToID
- DateCompleted
Hope that helps...
The interesting part, and the part where all of the difficulties lie, is in wrapping things up in transactions.
You'll need two tables: A table of available work, and a table recording work that is in progress. The "work in progress" table has a unique foreign key to the work-available table.
A process wishing to do work first locates a row from the table of work to be done. This should be done using a random sort order, in order to reduce contention.
That process the removes the "work in progress" row. It was never meant to be persist outside the transaction. It's only for locking.
That process then starts a transaction.
That process then creates a row in the "work in progress" table, with a foreign key referencing the work that is being done. It should then do the work. As a part of doing that work, it should change the state of the item being worked on (e.g., making it "finished" and no longer available to be worked on).
The process the commits its transaction.
If some other process has grabbed the work, then this process's transaction will fail due to its attempt to commit a duplicate foreign key to the "work in progress" table. In that case, the process should back off for a short, random interval, and to back to the start, trying to locate some work to do.
Monitor the "work in progress" table carefully. Some databases, or some versions of some database, don't expect a table such work-in-progress table to be used as a queue, with rows constantly being created and deleted. Specifically, older versions of Postgresql had difficulty cleaning up the old, no longer used rows, causing table bloat and poor performance.
精彩评论