Managing a Distributed Cluster?
Suppose that one has set up a cassandra cluster. You've got a 10[TB] database that is distributed evenly between 10 nodes, everything runs smoothly etc.
Suppose that you have 100 machines at your disposal, each trying to read (different) data from the cassandra cluster. in addition, you have many jobs that constantly need to be run, each job at a different time (and obviously, each job needs to be run on a different machine).
How do you manage all these 开发者_开发技巧tasks/jobs? how do you distribute the tasks between the machines? how do you keep track of the jobs / machines in the process?
Are there any open-source tools (preferably, with a Python
client) that help doing it in a Linux environment?
What you need is a Grid/HPC Framework to handle your distributed infrastructure and to run jobs.
In unix/linux there are two systems that might of good use for you. Portable Batch Systems (PBS) or Condor
How do you manage all these tasks/jobs?
Both Condor and PBS have a master need to act as receptor of every Job/Task, for every Job/Task you can associate level of priority and discriminators. The administrator of the cluster sets up rules based on those discriminators to schedule the jobs.
how do you distribute the tasks between the machines?
Condor or PBS will do it for you, you only need to submit the job to the master node and specify priority, inputs and outputs, etc.
You can periodically check for when a job is finished, subscribe for notification via different mechanisms or do a sort of job.wait()
to block till its finished.
how do you keep track of the jobs / machines in the process?
Both PBS and Condor have top
alike commands to list jobs that are queued in wait, or running, or cancel. They also have utilities to stop or cancel a job if the process allows snapshots.
For a large cluster, my advice is to try Condor. It's been there for ages to solve problems exactly like they one you have. Here there are some examples for Condor + Python
Other more recent solutions to consider are:
- Celery a distributed task queue for Python.
- DiscoProject a distributed computing framework based on the MapReduce paradigm.
精彩评论