Managing a Distributed Cluster?

2023-02-24 17:10 问答作者：

Suppose that one has set up a cassandra cluster. You've got a 10[TB] database that is distributed evenly between 10 nodes, everything runs smoothly etc.

Suppose that you have 100 machines at your disposal, each trying to read (different) data from the cassandra cluster. in addition, you have many jobs that constantly need to be run, each job at a different time (and obviously, each job needs to be run on a different machine).

How do you manage all these 开发者_开发技巧tasks/jobs? how do you distribute the tasks between the machines? how do you keep track of the jobs / machines in the process?

Are there any open-source tools (preferably, with a Python client) that help doing it in a Linux environment?

What you need is a Grid/HPC Framework to handle your distributed infrastructure and to run jobs.

In unix/linux there are two systems that might of good use for you. Portable Batch Systems (PBS) or Condor

How do you manage all these tasks/jobs?

Both Condor and PBS have a master need to act as receptor of every Job/Task, for every Job/Task you can associate level of priority and discriminators. The administrator of the cluster sets up rules based on those discriminators to schedule the jobs.

how do you distribute the tasks between the machines?

Condor or PBS will do it for you, you only need to submit the job to the master node and specify priority, inputs and outputs, etc.

You can periodically check for when a job is finished, subscribe for notification via different mechanisms or do a sort of job.wait() to block till its finished.

how do you keep track of the jobs / machines in the process?

Both PBS and Condor have top alike commands to list jobs that are queued in wait, or running, or cancel. They also have utilities to stop or cancel a job if the process allows snapshots.

For a large cluster, my advice is to try Condor. It's been there for ages to solve problems exactly like they one you have. Here there are some examples for Condor + Python

Other more recent solutions to consider are:

Celery a distributed task queue for Python.
DiscoProject a distributed computing framework based on the MapReduce paradigm.

继续阅读：cassandra cluster-computing distributed-computing python

Managing a Distributed Cluster?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？