fault tolerance in MPICH/OpenMPI

2023-02-18 17:14 问答作者：

I have two questions-

Q1. Is there a more efficient way to handle the error situation in MPI, other than check-point/rollback? I see that if a node "dies", the program halts abruptly.. Is there any way to go ahead with the execution after a node dies ?? (no issues if it is at the cost of accuracy)

Q2. I read in "http://stackoverflow.com/questions/144309/what-is-the-best-mpi-implementation", that OpenMPI has better fault tolerance and recently MPICH-2 has also come up with similar features.. does anybody know what they are and how to use them? is it a "mode"? can they help in the situation st开发者_运维百科ated in Q1 ?

kindly reply. Thank you.

MPI - all implementations - have had the ability to continue after an error for a while. The default is to die - that is, the default error handler is MPI_ERRORS_ARE_FATAL - but that can be set (eg, see the discussion here). But the standard doesn't currently much beyond that; that is, it's hard to recover and continue after such an error. If your program is sufficiently simple - some sort of master-worker type of setup - it may be possible to continue this way.

The MPI forum is currently working on what will become MPI-3, and error handling and fault tolerance will be an important component of the new standard (there's a working group dedicated to the topic). Until that work is complete, however, the only way to get stronger fault tolerance out of MPI is to use earlier, nonstandard, extensions. FT-MPI was a project that developed a very robust MPI, but unfortuantely it's based on MPI1.2; a very early version of the standard. The claim here is that they're now working with OpenMPI, but I don't know what's become of that. There's MPICH-V, based on MPI2, but that's more checkpoint-restart based than what I think you're looking for.

Updated to add: The fault tolerance didn't make it into MPI-3, but the working group continues its work and the expectation is that something will result from that before too long.

继续阅读：fault-tolerance mpi mpich openmpi

fault tolerance in MPICH/OpenMPI

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？