MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows.
Justin M. WozniakMatthieu DorierRobert B. RossTong ShuTahsin M. KurçLi TangNorbert PodhorszkiMatthew WolfPublished in: Future Gener. Comput. Syst. (2019)
Keyphrases
- fault tolerance
- high performance computing
- fault tolerant
- message passing interface
- distributed systems
- load balancing
- processing times
- distributed computing
- response time
- high availability
- message passing
- computational grids
- peer to peer
- group communication
- parallel computing
- grid computing
- parallel machines
- parallel algorithm
- massively parallel
- replicated databases
- database replication
- parallel implementation
- data sets
- shared memory
- flowshop
- mobile agents
- computing resources
- multimedia
- database
- workflow management systems
- energy efficiency
- sensor nodes
- fine grained
- component failures