Supporting task-level fault-tolerance in HPC workflows by launching MPI jobs inside MPI jobs.
Matthieu DorierJustin M. WozniakRobert B. RossPublished in: WORKS@SC (2017)
Keyphrases
- fault tolerance
- high performance computing
- fault tolerant
- message passing interface
- distributed systems
- load balancing
- processing times
- high availability
- response time
- message passing
- parallel machines
- parallel algorithm
- distributed computing
- database replication
- peer to peer
- parallel computing
- group communication
- flowshop
- replicated databases
- shared memory
- database
- mobile agents
- single point of failure
- fault management
- scheduling problem
- computational grids
- web services
- data replication
- error detection
- massively parallel
- computing environments
- business processes
- energy efficiency
- mobile agent system
- business process
- intelligent agents