Evaluating and extending user-level fault tolerance in MPI applications.
Ignacio LagunaDavid F. RichardsTodd GamblinMartin SchulzBronis R. de SupinskiKathryn MohrorHoward PritchardPublished in: Int. J. High Perform. Comput. Appl. (2016)
Keyphrases
- fault tolerance
- fault tolerant
- high performance computing
- distributed systems
- distributed computing
- response time
- load balancing
- group communication
- peer to peer
- high availability
- database replication
- replicated databases
- mobile agents
- failure recovery
- message passing
- cooperative
- error detection
- fault management
- parallel algorithm
- single point of failure
- massively parallel
- parallel implementation
- knowledge acquisition
- high scalability
- fine grained