The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI.
Joshua HurseyJeffrey M. SquyresTimothy MattoxAndrew LumsdainePublished in: IPDPS (2007)
Keyphrases
- fault tolerance
- fault tolerant
- high performance computing
- distributed systems
- design process
- load balancing
- distributed computing
- high availability
- response time
- database replication
- group communication
- error detection
- failure recovery
- data sets
- replicated databases
- random walk
- database
- sensor nodes
- mobile agents
- peer to peer
- multimedia
- knowledge base
- fault management