A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems.
Ifeanyi P. EgwutuohaDavid LevyBran SelicShiping ChenPublished in: J. Supercomput. (2013)
Keyphrases
- fault tolerance
- computing systems
- fault tolerant
- computer systems
- distributed computing
- distributed systems
- load balancing
- computing technologies
- response time
- peer to peer
- autonomic computing
- heterogeneous systems
- mobile agents
- database replication
- single point of failure
- fault management
- parallel computing
- replicated databases
- high performance computing
- graphics processing units
- artificial intelligence
- component failures