Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance.
José Carlos SanchoFabrizio PetriniKei DavisRoberto GioiosaSong JiangPublished in: IPDPS (2005)
Keyphrases
- fault tolerance
- current practice
- fault tolerant
- distributed computing
- distributed systems
- load balancing
- peer to peer
- random walk
- response time
- high availability
- group communication
- database replication
- fault management
- replicated databases
- mobile agents
- high performance computing
- sensor nodes
- distributed query processing
- allocation policy
- single point of failure
- data sets
- database
- component failures
- failure recovery
- error detection
- grid computing
- database systems