System-Level Fault-Tolerance in Large-Scale Parallel Machines with Buffered Coscheduling.
Fabrizio PetriniKei DavisJosé Carlos SanchoPublished in: IPDPS (2004)
Keyphrases
- fault tolerance
- parallel machines
- fault tolerant
- distributed computing
- load balancing
- distributed systems
- high scalability
- scheduling problem
- total tardiness
- response time
- replicated databases
- massively parallel
- database replication
- group communication
- fault management
- single server
- high performance computing
- minimize total
- peer to peer
- mobile agents
- unrelated parallel machines
- data sets
- data replication
- failure recovery
- error detection
- parallel computing
- reinforcement learning