System-Level Fault-Tolerance in Large-Scale Parallel Machines with Buffered Coscheduling.

Fabrizio Petrini Kei Davis José Carlos Sancho

Published in: IPDPS (2004)

Keyphrases

fault tolerance
parallel machines
fault tolerant
distributed computing
load balancing
distributed systems
high scalability
scheduling problem
total tardiness
response time
replicated databases
massively parallel
database replication
group communication
fault management
single server
high performance computing
minimize total
peer to peer
mobile agents
unrelated parallel machines
data sets
data replication
failure recovery
error detection
parallel computing
reinforcement learning