Algorithmic Based Fault Tolerance Applied to High Performance Computing
George BosilcaRemi DelmasJack J. DongarraJulien LangouPublished in: CoRR (2008)
Keyphrases
- fault tolerance
- high performance computing
- fault tolerant
- scientific computing
- distributed systems
- load balancing
- computational science
- response time
- massively parallel
- distributed computing
- peer to peer
- computing systems
- computing environments
- database replication
- mobile agents
- parallel computing
- computing resources
- grid computing
- energy efficiency
- failure recovery
- knowledge base
- cost effective
- management system