Fault tolerance of MPI applications in exascale systems: The ULFM solution.
Nuria LosadaPatricia GonzálezMaría J. MartínGeorge BosilcaAurélien BouteillerKeita TeranishiPublished in: Future Gener. Comput. Syst. (2020)
Keyphrases
- fault tolerance
- high performance computing
- fault tolerant
- distributed systems
- scientific computing
- fault management
- single point of failure
- load balancing
- high availability
- management system
- computing systems
- storage systems
- peer to peer
- high scalability
- computer systems
- distributed computing
- replicated databases
- database replication
- group communication
- grid computing
- data sets
- computing environments
- mobile agents
- expert systems
- massively parallel
- multi agent systems