A failure detector for HPC platforms.
George BosilcaAurélien BouteillerAmina GuermoucheThomas HéraultYves RobertPierre SensJack J. DongarraPublished in: Int. J. High Perform. Comput. Appl. (2018)
Keyphrases
- high performance computing
- fault tolerance
- detection algorithm
- detection method
- fault tolerant
- root cause
- three dimensional
- component failures
- failure recovery
- feature detectors
- failure rate
- failure detection
- messaging service
- genetic algorithm
- success or failure
- scientific computing
- highly reliable
- massively parallel
- parallel computing
- social networks