Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters.
Saurabh JhaValerio FormicolaCatello Di MartinoMark DaltonWilliam T. KramerZbigniew KalbarczykRavishankar K. IyerPublished in: IEEE Trans. Dependable Secur. Comput. (2018)
Keyphrases
- failure recovery
- fault tolerance
- failure detection
- power dissipation
- disaster recovery
- high performance computing
- fault tolerant
- input output
- high speed
- case study
- recovery algorithm
- single link
- massively parallel
- test bed
- fiber optic
- root cause
- lower cost
- low power
- databases
- power consumption
- signal processing
- peer to peer
- distributed systems
- image processing
- information systems