Login / Signup

Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System.

Sheng DiHanqi GuoEric PersheyMarc SnirFranck Cappello
Published in: DSN (2019)
Keyphrases
  • fault tolerance
  • high performance computing
  • root cause
  • data sets
  • databases
  • daily life
  • future development
  • failure rate
  • batch processing