• search
    search
  • reviewers
    reviewers
  • feeds
    feeds
  • assignments
    assignments
  • settings
  • logout

Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System.

Sheng DiHanqi GuoEric PersheyMarc SnirFranck Cappello
Published in: DSN (2019)
Keyphrases
  • fault tolerance
  • high performance computing
  • root cause
  • data sets
  • databases
  • daily life
  • future development
  • failure rate
  • batch processing