Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC.
Prashant Singh ChouhanHarsh KhetawatNeil ResnikTwinkle JainRohan GargGene CoopermanRebecca Hartman-BakerZhengji ZhaoPublished in: CoRR (2021)
Keyphrases
- fault tolerance
- distributed databases
- database workloads
- production system
- database systems
- general purpose
- high performance computing
- parallel algorithm
- shared memory
- production process
- fault tolerant
- distributed database systems
- real time
- computer systems
- message passing
- quality control
- neural network
- parallel computing
- database management systems
- failure rate
- raw material
- low overhead
- distributed systems