Coping with silent and fail-stop errors at scale by combining replication and checkpointing.

Published in: J. Parallel Distributed Comput. (2018)

Keyphrases