Login / Signup
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints.
Zhuang Wang
Zhen Jia
Shuai Zheng
Zhen Zhang
Xinwei Fu
T. S. Eugene Ng
Yida Wang
Published in:
SOSP (2023)
Keyphrases
</>
failure recovery
fault tolerance
distributed systems
load balancing
fault tolerant
single link
multi agent
distributed environment
information systems
peer to peer
mobile agents
np complete
sensor data
distributed computing
grid computing