Login / Signup
Swift: Expedited Failure Recovery for Large-Scale DNN Training.
Yuchen Zhong
Guangming Sheng
Juncheng Liu
Jinhui Yuan
Chuan Wu
Published in:
IEEE Trans. Parallel Distributed Syst. (2024)
Keyphrases
</>
failure recovery
training process
load balancing
fault tolerance
single link