Login / Signup

Swift: Expedited Failure Recovery for Large-Scale DNN Training.

Yuchen ZhongGuangming ShengJuncheng LiuJinhui YuanChuan Wu
Published in: IEEE Trans. Parallel Distributed Syst. (2024)
Keyphrases
  • failure recovery
  • training process
  • load balancing
  • fault tolerance
  • single link