Sign in

TRANSOM: An Efficient Fault-Tolerant System for Training LLMs.

Baodong WuLei XiaQingping LiKangyu LiXu ChenYongqiang GuoTieyao XiangYuheng ChenShigang Li
Published in: CoRR (2023)
Keyphrases
  • fault tolerant
  • fault tolerance
  • distributed systems
  • load balancing
  • state machine
  • training set
  • supervised learning
  • training process
  • high availability
  • safety critical
  • artificial intelligence
  • digital libraries