Login / Signup
TRANSOM: An Efficient Fault-Tolerant System for Training LLMs.
Baodong Wu
Lei Xia
Qingping Li
Kangyu Li
Xu Chen
Yongqiang Guo
Tieyao Xiang
Yuheng Chen
Shigang Li
Published in:
CoRR (2023)
Keyphrases
</>
fault tolerant
fault tolerance
distributed systems
load balancing
state machine
training set
supervised learning
training process
high availability
safety critical
artificial intelligence
digital libraries